How do you think humans talk?

I apologize for the odd title, but I was having trouble phrasing things concisely. Basically, I’m a phd student in experimental phonetics, and I just found out that I’ll be teaching a full-year undergraduate course in speech anatomy next year. I know the subject cold, but I’m terrified that my lectures will end up being confusing and hard to grasp.

Now I’m trying to make a pre-emptive list of common misconceptions about speech production, so I’d like to do an informal poll: without looking anything up, what are the rough dynamics of human speech? In other words, in as much detail as you’re comfortable with, could you describe what produces the voice, how is it produced, and how is it precisely modulated to allow humans to produce a wide range of noticeably different sounds with one system?

I’m happy to tender explainations afterwards, and I’m sure the other linguists on the board would be glad to do the same… but for now, I’m basically trying to get a good idea of how intelligent people who haven’t formally studied linguistics conceptualize the speech chain.

It’s late and I am headed for bed, but I just wanted to say this about this sentence.
Sit down and ask yourself "What do I want the student to be able to do (know) when the class is over? Write these things down.
Then ask yourself What knowledge/skills are the students supposed to bring to the party. Write these down.
Subtract list 2 from list 1 and you now know exactly what you are going to teach. Put it in a logical order, and go wow the students with the best lectures they have ever heard.

To answer your question…

AFAIK speech is produced mainly by three mechanisms:

  1. Tension in the vocal chords.
  2. Tongue placement.
  3. Shaping of air exit pathway by the lips (and to some extent, perhaps shaping the mouth as a whole)

I don’t really know OTTOMH what muscular actions in each case produce what general effect (i.e., does tensing the vocal chords produce higher/lower pitch? higher/lower volume? Something else?) – I just seem to recall that these are the three main mechanisms of vocal control.

HTH

I appreciate the advice. :slight_smile: I already have a list of benchmarks that the department provided; stuff that the students absolutely have to know. And I’ve taught graduate seminars before, so I’m confident in my ability to organize and execute lectures… what really worries me is the possiblity of being incomprehensible. I’ve basically hung out exclusively with graduate students and professors for the last three years, and I’m so saturated in the high-level technobabble that I think there’s a very real risk of writing a lecture that sounds extremely reasonable and comprehensible to me and every other graduate student, yet still whooshes the poor third-years.

Hmmm…

My best guess:

Air moving through the vocal cords creates a resonating oscillation in the chords which in turn creates sound waves. The specific qualities of these waves are influenced by the structure of the chords, the volume and pressure of the air passing through them, and the shape of the throat, mouth, tongue, and lips.

How close am I?

Ok; I think that the lungs propel air over the vocal cords, which produce sound much like a wind instrument ( they may or may not vibrate string style; I’m not sure ). The vocal cords can be tensed or relaxed to alter the basic vocal tone; IIRC a few people can make the different cords have different tensions, and pull off tricks like sing in harmony with themselves. The lungs act as a resonance cavity as well. The lips and tongue further modulate the sound, and use the expelled air to make further sounds as well.

Answering the OP directly and with a minimum of big words.

We take air in and out; we also have a bunch of organs (the whole upper respiratory system) which change position. One of the most important is the vocal chords, which we never really went into detail in Bio class but people who have them damaged don’t talk well, so based on that and on the name I imagine talking is* their* job (other organs, like the tongue, are useful for several things).

Depending on the position of tongue, lips, nose etc you make different noises as air goes in and out of your lungs. The vocal chords and the tongue can vibrate (move following a rythm); this vibration transmits to the air as it moves. Noise is a vibration (normally, in the air).

I’m always kind of surprised and dismayed by those singers that can’t sing when they take air in… to me it’s basic technique but there’s some who sound like “I’m singing gasp a song to gasp youuuu gasp uuuuuu”.

Well-trained or skilful singers also tend to display some pretty cool acoustic characteristics that non-singers typically don’t. Most noticeably, they have an “extra” formant. Formants are basically meaningful components of the speech signal in that listeners rely primarily on formant position to decipher what speech segment they’re hearing. There are three “main” formants which we look at in speech, which have imaginatively been named the first, second, and third formants, but the speech signal contains a theoretically infinite number of formants at increasingly high frequencies. What’s interesting about singers is that they have a very clear formant which nonsingers almost never manifest. Among other things, perception tests have shown that this formant is the main thing which allows listeners to pick a vocalist’s words out from the surrounding instruments during orchestral performances.

… formant?

M-W:
Main Entry: for·mant
Function: noun
Date: 1901
: a characteristic component of the quality of a speech sound; specifically : any of several resonance bands held to determine the phonetic quality of a vowel

still leaves me kind of “ei?”

Let’s see if I get this, a “formant” is any of the elements that make a particular voice be that particular voice? Tone, pitch, etc only they have another technical name? And you’re saying that being able to recognize two different operatic tenors singing the same aria is an extra formant over regular people?

(The “first, second, third” reminds me of how in Statistics stuff like the “average” and the “standard deviation” are officially called first, second… Idon’trememberthetermrightnow - people still use the old fashioned names, because they may be less exact but we don’t really need that level of exactitude)

The vibration stuff is how we produce sound but how we produce speech is by modulating the sound with our lips, tongue, and throat.

Different sounds are produced by different parts of the mouth & throat; “hard” consonants like K and G are possible only because humans have a sharp angle in the throat (other primates can produce vowel sounds and certain consonants like P and H that are made by the lips.)

Interestingly, this sharp angle in the throat also makes humans more prone to choking than nearly any other animal, and certainly any primate.

Heheh, I apologize; I was trying to go for a really generalized answer that skipped the theoretical stuff, but let me be pseudo-precise. (Warning! This deals with simple physics, which is not an area I specialize in. I’m also leaving out oodles and oodles of stuff for the sake of simplicity and readability. If you want a really thorough introduction, read Elements of Acoustic Phonetics by Peter Ladefoged. It’s a really neat book.):

Everyone who has replied basically has the right idea about how speech is produced. One very prevalent model in north america is the Source-Filter model, which basically claims that there are two big components at work making speech: a Source, which is vibrating air (which may or may not include buzzing introduced by the vocal folds), and a Filter, which is the post-glottal vocal tract. You know how acoustic instruments like guitars have sound boxes which amplify the sound of the string vibrating by boosting certain types of oscillation in the air? Well the Filter in the vocal tract works in almost the exact same way. Think of the throat, oral cavity, and nasal cavity like a big cavern: you make a noise, and if the cavern has the right proportions when the noise you’ve made echoes, the echo will be louder than the original noise.

The reasoning behind this amplification is fairly simple: the vocal tract essentially acts like a tube resonator, so a very good (simplified) way of looking at the properties of resonance which amplify the speech signal.

First, we’re going to look at simple harmonic resonance. The speech signal is extremely complex, but it is gudied by the basic principles.

As you’re probably aware, vibration just describes oscillation of an object (in this case, molecules of air). If I take a spring and pull on it it’ll get longer, but the moment I let go of the spring it will spring back to its resting state, and then some. In recovering from the displacement I introduced it will actually compact itself, momentarily coiling more tightly than it was originally. The configuration our spring was in before we messed with it is called its “equilibrium position”. In essence, a harmonic oscillator like a spring, or the air molecules which vibrate during speech, is simply a system which, when displaced from its equilibrium position, will experience a force to return to said equilibrium which is proportional to the displacing force. Since our spring will only have one displacing force in this simplified model - the kinetic energy introduced by the guy who first pulled on it - it is displaying simple harmonic motion: if you were to tie a pencil to the vibrating end of the spring and put a moving sheet of paper under it like a seismometer, the pencil would draw a sinusoidal wave. Anyway, since there’s friction acting on the spring and damping the oscillation it’s eventually going to stop vibrating back and forth and return to its equilibrium position.

Air in a tube resonator essentially works in the same way, and this is vital to promoting the resonance which amplifies the signal. You see, the speech signal at the source - the glottis - is actually quite quiet in most cases. However, if we picture the vocal tract as an upside-down L-shaped tube and the vibration at the source as a wave, we can readily visualize how formants work.

First off, let’s assume that we’re working with something like a vowel: the vocal folds are vibrating, air flows through the folds, through the mouth, and out past the lips. The first thing we need is the frequency of the vibration at the Source, the vocal folds. If we take the frequency of the source (F0, or the Fundamental Frequency) in hertz, it will simply be equal to the number of vibrations of the vocal folds per second. Let’s choose a nice, clean number like 100. (The numbers we’re going to develop won’t work anywhere, and would never accurately model a real-life vowel. Be forwarned :).) Now, visualize a 100hz sine wave: since we have 100 cycles per second, the wave will repeat itself 100 times. Now, as the signal travels through the vocal tract the tract will create resonances at various frequencies. What we’re interested in are the resonances which boost the signal. As it turns out, the resonances which boost the vibration at the source are the harmonics, which are odd integral multiples of the fundamental frequency. To visualize it, draw a 300hz sine wave over the 100hz wave. This wave is 3x the fundamental frequency, and as such it’s going to be the first harmonic. If you look at the drawing, you’ll notice that although the waves don’t match up in their entirety, they overlap at key points. This overlap is essentially what boosts the speech signal.

Now that we know what harmonics look like, formants are easy: a formant is just an acoustic peak in the signal, which amplifies certain key harmonics.

Wikipedia actually has a nice visual reference at File:Spectrogram -iua-.png - Wikipedia . The dark bands, which are noted with red arrows, are the formants.

My first impulse is to leave the vocal chords out, because you don’t use them to whisper, do you? But that’s in English. Chinese words depend on pitch and such, I think.

The vocal chords vibrate to produce sound of a particular key. (Pitch? I don’t know the proper words.) Most of the change to the sound seems to be through the lips, although some sounds rely on the tongue and teeth. It also seems to be possible to talk from different places in the mouth/throat … where the sound hits seems to be one of the big things that determine accents. I’m guessing this is about where the sound resonates the most, where it’s “thrown” by the throat. I mention this because I have no idea of it’s true, of course. (This is like an anti-general questions.)

The reason for the great variety of sounds is the flexibility of the opening of the “instrument”. Sound can be suddenly blocked, muffled or distorted in many ways and on several levels, since the tongue, teeth and lips are all involved. Some sounds can be produced by the mouth itself, without the vocal chords, and some languages rely on these.

Thank you. Since I actually know some Physics but my phonetics are limited to having to memorize the characteristics of the Spanish phonemes back in 9th grade, that made more sense than the abreviated version.

It sounds really weird, but whispering actually puts more of a strain on the vocal folds than loud speech! It’s mainly due to the fact that during ordinary phonation, the vocal folds are either lax (during nil voicing: sounds like “p” and “s” that don’t require vibrating vocal folds) or tense (during voiced sounds like “b” and “z”). The vocal folds vibrate quite quickly during voicing, often hitting 100-250 vibrations per second depending on the speaker. The problem is that nerve conduction velocities don’t go anywhere close to that: lateral cricoarytenoid muscles, which are the primary muscles involved in adducting the vocal folds, are innervated by the vagus nerve, which is the tenth cranial nerve in humans. (We like to be all fancy and Latin-y when naming anatomical structures, so you always refer to the cranial nerves with roman numerals: Cranial Nerve X, or just CN X.) Now, the nerve conduction velocity for the cranial nerves isn’t nearly fast enough to account for each individual vibration (I’m remembering that brain > vagus > LCA muscle averages 10-13ms in your average person, but don’t quote me on it). As such, what you actually do to vibrate the vocal cords is to adduct them with the LCA and a few other muscles, and allow the bernoulli effect to open and close them for you. Essentially, when you expel air from your lungs when the vocal cords are closed the differential between sub-glottal pressure and supra-glottal pressure will be sufficient to force the vocal folds open, at which point air will rush through. However, the sharp increase in velocity as the obstruction (the vocal folds) disappears causes the pressure in the space between the folds, the glottis, to quickly plummet, sucking the folds back together. As soon as this happens, the sub-glottal pressure begins to build, quickly exceeding the supra-glottal pressure and forces the folds open… this pattern repeats until phonation ceases, and is the primary mechanism of vocal fold vibration.

So that’s how ordinary vibration happens. The thing is, when you whisper you aren’t just holding the vocal folds shut: instead you’re spreading them in a kind of Y-shaped opening and exerting a certain amount of medial and longitudinal tension to maintain that shape during phonation, so it actually tires your vocal folds out faster than if you’d spoken with normal voicing.

And yes, many of the Chinese languages (and actually a lot of languages in that region in general) make use of what’s called “lexical tone,” which means that changing the pitch will actually change the meaning of a word. Tone marked by pitch variation is, as far as we know, a universal feature of language: every human language that’s been documented displays systematic variations in tone and pitch, which are conditioned by different things in different languages. What isn’t universal is what these changes in pitch signal, and only a subset of the world’s languages use lexical tone.

On a side note, it’s interesting to note that though tone appears to occur universally, stress only occurs in a limited subset of the world’s languages.

No problem! And yea, formants are kind of odd. They’re basically just acoustic peaks which amplify key harmonics.