How does a wave carry all of the complexities of sound?

I know there is an easy answer - our ears/brains interpret a wave as sound thus if I can reproduce a wave that our brains has learned or is built to interpret as X, we hear X.

But I’m not satisfied with that. Sound is extremely complex. For example why does a middle A on a piano sound different than a middle A on a clarinet? It’s overtones and we can do some fancy wave addition to come up with two different wave forms for the piano and the clarinet. But now play both simultaneously and we get a third wave form but how is it that we “hear” both instruments?

So my question is how can these wave forms contain all the information we need for the richness of sound instead of just hearing modulated sine tones?

I’m not sure I understand your question. The modulated sine tones IS all the information summed together (between about 20hz -20khz anyway).

The simple answer is that any waveform, no matter how complex, can be broken down into a sum of simple sine waves. In fact, this is kinda how your brain “hears”. Your ears have a bunch of itty bitty cells that act like tiny bandpass filters, activating nerves when their specific tiny frequency range is detected.

You’re not actually hearing the waveform. You are hearing the waveform converted into a bunch of frequency signals, and those signals react fairly slowly. Even though you can hear frequencies in the 20 Hz to 20,000 Hz range (sorta - the range narrows significantly as we age, and at all ages there’s significant variation in what people can hear), you can only detect changes in sound that occur in about 1/15th to 1/20th of a second or so.

In other words, if we have two different “bang” sounds, and the first time we play them so that the sound waves start simultaneously, and the second time we play them so that one is delayed by 1/20th of a second, the resulting sound waves will look very different, but your brain won’t be able to tell them apart.

A sound like a flute will be very close to a simple sine wave. A sound like a brass instrument is closer to a square wave, and if you break that down into its component sine waves, you’ll find that it has a lot of harmonics added into the mix. But you can break that brass sound down into a bunch of sine waves, and that’s exactly what your ears and brain do.

But again, while the sine wave part is fairly simple, your brain’s pattern matching ability is anything but simple. Your brain hears the complex frequency pattern of a brass instrument, and out of all of the things that you have experienced and heard in your life, your brain identifies it as a brass instrument and even predicts somewhat how it should sound. By using that prediction, your brain can also decipher if other frequencies are jumbled into the mix, and can possibly identify other instruments mixed in there as well.

While the sine wave parts are simple, the brain’s role in all of this is so ungodly complex that we don’t have a freaking clue how it works. We know what signals the brain gets from the ear, but we don’t know how the brain matches patterns. The human brain is just too complex for us to understand it, at least for now.

To some extent that’s an ill-defined question - a lot depends on whether you are asking about the sound waves, the ear, or the brain ‘pieces’. That said.

For instruments each one has its own set of overtones, harmonics, etc determine mostly by geometry (lip position, shape of the air tube, materials, etc). Combine two instruments with the same primary frequency (say concert A), and each of those fundamentals, overtones, harmonics etc will interact. I think either factorial or exponential since, say, the trumpet fundamental will have interference with fundamental and every other frequency of the piano. Add another instrument and you increase the number of combinations yet again. If you route the sound through a spectrum analyzer, you’ll see the whole messy frequency & volume content at each instant.

On the ear and brain side (not beer and rain sides), dunno but in my experience you can switch between focusing on a single instrument and hearing the music as a whole. Time for another analogy or five.

AAUI dogs can break up an odor into all the little odor pieces, so they can focus on one scent and track it. Humans without training just get the overall picture - with practice you can taste a stew as ‘good stew’, but also taste a stew as ‘strong beef flavor, tastes of bay, sage, and red wine, and too much salt’

On the optic side, you get a lot less info than with sound or smell - just light receptors saying ‘light here’. The brain has sections that breakdown what you see by horizontal line, vertical line. motion, etc etc. And then it magically puts all those pieces back together.

Ultimately, sound waves contain all necessary information since if you played back the exact spectrum (including sound outside normal human ranges) you’ll hear the original sound. Converting the spectrum into ‘talk’ or ‘music’ involves parts of the brain that I don’t know anything about.

So the ear does a Fourier transform and send the data as separate signals to the brain?

No, your brain gets dry signal and does the Fourier transform before it sends it to the parts of your brain that make you aware of it.

The Cochlea in the inner ear has hairs. Different frequencies stimulate different hairs. It is more like a bunch of tuned resonators that a Fourier transform.

Actually, the brain does more of a wavelet transform based upon inputs from various arrays of mechanosensory cells in the Organ of Corti. The each of these arrays it connected to stereovilli which, as gazpacho notes, basically act as resonators for a given frequency range. The increase in intensity is determined by how many cells are stimulated to potential over a period of time. The cochlea, which contains all of this, is basically a big amplifying horn with frequencies coupled to those resonances. Then the auditory cortex of the brain is somehow able to take all these multitudes of wavelets and integrate them to identify them as natural sounds, melodic patterns, and speech. Exactly how it does this we don’t really know and cannot replicate directly in machine learning because the brain doesn’t simple recognize and match patterns to a database of vocabulary but is able to place them into a context which (generally) can separate “word-like” noises from coherent speech.

Stranger

Pretty much, yes.

The conversion to the frequency domain is done in the ears, not the brain.

Personally, I’d call it an electro-mechanical discrete Fourier transform conversion. :slight_smile:

Anyway, there are actually two sets of hair cells, and I don’t think scientists fully understand how they all work together. The outer hair cells seem to act more like amplifiers of sorts, which has the effect of amplifying quiet sounds and dampening loud sounds so that the dynamic range of the overall hearing system is greatly enhanced. The inner hair cells are the ones that pretty much drive the auditory nerves and do the actual “hearing”, if you will.

This is, as you already see, a big subject.

To add a few things, on both sides of the question.

The ear is a very interesting device. On very interesting aspect of its function is that it isn’t passive. There is a significant bundle of nerves that relay control from the brain to the ear. This control seems to act on very small muscle tissue at the base of the hairs, and acts to tweak their response. The ear/brain uses this to achieve astounding frequency resolution. Trained musicians and instrument technicians get down to about one cent of frequency resolution. That is one-hundreth of a semitone. However there is no such thing as a free lunch, and this resolution comes at a price. The ear is insensitive to energy close to the frequency of a louder tone. (Known as spectral masking). (This feature is one of that underpins a key trick used in some audio signal compression systems.) Also, the ear/brain takes time to acquire the signal well enough to allow such resolution. The same feedback system is also used to provide some amount of automatic level control. The trick for the ear/brain is not that the ear is sending the brain information that has such insane frequency resolution, but that the brain sees how much it needed to tweak the system to find the dominant sound in a band, and that information tell the brain directly what the frequency offset is.

The ear sends to the brain phase information for frequencies less than about 1kHz. Indeed it seems that the brain probably receives what amounts to an encoding of the waveform in real time up to about this frequency. Phase information allows the brain to localise sound. It is probably no coincidence that the wavelength at which this information is no longer encoded is roughly the width of the head - the distance at which phase information would become useless.

But a core part of the OP’s question. How is that such apparently complex information can be carried by simply vibrating air? You can view this is a information theory problem. Shannon tells us that the information carrying capacity of a channel is the dynamic range times the bandwidth (or more properly the integral of dynamic range wrt frequency across the band.) So how much information is available? The answer is a bit messy, because we have very different dynamic ranges at different frequencies, and we lose high frequencies with age. But to a good approximation the ubiquitous CD format (98dB 20-20kHz) is pretty much the upper limit. That is, if you want to capture the information capacity of the ear, you will need to match that, and that gives you a metric of the information content. Now the ear itself can’t match that - issues like masking, the Hass effect, and a host of others mean you can actually lose a lot of information from that and not detect it by ear. But if we are not so much worried about the ear as sound in general and mechanical recording (and perhaps machine based detection and analysis we can stick with the whole channel content and not worry about the nuances.)

Music you can see is dots on a page. The information content of a sheet of music is remarkably low. You can get MIDI files with transcriptions of a lot of music, including many major symphonic works. They are measured in kilobytes. This underlies an key part of thinking about sound waves. If you look at a wave in the time domain it looks like a highly complex mess. You see a sine wave on a computer and it takes lots of samples and lots of information to capture it. But it can also be represented with only a few parameters. Start time, duration, frequency, amplitude. You could get that in 8 bytes. Add the first 5 harmonics and it is only 48 bytes. Yet if you looked at it on a screen in the time domain it would look very complicated. A simple MIDI based software player could take the MIDI file, go to its simple instrument database, and use trivial harmonic content descriptions for the various instruments and play the symphony. It would sound pretty dreadful, but it would be quite recognisable. Add envelope descriptions for each instrument (which can be a very simple number of parameters - the ubiquitous attack, decay, sustain, release as a start ) and you could code the entire instrument database in only a few kilobytes. There was once a semi-serious suggestion that an advanced MIDI style encoding might become a viable very high compression ratio music delivery format.

The point? It depends which domain you look at something in. Looking at musical sound in the time domain and it looks dreadfully messy. Looking at music in the frequency domain and it starts to look a lot less so. Yet the two are duals of one another. In the abstract you can transform from one domain to the other, and do so with no loss of actual information content. There are an infinite number of possible domains you can use. However the frequency domain is probably the most used alternative to time. And the go-to method of getting there is the Fourier transform. (There are others, and as alluded to above the ear is less like a FT.)

The Fourier transform is neat in that it is bi-directional. Take you sound, FT - frequency space. FT again, you get back your time domain sound. This, with a few other nice properties, forms the basis of a very large part of the world of signals processing. The duality of worlds is the key.

Here’s a mind-f*ck for you. Take two tuning forks of slightly different frequencies and hold them up close to one ear. You’ll hear beats with a frequency equal to the difference between the frequencies of the forks. That’s interference changing the effect on the mechanics of the ear.

Take one of those forks and place it by the other ear. You’ll hear pretty much the same thing, but now it’s an effect of the brain processing the signals coming from each ear.

This intriguing phenomenon is called:

https://en.wikipedia.org/wiki/Binaural_beats

Apparently it happens in the inferior colliculus of the midbrain and the superior olivary complex of the brainstem. It’s all in your head…

For what it’s worth, it would actually be possible to play a bunch of different notes on a bunch of different flutes in just such a way that it would sound like a clarinet. But part of the pattern-matching process in the brain is in trying to figure out the simplest or most likely combination of sources to produce the signal it’s getting, and so when you hear that sound, you think “clarinet” not “dozens of flutes”.

This is a lot trickier than it sounds, as some of my colleagues discovered when working on a very similar problem with gravitational waves. For the instrument they were working on, they were expecting most of the signal to come from a bunch of white dwarf binary stars, plus a little from more exotic things like black holes. So their first attempt at analysis was to find and subtract out all of the white dwarfs, but when they did that, there was nothing at all left, because it turns out that, just like you can make a clarinet sound out of a combination of flute sounds, so too can you make a black hole gravitational wave out of a combination of white dwarf gravitational waves.

In fact, you can produce any complex signal from a superposition of identical wavelets with the only parameters being the center frequency and a scale factor to any arbitrary degree of precision in both the frequency and time domains, something that cannot be done using classic Fourier methods.

This sounds like a very elementary signal analysis problem that every scientist and engineer working with digital signal processing runs into at some point. You can “subtract out” Gaussian sources of noise and pull a signal out of the noise floor, but if you try to isolate a series of arbitrary signals you’ll find that your target signal is also comprised of arbitrary signals specifically for the reason described above. More complex adaptive filtering is required to isolate the target signal from the non-random background, either by specifying the target signal parameters or those of the interfering signal(s), which can often be completely arbitrary. Basically, if you are going to find a needle in a haystack, remove everything that is yellow or smells like cut grass, which is a laborious task.

Stranger

Francis Vaughan, that was a helpful explanation of some of these matters, such as the differences among a sheet-music description of a musical performance, a MIDI-level description, and a full sonic description; and how something can look complex in one domain, but simpler in another.

To some extent, this is just me speculating, but some nice in-depth answers gave already been given so I think I might be able to provide some alternative ways to look at it, based on some things I’ve read/hard about.

Firstly, I would suggest looking up the McGurk effect. Here’s one example video:

Basically, I would suggest that if you were born blind and everyone refused to teach you anything, speak to you, etc., and all music was played to you from a single, non-stereo speaker, you would not be able to determine that one piece of music was performed on one instrument and another was performed on multiple. You are receiving a single, additive wave of the various inputs merged together and, minus forget information, there is nothing about the brain nor the ear that would have any power to reverse engineer what the inputs were.

If I took a musician with a very good ear for pulling out the individual instruments in a song, but he had only ever been introduced - in his whole life - to exactly three instruments, and he’s completely unaware of the existence of any others, and I played for him a recording of a song played using a variety of instruments that were not the original three, it is likely that he would hear the song as played on his three. Maybe he would interpret them as being poorly played to achieve wacky sounds, maybe he would think that there were double the players performing as there really had been, because a particular cord would be impossible to play with any of this instruments, I don’t know, but he would probably be very specific and able to tell you exactly how to reproduce something very like that song, using his three instruments, and he would be convinced that he had done everything correctly.

If you took a child and, over their life, showed them film of instruments being played, but it was actually videos that had been times to match the notes - so you could see them being struck - but played on the wrong instruments, and not even the same wrong instrument, then that child would grow to learn and distinguish what instrument made what now completely wrong. When played a new song, before unheard, they would break down the notes in accordance with the visual clues that had been provided before.

When we break a song down into separate parts, we’re using our library of knowledge - what that instrument sounds like, what parts in a song it plays, which styles of music that instrument is used in, what notes are possible, etc. - and we’re making assumptions - even with visual clues - as to what all, added together, would make the single wave that we are receiving. We hear a single instrument being played, we remember what it sounds like on its own. We see a band playing a song and we now how their gestures map to the notes we’re hearing. Or brain remembers all of these things and, later, when we don’t have other cues to go by, it assumes up a composite that it conceives would add up correctly, and it does do transparently to us.

Fundamentally, the brain is a really good prediction engine. You show it a bunch of sample data and the inputs which produced it, and it can perform that math, fuzzily, in either direction, ever after. It doesn’t really matter what kind of signal it is. So far as the brain is concerned, audio is just electrical synapses, video is just electrical synapses, taste is just electrical synapses, and memory is just electrical synapses. Different receptors might have different fidelity, so you can pull more subtlety out of one data source than another, but one it hits the brain, everything has been converted into a fungible metric that it can math on.

Different neural cells learn what direction(s) and at what amplitude to send out signals based on what came in from what direction. We could call this a mathematical transformation, where we have variables coming in, the transformation is applied, and the output distributed to neighbors. They have (as I understand it) a sort of memory where they are able to maintain a particular set of internal constants to apply in the transformation and if, for example, the brain is full of dopamine, it assumes that the constants are good and it holds to them. If the brain is full of cortisol, it assumes that the constants are bad and it’s more willing to try out new, random constants and see how that goes.

With billions (or however many) of neurons, it allows for some really complex (but fuzzy) analysis, using dozens of data receptors at very fine tolerances.

Was that George Duke or Jeff Beck? Miles Davis or John McLaughlin? Without seeing you might never know. With Allan Holdsworth you need to see the video in order to understand what’s going on. There’s just too much.

Identifying different musicians is a really interesting element of this question. There are so many nuances.

A huge part of many of the cited musicians is that they have a very distinctive compositional or improvisational style. (OK, Miles had more than a few over his career.) So you want to eliminate some of that.
One could have them all play the same well known piece, one none of them wrote. It might become much harder, but not impossible. I can recognise some classical musicians playing standard repertoire pieces by their tone and style. (I find this especially so with conductors.) Professional musicians I know regard such recognition as just part of their lives. Just what it is that allows this to be done is really difficult to guess at.

Holdsworth had such a distinctive legato style, and a wealth of harmonic ideas in angular odd chords, he is really hard not to notice. Jeff Beck is so different that you would have no difficulty picking him. And so on.

Also, their instrument of choice, and tone of choice. Guitar, amplifier, settings. Holdsworth, in his Synthaxe days makes for a very interesting comparison. A Synthaxe is just a controller. We are back to the equivalent of music as MIDI commands. Where is Holdsworth in that? If I found a MIDI file Alan had created with the Synthaxe, and then rendered it with a modern synthesiser, is it still Alan playing? I bet almost anyone who knew his work would instantly recognise those signature elements of his style. You can get piano rolls created by Rachmaninov. He himself agreed that they captured his playing, despite his initial scepticism. (A piano is only coded by time and hammer velocity, so the rolls could reasonably exactly capture what the piano did mechanically, allowing perfect reproduction of hammers on wires.) Not a lot of difference in piano roll to a MIDI file - and you get the man himself agreeing it captured him.

For the OP. Yep, this information still finally comes over nothing more than compressions and rarefactions of the air.

Francis Vaughn and others well explained this, but you are essentially asking how does a simple wave carry all these complexities. The answer is that wave is not simple and you are likely thinking of it in the time domain. Examined in the frequency domain on a waterfall graph you can see hints of the complexity.

E.g, here is a frequency domain waterfall graph of music on an AM radio station. The waveform at the top is an instantaneous time domain signal which itself does not visually reflect the information complexity. The pattern below is a waterfall representation where amplitude is coded to color. This is a double sideband RF signal but you’d see something vaguely similar in the audio spectrum.

The horizontal striations are music beats and the geometric patterns are vocals and instruments.The bright vertical line is the carrier and (in this case) the information is symmetrically duplicated on both sides.

Another way to view the complexity of polyphonic music is using editing software like Melodyne. It can edit individual notes in a polyphonic chord. So the true complexity of sound is better appreciated in these other representations, not just as a time domain waveform: https://www.youtube.com/watch?v=9FScFKuXXM0

No it isn’t. Sound is incredibly well understood. The only people who say otherwise are ones who think they’ve outsmarted digital audio.

Because even though the two notes have a fundamental of 440 Hz, they have a different mix of harmonics.

There’s a concept called the “missing fundamental” where even though we can’t hear the fundamental note of an instrument, we hear enough of the harmonics in the correct ratios where our brains can piece together what the instrument is supposed to be.

Think of it this way- when you talk on the telephone, the fundamental frequencies of your voice are stripped out in transit. Yet the listener can still recognize it as your voice.

Because all sound is composed of sine waves.