To some extent, this is just me speculating, but some nice in-depth answers gave already been given so I think I might be able to provide some alternative ways to look at it, based on some things I’ve read/hard about.
Firstly, I would suggest looking up the McGurk effect. Here’s one example video:
Basically, I would suggest that if you were born blind and everyone refused to teach you anything, speak to you, etc., and all music was played to you from a single, non-stereo speaker, you would not be able to determine that one piece of music was performed on one instrument and another was performed on multiple. You are receiving a single, additive wave of the various inputs merged together and, minus forget information, there is nothing about the brain nor the ear that would have any power to reverse engineer what the inputs were.
If I took a musician with a very good ear for pulling out the individual instruments in a song, but he had only ever been introduced - in his whole life - to exactly three instruments, and he’s completely unaware of the existence of any others, and I played for him a recording of a song played using a variety of instruments that were not the original three, it is likely that he would hear the song as played on his three. Maybe he would interpret them as being poorly played to achieve wacky sounds, maybe he would think that there were double the players performing as there really had been, because a particular cord would be impossible to play with any of this instruments, I don’t know, but he would probably be very specific and able to tell you exactly how to reproduce something very like that song, using his three instruments, and he would be convinced that he had done everything correctly.
If you took a child and, over their life, showed them film of instruments being played, but it was actually videos that had been times to match the notes - so you could see them being struck - but played on the wrong instruments, and not even the same wrong instrument, then that child would grow to learn and distinguish what instrument made what now completely wrong. When played a new song, before unheard, they would break down the notes in accordance with the visual clues that had been provided before.
When we break a song down into separate parts, we’re using our library of knowledge - what that instrument sounds like, what parts in a song it plays, which styles of music that instrument is used in, what notes are possible, etc. - and we’re making assumptions - even with visual clues - as to what all, added together, would make the single wave that we are receiving. We hear a single instrument being played, we remember what it sounds like on its own. We see a band playing a song and we now how their gestures map to the notes we’re hearing. Or brain remembers all of these things and, later, when we don’t have other cues to go by, it assumes up a composite that it conceives would add up correctly, and it does do transparently to us.
Fundamentally, the brain is a really good prediction engine. You show it a bunch of sample data and the inputs which produced it, and it can perform that math, fuzzily, in either direction, ever after. It doesn’t really matter what kind of signal it is. So far as the brain is concerned, audio is just electrical synapses, video is just electrical synapses, taste is just electrical synapses, and memory is just electrical synapses. Different receptors might have different fidelity, so you can pull more subtlety out of one data source than another, but one it hits the brain, everything has been converted into a fungible metric that it can math on.
Different neural cells learn what direction(s) and at what amplitude to send out signals based on what came in from what direction. We could call this a mathematical transformation, where we have variables coming in, the transformation is applied, and the output distributed to neighbors. They have (as I understand it) a sort of memory where they are able to maintain a particular set of internal constants to apply in the transformation and if, for example, the brain is full of dopamine, it assumes that the constants are good and it holds to them. If the brain is full of cortisol, it assumes that the constants are bad and it’s more willing to try out new, random constants and see how that goes.
With billions (or however many) of neurons, it allows for some really complex (but fuzzy) analysis, using dozens of data receptors at very fine tolerances.