Is it possible to make an objectively "perfect" stereo?

I’ve read you can spend any amount of money on audio reproduction equipment, from 10k to basically no limit. There’s also a bunch of people who prefer the sound of tube amplifiers, the $1000 analog cables, and other such gimmicks.

Well, it occurred to me : why not perform an FFT on the source audio signal (from the digital waveform produced from the source media) and set up a microphone array inside the room the stereo is operating in. You’d perform the same FFT on the actual output signal and compare the ratio of signal amplitudes at each frequency. If the source material and the output have the same signal amplitude ratios (for instance, the signal at 1 Hz is X times the signal at 2 Hz, and so on), then you have correctly reproduced the signal.

You could take it a step farther and have your equipment automatically adjust gains, frequency by frequency, to correct for distortions introduced by your equipment and the room you are in.

It seems that you ought to be able to demonstrably prove how good your audio equipment is using this method, and do blind A : B testing to prove that no one can tell the difference between your equipment and another manufacturers.

A “perfect” stereo would be close enough to the source material that human hearing cannot discern any difference, as measured by blind A:B testing.

So, why isn’t this the standard? The basic technique I described might be easier with modern digital equipment, but it has been technically possible for many decades - you could have done the same thing with analog plots.

The first problem is reducing sound via Fourier analysis. It does an okay job on simple waves, but music isn’t all nice tones. In particular, what makes different instruments sound different are oftentimes transients that the FT does a really poor job handling.

A famous wave that it hiccups on is the square wave. Note the little spikes on the corners in the diagram. And that is something periodic. Something not at all periodic is even worse.

It took many people in electronic music a while to adjust to the idea that FM signal generation does a better job than FT-style methods in reproducing instruments. (With increasing memory capabilities sample-based methods took over.)

The old saying “If the only tool you have is a hammer, everything looks like a nail.” is most often encountered for the first time when discussing the limits of FFT.

This sort of thing has been done in the past, though not dynamically. A friend of mine had a graphic equalizer with a built in white/pink noise generator and a microphone input. You would place the microphone at various points around the room and try to adjust the equalization until you got an essentially flat response on the graphical display. Once it was adjusted, the system (theoretically) was compensated for imperfections in both the speaker’s and the room’s frequency response and wouldn’t require any dynamic adjustment.

As for “audiophiles” these days, and since this is GQ, let me just say that if these folks really wanted accurate sound reproduction over a very wide dynamic range, they wouldn’t be using tube amps.

My receiver has a feature like this (Audyssey), but there’s no manual adjustment involved. Just put the mic at the various seating positions, let the receiver take measurements, and it will try to compensate for the room and speakers. It’s far from perfect, but it does a decent job of making things sound good in my experience.

But the feed back loop is a good idea. It would be like you have a PA setup, you press “autoequalise” … And then the computer works out some digital filter …

The main troubles are

  1. the microphone has to be positioned in some sensible way… Well it would be like Sheldon bouncing all over the movie theater.

  2. the equipment is meant to be perfect all along, and people use equalizers to adjust for personal taste …

  3. Adjusting the equalizer isn’t that hard

BTW microphones aren’t perfect either. So your evaluation would only be as good as the mic.

They do make perfect digital-to-analog converters based on superconducting josephson junctions, but according to my calculations you’d need about a terabyte of SDRAM to play a 3-minute song, and would probably sound like crap as there is no distortion. The converters are really only good for periodic waveforms.

As mentioned above, Audyssey is a well known company that provides some level of compensation. Another, higher end, and more complete technology is from Dirac. However these still leave a lot to be desired, and there are intrinsic issues with what you want to do that are not possible to compensate for.

There are a massive number of things that go wrong between the source and your brain. First up is simply that there is no possible mechanism for a stereo recording to match the sound you would hear live. HiFi enthusiasts mostly need to get out more and listen to more live music. (I was lucky enough to hear the Royal Concertgebouw Orchestra play in the the Concertgebouw a couple of days ago. I have also heard a few hundred thousand dollar sound systems. If anyone thinks their sound system can get even close to the sound of the real thing they are deluded.

Problems. The sound field is dominated by the reverberant sound. Whilst the direct sound is critical, and the Hass effect causes it to define the structure of what you hear, the diffuse reverberant field contains the most energy. It is what makes the Concertgebouw sound as good as it does. You can’t capture the sound with two channels. The best you can do is capture a binaural sound - and listen with headphones - but this does not provide the natural sampling of the sound field that occurs when you move your head. So it still does not spatialise fully right.

You can try for a proper multichannel recording - and a few are made, with say 8 surrounding channels. Or you can try to synthesise the field from a stereo recording. Dirac, Lexicon and others do this, with varying success. But critically, you will be listening in your own room, and the acoustics of that room cannot be ignored. A big part of what many HiFi enthusiats do is mess about with speaker positioning so that they get a sense of space from the diffuse field their room produces. This can work to a point where it sounds good, but to imagine that your lounge room can reproduce the diffuse field of the Concertgebouw is silly. The difference is size makes this intrinsically impossible.

Room treatments to control the sound in the listening room are more important than the speakers and the rest of the reproduction chain. Unless this is got right you are already so badly limited that you won’t get anything like what you desire, no matter how much money is spent. Even the most sophisticated equalisation systems cannot compensate for a bad room. The mathematics of the situation make it impossible. It is possible to compensate for frequency and phase issues with the direct sound - essentially compensating for the speaker’s frequency response and physical layout issues, but only at one point in the room.

Almost no speakers have a true omnidirectional sound spread across all frequencies, this means that they intrinsically have different frequency responses in different directions. A good speaker will at least try to get the direct response to your ear flat, but the response elsewhere can be astoundingly ragged. This is impossible to compensate for unless you compromise the direct frequency response, and intrinsically impossible to compensate for any more than one direction. So the energy going into the room, into the diffuse field, never has a flat frequency response, and it varies by direction, yielding a diffuse field that varies all over the place in the room. Move your head only a small amount and the sound can change. This affects the apparent sound significantly. Good room design and treatments help a great deal here, but no matter what, the size of the room limits what can be achieved. It will always sound like the size room it is. And it is not possible to fully control.

Distortion in speakers comes in lots of forms. Frequency and phase issues are the easy ones. Speakers are non-linear, and have a range of intrinsic distortion mechanisms. One, when a high-power signal drives the speaker, the voice coil heats up, and this heating is enough to raise the resistance of the coil enough that the efficiency drop of the speaker causes quite apparent and measurable distortion or dynamic compression. The field of the voice coil modulates the field in the magnet’s gap, and produces nasty distortion products. Mechanical non-linearities also cause issues. Compensating for all these distortion mechanisms might be possible, and there are bass systems that try either conventional negative feedback, or feedforward compensation, with varying degrees of success, but there are currently no system that work at higher frequencies.

There is a great deal that can be done to fix things beyond what even the high end home theatre guys do. (HiFi enthusiasts are mostly so way behind the technology here that it is stone-age.) However recognising that eventually you will always be constructing a compromised system, the answer lies in understanding how to design its compromises to best suit your tastes and requirements. In a sense this is what the steam drive HiFi guys are doing. Tube amps, and some of the weird speaker designs (especially open baffle designs) are attempts to create a sound field that is aesthetically pleasing. The results can be very nice. But accurate they are not. However recognising that nothing is accurate in a useful technical sense, it isn’t all that bad approach. However there is a lot more knowledge about how sound works, and one can do a great deal better by applying the science.

Floyd Toole’s book is a very good place to start.

It’s a lot more complicated than that. A speaker has different frequency response in different directions, and the room acoustics change the balance of the reflected sound. If you equalize the overall sound power to be flat, the direct sound will be non-flat, generally very bright.

Stereo sound will never be perfect or indistinguishable from reality. In real life you don’t have two or five speakers, you have various sounds coming from all directions. There is no way to replicate this, except perhaps with arrays of hundreds of speakers. (See “wave field synthesis.”)

Most of this post is not technically correct.

Fourier is only as good as your compression and spatial sampling. There’s no problem using sine waves to sample the sound of musical instruments - those don’t produce square waves either. And even if they did produce squarish waves, adequate sampling gets you beyond human perception pretty easily. You actually CAN reproduce sound using Fourier transforms.

The actual problems come, as noted in other posts, from the response of the microphone, the speakers, and the fact that 2 channel audio (or 5 or even 8 channel) really can’t fully capture the experience of hearing something live.

Note this is a similar problem we have in geophysics where we use seismic waves to study subsurface geology. Fourier transforms aren’t the problem. Reproducing the recorded acoustic wavefield adequately is the problem. The OPs suggestion is similar to a technique called FWI - Full Waveform Inversion. It has utility but it is not a cureall for much the same reasons other posters have given in this thread.

Of any stereo component, speakers suck the most by far. And by this, I mean the speakers produce way more distortion than anything else in the system.

So all you really need to do is test speakers.

And guess what? You can test every speaker every made, and none will come close to being transparent. In a sense, *all *speakers sound like crap.

Please explain? Nothing you’ve stated seems relevant to my post. I was just giving the square wave as example that FT does poorly on with an easy to find web page with a diagram. The main point was transients. I should have explicitly mentioned frequency variation in tones as well, but I think of that as another transient-like effect.

FT does a lousy job on drums, pianos, etc. Anything with a “spikey” sound. The sampling rate has to be ridiculous in order to somewhat accurately recreate the wave form. And people do perceive the difference. Oddly, they usually like the FT processed sound better due to sounding more “normal”.

But note that my post wasn’t addressed to human perception, but to making an ideal mechanical replication. Please re-read the OP.

Making music reproduction pleasant sounding vs. making it accurate are two different issues. Hence the nonsense with dynamic range being destroyed on popular CDs in recent years because that’s what the listeners have learned to like.

Bingo.

Another good point. Interestingly, someone did once make a virtually massless driver that, if the science was correct, should have been a perfect reproduction of the incoming signal to a very high degree of accuracy, the Hill Plasmatronics. Amusingly, audiophiles didn’t like the resulting sound. My guess is that since music is engineered while listening through conventional speakers and headphones, it’s going to sound best when played back through conventional speakers and headphones.

A Fourier series representation of a signal is 100% accurate, if you can manage an infinite number of partials. Any less than that, and you have a quantifiable error term. For this reason, your point about FT is misleading and inaccurate.

Regardless, FT is a complete red herring here. With infinite harmonics, it’s identical to the original input waveform. So, why convert at all? Use the original input waveform. Converting to a Fourier series is useful only if you want to understand or manipulate the harmonic content-- that is, if you want to do manipulations in the frequency domain, rather than the time domain. It offers zero benefit for the OP’s purpose.

Mics and speakers are by far the weakest links. That’s why I always recommend to folks getting started in home recording hardware to buy an inexpensive but decent audio interface and save the budget for the best instruments, mics, mic preamps, and speakers you can afford. The differences in those areas are obvious, whereas the differences between typical decent inexpensive converters and the world’s finest are subtle and difficult to perceive even by experts.

Going by the OP, the answer is no. The reason the answer is no is you are asking humans to be objective. We need to solve that one first.

Can we make a sound system that can be bench tested to be perfect? Yes. But, we just don’t work that way and each processes our stimuli in any manner of ways. Hence the plethora of audio equipment, each sounds different, and you hunt for the one you like.

The really expensive stuff though, that’s just for show. Same with all the snake-oil products that I have seen.

Speakers and their distortions are an interesting issue. In their favour, the distortion products are mostly low order and thus much less objectionable than some other distortions mechanisms in the reproduction chain. As a rule of thumb, multiplying the distortion level by its order is a good starting point to normalise the sonic objectionableness. An amplifier that introduces a tiny bit of 9th order distortion may add a much more nasty edge to the sound than a speaker that has a significant amount of 3rd order. Further, second and third order distortions tend to sound musical, and add not so much a noticeable degradation to the sound as a different sonic signature. That said, it is only recently that really good distortion numbers have started to come out of speaker drivers, and even then they are vastly higher than you can reasonably expect from the rest of the chain.

There is also a lot of misconception about what makes for a good speaker. Very low mass drivers are actually a waste of effort, and an effort to reduce mass close to useless. The effective acoustic load on the driver exceeds the mass of these lightweight drivers by an order of magnitude. The effective mass of the driver is simply a first order term in the harmonic oscillator that is the nature of all drivers. It does nothing more than define the frequency response.

Currently it is clear that the major issue in speaker design is the off axis response, and the manner in which this integrates into the room’s acoustics. This is a desperately difficult thing to sort out, as you are fighting the laws of physics at a very basic level. Bass response is dominated by the room. It is possible to use equalisation in the first 100 Hz or so to produce a flat bass response for one position in the room. Use of a large number of bass drivers with individual frequency and delay compensation can make more of the room usable. Once you get past the bass frequencies it is intrinsically impossible to compensate. With a wavelength of the order of a room dimension and smaller you simply cannot do anything to remove the effects of the room. You can try to ameliorate the worst aspects, but it isn’t clear what the best answer here would be anyway. Diffraction around a conventional box defines the off axis response in the most critical frequencies for most speakers, and the usual way in which the on-axis frequency response is compensated to cope with this necessarily compromises the off axis response, and hence the frequency response of the diffuse field.

However the ear/brain system is not some miraculous perfect receiver. It is easily fooled, and in particular, the Hass effect means that there are aspects where a far from technically perfect reproduction will not be noticed. It has been noted by some that a very well treated room, with speakers that have a carefully controlled off axis response and a really good on-axis response can sound almost the same as a pair of high quality headphones. Subjectively this might be as close to a “perfect” reproduction as you can get. However, it still might fall short of an aesthetically perfect result.

Gibbs’ phenomena? Sure, I’ve heard of it.

It’s not especially relevant here, because when most people mention “Fourier”, we’re already in the discrete-time digital domain. That’s the first “biggie”. There are analog, continuous time Fourier transforms, but the OP implicitly assumes digital.

Ok, well, that implies a fair number of steps already. Here’s a very simplified flow chart of a basic discrete-time digital processing sequence for an analog continuous time sound source. The stuff I’m leaving out is channel/source coding, error correction, and other stuff that’s important in real systems but not as much for an overview.

SOURCE –> A/D conversion —> Filtering/Processing —> D/A conversion –> Speakers

The Fourier part comes after the Analog/Digital conversion. That’s where our analog, continuous time signal is converted to a discrete-time digital signal. Our recording gear is important, too, since the response of our microphone will dictate how many bits and how finely sampled we need to be (or not to be).

Well, that’s already a problem. Here’s where we may potentially lose some information if we aren’t sampling with sufficient bits and/or the sampling rate is too low.

A/D quantization loss is fairly well studied. And it’s not really a problem, especially in the context of the OP.

Ok, then comes the Fourier transform. Well, we no longer have a problem. Once you’re digital, you can 100% accurately recover your signal. A square digital wave is perfectly recoverable, as it happens, without the pesky Gibbs’ phenomenon.

Why is Gibbs’ phenomenon a problem? Well, it’s a problem in analog space. If we use the analog version of the Fourier transform, yes, we can experience Gibbs type ringing. It never goes away, but, it’s a non-issue in practice. But since we’re generally talking digital processing (FFTs, especially), it doesn’t come up since that’s a different type of processing entirely. Digital vs analog. It’s certainly related, but it’s not actually relevant to the OP.

As for drums and other percussive sources, no, they aren’t really square waves. They’re closer in quality to impulses. Very short spikes. In signal terms, that’s nearly the opposite of square waves. Such sources have short time duration but cover all frequencies. Those can actually be easier to recover. If they don’t seem to be perfectly recovered, blame the audio engineering. It’s not a fundamental issue with Fourier (which CAN recover it perfectly - I’ve some especially good classical piano recordings), but with what the engineers are told to do to the sound.

To the extent this relates to the OPs, this means that Fourier isn’t necessarily a problem but your A/D conversion may result in some loss after the D/A recovery. But for purposes of the OP, an A/D conversion with conversion loss below human perception is quite possible. Though that still doesn’t resolve the underlying issues with adequately spatially sampling the acoustic wavefield or reproducing it in a different location.

There are a number of different things going on here.

  1. Some people do perceive a difference. But most of the time, it’s in their head. Yes, there are so-called “golden ears”. But more people claim to have them than actually exist. It’s kind of like having a 9-inch penis. More men claim to have them online than actually exist.

  2. Preferring CD sound or vinyl. Personal preference, purely. For a long time, audiophiles swore by vinyl. Said it sounded “richer”. It does sound different, but mostly, they referred to the response of the record player, which introduced its own distortions that softened the sound. They got used to preferring the smooth distortions of record players rather than the more ‘accurate’ (for a certain value of ‘accurate’) CD sound.

  3. Some people perceive a difference, and it’s real. CD level recording isn’t perfect. There’s quantization error that’s within the range of human perception. The apocryphal story is that during development of the CD, the head of Sony asked if the entirety of Beethoven’s 5th would fit on a single CD. When told not, he told the engineers to make sure it could, which necessitated compromises in sound quality. That has nothing fundamentally to do with Fourier. It’s not worth re-engineering at this point, but it can be engineered out to the point any differences are purely in the playback and not to insufficient quantization/sampling.

Again, this isn’t a fundamental issue of Fourier transforms. Simply sample more frequently and use more bits. At some point, you do reach fundamental universal limits, but that’s true for mechanical replication as well. At some point, they’re as equal as anything can be in the universe.

Well, that’s again a different issue. Again, nothing to do fundamentally with Fourier and everything to do with how the audio source is processed. Audio engineers choose to destroy dynamic range on CD to achieve a desired effect.

There’s actually much more available dynamic range on a CD than on vinyl, and it’s more ‘accurate’.

The basic theme here is that you seem to be conflating audio processing with Fourier transforms. Yes, bad things can be done with Fourier transforms in processing sound. That doesn’t mean bad things necessarily have to occur with them.

A point about Fourier transforms. Nowhere in a conventional audio digital processing chain does a Fourier transform occur. It can be used to analyse the signal, and this is done commonally, but processing is typically performed with IIR and FIR filters, and other similar stages. You can use a pair of Fourier transforms to apply a convolution, and in many other domains this is the preferred technique. But in audio, for a host of reasons, it doesn’t usually happen.

As alluded to above, Fourier theory has a great deal to do with digital. The mathematics of sampling and reconstruction are governed by Fourier, and the fundamental results about resolution and bandwidth similarly so. Shannon derived these fundamental results in the his seminal work, including deriving the Nyquist limit on the way.

The design of the CD recording format is right on the edge of human perception, and the application of noise shaped dither techniques allows it to essentially cover the entire capability. What it doesn’t do is provide even the slightest bit of wiggle room for mixing or processing, and for that reason recordings are done at both a higher sample rate and higher bit depth. But once a recording is mixed and mastered, it can be safely sampled down to CD resolution. The precise specifications are rooted in some very arcane and odd history. 44.1 kHz was chosen so that both NTSC and PAL video recorders could be used to store the sampled data with a simple adaptor. Fitting Beethoven’s 9th on a single recording seems to be a bit of an industry legend, but one with not a huge amount of provenance.