I’m a computer engineer.
I read some articles describing how to build simple headphone amplifiers, such as thisarticle. I also read articles describing various methods for the actual transducers that produce the sound (magnetic, electrostatic, and the most common method, “dynamic driver”, where a coil is moving back and forth, it’s basically a form of motor.
Anyways, you have an objectively correct metric. Please don’t respond to this post if you think objective measurement of sound is impossible. It’s very simple - you can take the Fourier transform of the digital audio input signal for every frequency. Correct reproduction, the resulting signal contains only the frequencies present in this digital source signal (well, the side frequencies aren’t in the audible range), the ratio of signal magnitudes is the same as the source (the overall magnitude can vary, obviously, as someone turns a control up and down), and that’s that.
Well, there’s all these limitations. Different transducers have different responses at different frequencies. Moving transducers need power to start but then when oscillating there’s only so much damping. The amplifiers introduce frequency specific phase shifts.
So the solution to this is also obvious. You need a good, reliable sensor that tells you what the transducer is actually doing. Some kind of element embedded in the transducer itself, such as a magnetic embedded in the transducer and a hall effect sensor array, etc.
You fourier transform this input signal, and compare the actual ratios of inputs to the setpoint. You compare the overall signal magnitude to the setpoint. Then, you have a learning algorithm that adjusts the input signal for every single control point across the frequency range - you essentially apply a transform to the input signal specific to each and every frequency. You fix phase lag by phase advancing or phase retarding by frequency. You can fix most limitations of your analog amplifiers and transducers this way, and thus use cheaper components and get the same audio quality. Transducers have state dependent responses - if it’s already oscillating at a specific frequency, it behaves differently than if it isn’t, so your algorithm has to be aware of that. Probably just a ~50 channel PID solution or something.
This is totally doable. A system on a chip with sufficient processing power that costs just a few dollars could do it. Why isn’t one of these inside each and every speaker and headphone sold today, with integrated sensors? Inputs would only be digital so all this correction would be done right inside the device.
The ultimate goal is to replicate the same sounds you would have heard if you were standing near the sound source, or to hear whatever the audio designer intended when he created the input signal using extremely expensive reference equipment that basically does what I am saying.