The trouble with using wave comparison for audio reproduction comparison is that doesn’t measure what matters here.
Our ears don’t really respond to the wave amplitude in time. They are for the most part actively controlled frequency analysers. They don’t see a lot in the way of phase anomalies. The ear-brain needs to be considered as a unit. There is active control of the hair bundles in response to sound that both modifies the response and informs the brain’s knowledge of the sound.
There is some ability to detect absolute time offsets for frequencies below about 1kHz. This likely fits with the distance between our ears and allows both amplitude and time difference to inform stereo location. Changes to frequency response with angle (the HRTF, head related transfer function) inform this as well.
But the system is insensitive to a huge set of phase anomalies. Comparing waveforms is ruined by even tiny phase anomalies. (Then there the stupid arguments about absolute phase. Something is almost always inaudible outside of extreme events. )
So you need to be very clear that your comparison is not claiming terrible results for things that just don’t matter and yet be relatively insensitive to things that matter greatly.
One area digital streams were shown wanting is the propensity to create distortion products that were not harmonically related to the source signal.
This was an interesting surprise.
In the past distortion was neatly measured by feeding a system a signal at a fixed frequency and looking for all the harmonics generated by the chain. Add up the energy in all the harmonics and you have total harmonic distortion, THD. It didn’t take long to realise that second harmonics were almost inaudible and if they were audible they added a nice warmth to the sound. Third harmonics in small doses could add a pleasant richness. Given these are the octave and a fifth above the octave this is perhaps not a surprise. It is a staple of studio production adding just the right amount of harmonics. Doing so in an amplitude related manner even more so.
Higher harmonics are less desirable and a metric that weights the harmonic by its order has been used to better relate measurements to audibility.
Then digital came along and could generate spurious products at heterodyne frequencies relative to the sample clock. These aren’t harmonics and turned out to be very objectionable artefacts audible at levels well below what we were seeing with harmonic distortion.
So we could measure systems to the best of our ability and miss things that matter.
In the end there is no go/nogo metric. Measurements must be informed by our understanding of how the ear brain system functions.
The interesting studies were done during the development of perceptual compression systems. So MP3 and its successors. They performed many double blind trials to determine the limits of perception to tune the compression schemes. A criticism of the work was that it most used classical music. But the results have been very interesting. They could identify a large amount of signal content that was simply inaudible. There are interesting tells, where if you know what to listen to, are give aways. An identifiable splashy edge to cymbals and high hats. And the easy one, loss of the faint ambient fade out of the sound, replaced by a sudden cut to silence.
Better compression systems at reasonable bit rates are impossible to pick.
Then you can get into other weirdness like the precedence (aka Hass) effect.
For anyone that cares about the reproduced sound, they know that the single biggest factor is the room. The sound field you hear is dominated by the room, and effort spent getting that right (if there is a perfect right) pays dividends way above effort spent on the equipment. Stereophonic sound is so compromised relative to live sound that there is no point worrying past a certain point. It is intrinsically impossible to reproduce the original sound field. Getting to an acceptable compromise that is a good experience in and of itself is the best we can hope for.