Have we reached the peak of media compression technology or is it still advancing

I’m not sure if this is a factual question or not. But my impression is 25 years ago, the MP3 was the best compressed form for audio. It is about 90% smaller than a lossless, non-compressed file. I don’t know much about audio files at this point, but looking online .ogg files are more popular now and are smaller, but I’m not sure how much smaller.

With video for example, I have ripped some DVDs as MKV and they are about 5GB per DVD. I think Blu-rays are about 50GB per disc. However downloadable 1080p videos are closer to 2GB and are MP4 format.

Looking at some older video files, MP4s that were 480p were about 1-2GB, implying that they were only compressed maybe 50-75% from what I assume was a 5GB uncompressed file. But the 1080p videos imply they are compressed about 95% if they’re going from 50GB down to 2GB.

So was there an advance in MP4 compression?

I looked online at a website, it said 0.675 GB/hr of 480p to stream, but 2.7 GB/hr of 1080p to stream. But the MP4 files I have that are older are larger than 0.675GB/hr (closer to 1.2GB/hr), and the newer 1080p files are smaller than 2.7GB/hr (more like 1.3GB/hr).

MP4 is what they call a “container” format, which is really just a wrapper around the actual video encoding. The actual “codec” (compressor / decompress software) would be something like H.264 or the more recent H.265, which, yes, can produce higher quality files at the same size (or the same quality at a smaller size). The tradeoff is that it usually requires a faster computer or a hardware chip.

Outside of the MP4 world, the WebM container also allows even more modern codecs like AV1, which is even better, but it’s not as widely supported. Apple in particular didn’t support it until their very recent devices.

There is a similar codec war going on in the image world with JPEGXL and HEIC.

In the audio world, high fidelity music is pretty well served by AAC and FLAC and Vorbis, but there have been some cool recent advancements in low bitrate voice encodings (for clear telephone calls over low bandwidth connections).

But generally nobody really cares unless you’re operating at the scale of YouTube or Netflix, where tiny % improvements in efficiency can save you millions of dollars. Everyone else just keeps using MP4, MP3, and regular JPEGs.

I guess there’s also the open and somewhat interesting question of whether LLMs (the AI chatbot stuff) could serve as a form of lossy encoding. They don’t really retain the original data, but become a sometimes accurate statistical model of it. Sometimes that’s good enough…?

Sure, if you want a summary of the movie.

I’ll just add to this that it’s important to understand the magnitude of these differences. H.264 and H.265 are quite radically different. The latter can produce much smaller files than the former (or alternately, similar size files with much higher quality). H.265 also requires a lot more processor power to encode and decode.

I fully expect this to be a thing. That is, we will have AI encoders that only record a loose “sketch” of the recording (video or otherwise) and use AI to fill in the gaps. The odd part will be that as you increase the compression rate, it won’t turn blocky or anything like that–it’ll still look good, but the content itself will be a worse approximation of the original.

Voice compression will eventually just perform voice-to-text, record a few hints about intonation and stress, and then perform text-to-voice on the other end using some model of the speaker’s voice. It’ll sound just as good independent of the compression rate but you’ll lose the subtleties.

It seems to me that this would imply that you will never hear the same song /see the same video / hear the same speech twice, as the AI will perform different compression/decompression routines in the “black box” every time.
I wonder what crazy copyright lawyers will make of that. I fear I will not like it.

It’s already possible for LLMs to regurgitate intact-looking copies of images from their training data (along with a wide variety of inexact or absurd-looking copies). I don’t imagine it’s impossible for the same thing to happen with movies if the models keep on getting bigger.

In a way, this seems like it could be an interesting thing that people might even want, because it potentially means the process of replay could change or omit certain details; for example, wouldn’t it be better if the river of chocolate in the 1971 Willy Wonka movie actually looked like chocolate rather than just muddy water? I sort of want that, but I’m not sure I want it enough to accept the torrent of garbage that entertainment would become if LLMs become the default delivery method.

As computers get more powerful, we can compress or decompress data further. So, no, we’ve not hit the peak. Just the peak of what’s feasible right now.

And the thing you want to look at is called a codec, not the file extension. CODEC stands for code/decode, which is where the compression actually happens.

It’s entirely possible to make totally deterministic decoders–AI models today can be deterministic as well, but they are intentionally designed with some randomness to make them seem more natural.

Still, it’s an interesting point. If you wanted, you could make it churn out slight variations of the same thing. Watch it once, an orange cat walks by, a second time the cat is black.

Of course, one can imagine much more drastic changes, even totally changing the flow of the movie… but you’d have to disconnect it from the “storyboards” that are keeping it in line.

I think we can count on that even without AI compression. Facebook is already filling up with AI slop.

The next several years are gonna be interesting…

There are limits on compression that no amount of computer power will overcome.

Yes, this. There will, fundamentally, be a trade-off between the size of the compressed data and the size and processor requirements of the codec.

There is also still a lot of room for theoretical advances in coding theory and data compression. I don’t mean “AI compression”, though (does that even mean anything?)

Agree with side commentary to follow.

There totally are mathematical limits to the amount of compression that can be performed on arbitrary data. And we pretty well know what they are and can routinely achieve them.

When the data in question is a picture, an audio recoding, or a video recording we get into the very gray area that not all bits in the data are created equal. Speaking a bit metaphorically, the whole point of lossy compression is to identify and discard the bits the humans won’t notice being gone. Which is not so much math as a study of the foibles and limitations of human perception.

it also points out that compression that’s good enough for somebody watching a vid on a phone in a noisy subway train would look & sound awful if played in a modern “movie” theater to an assembled audience. But is totally fine on the phone on the subway.

IME/IMO we have not reached the limits of further exploiting the features of human perception to further shrink audio & video.

As well, we (well, the major streaming outlets) are also training people that a lo-fi experience is plenty good enough. Sound that would make an audio recording engineer cringe is fine for a fogey like me with raging tinnitus and lots of high-frequency roll-off. It’s also fine for a teen raised entirely on highly compressed TikToks using 3x pirated music.

So maybe the race for ever higher fidelity that started with the earliest wax cylinder recordings has peaked and now we’re headed back the other way as we (or our entertainment suppliers overlords) prefer greater efficiency over that last increment of achievable fidelity wasted on 99% of their customers.

Vernor Vinge had a scene in one of his novels where one spaceship is communicating with another (holographic, of course) and the transmission is lost. The one character remarks the transmission looked cartoonish and stilted toward the end, and they check to find they were only receiving a few dozen bits per second… the essence (text?) of the message and the local AI was generating the character and appearance as if it were a live transmission. Fairly prescient for 30 years ago.

I have a book from a Bell Labs researcher (somewhere in my boxes) that discusses compression and “information theory”. The discssion here is correct - how much detail do you actually want? A song can be reproduced by notes and lyrics, or you may want the peculiar voice modulation that only a Bob Dylan or Buddy Holly can provide, or you may want the multi-part harmonies like the early Beatles. The level of detail determines the level of compression. One other trick for direct audio is filtering out the higher frequencies, another means of sacrificing data bandwidth for compressibility which may not be so noticeable. The key is - how much precise detail do you accept to lose? You can for example get a much smaller JPG by blurring the background, if it’s a portrait and the background is unimportant, but that won’t fly with a landscape picture… but in a landscape, how important is the detail of every leaf on the trees? Landscape painters long ago figures out that dabbing the canvas with a green brush was passable as leaves without precise detail. OTOH, with a Van Ggh reproduction, you want the detail of every brushstroke to be visible, and AI simulating it for display may not be an accurate enough reproduction.

You get what you pay for, bit by byte.

As I’ve commented on in AI threads, recent computer graphics cards use AI to upsample and add details to the scenes displayed on them. It’s quicker to give a low-resolution/low-detail image to the AI to improve than to compute a high-res/high-detail image directly.

There can be a fuzzy border between artistic intent and compression techniques. Audio engineers used to make artistic decisions to get a song sounding good when played from vinyl. It’ll be similar when 3D visual artists tune their models to look good after being filtered through digital compression (AI or other).

In some sense some forms of an AI generated compression codec is really just a more nuanced form of a perceptual codec. We train it to ignore the stuff our brains ignore. Audio perceptual codecs are already pretty advanced, partly because we have some pretty solid ideas about how the ear brain system works and how concepts like masking allow us to drop a lot of information.
An actual AI style voice synth may run into the problem that to upload the voice model to the receiving end isn’t worth the cost versus an efficient perceptual codec. You need the model ahead of any conversation. Maybe we can work out how to parametrise a system with a limited number of parameters to create an individual’s voice, but only maybe. The silly amount of bandwidth available in the modern world really makes worrying about voice compression mostly an academic pursuit.

Video probably has a long way to go. Application of some of the gaming graphics system’s AI detail enhancing tricks (Super Scaling) might become a useful technology to apply to video rendering. There is still a significant compute requirement for good quality video compression. The world essentially works with both high CPU cost high quality, and real-time low cost lower performance codecs. High resolution video is usually done off-line. Then you can throw CPU at compression, and favour both lower bandwidth and low decode CPU costs. You Tube only does a high performance high quality compression on videos that reach some threshold of views. This can be noticeable if you are one of the first to view a newly uploaded video, or stumble upon one with few views.

Latency issues with real time video also tend to push towards simpler codecs. You can make a long pipeline of compression steps, but you reach a point where the latency becomes annoying. Same problem may occur with P and B frames (intermediate frames that encode only motion changes relative to other frames.) B frames require key frames (aka Intra or I-frames) from both ends - before and after the intermediate frame (hence why they are called B) so add significant latency at both ends of the codec. Not a problem with streaming a movie, not great for live video. It is pretty remarkable just how many B frames can be inserted when the scene is reasonably static. But it all turns to mush if the scene does something dynamic, so codecs need to be adaptive. Movie encoding uses algorithms to detect scene changes and force an I frame. So more compute.

There are certainly lots of possible way to improve compression of images in general. The Discrete Cosine Transform on rectangular blocks (eg JPEG) can be bettered by things like wavelet based compression (such as in JPEG 2000), which can lead to less obvious artefacts (no blocking artefacts, which our eye picks up all too easily) and can made to dynamically favour image regions that may be more important in a scene. It is just a better codec all round. Digital Cinema uses Motion Jpeg 2000 (ie a stream of Jpeg 2000 images - no P or B frames). So go to the movies and you are pretty much guaranteed to be watching that.

I would say that the big gains might be had in the motion flow encoding.

Exactly. It’s the same process that leads us to encode video into red-green-blue or other low-dimensional color space instead of trying to capture the entire visible spectrum. And similar simplifications to encode motion. Our eyes don’t actually see that much, with our brains filling in a lot of stuff without conscious awareness. Video codecs that can model the whole vision process can drop a lot information without greatly impacting the perceived quality.

Well said.

I recall someone (not computer savvy) who told me in the early days of internet he didn’t like JPG or MP3 because XModem transferred at mich higher bitrates when the file was uncompressed.

There is a law of diminishing returns, however. The tug of war will always be between fidelity and size.

Also, a lot is tuned to human perception for audio-visual. Typical response of the human ear… so what if fairly young people notice the missing high frequencies? Whereas in telephone conversations where the essence is simply understanding the conversation, eleminate even the moderate high frequencies and the voice is still intelligible. Humans have 3 (4 if you cound B&W rods) visual sensors. Precision lies in encoding to feed the rods and cones in the eyeball. Plus, video often only has to match our total eye’s ability to resolve, while static images for close up inspection need more resolution.

(I used to joke, do you really need to watch Meet the Fokkers in 4K or 8K? And so many historical materials are recorded in 480P but are still enjoyable)