Except you can’t do virtual reality with mere video streams at all. If it’s to be interactive, then you need to be able to see everything from multiple angles, which means that you need the entire 3D scene, not just any one or few 2D vantages on it. Right now, that’s done by making the 3D scenes relatively simple, but there’s no reason it couldn’t eventually be photorealstic.
This is exactly what I was thinking in terms of bandwidth requirements. My limited experience with VR was when my son got a VR system a couple of years ago and was showing it off to me. I seem to recall that it had very significant video card requirements.
Anyway the first … I don’t know what to call it – not a video, I guess the first “VR experience” – was interesting because at first all I saw was a movie, apparently documenting one of the Apollo lunar missions. So I was looking at this movie and thinking, this isn’t particularly impressive, and it’s not even taking up very much of my field of view. Then I happened to look to the side, and there right beside me was the movie projector that was running the movie, and as I looked around I saw that I was in a strange room with all kinds of retro sci-fi objects, with a good ol’ lava lamp off to one side. This was the moment that I viscerally understood what VR was about. (What followed was a pretty cool recreation of a trip to the moon on Apollo, including the landing of the LM.)
I currently have on the 256 GB MicroSD in my phone 2,344 tv episodes (166 GB) and 112 movies (33.6 GB). I checked the files list for one of my 256 GB USB drives (a text file I have, not plugging in the actual drive) and it contains at least one full season of 129 tv series. At least 35 of them I know from the top of my head are the complete series:
Summary
Addams Family, The
Addams Family, The 1973 animated
Alf
Alf Tales
Alf The Animated Series
Amphibia
Atlanta
Atypical
Better Things
Detour, The
Dollhouse
Don’t Trust the B
Farscape
Gravity Falls
Happy Valley
Hilda
I Am Frankie
iZombie
Kim’s Convenience
Lilyhammer
Major Crimes
Marvelous Mrs Maisel
Munsters Today, The
Munsters, The
New Addams Family, The
No Good Nick
Owl House, The
Pushing Daisies
Righteous Gemstones, The
Rutherford Falls
Six Feet Under
Stan versus Evil
Teen Wolf
Terminator the Sarah Conner Chronicles
Todd and the Book of Pure Evil
That drive should have in the 3,000 to 4,000 episodes range. Compression doesn’t matter to you because you have so very little content.
(I re-encode all of my files at 854x480 for TV series, 1280x720 for movies in h.265.)
Compression matters to me a whole lot because I’ve digitized just about every movie ever made that I care about (and I’m talking thousands) and probably at least half are in 1080p. Also full seasons of a few dozen TV series. Amazon loves me for my constant purchase of HDDs.
The SD card on my tablet only contains the stuff I plan on watching in the near future.
1G is arbitrary. The assumption is that 1G is representative of the rest of Wikipedia. Or at least the techniques used on 1G would scale up (and down to a degree).
Also the challenge is for lossless compression which an LLM won’t do well. Other techniques would have to be used on tables and other data.
There’s too much additional information captured in an LLM to compete with a purpose-built compressor. A 400B parameter model could churn out plenty of non-wikipedia documents; that represents wasted capacity.
What’s interesting is the purpose of the challenge is to further AI research. They are considering AI to be a compression problem.
The choice of Wikipedia is also arbitrary. It’s just an easily available dataset. The point is to advance text compression.
You make the prediction with the LLM, and then correct for it. The output of an LLM is actually multiple possible tokens, with confidence values (normally the LLM picks one of those with some randomness injected). You could divide your entropy space among these possibilities, only having to store the word if it isn’t in the top N choices.
Or you could take another step back, and compress in the vector embedding space. The LLM produces a particular vector, and your job is to use the minimum number of bits required to pick the actual vector corresponding to the thing you’re compressing. You hardly need any bits if the LLM did a good job and picked the vector closest to the answer. If the answer is farther away, you’ll need more bits.
The size of the LLM is irrelevant for a large enough dataset. The point is that it actually captures meaning, and you need meaning to make good predictions. Prediction happens at many levels: there are only so many next characters that are valid, only so many words that are grammatically correct, only so many words that are meaningful, only so many that fix with the style of the rest of the text, etc. LLMs do a good job of capturing this info.
That’s because prediction and compression are effectively the same thing. You need a good AI to make good predictions.
As above, it isn’t really clear that in this process a LLM is doing anything different than any other text compression algorithm. If it must be lossless there are limits on how it can operate. LLMs just work on a vastly larger scale than ordinary compressors, so can find repeating patterns at scale. If you took an existing compressor geared to work on text, and tweaked it to use as much memory as it liked and threw huge compute at it, it would probably do a better job. Partly because we are working in a much more constrained set of requirements. There isn’t much fancy in LLMs other than scale.
This Captain Disillusion video does a great quick job of explaining modern video compression and how clever it is versus “just do JPEG on each frame” which as he says doesn’t actually work very well. Modern technology (AI or not) may allow much more interpretation/interpolation of p-frames but still produce accurate results, or maybe there’s some other completely undiscovered way to compress video.
Back in the day Flash videos and games that were the mainstay of sites like Newgrounds were insanely tiny by today’s standards (a few hundred KB), but when everything is vector-based it can be scaled up to any size and framerate. That’s essentially how South Park got an HD release, they just re-rendered the episodes from the original source files (the very first episode was done with actual construction paper and stop motion, but it looks like even that was scanned at a higher resolution). Might some future video compression algorithm be able to turn a raster video input into the equivalent of a vector-based Flash animation, only instead of literal vectors and coordinate movements, instead it’s using people plants, and other higher-order objects?
In simpler terms, it would be like taking an audio recording and turning it into high-fidelity MIDI. In their original implementation they are quite limited, but an entire song is only a few dozen KB in size, versus a couple MB for an MP3, or dozens of MB for uncompressed CD audio. It’s sort of the vector equivalent to those Flash videos, since it’s just using the sheet music with different instrument samples applied to it. With a bigger library of samples and better interpolation, I could see this being a pathway to better compression.
I’m no expert in any of this, so maybe I’m completely off-base. I’m just suggesting that as computers become more powerful, new creative methods of compression become available that were previously inconceivable.
It will work, but I don’t see how it beats a purpose-built compressor even though the tasks are very similar. The LLM trained on a global corpus will have less compression than an algorithm that self-trains on the target corpus.
Training a custom LLM just for the target corpus looks a lot like a text compressor, but with the overhead of generalized knowledge (being able to generate other text).
It makes sense when there are ubiquitous and identical 400B LLMs on devices such that the LLM isn’t part of the ‘cost’ of the compressed data.
Some other thoughts:
An LLM will need to get prefilled with enough context to start decoding accurately. It would likely be smaller to traditionally compress the prefill context than the decode errors.This will have to be repeated as topics change since there will be a discontinuity between past and future decodes.
The current best algorithm is using 0.88 bits per byte. Assuming characters are all 1 byte (worst-case scenario) and tokens average 4 characters that’s 3.52 bits per token. So the LLM would have to get the right token in the top ~11.5. That’s better (lower) then I expected.
The LLM has to get re-run each time there are decode errors. The corrected tokens need to be prefilled back in (with the old kv state) similar to Speculative Decoding.
Because it has understanding. Or something akin to understanding. Let’s work through a simple example.
Suppose I’ve compressed the first part of a string, which contains:
One plus one equals
There are three characters left (not including a space). Any human can see what the answer is, and you only need to store a single bit (or less!) to indicate that the obvious prediction was correct. What would a computer predict?
The first way is to have no information at all:
One plus one equals %#(
Three random characters are as good as any others. So the top prediction is the same as any other and there’s no way to choose something else.
What if we take English letter frequencies into account?
One plus one equals eee
E is the most common letter, so the most common sequence by this standard is just all Es. Still not a good prediction. How about taking word frequencies into account?:
One plus one equals the
“The” is the most common three-letter English word. Still not a good prediction. What about taking English grammar into account?:
One plus one equals way
The sentence needs a noun there, and “way” is the most common three-letter noun. But if we look more closely, we can see we’re actually doing better. The top three-letter nouns are: way, art, map, and two. And they’re all pretty high in frequency. We can pick out the answer without too many extra bits, even though it isn’t the top prediction.
Finally, what if you take the actual meaning of the sentence into account? And the answer, which any child can come up with, is:
One plus one equals two
Of course, other answers are possible. Maybe the author is lying, or writing a post that shows several counterexamples. We need more bits to account for when the prediction fails. But ultimately this is going to be the most probable answer.
Shannon knew from the start that the better predictions you can make, the fewer bits you need. You need zero bits if your predictions are perfect. One bit if you can always narrow things down to a 50/50 choice. And so on (including fractional bits). LLMs, having something like an understanding, will outperform simpler models that don’t have that (Markov chains, etc.).
LLMs have to be big to contain this understanding, which means they’ll probably never be useful for compressing something like Wikipedia. But if your dataset increases, the size of the LLM becomes irrelevant and the only thing that matters is if it can make better predictions.
Bah.
How big this the decompressor going to be for this AI-generated compression scheme?
As big as the LLM. A few hundred gigabytes, say.
Will it ever be practical? Probably not. But if you’re trying to set a record compressing a 100 TB corpus down to the minimum, I suspect you want something along those lines.
And to be clear, the thing I just described is lossless compression.
I doubt an LLM is going to beat a purpose-built compressor. Using your example, how much data does it take to represent the word “two” in your dictionary? if it’s more than three bytes, your compressed file is going to get bigger.
One of my points is that that ‘understanding’ comes at a cost as do all of the understandings in the LLM that you don’t need for your data.
A compression algorithm will also statistically model the language, but when it compresses the data it won’t put the extra effort into remembering the 2nd best answer or the 31999th best answer. An LLM will – that is the overhead I’m referring to.
An LLM is a superset of a text compressor. You can make it do the job, but it puts in a lot of extra effort. And I mean logical effort – it’s worse if you count OPs
Well, presumably an LLM could predict multiple words in a row.
Here’s a skeptical take on lossless LLM compression, presented with the understanding that I don’t know what I’m talking about.
Existing lossless compression uses dictionaries that reference words or sentence fragments in the document and substitute characters for them. Consider this simplistic dictionary:
01 - the
02 - and
03 - so on
So instead of storing “the” you store “01” and save a character. But you have to store the dictionary as well. Lots of files are made up of English words. Why not have a short dictionary that’s stored in the file compression program itself? That way you don’t have to store that mini-dictionary in the file. Similarly you could add something like this:
04 - results of LLM model #1
05 - results of LLM model #2
06- results of LLM model #3
That would take a lot of processing time, because you would have to run a number of LLM models. But it potentially could save a lot of space. The problem here is that while mini-dictionaries (my term) are a thing, they aren’t that common. I’m not sure why. I speculate that whatever makes them rare currently, will also limit the utility (in terms of storage) of this LLM approach. Maybe 7z is just really good.
LLMs produce a list of predictions with probabilities. You would use a standard entropy compressor (Huffman, arithmetic) to produce bit sequences that select from that list. Without checking, I’d guess that the “two” dominates the list, with an equivalent of >50% probability. That would mean it compresses down to just one bit. Of course, most words won’t be as predictable as that.
Again, I’m only talking about the case where you have an enormous dataset, where the size of the LLM of it is irrelevant compared to the data you’re compressing.
For small datasets like Wikipedia, the tradeoff is different.
They can, but you have to double-check that the decode tokens are correct and then redo them if not. Since inference time of the LLM is expensive you have to balance the improvement with the penalty. This is a common technique to speed-up LLMs called Speculative Decoding. There are a lot of variations on the theme as well.
My knowledge of compression mostly ended in the days of the modem. Back then the dictionary was built-up during decode. Compared to storing a dictionary in the file this technique has a start-up penalty, but a smaller file size. The cost of the start-up penalty because less important as the compressed file grows in size. Perhaps this is why it is not common?
This is perhaps the crux. LLMs don’t understand anything. They provide a plausible simulacrum because they have a metric of closeness of word fragments that occur near one another in sequences. So fed lots of text they accrete representations of closely related fragment streams that often correlate to similar semantics. This is just a very large superset of performing a very deep search of lots of text to find common streams of text without worrying about representing closeness, and just worrying about exact, which is what a compressor, or compressor generator might do.
The step that could make a pure text compression system better than an LLM would be to build a system that builds parametrisable phrase representations. Given enough of a search space and enough horsepower I suspect one could adapt some of the same enabling technology as used in LLM training to build such a system. A very capable compressor would end up with a pretty substantial data representation of the system, and the decompressor would be similarly substantial, but when compressing very large inputs, the total size would start to win out. We could either provide a compressor with a pre-built data set, which is akin to using a pre-existing LLM, or we could let the compressor run ab-initio on the source text stream.
ETA - it does however occur to me that given a trained LLM embodies a lot of the heavy lifting, interrogating one with phrases might provide a short-cut to identifying useful sets of parametrisable phrases and phrase variants. This might work best if we were accorded access to the bare LLM.
I just want to mention here that I find it amusing that we’re talking about the entire contents of Wikipedia as a “small dataset”.
I think that, for purposes of this thread, we’re measuring decoding algorithms purely by the compression ratios they can achieve, not by the processing time needed to achieve it.
I will never forgive Steve Jobs for effectively killing Flash!
Sideshow:
It had a really cool tool that could render vectors to bitmaps. So for example, I was working on an online casino, and we needed to generate a “wheel of fortune” based on roulette. Using vectors, it was easy to render the wheel, but spinning it was computationally hard, so obviously slow and glitchy.
So I used the BitMapData functionality to “screenshot” the vector wheel, and overlay the bitmap and spin that, instead. We had a random number generator, so at the appropriate time the bitmap wheel was hidden and the vector wheel in the equivalent position shown.
Kind of proud of myself for thinking way outside the box on this.