Have we reached the peak of media compression technology or is it still advancing

I think we passed each other.

I simply meant that “seven hundred and three” could be interpreted as the spelled-out version of the numeric value 703 or as a formula fragment meaning “perform the operation of adding 3 to 700”. Since the arithmetic result of either interpretation is the same, we cannot tell which of those interpretations the AI used. So IMO it’s premature to conclude that it correctly parsed your input. It may have gotten the right answer for the wrong reason.

My goal was to create a bait word formula where interpreting the “and” commonly found in written-out numbers as a call for an addition results in a different numerical value. But the bait formula would also have to be un-contrived enough that ordinary humans wouldn’t be scratching their heads at the weird ambiguous construction.

I suppose something like “negative seven hundred and three” might do it. If the “and” is interpreted as part of the number then the numeric value assigned will be -703. If the “and” is interpreted as a call to add 3 to negative 700 the answer becomes -697.

[Aside]
I recall a surprisingly contentious thread from 3-10 years ago on the topic of embedded “ands” in spoken or written numbers. Many people thought them esssential and other thought them anathema. IIRC we found some correlation between folks’ attitude and the region and era where / when they attended elementary school.

That’s what my (tortured) interpretation does above. It equals 1117, not 820. I only added parentheses to change the binding of the words. And of course interpreting “and” as “plus”.

FWIW, both Claude and Gemini got it wrong (both giving 1217!) when I added the parens. So they aren’t super great at addition yet :slight_smile: .

Still, this is true:

You’re right that most of the ambiguity doesn’t actually lead to a different answer. So a bait word formula where every possible interpretation gives a different answer would be valuable.

But it also has to be something where to a human, there’s just one obvious and correct interpretation. So it might be tricky.

I actually have access to the OpenAI API via work, which I can get raw token output from. I tried this prompt (on gpt-4o):

Complete this sentence (only including the new words): If square plus triangle equals septagon, then square plus pentagon equals

It correctly answered “nonagon”. The raw output is in “logprobs”, which is the (estimated) logarithm of the probability (natural log). Note that since tokens are not the same as words, it first predicts “non” and then “agon”. For the first response token:

"top_logprobs": [
    {
        "token": "non",
        "logprob": -1.0813986
    },
    {
        "token": "dec",
        "logprob": -1.3313986
    },
    {
        "token": "hex",
        "logprob": -2.3313985
    },
    {
        "token": "d",
        "logprob": -2.4563985
    },
    {
        "token": "und",
        "logprob": -2.8313985
    }
]

So it’s pretty confident in “non”, though “dec” is pretty close and “hex” not that far behind. Still, these are very good predictions: you need just a few bits to pick out the “non” from the list, reserving only longer bit strings for the possible case that the answer is something else.

The remaining “agon” is no problem at all:

{
"token": "agon",
"logprob": -1.247159e-05
},

That is a supremely confident answer, needing only a couple dozen microbits to answer.

So overall we’ve compressed “nonagon” down to maybe 3 bits. That’s much better than the 0.9 bit/char record. Whatever is going on here, it’s close enough to “understanding” to be a genuine improvement (for this particular example).

I would argue that a calculator routine also has some measure of understanding, at least where numerical calculations are involved.

I would maintain that those are both the same interpretation.

Tack a “negative” on the front of 700. Are they still? Why? Why not?

That’s a question of order of operations, not of what the “and” means.

The problem with bolt ons to LLMs is that they involve human curation and are arguably not really the AI operating. Vendors of commercial LLMs or anyone making a useful system might be expected to provide various bolt-ons to perform recognised functions.
Saying that any of these make the AI understand the paradigm requires you to say that the code inside the bolt-on understands the paradigm. The AI only recognises that the input pertains to the bolt on. So if the AI recognises an input as matching a grammar that it is hard wired to send to a bolt on to process, the AI requires, and contains, no further “understanding” of the bolt on semantics. If the AI decides the input is an arithmetic expression and sends it to a calculator bolt on, do we claim the calculator understands arithmetic? The calculator was coded by a human, but we don’t claim the implementation embodies the human understanding. No more than we might claim that a pocket calculator of old embodies understanding.

The issue with this is that you have to feed a bunch of non (or poorly) compressed tokens to get the last two. So the average won’t be 3.

“If square plus triangle equals septagon, then square plus pentagon equals”

Try predicting an article on wiki starting with just . The first token could be any of the 32000 tokens.

Actually the perplexity metric will basically measure this. For a given model, context size, and dataset it will try to predict the next tokens and measure the cross correlation error of the logits. The score is a stand-in for the number of bits needed to identify each token. Obviously you can do something like Huffman encoding on top.

I looked a bit for GPT-4o scores and didn’t find any good references. The scores ranged from 2.6 - 8.5, but did not indicate what context size, dataset, etc.

Not with the same probability. I’m quite confident that if you datamined Wikipedia you’d find clear peaks in the distribution of first words in articles. Probably with “The” being at the top.

But never mind that. Basically all entropy coders require some context before they get going. The dictionary has to be filled out. Of course they do more poorly on the first few words.

Nor is it just the last two it predicts, though. It easily predicted the end of “If square plus triangle equals septagon, then square plus pentagon” with high confidence. It could not get “pentagon” out of “If square plus triangle equals septagon, then square plus”, but the API only gives me the top 5 predictions. Since the top 5 were largely shape-related (i.e., it got the concept), pentagon (or “pent”) is probably just further down the list. It just has to be in the top 100 or so for it to be a win. Actually even more than that since there was a savings on subsequent words.

And of course this is just a crude experiment with an off-the-shelf LLM not really designed for this purpose. It just happens to be what I have access to. It’s just meant to demonstrate the principle that better predictions mean better compression, and that an LLM can make better predictions than a simple statistical model can.

In much the same way as a human can use a calculator without having any understanding whatsoever of the calculation. A phenomenon I’m all too familiar with, as a math teacher.

It’s proverbial Chinese Rooms all the way down. :slight_smile:

Gawd, tell me about it. I got asked this afternoon whether calculators were allowed in the upcoming operating systems exam, because the student was worried they might need to calculate a power of two for some basic data size. These are third year university students.
No, we don’t allow calculators, because we assume the students are still breathing.

To be fair, I think the student was just a bit stressed out with the looming set of exams, that for many marks the end of their degree. So some of the questions are a bit odd.

Crafting exam questions to probe understanding versus just having the students regurgitate rote learned stuff they pattern match the question against is an art. (One I don’t claim to have fully mastered.) This is why I remain very sceptical of claims of higher levels of performance from AI systems. The ability to generate a screed of plausible sounding text that mostly matches the general theme of the question is the hallmark of both the rote learning weak student and it seems LLMs. Neither understand.

And using AI systems to generate reports is just rife. So much so that I think assessments using such reports has the mark of death. Which is.sort of annoying. Probably going to be much more prescriptive about what and how data is presented and how conclusions are presented. All more work. Although it will likely make for a better all around pedagogical experience. In a way the rife use of AI forces us to focus more clearly on getting students to display understanding, and not be lazy about how we set and mark work.

In a sense though, isn’t this just ‘tool use’? (ie, the thing that got pre-humans out of the trees, and one of the things we consider to be an indicator of intelligence when we see it in non-human animals)

With all respect for the knowledge you bring to this board, I find statements like this frustrating to read. It’s getting a bit off-topic, but let me explain.

I’m sure that back in the day you probably had an opportunity to play with a very primitive AI chatbot called Eliza developed by Joseph Weizenbaum as a sort of proof of concept that Eliza could carry on a conversation. The thing was that Eliza’s responses were generic claptrap that bore no relationship whatsoever to the semantic content of the user’s message; it would usually be the user’s original assertion or question reformed in a trivial syntactic transformation; for instance, “I think you don’t understand anything” would come back as something like “Why do you think I don’t understand anything?” or sometimes just generic prepared responses like “please go on” or “how does that make you feel?”.

It’s self-evidently obvious that Eliza didn’t understand anything. But what about a modern LLM presented with a problem that would challenge a human of average intelligence? What if the problem, presented either verbally or as a combination of verbiage and pictorials, was accurately solved by the LLM, which also presented its work to show how it arrived at the answer, and what if it was conclusively shown that neither the problem nor the answer could possibly have been in the LLM’s corpus? And what if it passed this kind of problem-solving exercise again and again, sometimes making mistakes, but mostly outperforming most humans?

It seems to me that declaring that the LLM didn’t truly “understand” the problem is really stretching the meaning of “understanding” beyond the breaking point, and truly moving the goalposts. If GPT 4 could somehow have been demonstrated in the 1960s, I don’t think anyone would have doubted the reality of its understanding. Today we’re so acclimatized to digital wonders and lately to GPT that it’s all boringly familiar, and though we mostly don’t understand how it works we know it runs on silicon chips and has something to do with matching strings of tokens, so most pundits have concluded that it still lacks “true” understanding. But does it? Or have the pundits run away with goalposts again, as they did when grandmaster-level chess programs first appeared?

I think the correct statement is that LLMs lack human-like understanding. Their cognitive processes are very different from ours, and they sometimes make stupid mistakes that even a child would not, while also solving problems in math and logic that would be beyond the ability of a majority of humans.

Overall I see your point. And mostly agree with it although I admit I am far more ignorant in that opinion than you (or @Francis_Vaughan) are. This is a tech area in which I admit I am largely clueless for lack of effort to be otherwise.

As to the quoted snip …

Most humans are barely what I would term intelligent. They hate thinking. They are good at doing well the same stuff they do every day. Like dogs, they are fine automata, repeating their stored program in response to recognized stimula. And are flummoxed when presented with novel situations or problems requiring detailed thought.

IOW, beating the standard of “as intelligent as the average human” is hardly a mark of pride. Average adult humans are vastly more capable of getting around in the world than are current AI’s only for their vastly greater exposure to day-to-day life as a human. So they have far more pre-stored stimulus-response programs to draw on than does an AI. As general problem-solving engines on average they suck pond scum.

OTOH day-to-day life as a human is something that AI’s can experience only vicariously at best, and in far less quantity than we do.

I wrote an LLM-based lossless text compressor. It is functional, but exquisitely slow. The compression process is:

  • Tokenize the input text
  • Feed a token into the LLM and get the logits for the next token
  • Sort the logits by probability
  • Record the index of sorted logits that corresponds to the expected next token
  • Build a Huffman table using the frequency of each index

The output is the Huffman table (compacted) and then each Huffman encoded index in order.

The algorithm is really too slow to gather comprehensive results, but the compression rates for the test vectors are better than I expected. One of them is the Wikipedia page on Philosophy (the mark-up, not the HTML). Using llama3 70B, 3B, and 1B in float16 the compressed output is 10.24-15.09% of the uncompressed input. The same text with ‘bzip2 -9’ is 20.57%. The overhead of the Huffman table is significant; without the table the compressed output is only 8.49-12.08%.

On the flip-side, bzip took 0.49 ms compared to the LLM’s 1.84 hours (or 6609400 ms) using multiple GPUs and 160 GB of GPU RAM. And that’s not counting the time to load the LLM.

A couple of other points:

  • I assume the Philosophy page is part of the LLM’s training dataset. ‘Philosophy’ is also a long word that gets encoded as a single token. Both of these facts could provide a slight boost.
  • I created a test vector using recent news stories and it compressed almost as well as Philosophy (12.63 - 17.84%).
  • “The quick brown fox…” compresses amazingly well as expected. “The” is the third highest token to start a sentence and after “The quick brown” the LLM predicts the rest of the sentence perfectly. This compresses down to 16-bits without the table (1.6 bits per token).
  • LLM tokenizers typically split numbers at individual digits (0-9) and don’t have tokens for common larger numbers like 10 or 100. This is different than a typical text compressor that treats all characters equally. I have a test vector of a table of data of Illinois cities and it compresses a few percent worse.
  • Anecdotally I don’t think run length encoding would help in general. However, I think having additional Huffman symbols for runs of 0 indices would help at the cost of making longer codes for the other indices (1, 2, etc.) Again you need a big dataset to measure the trade-off.
  • I used the instruct versions of the models because the thesis is the compressor would use whatever LLM is locally available. I think the non-instruct would be different, possibly better – but did not try it.

There are a lot of other variables that have some effect:

  • LLM complexity, context size, data type, quantization, GPUs
  • tokenizer (I used llama3 with vocab size of 128256). This effects not just the LLM, but also the Huffman encoding
  • size of the chunks fed to the LLM
  • size and variety of the input text
  • various speculative decoding techniques could boost token rates by 1.5x - 2x.

Very impressive!

I don’t quite get why you need to store the Huffman table, though. IMO, the table should be distinct for each token, and derived from the logits. Since the next-token logits are deterministic and based solely on the previously generated text, the table can be exactly recreated. And really you should be using arithmetic coding, so no actual table, just a list of probabilities that correspond to segments of the 0.0-1.0 real numbers.

The slow coding time is pretty funny and not unexpected. I made the suggestion on theoretical grounds, not practical ones :slight_smile: .

Seven hundred and three times one hundred and seveteen ?

Discussion here reminds me of a throw-away line in Canticle for Lebowitz where they mention that one monk has developed a mathematical algorithm for guessing the missing letters when they transcribe damaged old texts from the Golden Age.

Thanks!

Originally I was thinking the table would be pre-determined and would not be included (both the compressor and decompressor would know the table). But with 128256 possible indices, that would result in some really long codes – most of the codes wouldn’t be needed in a given text. In practice it might be better to have some combination of bit encoding for common cases and then an escape sequence and explicit index for uncommon cases. This is where you need a lot of data to tune the trade-offs.

I wasn’t familiar with arithmetic coding, but that might be a better fit.

Below are a few concrete examples. These examples work out that each word is a token so it is easy to map out the flow.

The first and second rows show how the text is split into tokens and the corresponding token value (between 0 and 128255).

The third row shows the LLM’s prediction. A value of 0 means the LLM predicted the next token as the most likely token whereas a value of 128255 means it predicted it as the least-likely token.

If we wanted to save this stream of indices we would need 17-bits per index, but since the indices are not equally likely, one option is to use a probability-based encoder. The fourth row shows the Huffman code to encode the indices.

For example:

text:   ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', '.']
token:  [ 791,   4062,     14198,    39935,  35308,    927,     279,    16053,   5679,   13]
index:  [ 2,     2444,     6,        0,      0,        0,       0,      0,       0,      0 ]
code:   [ 001,   011,      010,      1,      1,        1,       1,      1,       1,      1 ]

‘The’ is the 3rd most likely token to start any sentence. Then knowing the first word is ‘The’, ‘quick’ is the 2445th most likely token. Once the LLM knows the sentence starts with ‘The quick brown’ it is able to predict the rest of the sentence easily.

Another more realistic example:

text:   ['What', ' is', ' the', ' best', ' dog', '?']
token:  [ 3923,   374,   279,    1888,    5679,   30]
index:  [ 15,     0,     0,      1,       22,     68]
code:   [ 110,    10,    10,     011,     111,    00]

Gotcha. My intuition is that you could get away with a generic Huffman table as most text will probably follow something like a power-law distribution. Sure, you’d still have some outliers like “quick” above, but most of the benefit will be in picking low indices that will be fully covered in any text of reasonable length.

Arithmetic coding should help a fair amount. It does best for very short codewords. The granularity of Huffman coding is a problem when you’re talking just a few bits. Makes a difference when 1.3 bits gets rounded down to 1, or 1.7 rounded up to 2, etc. Arithmetic coding handles any fractional number of bits, even <1.