Ultimately, it doesn’t matter. It’s a philosophical question that’s impossible to ever answer. Whether they “actually” understand anything or not–whatever that could possibly mean–they at least do an excellent job of simulating understanding. And that’s good enough for the task of prediction.
I dunno. That’s how Alexa works, or at least used to (and also my tweeting litterbox). You quickly find that even though it seems like with all those parameters, it still sounds robotic. There’s something about language that you just can’t easily capture with a bunch of fixed parameters and handful of grammar rules. But LLMs sound very natural in comparison.
Wikipedia is pretty big, but there are less than 7M articles in the English language. I can’t find the number of English language books in the world, but the total number of books is well north of 100M, so probably a few tens of millions are in English. And those are whole books, not just articles.
If you count the English corpus beyond books, it’s larger yet. I’m not sure I could even estimate it, but it’s almost certainly 100x or more the size. Of course, a lot of that is repetitive junk–which will compress well.
So there is the question of the inherent compressibility of the corpus itself. If you scraped every text message ever written, I’ll bet it compresses very well compared to Wikipedia. People use it for the same stuff and so it probably has low entropy.
Text messages are a bit of a peculiarity, here, since the metadata is comparable to or even longer than the content itself. I can write a 500-page novel, and index it with a five-word title. If I want to search my hypothetical Database of All English Corpus for that novel, the title (and maybe the author, if we want to be thorough) suffices. But if I want to search for a particular text message, then I need to search for something like “The text message sent from John Doe to Jane Roe at 12:17 PM on January 13, 2019”. Do we store all of that metadata in our Database of All English Corpus? And is that metadata itself (which may or may not be in any format that could be called “English”) part of the corpus?
There’s also the problem that any compression relies on context, and the context for any given text message conversation doesn’t last very long. A common text message might be, for instance, “Need anything from Aldi?”, with the response “Bananas”. However short a compressed message you can make for that, it won’t give you any more than those five words, because the next text exchange between those two people probably won’t have anything to do with Aldi or bananas. By contrast, with a novel, the context persists for a long time: Once you have “It was the best…”, it takes little additional data to get “of times”, and then little additional data to get “it was the worst of times”, and so on, with every new word having lots of context available to make it more likely.
Basically agreed with all that. The lack of context does reduce the predictability. But there must also be many thousands of instances of identical “Need anything from Aldi?” messages as well, so there’s probably a tradeoff. Don’t know how that balances out.
I wasn’t considering the metadata, but that would undoubtedly increase the entropy since the timestamp on something is close to random. Not as bad as fully random text since obviously not all dates/times are valid (and the digits and special characters are a subset of the whole character space), but it’s almost certainly more than typical English text.
It’d be straightforward, at least, to compress the metadata of a text message. The timestamp can be a single number, probably the number of seconds since some epoch time (seconds are probably fine enough to uniquely identify most text messages), and that, as you say, won’t compress much (though probably some, given that not all times are equally likely). The sender can be represented by their phone number (which is, again, mostly random). Once you have the sender, you can probably compress the recipient, because any given sender will mostly always be texting the same few recipients, but that just saves you at most the cost of another phone number, which is, what, 30ish bits?
The LLM is probably smart enough to pick up on the format of the conversation and metadata after a few samples. It might even be smart enough to figure out that a series of messages is likely to be minutes apart, and predict a date/hour that’s the same as it had been. Maybe even smart enough to guess that a response to a message sent at 1:59:57 will probably be at 2:0?:??.
I wasn’t suggesting this as a way of creating synthetic content. Rather as a higher level mechanism of lossless compression of text. If we posit large scale similarities of sentences and structure we can leverage that. Mostly an academic exercise, but still interesting.
Sort of why I call anyone on it. If “understanding” is an impossible thing to answer the existence of, or even define sensibly, people shouldn’t use the word “understanding” in reference to any AI.
One might as well claim AI’s have a soul. Just as useless and pointless. But one sees persistent attribution of higher level capabilities to AIs that just don’t stand up. It gets worse when such assertions lead to inferring capabilities that just don’t work in any form from these wobbly ideas. Sadly the world of LLMs and neural nets has taken sway. There has been a lot of work in the more hard traditional AI world on knowledge representation that seems mostly forgotten. Just a vague claim that the trained nets automatically create complex models in an inscrutable manner.
If I talk to a human being (say a student) and ask if they understand something, I get a wide range of answers. Sadly some cultures result in students saying “yes” even if they have no clue. If I set an exam question to probe understanding I am going to expect a student to be able to reason from the base concepts and synthesise an answer to the exam question that is more complex than stringing phrases together. Sadly, sometimes that is all I get. There are times when ChatGPT probably would get better marks than some students. Doesn’t mean it understands anything.
I think that @Dr.Strangelove adequately hedged his bets with “…or something akin to understanding”. We might not be able to pin down what, precisely, “understanding” is, but what LLMs have is at least related to understanding.
Right. Ultimately, if an LLM can predict words at a higher probability than a basic statistical model, then there must be something deeper going on. Whether we call that “understanding” or something else is irrelevant.
My “one plus one equals two” example was a little too simplistic since that particular example probably exists in the corpus already. But it’s obviously easy to alter the example until you get something not in the corpus. “Seven hundred and three plus one hundred and seventeen equals” does (did) not appear anywhere on the internet, and yet I asked Gemini:
please complete the following sentence: seven hundred and three plus one hundred and seventeen equals
Seven hundred and three plus one hundred and seventeen equals eight hundred and twenty.
A basic statistical model couldn’t have come up with that. It “got” that I wanted the numbers to be spelled out, and gave the correct mathematical answer. It also didn’t get confused at the grammatical ambiguity (the way the “ands” group things, etc.).
Does the LLM have a deep “understanding” of numbers, addition, and so on? Probably not. But it operates as if it does, at least most of the time. And really, so do some human students, who sometimes never really get a subject, but manage to learn it in a mechanical way well enough that they can produce the right answers most of the time.
And in fact, often the best human students, like Richard Feynman, are the ones who realize that what they have isn’t really understanding at all.
So maybe, even if LLMs don’t have “real understanding”, whatever that even is, maybe they do still have whatever that trait is that Feynman has (to some rudimentary degree).
Although if it treated the "and"s as if they were "plus"s it would still do the correct arithmetic. In digits: 703 + 117 is equivalent to 700 + 3 + 100 + 17. It might be entertaining to construct a similar problem where interpreting “and” as “plus” versus “and” as grouping operator results in a different answer.
But all that’s mere quibbling on my part.
Your larger point, and that of @Chronos just above, surely stand. Whatever it’s doing, it’s a decent proxy for what we call “understanding” when applied to people. Within this realm at least.
The target token won’t be in the top-2 most of the time. There are too many reasonable options at any given point. Top-50 is a common sampling setting because there are good options throughout the range.
Using it for text messages is terrible. They’re too short. For an LLM to be able to predict accurately, it needs context. With an LLM compression scheme you’d have to feed so much of the text uncompressed it wouldn’t pay off. There’s also too much variance. Predicting something like a timestamp is futile. The data is all over the place.
It’s noteworthy that LLMs are really good at generating fake text messages with logical names, timestamps, titles, slang, and emojis. The encoder is able to capture many legitimate states, but not in a way that the decoder can extract the one correct state.
I don’t think that we did. Data compression is a trade-off of compression ratio, data generalization, encode resources (compute, RAM, ROM), and decode resources. A core technique for video codecs is to spend more resources once on the encode side to get a better compression ratio with fewer decode resources. If we are speculating on the future of compression, then we need to consider all of these factors.
Otherwise I could claim to compress Wikipedia into 0-bits with an algorithm about the size of Wikipedia.
No my point was the compression ratio would be infinite (output size / input size) while the algorithm size would be huge (the same as the input size).
It was a light-heated attempt at demonstrating how all of these factors are interrelated (trade-offs).
That’s my point. Compression scoring != Compression ratio. You have to consider more than the compression ratio when evaluating (scoring) an algorithm.
In the wiki contest above it also includes requirements on time, compute, and RAM.
“(Seven hundred) and ((three plus one) hundred) and (seventeen) equals” is a pretty tortured interpretation but not impossible. But you’re right, probably there are are other examples that are more ambiguous.
A basic statistical model with a kludge bolted on to run the calculator routine could have come up with that. We need a better example.
I’m beginning to think that computers will not and cannot ever achieve true understanding of anything, similar to humans. We are all LLMs, at least from time to time.
Well, yes. But I can keep coming up with examples, and for each one you can say “a basic statistical model with an X bolted on” will solve it.
Real-world compressors do have these things bolted on. For example, some recognize x86 machine code specifically and rearrange things to be more friendly. Normally, the bits get packed together in a way that standard entropy compressors don’t do well. If you sorta unpack them using domain knowledge, they do much better.
But you have to write one of these for each special case, and if the special case isn’t very prevalent then you can’t really justify it. It would be nice to have something that generalizes across a wide range of special cases, and that can be trained without human intervention–like an LLM.
For a completion of “One plus one equals”, it will.
But in general yes, but it doesn’t have to be. As you say, the best compressors are currently around 0.9 bits/word. For a typical 6-character word, that’s 5.4 bits or around 42 possibilities. So if the LLM can narrow things down that much on average, it’ll beat the best compressor.
Actually, it doesn’t have to even be that good, because we can weight based on probability. Even narrowing it down to 100 words–as long as it can indicate which ones are better candidates–would do the job.
Of course sometimes this will totally fail. But sometimes it’ll beat the average as well. The predictions don’t have to be perfect, just pretty good.