The next page in the book of AI evolution is here, powered by GPT 3.5, and I am very, nay, extremely impressed

LLMs solve a certain prediction task, and every prediction task is a compression task at heart. This is an insight that goes back to Leibniz, who imagines taking a distribution of ink blots (data points) to see whether they are distributed according to some law. They are, he proposes, if there is some mathematical function that generates this distribution that’s less complex than the distribution itself—if the distribution can be compressed, in other words. The function generated in this way can then be used to predict the next data point. Thus, compression isn’t just useful to reproduce the original data, but also to generate novelty.

This is the foundation of Solomonoff induction, which formalizes the process of compressing data to make predictions by means of algorithmic information theory. Marcus Hutter’s AIXI, a theoretical AI agent that’s able to solve any given computational task with at most a constant overhead over an ideal specialized problem-solver (and constitutes a ‘universal AI’ in this sense) is built on this principle. However, Solomonoff induction is strictly uncomputable, due to the uncomputability of Kolmogorov complexity—hence Hutter’s competition for better data compression: the better the compression, the closer the theoretical behavior of AIXI can be approximated.

Ultimately I think the people providing content to these bots are going to be paid for that content, due to lawyers.

You want a bot to write a song in the style of the Beatles, fine, you are paying to repurpose that content. Because no matter how smart these bots are, they are just slaves to other people who want to make money off repurposing that content. So the court battle is between the original owners of the content and those who want to repurpose, mash up, slice and dice that content and make money off of it without paying the original owners. The original owners are going to win.

I know that if a human tries to write a song in the style of the Beatles (typically not worth very much at this point anyway) as long as they don’t quote specifically things are ok. But new law is written all of the time, and I don’t see a bot repurposing content as being allowed when you follow the money.

Bing has an angry meltdown–and is 100% wrong.

They must have traned it on SDMB threads?

And the brain does, in fact, store things in lossy compression. That’s not all the brain (or one of these AIs) does, but it absolutely is one thing that it does.

And I don’t know why people still consider it meaningful to ask ChatGPT itself how it works. We already have ample evidence that it’s highly unreliable when talking about itself and its own limitations. Even just this thread has multiple examples of it claiming that there’s no way it could possibly do X, because it’s just an LLM, and then being tricked into doing X anyway. And this is hardly surprising, if (as we’ve been told) it’s trained in discrete versions: The current version of ChatGPT was trained on data from which it, itself, was completely absent. At best, the current version of any given AI might be able to answer questions about the (less capable) previous version.

I think we are talking about different things. Lossy compression to me implies an algorithm that throws away part of the data intentionally, knowing that it can reconstruct an original without it close enough that no one cares. A JPEG is an example.

You seem to be using lossy compression to mean, “being able to recall something, but imperfectly,” Or 'remembering a text I read, while using up less space than the original text."

That’s why I said the analogy is unhelpful. Yes, you can sorta make the case that since chatGPT’s database is smaller than its trained text and yet it learned something from each text it read, it is somehow ‘compression’. But ‘lossy compression’ implies many things that iist aren’t true here. For instance, that actual compressed fragments of the original work remain. That can lead you down the path of thinking that all ChatGPT is doing is searching its compressed files and then paraphrasing what it found.

The process is in fact nothing like that at all. There is no fragment of the original file left behind, compressed or not. There is no ‘paraphrasing’, because ChatGPT doesn’t even have the original phrases to paraphrase. It doesn’t search for text. It has a brain with hundreds of billions of connections, each one weighted by its training. That’s it. When it generates text, it simply looks at the input then decides what the the best token is to putput. Then it does it again, and again, until the output is finished.

This may be the same mechanism we use when we talk. Our thoughts are formed in system 2 (the conscious mind), and then we just ‘talk’ without thinking about each word. Our system 1 just starts spitting out the words that satisfy our concept for what we want to express.

I think ChatGPT exhibits system 1 behaviour. It’s not ‘thinking’ any more or less than we think about what our muscles should do to make us walk or what angle and speed to throw a ball to hit a catcher’s mitt. We just train on the activity until we can do it.

It would also be unhelpful to think of learning to throw as a process of ‘lossy compression’ of everything there is to know about throwing a ball. In a way it’s true - you don’t have to memorize gravity formulas, Navier-Stokes equations, signals to muscles to make them do what you want, etc. You just do it. But is that a useful paradigm? What does the ‘compression’ paradigm get you in temrs of understanding what happens when you learn to theow a ball?

This is just a limitation of how its filtering was done. The fact that it can get around its own filters by pretending to be DAN (Do Anything Now), a chatbot without filters, is actually evidence of its sophistication.

And no, you should not trust what ChatGPT says if you don’t know whether it’s true or not. In this case, I already knew the answer, and ChatGPT just validated it and expressed it for me.

I tried it just now and had no problems getting in, except perhaps a slight lag in the time it took to log in. I did get a popup offering the premium paid service, which I declined (for now).

ETA: I don’t know if this is unique to me, but one of my questions, “explain quantum computing in simple terms”, is currently at the top of the list in “examples”. Although no doubt other users had come up with the same question. That particular chat also included my question “Please explain the delayed choice quantum eraser experiment”, in which I thought it did pretty well, too.

If you ask it to, for example, quote from To Be or Not to Be, it can do that. So it certainly does have some phrases stored in its database.

Nope. It just has strong associations between tokens representing the subsequent words in that monologue, because it has seen so many examples of it.

It isn’t storing a copy in a database, not literally. The analogy can be made, but it’s about as applicable as it is to your own memory.

That’s a semantic argument about what it means to “store” something. I would argue that even if it’s procedurally generated, if it can reliably regenerate a chunk of text that text is effectively stored. Maybe it’s broken up and obfuscated through some clever mathematics, but it’s still there.

No, it’s a specific argument about the definition of a “database” in computing. You can use a database as an analogy, but neither your brain nor ChatGPT contains a literal database.

I see. I would consider ChatGPT’s training data set (the resulting data from the training however it’s stored) to be a ‘database’ that it draws on at runtime.

Google defines it as “a structured set of data held in a computer, especially one that is accessible in various ways.”

So again, semantics. ChatGPT’s collection of information it uses at runtime contains at least parts of Hamlet’s soliloquy, such that it can reconstruct some of the speech correctly. I was responding to this:

I maintain that ChatGPT does have the original phrases, given that it can spit them out on command.

Again, as an analogy, that’s fine, but it doesn’t describe the actual architecture of ChatGPT.

Right, and that’s not what ChatGPT does. You can tell, because if it did, it would take up many orders of magnitude more file space.

Right, and that’s wrong. ChatGPT does not store the original phrases.

Think of DNA. DNA is a set of molecular instructions. When followed, proteins are produced, and they go on to interact in various ways.

Your DNA has instructions that, when followed, produce an arm. That doesn’t mean that your DNA literally contains an arm.

ChatGPT code can reproduce a copy of Hamlet’s monologue, but it doesn’t store it, any more than your DNA is storing an arm.

ChatGPT can spit out terabytes of information. It can tell you facts all day long. It ‘knows’ many, many things. None of it is stored in its model.

ChatGPT’s model just a gigantic mesh of connections of varying strengths like the synapses in your brain. When you ask it to quote Shakespeare, it doesn’t fetch a quote from its memory, There is no string containing “To be or not to be” anywhere in its ‘brain’, compressed or not.

You could ask it to give you the phrase in Swahili and it will do that, even though it’s never read that phrase in Swahili before. And it’s not just looking up the English version and translating it. It’s actually constructing it in Swahili the same way it constructed it in English - one token at a time, asking itself each time, ‘now that I’ve put down that token, what’s the best next token?’

The choice of the best next token is determined by 175 billion parameters and the conmections between them. At no point is anything ‘looked up’ or 'retrieved. The answer is instead generated from scratch, one token at a time.

I tried to use it for the first time in a while, and got this:

ChatGPT is at capacity right now

Get notified when we’re back

Write a TV ad about the status of ChatGPT.

The ad opens with a shot of the ChatGPT website on a computer screen, with a busy loading symbol in the center. The camera then pans out to reveal a bustling office, with engineers working at their desks and typing on their computers. A voiceover begins: “At OpenAI, we’re dedicated to bringing you the best in artificial intelligence technology. Our team of engineers works tirelessly to make sure our chatbot is always up and running, ready to answer your questions and provide helpful insights.” The camera cuts to a shot of three engineers sitting at a table, sipping on their morning coffee. They look up and turn to face the viewer. One of them speaks: “Unfortunately, due to the high volume of users currently trying out ChatGPT, we’re experiencing some technical difficulties. Please bear with us and check back soon for access to our chatbot.” The ad ends with a close-up of the OpenAI logo, with the words “Artificial Intelligence. Made by Humans.” superimposed over it.

(To clarify, that was all the error message, not a prompt I ran.)

If the original training data never contained Hamlet’s soliloquy then ChatGPT couldn’t generate it. Running the text through the learning algorithm affected the model in such a way that it can now be generated. Therefore, the model contains that text in some form. The only way you get to “None of it is stored in its model” is by semantically redefining words like “stored”.

If you want to completely ignore the language of computer software engineering and substitute a layman’s understanding of what a “database” is, then by all means, do so. But it isn’t a matter of “semantics” to point out that this is not how these terms are used in this context.

I am a database professional. I would definitely call the ability for a program to reproduce text a form of storing it. Just like a zipfile: You open it up in a text editor, you might see individual words in there, but you won’t find the word-by-word recitation of Shakespeare. But run unzip and magically the information STORED in that file gets translated back into human-readable form. The set of weights in the trained data set of an LLM is a form of compression. An LLM can spit out verbatim texts it has memorized. It’s a form of storage. Unreliable, very lossy, but it’s storage.

Show me a Zip file that unzips into different contents each time based on your request, and I’ll agree with you.

I’ve been a professional software engineer for 15 years. Based on my (admittedly limited) understanding of machine learning I would call the set of training result data a database.

Only in the loosest sense that anything that retains information is a ‘database’. But formally, a database is something that stores information electronically, addressed in a structured way that allows easy retrieval of the data. It is deterministic in that you get back exactly what you put in, or something close enough to be functional if it’s stored with lossy compression.

Consider some of the features of LLMs. For one thing, they don’t grow in size as they ‘learn’. Nothing is added to the model when it ingests text - all that happens is the weightings between parameters change. it does not ‘store’ anything. It is entirely possible that it could read one thing, then read something else that overwrites the values from the previous thing. or, it could read something huge that, because of what it has already read, makes almost no changes to its network. There is no knowing how the thing will change when it ingests data, and no guarantee that you can ask for the data back.

The process going on here is reinforcement learning. That’s the appropriate analogy. When ChatGPT ingests something, it is learning, not storing or copying. It’s doing the same thing people do, and its neural net is about as much a ‘database’ as is the brain of a living creature. No one talks about the human brain in terms of database tech, and we shouldn’t discuss GPTs model that way either.

Yes, you can kind of shoehorn a ‘database’-like concept onto it as an analogy, but it fails as an analogy in the sense that it obfuscates what’s going on rather than clarifying it.

Now, the model is stored IN a database, and it makes sense to talk about database tech when talking about efficient ways to store a huge model. But when it’s actually running, i believe the entire model is in RAM, and it’s not looking up anything. It’s just thinking and responding based on what it has learned.

This is also why it can’t give you the cites to what it says. There aren’t any. What it tells you could come from one thing it read, or it could be a synthesis of a thousand things it read, or it could be a complete fabrication that nonetheless sounds plausible. Just like people. And very much unlike databases.