The next page in the book of AI evolution is here, powered by GPT 3.5, and I am very, nay, extremely impressed

More like the number of users they have has massively scaled up. ChatGPT has had the fastest adoption of any tech in history, by a mile. They are well over 100 million users now, and they are burning up billions in compute time.

Interesting. I’ve rarely had issue. Maybe one out of five logins I can’t get in. Though right now, it is indeed full. I tend to use it in the afternoon and evening.

Interesting take on ChatGPT by science fiction writer Ted Chiang, focusing mainly on the language and writing aspects of its use:

So there’s been lots of talk about how ChatGPT, perhaps bundled with Bing, might put Google out of business if they don’t deploy something similar, soon. But that’s not the real problem here, is it? Even if Google deploys Bard-powered search or whatever, they’re kinda shooting themselves in the foot. Something like 80% of Google’s revenue comes from advertising—deployed on sites people are led to via Google search. But if people just get whatever answer they’re looking for regurgitated to them via chatbot, that secondary advertising revenue stream doesn’t happen; people see, at most, the ads on the search site itself.

To say nothing of the dangers for web content in itself. If those advertising revenues dry up, then so does the revenue stream of influencers, bloggers, and other content creators. But if they don’t create content, there’s nothing for ChatGPT or Bard to paraphrase, leading to the stream of new content drying up and the existing content growing ever more stale.

That’s something I’ve thought about. We may see lawsuits about content theft from websites who’s revenue suddenly dries up because their content is being used by an AI.

One idea that came to mind for me is that Google could pay sites a percentage of revenue for any ads that display next to that site’s content. But then I realized that the problem with this is that trained AIs aren’t referencing any one site and then echoing the text. They’re just presenting what they’ve “learned” from their training database. As far as I know, their neural nets don’t contain any reference information. How could they, It’s just weights in a neural net.

There will need to be a whole new financial model for the web. Maybe the AIs can figure it out.

The analogy of Large Language Models to lossy compression is spectacularly unhelpful. That’s not at ALL what’s going on. And his idea that what these things are doing is ‘interpolation’ like a JPEG filter is also wrong. His explanation for why they don’t do math well is also wrong.

His idea that ChatGPT is just ‘paraphrasing’ what it has read is also wrong. It’s trivially easy to get ChatGPT to do something that can’t be a paraphrase of anything.

Neural networks are not tools for ‘lossy compression’, although they can sometimes be used for that. In the context of an LLM, we don’t know HOW they work, just how to set up an architecture to allow them to learn, then watch what happens. The abilities of these LLMs were not designed into them, they emerged. Sometimes an LLM can cough up something that is identical to an original - no ‘lossiness’ at all. Other times it may have ingested something it can’t recall at all, but reading it changed a bunch of numbers that change how it responds to other things. Just like human brains.

Take the ability to translate from one language to another. No one knew LLMs would be able to do that. The capability emerged at a certain scale. There’s no ‘lossy’ compression there - the thing can do perfect translations whenever something can be translated perfectly.

He suggests that the image-building LLMs operate analagously to JPEG algorithms that interpolate between pixels to come up with the ones in between. Again, that’s not at all what happens. Not even close. In diffusion models, a brand-new image is generated from an image containing nothing but generated Gaussian noise. If you ask the model to build you a picture of a monkey in a spacesuit, It basically knows what monkeys are, and what space suits are from its training data. It’s seen millions of monkeys, and many space suits. So as it builds the picture up, it changes a pixel away from noise, then tests to see if the picture better matches the input. If it does, it continues on. Eventually you get a completely new picture of a monkey in a spacesuit. It’s not a monkey that’s been drawn before or a space suit that’s been drawn before. It’s not a less-detailed version of some better picture. It’s a completely new work that shares nothing in common with any other work. There are ways you can get it to mimic styles and such, but again all that does is change the fitness function so that each changed pixel is scored on whether or not the image now better fits that ‘style’.

This is incredibly simplified. I tried to read some of the technical papers, but the math behind it gets crazy very fast. But the core fact is that these models are not just ‘paraphrasing’ existing stuff as Chiang seems to think. They are GENERATING completely new content that has never existed before, and which doesn’t even have to bear a resemblance to anything created before.

He says this about math:

The actual reason why 2-digit numbers work better is likely due to tokenization of the input text. LLMs like ChatGPT break their input down into ‘tokens’, which can be words, parts of words, whatever. The process of doing math is no different than how it does english - it coughs up a token that best matches its fitness function, then another, and another and another until it’s done. It’s not ‘doing math’ at all. Numbers that span multiple tokens probably fail the process, that’s all.

He says, “It hasn’t been able to define the principles of arithmetic”. First, I’m not sure why we should expect a large language model to define the principles of arithmetic. But second, you could not long ago have said, “Even though the language model has been fed samples in a large number of languages, it is utterly unable to translate between them.” That was true of LLMs until they hit a certain scale, then suddenly the ability to translate between languages just emerged.

In fact, the ‘principles of arithmetic’ HAVE emerged. Smaller LLMs had no math ability whatsoever outside of puttin togther tokens that were sometimes correct. Howeer, at a certain size general math ability emerges.

Here’s an interesting paper on emergence in LLMs, from which the following information is taken:

In short, ChatGPT and LAMBDA couldn’t do arithmetic AT ALL at first. Their accuracy was close to zero. But once they hit a certain size (13 billion parameters for GPT-3, and 68 billion for Lambda), the ability to do this type of arithmetic just emerged. Try explaining that with a ‘lossy compression’ paradigm. It’s entirely possible that with enough training data, GPT 3.5 might suddenly be able to do advanced calculus. We don’t know, because it’s emergent and a surprise.

Some other capabilities that emerged from these models without anyone designing the capability in, or even necessarily anticipating it:

Transliteration
Modulo arithmetic
Word unscramble
Word in context (meaning of words in context, not dictionary meaning)
Math word problems
Instruction following
Multi-step reasoning
…and many more things, with more to come.

The interesting thing about the emergence of abilities in these models is that no one really understands how this happens, and no one can predict it. GPT-4 is rumored to have as many as 100 trillion parameters. No one knows what that thing will be capable of doing. Maybe nothing more will emerge and it will be just an incremental improvement over GPT-3.5, or maybe the thing will write masterworks and compose great songs. We won’t know until we test the model.

Back to the article in the New Yorker. The writer of the piece is a fiction writer. He has his biases, which lead him to believe that these things are spitting back unoriginal, perhaps lo-fi copies or paraphrases of other work. This supports the idea that they can never be great writers, and that what they are doing is somehow taking away from humans or even copyright infringement. I think that has informed his view of how these things work.

If you read a book, would you say that you are engaged in ‘lossy compression’? I doubt it. You read the book and it changes the way you think about things that aren’t even in the book. And some stuff that’s in the book might be unimportant to you and you completely forgot you read it. This process is called ‘learning’, not compression. And that’s a better analogy for what these AIs are doing. The existence of emergence of general capability that goes beyond its training set should be a big clue that there’s a lot more going on here than ‘interpolation’ or ‘lossy compression’.

It would have been really good if he had actually talked to some AI scientists or quoted some for this piece. Or his editors had passed it by them before publishing.

Not directly related to ChatGPT, but we’ve talked before about whether diffusion-based image generation models actually store images or not. The simple answer is that they can’t possibly, since there isn’t nearly enough data to store them in the final set of weights. The longer answer is that that’s still true, but with a tiny caveat:

However, Carlini’s results are not as clear-cut as they may first appear. Discovering instances of memorization in Stable Diffusion required 175 million image generations for testing and preexisting knowledge of trained images. Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested (a set of known duplicates in the 160 million-image dataset used to train Stable Diffusion), resulting in a roughly 0.03 percent memorization rate in this particular scenario.

So they managed to come up with about 200 image recreations (which actually do look a bit like lossy JPEGs) out of 350,000 “high probability” candidates, which were themselves a tiny subset of the 160 million training images.

The finding doesn’t change the bigger picture, and in fact reinforces the idea that almost everything these models create is original. Though I wonder if groups doing the training will use this research to root out the rare cases where images were stored and remove them.

Tom Scott has a video on ChatGPT:

It actually caused me to take my own look into Google Apps Script for a project I was working on, and did so in the same fashion–using ChatGPT to help build it. I know Javascript well enough and am a solid coder. But I don’t know the first thing about Google’s various APIs, which would have taken 99% of the development time to learn.

And it’s great. It’s actually a very satisfying experience. It’s like having access to a junior coder who is not that great at coding, but has read through all the documentation and knows the platform. I can instruct them on what I want, inspect the resulting code, ask for modifications, and so on. Sometimes I realize from the result that I wasn’t even asking for the right thing, and ask them to change it. Very quickly I can iterate to what I really wanted, and it doesn’t involve painfully reading reams of documentation.

One side effect of this might be to reduce the number of questions posted to Stack Overflow and the like, which might reduce the training set that OpenAI and the others have to work with. Then again–maybe it’s mostly using the raw documentation, or even the code. Acting as an advanced documentation search and summarization tool is a fantastic value-add from my perspective.

Yep, this is a big problem for Google. Google lost 100B in market cap recently. They blamed it on the errors in the ‘Bard’ demo, but I think it’s more likely that investors saw the demo and said, “Hey, even if this works perfectly, how are they supposed to make money?”

A related issue is that these things destroy the value of Google’s search algorithm. Until now, Google has stayed on top simply because they have the best search algorithm. But the LLMs don’t need an ‘algorithm’. Also, implementing these things is very easy, and you don’t need the resources of Google to do it. The entire ChatGPT database would fit on a smart phone. There are a dozen companies with the capability to train LLMs on huge databases. Search could easily become commodified and be something integrally built into products of all sorts, and the idea of going to a special web site to search for something might not even be a thing in the future. We don’t know, but Google is terrified.

Again, these things aren’t ‘paraphrasing’. If someone asks you what kind of TV you shuld buy, and you’ve read a lot of TV reviews and talked to people who own TVs and bought them yourself, and therefore have a good idea of what’s good and bad, are you ‘paraphrasing’ if you say, “I think Toshiba makes the best TV?” Are yoiu stealing anyone’s content?

But you do have a good point about some content. What happens to people whose content is based on literal real-world effort? Say, a review site that sets up 10 routers and measures their performance. If ChatGPT reads that, and then says, “Broadcom is the best performing router” when someone asks, it is leaning on the physical work that other site did to generate that information.

It may not be plagiarism any more than if I had told you the same because I read the site myself, but it’s problematic when it can scale like this, and if reviewers can’t drive content to their site, they’ll eventually stop reviewing.

Maybe we’re going to eventually need a whole new paradigm for web searching and monetizing. It’s ripe for disruption anyway. SEO specialists have gotten soo good, and Google’s promoted content and other advertising has really made their search substantially worse. I used to marvel at how the best result always seemed to be in the top two or three, but now I often have to scroll past a page or two of ads and promoted content before I can find anything even remotely close to what I’m looking for. Or worse, I’ll click on one of the first links that looks good and waste some time browsing it before realizing it was a freaking ad or promoted content and not at all what I wanted.

I just did the same thing. I have a Neurosky ‘mindwave’ headset that I bought for a project involving using a lamp to provide feedback for meditation. It has a bluetooth connection, and I use an ESP32 processor for the lamp. I knew there was an API out there for the headset, but hadn’t had the time to research it. So I just asked ChatGPT if it knew of this API. It did, so I asked it to write a C++ module to use the API to sample the values from the headset. And it did. I’m about to test it tonight. This is simple test code I had it write, but if it works I plan to try to incrementally get it to build me a full class-based lamp application with web server and web front end.

#include <SoftwareSerial.h>
#include "Mindwave.h"

Mindwave mindwave;
SoftwareSerial mySerial(2, 3); // RX, TX

void setup() {
  Serial.begin(9600);
  mySerial.begin(9600);
  mindwave.begin(&mySerial);
}

void loop() {
  if (mindwave.available()) {
    float attention = mindwave.getAttention();
    float meditation = mindwave.getMeditation();
    Serial.print("Attention: ");
    Serial.print(attention);
    Serial.print(", Meditation: ");
    Serial.println(meditation);
  }
}

Here’s its description of the code:

Neat! The code is simple enough, but the issue as always is that translating poor documentation into code is always fraught. The Arduino documentation is… not great. Especially for third-party libraries.

My project is likewise using an ESP32. I wanted to simply log the data to a spreadsheet online somewhere. I’m currently logging it to some online service that (in retrospect) seems designed around holding your data hostage. But via ChatGPT I discovered that I can do it via Google Apps Script as well. Simply POST the data to a specific URL and it will appear in a Google Sheets doc. Perfect.

BTW, I highly recommend using ArduinoOTA for the programming. It’s so very nice to iterate on my projects over WiFi (including not having to push that stupid reset button on the ESP32!).

Interesting. My ESP32s don’t need the reset button pushed to upload new code. But yeah, it would be great to do OTA updates. Thanks for the link.

I don’t use the Arduino IDE - I use Visual Studio with the Visual Micro plugin. I like having the power of Visual Studio over the Arduino IDE. I’l have to see if I can get the OTA stuff to work with that.

If we don’t know how they work, then how can we so confidently state that they’re not just lossy compression? Just because they weren’t designed to be lossy compression doesn’t mean that’s not the net effect.

And LLMs, at least as they exist now, also can’t do searches. Sure, they can answer questions, but that’s not actually what I want: Any idiot can answer questions. I want authoritative answers, which means I want sources for the information.

And sure, maybe the next generation of LLM will be able to cite its sources. When they do that, then they can start competing with Google. But when they do that, Google will be able to monetize them the same way it’s monetizing its search now.

In the case of diffusion models, we can confidently say that the data isn’t there. But mostly, what we can confidently say is that anyone that confidently says these are just lossy compression has no basis for such a statement.

Interesting. Questions answered are exactly what I want. Google is increasingly just becoming a front-end for Wikipedia, Stack Overflow, and a handful of other useful repositories. Almost everything else on the web is useless. I have the means for independently testing the reliability of many answers (such as running the code it generates), so it’s not always the case that I need a cite.

Why does Google or the advertiser care whether a customer clicked an ad on the Google homepage vs on MyLittlePony.com?

Google wouldn’t be missing out, the website that you’re no longer clicking through would be missing out.

Google makes money by selling sponsored ads in the search list. It also makes money by selling ads on the sidebar, which are purchased by advertisers by giving Google a list of keywords that link to the ad. If a user uses one of those keywords in a search, the ad will pop up along with the search results.

Google adwords are advertisements on other web pages, with the content controlled by Google. Clicking on those ads makes money for both Google and the website.

If people search and get text results instead of a series of links, Google loses out on the ability to sell sponsored links. If the users don’t go to the specific web sites for answers but get them directly from the AI, Google’s adword revenue drops. They still might manage to sell ads on the search page, but if search starts getting incorporated into non-browser products, they’ll lose that too.

Google has other revenue like cloud services, Youtube ads, some hardware and software sales, etc. But ad revenue makes up more than half of all their revenue. They could easily lose half of that or more if they can’t come up with a way to retain that revenue while staying competitive in search.

Google serves ads by building profiles of users through search history, browser history, etc. Nothing about that changes except now it’s chat history vs search history. Google could just as easily put sponsored links in chat results or real-time update ads on the page due to chat interactions.

Sure, they’d have to adapt the chat service from merely next token prediction, but, no, the interface wouldn’t change their entire business model.

Heck, you could have interactive dialogue that ends with Google recommending different vendors to buy things.

Retrieval augmented chatbots are a thing that are currently being explored. I imagine Google is working on a chatbot that retrieves information from their vast store of user profiles. If they do pivot to a chat interface we’ll end up seeing automated salesmen that I’m sure people will pay for.

I should rephrase that. We know how they work about the same way we know how the brain works. There are lots of neurons, they fire under certain conditions, and we know quite a bit about the very low level mechanisms. The same is true for the language models. We know what they do when they read text or images. We understand the basic mechanism of learning - transformers, fitness functions, back propagation, that sort of thing.

What we don’t understand is exactly how the neural net does what it does. We know it’s not storing stuff in specific places. There is no ‘image memory’ or anything like that. It’s just a neural net with 175 billion parameters, which are roughly analagous to neurons in the brain.

Here’s a better analogy than ‘lossy compression’: complex iteration. You can take a very simple formula and plot it by iterating the formula thousands of times, and the result can be the Mandelbrot set or something completely different. Finite-state-automata are very simple data structures and rules which, when iterated over repeatedly, can create amazingly complex things, universal computers, etc.

So we have these LLMs, and we constructed transformer algorithms, fitness functions, etc. We understand all that. We get what’s going on, But then to ‘train’ them we iteratively manipute the model trillions of times using terabytes of data. With each iteration, billions of values in the net might change. connections between one parameter and others are established or made stronger or weaker. And then the training data is thrown away, and all that remain are all the new connections and parameter values.

And at a certain number of iterations, suddenly the thing can talk. Billions more iterations, and it can do 2-digit math. Billions more, and it does 8-digit math. Oh, and suddenly it can translate between languages and write comouter code.

How it does all that is the HOW we don’t understand, in the sense of being able to see how all those numbers come together and generate a sonnet or draw a picture of a space cat. These capabilities simply emerged from the process, unexpectedly and unpredictably. But we do know it’s not ‘compression’ - the LLM model is pretty much the same size at the end as at the beginning, and it has capabilities that were never part of its training data.

Here’s ChatGPT’s response:

That’s pretty much what people do.

You can get it to generate stuff that can’t possibly be a paraphrase, such as a poem where every line is in a different language yet should rhyme when pronounced in English. Or a poem about something that doesn’t exist.

I just had it write a poem about ‘message boards’ where the first letter in every line spells out CECIL ADAMS:

Communities on the web,
Electronic message boards we’ve met.
Communication, that’s the key,
Ideas, opinions, all set free.
Lively discussions every night,
A diverse community, what a sight!

Debates of truth, facts to defend,
Arguments made 'til the very end.
Minds are sharp, quick to assess,
Statements made with finesse.

I’m pretty sure that hasn’t been done before.

The idea behind using an LLM as search is that there are no links to click on, just an answer. But in Google’s case, I suspect theynwill try for a hybrid approach where they give a text answer AND a bunch of clickable links to web pages. That will work for some kinds of searches where people are specifically looking for a web site, but for lots of other queries a good answer means the user will never click a link.

For example, today if I want to find out who the NHL’s too five scorers are, I will search for that and then get a bunch of links to various sites that have NHL player stats. Then I can pick one and navigate to it and look at their stat table.

In the future, I ask for the top scorers in the league, and it just gives me the list. Nothing to click on. perhaps they can bury a relevant ad beside the text, but if I got my answer the click-through will be worse.

They won’t lose all gheir ad revenue, just a portion of it. A bigger danger is that people won’t go to Google at all. Bing, DuckDuckGo, and whatever search engines that come along may do just as good a job at search, cutting Google out of the loop completely.

I mean, google could list the scorers and then link to them. Literally nothing stopping them from doing that. “Conor McDavid is the leading scorer. Read more about him at this link” or “In 2023 Conor McDavid has X goal. Did you also know X about him <hyperlink to wikipedia/espn/whatever>”

And they already just give you answers to lots of things without requiring you to click a link. Someone should tell them how much revenue they’re stealing from themselves.