The next page in the book of AI evolution is here, powered by GPT 3.5, and I am very, nay, extremely impressed

Nick Cave was recently sent a ChatGPT song ‘in the style of Nick Cave’, and I think his response is very good, especially as a professional songwriter who thinks about his craft, and what he has to give vs the robot:

Songs arise out of suffering, by which I mean they are predicated upon the complex, internal human struggle of creation and, well, as far as I know, algorithms don’t feel. Data doesn’t suffer. ChatGPT has no inner being, it has been nowhere, it has endured nothing, it has not had the audacity to reach beyond its limitations, and hence it doesn’t have the capacity for a shared transcendent experience, as it has no limitations from which to transcend.

Full text here - Red Hand Files #218.

Prompt: Nick Cave was recently sent a song written by ChatGPT ‘in the style of Nick Cave.’ Write a response by Nick Cave where he thinks that only humans can write songs.

Summary

“While I appreciate the effort and creativity that went into creating this machine-generated song, I firmly believe that only humans are capable of truly writing a song. The process of songwriting is deeply personal and emotional, and it requires the complexity and nuance of human experience to truly capture those emotions in a meaningful way. Machines may be able to mimic certain aspects of the songwriting process, but they can never truly understand or convey the human experience in the way that a songwriter can. I’ll stick with the real thing and appreciate the human touch in songs.”

If that represents a 1-on-1 test of content, then humans win again

The Chatbot version is very clearly written but that is all. It misrepresents what the real Nick Cave thinks - I don’t think he’d say he appreciated its work as representing effort or creativity. In the article he calls it ‘bullshit’. He’s also not saying that writing a song amounts to having had a roster of real-life experience to set to music. His reply is more about the self-reflection that follows from certain types of experience that forces you to push your creativity. Arguably, its not even written as Cave writes, which is much more complex.

Ultimately Ersatz Nick Cave Robo-Lite does not deserve my money or give me what full strength unleaded Nick Cave UltraPlus gives me.

I don’t think OpenAI is claiming yet that not only can ChatGPT reproduce human speech, but that it can recreate the complete inner experience of all humans on the planet.

Anyway, what Cave wrote is just as much bullshit as what ChatGPT wrote. Songs arise out of suffering? Well, maybe some do, but despite not being a songwriter I’m quite certain that not all songs, and probably not even a majority, arise from that. Cave’s most famous song is probably The Mercy Seat, about a man about to be executed. Since Cave has never been executed, by electric chair or otherwise, I would dispute his idea that it comes from a “shared transcendent experience” or whatever he wants to call it.

All creativity is a kind of a fake. Art still has great value, of course, in spite of that. But a sophisticated enough machine, whether or not it has some kind of internal experience, does not stop it from creating something that resonates with people. The message that a consumer of art receives often doesn’t resemble what the artist intended, or what their inner experience was. The errors in transmission run so deep that it really doesn’t matter what the origin was.

Where does it get this stuff?

In the TV series ER, how did Romano die?

Summary

In the TV series ER, Dr. Robert Romano, played by Paul McCrane, died in the ninth season of the show in the episode “On the Beach”. Dr. Romano’s death occurred as a result of a helicopter accident.

In the episode, Dr. Romano and several other characters, including Dr. Abby Lockhart and Dr. Neela Rasgotra, were sent to a relief camp on a beach in the Gulf of Mexico as part of a medical aid mission for victims of a hurricane. While at the camp, a helicopter that was supposed to take them back to Chicago crashes, killing Dr. Romano.

The death of Dr. Romano was a significant event in the series, as the character had been a regular on the show for several seasons and his death was met with a strong emotional response from both the characters and the audience.

So I gather that’s not what happened?

No, he died in the episode “Freefall”. The helicopter fell off the roof and crushed him on the street outside the hospital.

I don’t recall any episode in which there was a trip to the Gulf of Mexico for hurricane relief, but I didn’t watch the final seasons. It’s a wee bit implausible that there would be a helicopter ride all the way back to Chicago.

According to the ER wiki, “On the Beach” is the episode where Mark Greene dies. The rest of the AI’s synopsis appears to bear no resemblance to the episode.

Very interesting article on the emergence of capabilities in LLMs.

In short, no one programmed ChatGPT to write songs, or translate languages, or to understand context around words.

The image below shows a bunch of LLMs and their capabilities at various sizes of model. The dotted red ine is pure randomness. So for example, all the models scored essentially random in the ‘word in context’ test, until they got to a certain size (around 100 billion parameters). Then suddenly the ability to do this just emerged.

Important to note that this generally isn’t a slow improvement, like the models getting better and better at each task as they grow, but a sudden ‘phase shift’ going from ‘not at all’ to ‘pretty good’ very quickly.

The other interesting thing is that this emergence is not predictable, both in type and in the size required to get the capability. So no one actually knows what capabilities will emerge as these models scale. This strikes me as quite similar to how the brain processes data at the ‘system 1’ level, which is also emergent and unpredictable.

I suspect we are going to learn as much about ourselves and how we think from these LLMs as we will about AI, especially after we add more feedback and persistent state memory to these models.
.

“The GPT-4 rumor mill is a ridiculous thing. I don’t know where it all comes from,” said the OpenAI CEO. “People are begging to be disappointed and they will be. The hype is just like… We don’t have an actual AGI and that’s sort of what’s expected of us.”

I’ve said it before: It’s like a robot version of Cliff, on Cheers. Ask it any question at all, and it’ll have an answer for you. Often, the answer is correct, because it does actually know a fair bit. But it’s also very good at bullshitting, which it will happily do if it doesn’t know the answer, and it’s so good at it that there’s no way of knowing when it’s answering from knowledge versus when it’s bullshitting.

@Sam_Stone , I strongly suspect that it matters not just how many parameters these models have, but what kind of data they have access to. A model that’s fed diagrams as well as words, for instance, will be able to produce accurate geometry proofs much quicker than one that’s just fed words (and also much quicker than one that’s just fed diagrams, for that matter). And for some forms of routine human cognition, there’s not enough data in the world to effectively train an AI using the wrong sort of data (i.e., some things for which no possible amount of text would be sufficient without images, and other things for which no possible amount of images would be sufficient without sound, and so on).

Yep. I’m really looking forward to narrowly trained models with hyper-specific datasets. Even silly stuff. Like, what would happen if a GPT-4 model was trained on the scripts of every single Simpsons episode? What would it be able to do with that? Are there hidden layers buried in 30 years of cartoons? Who knows? It’ll be cool to see, though.

More importantly, the song sucks. I’d say it reads like a parody of Nick Cave, but parodies are supposed to be funny. The program gets the idea that Nick Cave’s songs are full of dark apocalyptic imagery–sometimes, and I say this as a huge fan, to the point of self-parody–but with original and creative themes and language. This reads less like Nick and more like a computer trying to sound like a depressed teen.

Maybe, but again these behaviours are emergent, and no one really knows what an LLM of a certain size will do until it’s trained. But the various models do sometimes emerge functions at very different scales, so I would guess that the differences in the training database caused this, as you say.

This is ‘fine tuning’. We’ve had some debate here over how GPT updates itself over time. Does the model change? Or do new models get added? It’s a bit of both, it seems. The ‘pre-trained’ model is generally fixed in the sense that all the parameter values set during learning are unchanged. To ‘fine tune’ the model, additional layers can be added and trained, with the training not affecting the pre-trained part.

It is possible to keep updating the ‘pre-trained’ model as new data is encountered either through web scraping, being fed info, or by its interactions with people. But that’s very computationally intensive in a gigantic model like ChatGPT, and IMO isn’t being done here - at least in real time. More likely, they have added layers that are being trained by the beta, which does change the performance of ChatGPT but does not affect its pre-trained database. Or, they may collect data and then train the model on it in a batch for a new release of the model itself.

Fine tuning will be a big part of the commercial use of LLMs, obviously. Legal LLMs that have entire legal libraries in their fine tuning, LLMs dedicated to specific activities and vertical markets are already appearing. I could see a long-running show like the Simpsons having all the scripts ingested by an LLM which would then write dialog, plots, or entire new episodes.

For those unfamiliar with the terminology, a Large Languge Model is a neural net, which is a structure containing an input row, an output row, and any number of ‘hidden layers’ between them. These hidden layers are what transform the input to the output. The output is passed through a ‘loss function’ or something similar to test its correctness, and the results of that fed back to the hidden layers, which are collections of numbers. Then the output is tested again. Repeat until an optimum accuracy score is hit. Then add more data and repeat.

When they talk about ‘100 billion parameters’ in a model, they just mean how many numbers are in the hidden layers. A ‘parameter’ therefore is just a number. Doesn’t matter if it’s an image model, a text model, or whatever. Parameters are just numbers, stored in a matrix or as vectors.

So a huge language model like GPT is just a huge matrix of numbers structured as hidden layers in a neural net. The database doesn’t grow in size as it ingests information (or not substantially, anyway). Basically you just start with a hundred billion numbers set randomly, then you start ingesting data. Each time GPT reads something, the numbers get modified, then a test is done to see if the resulting output is ‘correct’. If not, the numbers get modified some more, and then the output tested. The numbers also get weighted according to how much they affect the output. Repeat until you’ve read many gigabytes of data.

This is why GPT can’t give you search lists, or display references, or read documents to you exactly. It is also why it is sometimes inaccurate. It has no storage at all of any of the actual text. It’s not ‘looking up’ facts. The model at the end of training is the same hundreds of billions of parameters it was at the start. The only thing that changes is the value of the numbers, their weighting, an perhaps other stuff outside of the model like loss functions, but I’m not sure.

But there is no place in ChatGPT that you can point to and say, “that’s where that fact is stored”. It’s more like a hologram - all facts it ‘knows’ are encoded across all the parameters. In image models like DALL-E it’s no different. It’s all just numbers. The difference would be in the functions that take in the data and the functions that output it. The actual ‘brain’ of the thing doesn’t have a single pixel of any image in it.

If anyone has a different understanding of the basic functionality of LLMs, or wants to add to this, fire away. I’m new to this stuff myself, and I’m trying to do a deep-dive into it and share what I’m learning here.

It absolutely matters what sort of data it was trained on. LLMs attempt to match the distribution of natural language, perhaps conditioned on a prompt. The data is related to what distribution the LLM is matching and the parameters are related to how well it can match the distribution.

Sure, and that’s exactly what I’d expect from a generalized AI in this stage. It’s not trained to be a Nick Cave or even generalized lyric machine. It sucks at even coming close to the style and themes of the poets I’ve tried. But with some other stuff, like way back last month when I asked it for It’s Always Sunny in Philadelphia scenes, it surprisingly (to me, at least) seemed to get a gist of the vibe of the show. Some parts wildly off; some parts quite believable. I wouldn’t expect pitch-perfect pastiches for quite some time to come. Like I’ve said many times before, maybe I’m just easily impressed, but the capabilities of this thing are unbelievable to me.

ChatGPT recently passed all three parts of the U.S. Medical Licensing Examination.

Well that’s interesting, because just last week I gave it 10 randomly picked questions each from USMLE Step 1, 2 and 3 sample tests because I was interested in how it would do. My wife works for a medical school and there’s been a lot of concern over medical students use of Chat GPT so I was curious as to how well it would do on those exams. It did a pretty shitty job. 4 out of 10 correct on Step 1, 3 out of 10 on Step 2, and 5 out of 10 on Step 3. Maybe if I gave it more questions it would have done better, but I don’t have that kind of time (there’s around 280 questions in the Step exams). A passing score requires getting around 60% of the questions correct.

Do note that some answers were scrapped for being “indeterminate” like the AI would not give answers it thought could be construed as medical advice.

But I’m impressed even with the numbers you posted.

I found that interesting as well since it never hesitated to tell me which answer it thought was correct and why. Every answer is pretty much “medical advice”.