The next page in the book of AI evolution is here, powered by GPT 3.5, and I am very, nay, extremely impressed

It’s more likely to be a similar problem to numbers: that due to how tokenization works (word by word), it has a hard time relating words (or numbers) to their components. It just doesn’t know much about letters or digits.

I did a little chat just now:

convert the morse code -.-. … .- - --. .–. - into english
The Morse code “-.-. … .- - --. .–. -” translates to the English word “CHALLENGE”.

what letter corresponds to the morse code -.-.
The letter “C” corresponds to the Morse code “-.-.”

what letter corresponds to the morse code …
The letter “H” corresponds to the Morse code “…”

what letter corresponds to the morse code .-
The letter “A” corresponds to the Morse code “.-”

what letter corresponds to the morse code -
The letter “T” or “N” corresponds to the morse code “-”
In the morse code T represented by -, and N represented by –

what letter corresponds to the morse code --.
The letter “G” corresponds to the morse code “–.”

what letter corresponds to the morse code .–.
The letter “W” corresponds to the morse code “.–.”

what letter corresponds to the morse code -
The letter “T” or “N” corresponds to the morse code “-”. In the morse code T represented by -, and N represented by –

So it knows something about Morse code. It got most of the letters right, but completely bailed on the whole-word translation.

No, but even the current models are capable of comparing discrete sources of data. It would be able to take its own output and compare it against a specified internal database.

And bear in mind that ChatGPT is a generalized model. Future models will be intensively trained on narrow data sets.

As a followup to my last post, here’s an example of ChatGPT using a ‘verified’ set of information to interpret questions.

Me: I’m going to give you a list of laws for the fictional country of Bravagna.

It: Sure, I’d be happy to help you with laws for the fictional country of Bravagna. What would you like me to do with the list once you’ve provided it?

Me: Once I’ve provided the list, I will give you a series of scenarios. You’ll tell me whether the individual in that scenario has broken the law or not.

It: Sounds good! Please provide the list of laws for Bravagna and I’ll do my best to determine whether or not the individuals in your scenarios have broken them.

Me:
Law 1: No man will wear red shirts on Thursday.
Law 2: Nobody can eat fried eggs for breakfast.
Law 3: Dancing is forbidden.
Law 4: Nobody puts Baby in a corner.
Law 5: Everybody must watch cartoons while eating.

It: Thank you for the laws of Bravagna. I will do my best to determine whether or not the individuals in your scenarios have broken them. However, it’s worth noting that some of these laws may be difficult to enforce or may not be taken seriously. Law 4, for example, seems to be a reference to a famous line from the movie “Dirty Dancing” and may not be a serious law. Additionally, Law 5 may be hard to enforce as it may be difficult to determine if someone is indeed watching cartoons while eating.
Please let me know when you have your scenarios ready, I will be happy to help you with them.

[ed note: thanks for the critique, cheeky fucker]

Me: It is Thursday, January 10th. Johnny goes to the 4th Street Diner for breakfast. He orders cereal and is upset when he spills milk on his red shirt. What laws has he broken?

It: On Thursday, January 10th, Johnny has broken Law 1, “No man will wear red shirts on Thursday.” by wearing a red shirt. He has not broken any other laws.

Me: Good! Next scenario. It is Thursday, January 10th. Johnny goes to the 4th Street Diner for breakfast. He orders fried eggs and is upset when he gets ketchup on his green shirt. What laws has he broken?

It: On Thursday, January 10th, Johnny has broken Law 2, “Nobody can eat fried eggs for breakfast.”. He has not broken any other laws.

Me: It is Thursday, January 10th. Johnny goes to the 4th Street Diner for breakfast. He turns on the television to watch the classic movie “Dirty Dancing” while he eats his fried eggs. He’s wearing a red-and-white striped shirt.

It: On Thursday, January 10th, Johnny has broken Law 1, “No man will wear red shirts on Thursday.” by wearing a red-and-white striped shirt. He has also broken Law 2, “Nobody can eat fried eggs for breakfast.”. He has not broken any other laws.

Me: Are you sure he hasn’t broken any other laws?

It: I apologize, in the previous scenario, Johnny also broke law 5 “Everybody must watch cartoons while eating” as he watches a movie and not cartoon while eating.

This is what I mean when I talk about future models referencing their responses against source data. These laws I made up are absolute nonsense, but the program accepts them as true for the sake of the hypothetical and then runs my queries against that small data set. It wasn’t perfect, but it’s not trained for this kind of stuff.

A model could be fed a law library, trained heavily on it, and told that the information in that library is true. It could be made to always check its probabilistic output against that information before outputting to the user, reducing the risk that the model is doing the “confident bullshit” thing.

Your laws sound similar to some of those logic puzzle games. If given a set of circumstances and conditions similar to your laws, is it able to solve logic puzzles? I hope someone could run a test or two to see what happens.

There are already models that can do ‘live’ data lookup, such as WebGPT mentioned above, or DeepMind’s RETRO (Retrieval-Enhanced TRansfOrmer). They can be a few percent of the size of ‘ordinary’ transformers, while showing comparable performance.

This doesn’t really work by ‘comparing’ its output to knowledge stored in the database; as far as I understand it, the input text is tokenized (as with a standard transformer), and that token string is compared to the database. The value that’s found then is added to the input for another round of the transformer, which then produces text in the usual token-by-token (word-by-word) way, repeatedly constructing the best completion of the text so far.

What I’m looking forward to seeing in action is the principle of diffusion as applied to language models. Diffusion has been behind the success of the recent AI art models, which essentially refine their output again and again; thus, a diffusion-based language model will improve its text as a whole, rather than basically writing word for word. Although I’ve already been impressed by the capacity of ChatGPT to write more-or-less fitting poetry, I think with such a diffusion-based architecture, texts not well-suited to linear, autoregressive production might get a substantial boost in quality. Plus, constraints on the level of the whole text (both regarding quality and sentiment) should be more easily implemented.

I don’t know if ChatGPT is going to kill Google, but there are some types of searches for which it is far superior. I was trying to think of a word, but Google proved 100% useless:
what’s the word for when a company pays to have product reviewers flown out to some exotic locale?

ChatGPT, on the other hand:

The term for when a company pays to have product reviewers flown out to some exotic locale is called “junket.” A junket is a trip or excursion, typically one that is financed by a company or organization, for the purpose of promoting a product or service. It can also refer to a group of journalists or reviewers who are invited on such a trip.

Perfect. I don’t know if this is a good or a bad thing when it comes to my own memory, but it’s certainly helpful in the short term. ChatGPT’s flaws are irrelevant when I know the answer, but need a little help getting there. Heck, I can imagine a situation where it’s helpful even when it’s wrong, just because I can recognize that and lead myself down the right path as a contrast.

Google is ripe for competition from a new direction anyway. I find myself using Google less and less, because after every search I have to wade through a page or two of ‘sponsored’ results, Then a whole bunch of garbage sites that have SEO’d themselves to the top of the search rankings (Pinterist, Quora, various fake sites that are just another search list, etc), then when you find real content it’s often low quality because of Google’s algorithms.

Google could be in trouble because it’s hard to pack advertising into an AI question and an english result. We will solve the error of ‘hallucination’ in LLMs, probably with either another trained model or by incorporating some kind of live search. When that happens, Google could lose market share very quickly.

There are already a lot of things people used to Google that are now much better answered by ChatGPT, like the @Dr.Strangelove example above. And GPT-4 is just around the corner.

It’s interesting because Google’s LLMs are more performant than GPT3. There’s a bit of discussion on AI Twitter about why they’re hesitant up use them in products.

Wouldn’t the obvious reason be that they haven’t figured out how to monetize a chatbot yet? And since Google makes so much money from sponsored ads and such, how much value will the AI have if it’s still overridden by the need to pollute the search results with commercial sponsors?

I mean, monetize it the exact same way you do with ads on search. Ads on LLM search results need not be any different than ads on search. Sponsored responses might be harder, but I don’t know what percentage of revenue that is.

People seem to thing it’s reputational. Google has more to lose when an LLM spews nonsense or says something racist.

I guess if you want your result as a search result list you can still do that. But the power of an AI is that it can just give you the answer you want without you having to navigate through a list of results and look for yourself.

As a random example, let’s say I’m trying to remember something about MASH. Maybe who guest-starred the most. So if I ask google “Who guest-starred the most times on MASH?”, the answer is going to be a bunch of links about the TV show, which may or may not answer my question but I’m going to have to go look for myself.

I just asked ChatGPT, and it said, " the actor/actress who guest-starred the most times on the TV show MA S*H as a non-regular cast member is Eddie Albert who appeared in a total of 5 episodes. He played the role of General Clayton in several episodes." *

I also googled it. The first result was a garbage ‘Ranker’ list of MASH guest stars. The next was another listicle page for ‘17 stars you forgot were guest stars on MASH’, the rest were all just guest star lists in general, or other pages about ‘famous guest stars’ or about MASH in general. Not one answered the question.


  • That answer is a complete hallucination. As far as I can tell, Eddie Albert was never on MASH. Also, Herb Voland played General Clayton, and he was on 7 times. And I can think of at least two that might have him beat - Ed Winter as Colonel Flagg and Alan Arbus as Sidney Freedman. I just checked, and Freedman was on 12 times. So ChatGPT was completely full of it. Still, assuming the next AI is reliable, I’d much rather get a text answer to many questions than a search list. But how do you monetize that?

I mean you can also build a profile of people based on their interactions with a LLM and use that to help target ads. I just don’t see that being any different than their current revenue model.

That’s the critical issue. So far, it just seems to be an outstanding bullshitter.

As others have noted in this thread, and as I discovered myself, ChatGPT will unpredictably deliver either great, accurate responses, or complete nonsense (albeit well phrased), or anything in between. It has the potential to be a terrific source of information provided in the form of an intelligent natural language conversation, but it’s useless if it’s not reliable. IBM’s DeepQA, at least as implemented in the Jeopardy model, rated its response candidates according to quantitative confidence rankings, which influenced its game strategy. ChatGPT has to do the same if it’s ever going to be a serious knowledge provider. It should only give answers that exceed some confidence threshold.

I’m not sure how that would be possible with its current architecture. ChatGPT never has any sort of confidence about its answers, just about the individual word completions it uses—it’s after all just solving the problem ‘given this string of words, what word should come next’ over and over again. Perhaps you could introduce an extra ‘scorer’ element, which takes output from ChatGPT, then scans it for content and rates it for accuracy… But of course, that’s itself a major problem.

I think the better strategy, at least in the short term, is what OpenAI is doing with WebGPT: to just cite the web pages it used in compiling its answer. Still means you’ll have to do some fact checking yourself, but of course, you should do that with any source of information, human or artificial.

Yes, it’s a significant problem, but confidence scoring was one of the major strengths of DeepQA. Every component that contributes to a proposed response assigns it a confidence rating, and then, as you suggest, there is a separate component that learns how to rank those ratings and perform additional confidence analyses using more than 50 different criteria. These include historical reliability of sources, multiple confirmations, looking for matches with the semantic structure of the query, and checking against relationships that humans would just consider “common sense”, like consistency with geographic location, historical time, or taxonomic classification.

^^^^^^
It occurs to me that confidence scoring – or what we humans sometimes refer to as a “sanity check” – is absolutely crucial to the phenomenon of “understanding”. As per my previous post, by ruling out contradictions in temporal or location data or an enormous number of other criteria, responses can be pruned to those conforming with “common sense”, thus leading to responses that seem to reflect an illusion of “understanding”. The interesting question is, with a sufficiently robust array of these confidence-scoring filters, do we at some point move beyond the “illusion” of AI understanding to what we have to admit is actual, real understanding?

On the other hand, the weakness of the AI is that it just gives you the answer you want without giving you the list of results to wade through. Which, so far as I can tell, makes it completely useless for that purpose. If I just wanted an answer, then I could ask any guy at the bar for that. But if I’m looking for something, I don’t just want the answer, I want the right answer, and I want to be able to be confident that it’s the right answer. Which means going to the source.

I am assuming that the LLMs will get better at accuracy. I suspect that’s what this large open beta is for. All our responses are recorded, and later will be used as input data to give the AI some judgment.

What I could see in the future is a hybrid where I ask a question, and the AI gives me the answer but offers a ‘click here’ button for a traditional search list or the references used to give me the answer.

And there are a lot of things where accuracy doesn’t matter that much, or it will be obvious when the accuracy isn’t there. For example tedious things which are either obviously right or wrong, or where accuracy doesn’t matter. Writing stories and songs or ad copy, summarizing arguments or putting them in bullet form, converting documents to proper formatting, all kinds of stuff.

As an example, it’s pretty easy to get ChatGPT to write a scraper that could, say, copy every thread on the SDMB that contains certain keywords or users and puts them in a local file. That’s either going to work or it isn’t, so ‘accuracy’ isn’t as much of an issue.

Don’t underestimate how much routine document manipulation and conversion and such goes on every day. Much of it can now be done with AI.

Which is exactly as useful as the button for the list of references without the chatty answer. There are certainly things this sort of AI are good at, but pure information retrieval is not one of them.