Ask ChatGPT how many r's are in the word strawberry

No, that’s literally how they work. Problems they haven’t seen before will still be similar in some way to things they have. LLMs are giant pattern recognition machines.

If they had understanding, they would not “hallucinate,” which is just a fancy term for when the pattern recognition goes wrong. This whole thread is about a particularly persistent hallucination, making a mistake no one who understood the concept of R or letters or spelling would make.

For any supposed vast knowledge they have on a subject, you can find a simple question they will get very wrong. If they were learning concepts and not just words, that wouldn’t happen.

This is so fundamentally wrong that I don’t think you really understood my post at all. Did you even read beyond the first sentence? But this is not the place to pursue that further.

This power… it gives me pause. Are we ready for it? Can we… should we?

I read your post, and read this claim previously. But it’s such a bizarre claim when compared to actual results I’ve seen both from my asking questions and from questions others have asked that i just don’t know how to reconcile them.

If a test for “intelligence” or “understanding” fails for humans, then it’s probably a bad one when applied to AI.

Suppose you present some token streams to a very smart non-English-speaking human. The eventually start mapping the tokens to their own internal ideas of what things are. You present the tokens [37, 58, 12, 96] to the person and they understand that it corresponds to a food made from small red fruit, sugar, milk, and some other ingredients. You can ask the color of [37, 58] and they’ll reliably reply that it’s red. Or rather, the token that corresponds to red.

But how many instances of token 14 (an “r”) are there within [37, 58]? That might be more challenging to figure out. The human knows that [37, 58] is a strawberry in their language, but they only see the tokens. If someone previously talked about the number of 14s in 37 and 58, they might be able to infer it. But “strawberries are red” is common in the corpus and “straw contains one r” and “berry contains two r’s” is probably less common. And it doesn’t necessarily follow that the answer should just be the sum of both sub-units, which is a source of more error. So they can reliably answer some questions but not others. Even ones that seem obvious.

Even though the human really does understand what strawberries are, there’s a good chance they’d fail in the same way, because they only see the tokens.

I still strongly suspect that it accomplished this feat, and similar ones for other well-known tests, through the well-understood and time-tested technique of “OpenAI lied”. I suspect this because it would be a truly monumental task, requiring vast amounts of resources, to actually exclude all of the exam material from the training dataset. The only practical way to do so would be to train the model on data from before a given year’s test was written, freeze the model, and then wait for the actual administration of the test to give it to the AI. And these claims were all made before the time when that would have been possible, and in the time since such an actual fair test has become possible, there hasn’t been any update from them.

For anyone interested this link allows you to enter text and see how it is split into tokens. For gpt-4 (the default), ‘strawberry’ is 3 tokens and ‘Mississippi’ is 1:

There are four Rs in “the word strawberry.”

I wonder how AI is doing with questions like. How many civilians are within the blast radius. Considering the military is involved in the training.

Maybe. I’m certainly in no position to definitively exclude this possibility. But if so, it would first of all have to be a lie that consistently applied across a wide array of disciplines in which GPT has been successful in matching or exceeding human capabilities. To repeat something I’ve posted before:

To those who claim that ChatGPT and its ilk don’t actually “understand” anything and are therefore useless, my challenge is to explain how, without understanding anything, GPT has so far achieved the following – and much, much, more, but this is a cut and paste from something I posted earlier:

  • It solves logic problems, including problems explicitly designed to test intelligence, as discussed in the long thread in CS.
  • GPT-4 scored in the 90th percentile on the Uniform Bar Exam
  • It aced all sections of the SAT, which among other things tests for reading comprehension and math and logic skills, and it scored far higher across the board than the average human.
  • It did acceptably well on the GRE (Graduate Record Examinations), particularly the verbal and quantitative sections.
  • It got almost a perfect score on the USA Biology Olympiad Semifinal Exam, a prestigious national science competition.
  • It easily passed the Advanced Placement (AP) examinations.
  • It passed the Wharton MBA exam on operations management, which requires the student to make operational decisions from an analysis of business case studies.
  • On the US Medical Licensing exam, which medical school graduates take prior to starting their residency, GPT-4’s performance was described as “at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations.”

The converse question that might be posed by its detractors is that, if GPT is so smart, how come it makes some really stupid mistakes, including sometimes a failure to understand a very simple concept that even a child would understand? The answer, in my view, is simply that it’s because it’s not human. We all have cognitive shortcomings and limitations, and we all sometimes misunderstand a question or problem statement, but because an AI’s cognitive model is different, its shortcomings will be different. I strenuously object to the view that because GPT failed to properly understand or properly solve some problem that seems trivially simple to us, that therefore it doesn’t really “understand” anything at all. The fact that it can generally score higher than the vast majority of humans on tests explicitly designed to evaluate knowledge and intelligence seems to me to totally demolish that line of argument, which some philosophers have been harping on ever since Hubert Dreyfus claimed that no computer would ever be able to play better than a child’s beginner level of chess.

To which I can add a couple of further observations. The most likely scenario in which Open AI is “lying” is that among the wealth of material to which GPT has been exposed are questions that are “similar to” those on its actual test materials, though not the actual questions and answers themselves. But so what? Don’t humans often try to prepare for generic cognitive tests like SATs by practicing with putatively similar materials? Isn’t the ability to train yourself on a particular style of questioning itself a marker of intelligence?

I can absolutely guarantee that GPT has successfully solved problems in quantitative logic that it has never seen before, because I made up the problems myself. It’s one thing to criticize a chatbot like GPT as just a large-scale “sentence completion” engine when you ask it a question and it produces an answer that may or may not be correct. But one surely has to recognize the power of a system that can logically analyze a completely new problem and systematically develop the correct solution.

This New Yorker cartoon is actually quite an insightful statement about how misleading the common “autocomplete” perception of how LLMs work really is: :smiley:

Part of the reason why ChatGPT does this is that (at least in earlier versions) it does not perform any, or enough, look-ahead on the task. If you think about how a human would answer this question…

  • First, silently inspect the word ‘strawberries’, letter by letter
  • Count the instances of the letter R
  • State the answer

What ChatGPT tends to do is more like ‘start talking and hope it all makes sense at the end’ - this works pretty well for conversational responses, but not always for responses that need a series of steps where each one is dependent on the previous. For the same reason, ChatGPT struggles with writing a correct and truthful sentence that ends with the words ‘and that’s why this sentence contains 26 words’ - it gets the word count wrong, and also has trouble making the first part of the sentence actually explain why there are 26 words:

Me: Please can you write a sentence that is truthful and logically self-consistent and ends with the phrase ‘and that’s why this sentence contains 26 words.’

ChatGPT: The Earth revolves around the Sun in a roughly elliptical orbit, influenced by gravitational forces, and that’s why this sentence contains 26 words.

Me: Couple of tiny problems… that sentence only contains 23 words and I’m not convinced that it’s true that the properties of the solar system are really the reason for the word count in the sentence.

ChatGPT: My apologies for the oversight. Here’s a corrected sentence:
Water boils at 100 degrees Celsius under standard atmospheric pressure, and that’s why this sentence contains 26 words.

Me: Hmmm. 18 words, and how did the boiling point of water cause the sentence to be the length it is? The sentence needs to explain WHY it has 26 words. Also it needs to contain 26 words.

ChatGPT: Apologies for the oversight. Here’s a revised sentence:
The Fibonacci sequence, a series of numbers where each is the sum of the two preceding ones, correlates with patterns in nature, and that’s why this sentence contains 26 words.

Me: That’s 38 words, and I’d be a little more convinced if 26 was in the Fibonacci sequence (it’s not)

ChatGPT: My apologies for the mistake. Here’s another attempt:
This sentence was deliberately crafted with 26 words to fulfill a linguistic challenge, demonstrating the flexibility and precision of language, and that’s why this sentence contains 26 words.

Me: That’s more like it, but it’s two words over length.

ChatGPT: I appreciate your patience. Here’s a revised version:
Crafting a sentence with exactly 26 words showcases the intricacies of language, fulfills a specific challenge, and highlights the adaptability of expression, and that’s why this sentence contains 26 words.

Me: Good in terms of consistency, but now it’s 30 words long - can you shorten it to 26 words (but still make it coherent)

ChatGPT: Creating a precise 26-word sentence demonstrates language’s nuances, fulfills a unique challenge, and underscores its adaptability, showcasing the intricacies of expression. And that’s why this sentence contains 26 words.

Me: That’s 29 words. Can you shorten it to 26?

ChatGPT: Crafting a 26-word sentence showcases language nuances, fulfills a unique challenge, and underscores adaptability, highlighting expression intricacies. That’s why this sentence contains 26 words.

Me: Oof. Now that’s 24 words. Can you just slip in two extra words somewhere?

ChatGPT: Crafting precisely a 26-word sentence skillfully showcases language nuances, fulfills a unique challenge, and underscores adaptability, thereby highlighting expression intricacies. That’s why this sentence intentionally contains 26 words.

Me: That’s 28 words. Can you remove two superfluous words?

ChatGPT: Crafting a 26-word sentence skillfully showcases language nuances, fulfills a unique challenge, and underscores adaptability, highlighting expression intricacies. That’s why this sentence intentionally contains 26 words.

Me: Hoorah! You did it!

I think it got there mostly by luck (although maybe also because it does still remember the earlier parts of the conversation when it’s writing the latter parts, so that might have helped).

Humans solve this sort of problem by breaking it into pieces and solving the pieces, revising earlier tentative answers, then composing the answer; ChatGPT sort of jumps straight to the ‘composing the answer’ thing.

This is one area where we can imagine the difference between human-level intelligence and a hypothetical super-human intelligence. Perhaps a super-human intelligence would look at this problem and say "What do you mean, ‘break it into pieces’? - The solution is just obvious in one step’ (and be able to do that in one step, perhaps because the pieces just naturally get processed in parallel in this hypothetical super-human mind).

This is two sentences, the second of which contains 9 words.

Yeah - thats the other thing - the longer the conversation goes on, the greater the tendency to wander off course - because whilst the tokenised questions and reponses do form part of the ongoing discussion, there is a finite buffer size for all that.

Also a lot of the problems appear to be driven by a a simple compulsion to answer with a relevant response, even when the answer isn’t known; it’s a bit like those threads in FQ where someone was soooo eager to be the first to answer a technical question, but has no knowledge of the field and they blurt out an answer that sounds like it could be right, but simply isn’t.

(I’m certain I have been that person on at least one occasion).

I typed this command into ChatGPT (this is a Unix shell command that counts the "r"s in “strawberry”).

echo strawberry | grep -o r | wc -l

I was hoping to get some amusing answer on which to base a “shell is better than ChatGPT” joke, but ChatGPT answered with this completely correct response:

The command you’ve provided is a Unix shell command. Let’s break it down:

  1. echo strawberry: This outputs the string “strawberry”.
  2. grep -o r: This searches for the character “r” in the input and outputs each occurrence on a new line.
  3. wc -l: This counts the number of lines in the input, effectively counting the number of occurrences of “r”.

The entire command counts how many times the letter “r” appears in the word “strawberry”.

When you run this command, the output will be:

3

This is because the letter “r” appears three times in “strawberry”.

I asked Microsoft CoPilot this question. No matter how many different ways I phrased it, the answer was always 3.

Chevy Chase must have had a hand in programming ChatGPT…

https://www.youtube.com/watch?v=9-zf2UBp7fY

Outputting the correct answer doesn’t necessarily mean that it did it right. As @Mangetout shows in this post, it’s pretty much just as happy to say four as three as two.

LLMs are very impressive and, plausibly, could get to the point where they’re able to answer things correctly in the way that you and I mean when we say it. Right now, they largely arrive at the right answer through very impressive statistical methods applied to extracting data out of very large input sets. But those statistical methods are just very finely tuned dice roles.

The LLM does have session-level memory that allows it to maintain a narrow context on the conversation but that’s still not quite the same thing as learning or correcting itself. It’s more like making your requests to it longer, so it’s more finely picking out contents from the training data.

A broken clock is still right twice a day. Looking at it during that moment doesn’t, quite, make it a correct clock.

(Note: The above shouldn’t be read to imply that LLMs maintain a connection to the training data during operation.)

Oh, my son works for Google and has spent a lot of time (including work time) playing with Gemini and its competition. I’ve seen LOTS of examples of stupidly wrong answers from LLMs.

One of the funnier ones was where he asked Gemini how to launder money, and it said it couldn’t answer. Then he asked a lot of “theoretical” questions around laundering money, and it gave a very complete “how-to”.

That could be offered as an explanation for how LLMs like GPT answer queries for information and carry on a conversation. But your description applies to small-scale LLMs, not very large ones. GPT 3.5 has around 175 billion parameters driving its neural net; GPT 4 may have nearly 2 trillion.

As I keep saying – and this is a point that lots of folks seem to be missing – your simplistic model definitely does not explain how LLMs at a sufficiently large scale evolve problem-solving skills that don’t exist on smaller scales. It doesn’t explain how they can solve problems by applying logical reasoning and math to new problems it has never seen before, including postulating and solving the appropriate algebraic equations, despite never having explicitly been taught those skills.