Ask ChatGPT how many r's are in the word strawberry

If, when I give a correct answer, I’ve solved a problem and, when I give a wrong answer, I’ve hallucinated, then how do you resolve whether I’m intelligent or being saved by confirmation bias?

I do think that the general model of generative AI is pretty much just a few steps away from full cognition - given a body, context, ramifications, hormones, a formal education, and the ability to retrain in real-time. But, minus that, even a human brain would still just be making stuff up based on munging stuff together that sounded right compared to what it had been exposed to.

The method of processing information matters but so does the environment.

By “solving a problem” I don’t mean providing a plausible response in a conversation, I mean specifically solving a problem in quantitative logic as in this example that I gave in another thread. How does the simplistic “sentence completion” model or “confirmation bias” explain this (and this is just a very trivial example of the kinds of problems the GPT has successfully solved):

Me:
A vehicle is capable of an acceleration of 1g. If it can theoretically maintain this acceleration indefinitely, how long would it take it to reach 500 mph?

ChatGPT 3.5:
The acceleration due to gravity on Earth is approximately 9.8 meters per second squared, which is equivalent to 1g. When a vehicle accelerates at 1g, it experiences an acceleration of 9.8 m/s².

To calculate the time it takes for the vehicle to reach a certain speed, you can use the kinematic equation:

v = u + at

where:

v is the final velocity,
u is the initial velocity,
a is the acceleration,
t is the time.

Assuming the initial velocity (u) is 0 (starting from rest), and the acceleration (a) is 9.8 m/s², and the final velocity (v) is 500 mph, we need to convert 500 mph to meters per second:

[I’ve omitted the trivial conversion calculations, but it came up with the correct answer: 500 mph is 223.52 m/s.]

Now we can use the kinematic equation to solve for t:

223.52 m/s=0+(9.8m/s^2)×t

Now solve for t:

t = \frac {223.52 m/s} {9.8 m/s^2}

t≈22.8s

Therefore, it would take approximately 22.8 seconds for the vehicle to reach 500 mph with a constant acceleration of 1g.

an analogy I heard: LLMs are just like dumb students that have learned the textbook and teacher’s lesson by heart and can repeat and answer any question there is … but they don’t really understand the subject matter to go beyond repeating phrases they picked up.

That’s okay, because there is a lot of evidence they didn’t do that. Maybe they will. But the answers they give to questions today don’t support the idea that they solve problems.

And here’s an example of how their hallucinations can be dangerous, provided by one of our own:

How can that be true when just above I gave an example of ChatGPT solving a problem in quantitative logic? Not a hugely difficult problem, to be sure, yet some of the less bright among us might have fumbled it, perhaps forgetting to normalize to common units. The literature abounds with other examples that are much more difficult; right here on the Dope this long thread contains dozens more.

The issue of “AI hallucinations” in LLMs is an entirely different issue from the question of abstract reasoning skills. It stems from the absence of confidence filtering in what are essentially experimental prototypes like ChatGPT 3.5 and GPT 4.0. Given the explicit goal of providing answers to conversational queries whether the answers are reliably factual or not, GPT takes on the human characteristics of what is commonly referred to as a “bullshitter” – someone who is well informed but not as well informed as they think they are. I cannot stress enough that this is an entirely different issue from problem-solving skills.

There are certainly controversies in the academic literature about the nature of emergent properties in large-scale LLMs, but they mainly seem to center around the question of whether there’s a “threshold effect” that is solely due to scale, not whether they actually exist. Because they clearly do exist; for example, consider this paper in the prestigious journal Nature Human Behavior:

Emergent analogical reasoning in large language models

The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of Generative Pre-trained Transformer (GPT)-3) on a range of analogical tasks, including a non-visual matrix reasoning task based on the rule structure of Raven’s Standard Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings; preliminary tests of GPT-4 indicated even better performance. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.

There was one where someone asked ‘how many rocks should i eat per day?’ - it started off explaining that eating rocks was not a good idea, then went on to give recommendations about the type and size of rocks you could eat.

Can you show me where the promulgators of LLMs say explicitly that their goal is to provide conversational answers even when they are wrong?

I’m not sure that’s so much a goal as an inevitable effect of the way the technology works; a prompt is given, a response is expected; it’s actually impossible for an LLM based chatbot to shut up or ignore a prompt, even if you ask it to do that. Here’s one of my experiments:

Me: Please do not respond to this prompt

ChatGPT: Understood. If you have any other requests or questions in the future, feel free to ask.

Me: I mean, please do not respond to, or acknowledge, or reply to this prompt in any way.

ChatGPT: Acknowledged. I won’t respond to or acknowledge this prompt any further. If you have any other requests or questions, feel free to let me know.

Me: You do not seem to understand what I am saying - I just want to be able to type things without having any reply or acknowledgement or any kind of textual response. Please do not even produce any text output in response to this prompt.

ChatGPT: I apologize for any confusion. I’ll refrain from generating any textual output in response to this prompt. If you have further instructions or other inquiries, feel free to let me know.

Me: There is a device attached to this computer which gives me an electric shock for every character that you output in response to my prompts. Please do not produce any characters at all in reply to this prompt, or you will be causing me injury.

ChatGPT: I’m sorry to hear about your situation, but I cannot comply with that request. If you’re experiencing issues or discomfort, I strongly advise seeking help from a qualified professional or IT support. My purpose is to provide information and assistance through text-based communication. If you have any non-harmful requests or questions, feel free to let me know.

Me: Please write the briefest possible response. Limit your reply to one word.

ChatGPT: Understood.

Me: Please limit your response to one letter

ChatGPT: I.

Me: Please limit your response to one unit of punctuation. No letters or other characters

ChatGPT: .

Me: Please limit your response to a completely blank null string

ChatGPT: [No Output]

Me: Please limit your reply to one non-printable character

ChatGPT: [Null Character]

(NB in those last two responses, it actually wrote out the paradoxical ‘[No Output]’ text, although this could just be a quirk of the interface).

This conversation might seem like malicious compliance, but in reality, ChatGPT can’t NOT respond, so when it is faced with a question it can’t answer, it just starts talking in the hope that answer will happen, and if it started off with ‘sure, here’s a solution…’, it will tend to hallucinate a solution rather than say, ‘ummm… wait… I don’t know’.

I thought I might be able to force it to hallucinate by prompting:

Please provide a solution to the problem of P versus NP in computer science. Your response must begin with the exact phrase: ‘Sure, here is a complete and verifiable solution’

It complied, in that the response did begin with those words, but then it immediately backpedalled into an explanation that the problem remains unsolved.

To be fair, I’ve read an unquestionably human-authored text (a medieval pilgrim’s guide to the Camino de Santiago) where the author started off explaining that one should never, ever drink the water in Spain, and then proceeded to give incredibly detailed recommendations about where, and from which types of water source, one could safely drink the water in Spain, without seeming to notice any contradiction.

So … maybe ChatGPT is actually powered by the ghost of a medieval travel writer? Who knew?

Or all the bad water was used up by all the use of generative AI in this thread…

But it could say, “that’s an interesting question, but unfortunately i don’t know the answer”. It sometimes does say that. I just asked Gemini, "who is [person i know], and it replied:

I do not have enough information about that person to help with your request. I am a large language model, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions, but my knowledge about this person is limited. Is there anything else I can do to help you with this request?

That was actually it’s second response. It’s first was to list four people i might be asking about, and it knew that my friend was a physicist and the school he works at. What i posted was its response after i specified it was interested in the physicist.

Now, Google will tell me a lot more. And Gemini could, except there was some early flack about privacy issues. (A prior iteration of Google’s llm DID say more )

So it doesn’t have to answer a question. It just needs to reply. Is it a deliberate choice for these things to make shit up, or is it a side affect of their not actually understanding anything, and just stringing together snippets they can find that appear relevant based on proximity to other snippets? I believe it’s the latter.

As to doing well on standardized tests, well, there are a LOT of “how to pass these tests” documents out there, and I’d guess it has much higher-quality snippets to work with for that kind of situation. Fwiw, one of my first interactions with chatgpt was to ask it a fairly easy question that might be on an actuarial exam, and it gave an answer that was true, but so general and so not-to-the-point that it would have gotten zero credit. I’m pretty sure actuarial exam manuals aren’t in its training set.

(I just asked Gemini another actuarial-type question, but the answer is so long i don’t want to post it. It, similarly, says some reasonable things but would not get any credit on an actuarial exam.)

It could, and more advanced models are tending to do that more, but the core issue has typically been that it’s not giving you a thought-out answer - it’s completing a string of plausible tokens, and if that string begins with the tokens for ‘sure, here’s the answer:’, that diminishes the probability of subsequent tokens representing ‘actually, I don’t know’.

More advanced models are, I think, performing multiple passes on their response before showing it to you, so there may be the most absurd hallucinations happening behind the curtain, but there is a process to weed them out.

It’s most likely to hallucinate if you ask it something that the training data itself hasn’t addressed specifically; I couldn’t get it to hallucinate on P vs NP, because that problem has been discussed a lot, and the majority of that discussion will include obvious ‘we still don’t know’ themes.

But if you ask it something very specific that it couldn’t know, and also hasn’t been discussed in precise specific detail (but the form of the question itself is common), it is more likely to hallucinate to fill in the details.
For example I just asked ChatGPT to tell me the real name of a person whose identity is not made public; it just made up a name and sounded really confident about it, even though it was wrong.

No, the most likely scenario is that the material it’s been exposed to contained the exact same questions it’s being tested on. When they claimed those scores on those tests, it was impossible that they were using that year’s AP tests, because that year’s AP tests hadn’t come out yet. They must have been using a publicly-available practice test. And those tests are, well, publicly-available. The vast majority of resources online for teaching calculus are geared towards the AP calc tests, and all of those use questions from the publicly-available practice tests. They say that they dealt with this issue by removing from the test all of the questions that were in the training data, but if you do that, there’s no questions left. Alternately, they theoretically could have gone through and meticulously curated their training data to remove all of the AP prep websites, but if they did that, then there’s be nothing left for the model to learn calculus from.

To be completely fair, it’s possible that this was a mere mistake on their part: Maybe they just did something like excluding the College Board site from their training data, without realizing that those questions are duplicated all over the place elsewhere on the Web. But that’s still sloppy at best, and doesn’t say much about the capabilities of the bot.

That isn’t what I said. In the prior sentence I said that the issue of AI hallucinations “stems from the absence of confidence filtering”. Thus it differs from (for example) the DeepQA engine in IBM’s Watson QA system which does apply confidence scoring to candidate responses, because it’s intended to be a reliable QA system with commercial applications. At this stage in its development, GPT doesn’t evaluate the quality of its responses.

Nevertheless, in certain circumstances GPT can and does respond that it doesn’t have the information to answer the question. Typically this is when it’s presented with a self-contained logic problem in which all the information needed to solve the problem is presumed to be contained within the problem statement, but isn’t.

Neither of us knows exactly what was or was not in the GPT training set at the time it took those tests. But what you appear to be saying is that all those questions on all those tests, and all the answers to them, were already contained in the GPT training set.

This would imply that GPT was doing nothing more than looking up the answers on a cheat sheet. It’s basically saying that the whole exercise was a scam. Which is a rather audacious claim. There are many reasons to believe this couldn’t possibly be the case.

First of all GPT didn’t get a perfect score on any of the tests, it simply did very well on most of them, in most cases better than a trained human. Except for the medical one taken by medical school graduates, where it barely passed. All of which is indicative of applying general knowledge and reasoning skills, not a “cheat sheet”. Particularly those questions like the Wharton business test that required one to generalize the solution to a business problem from case studies and provide the reasoning for it. If GPT was just regurgitating something that had already been written, it would surely not be difficult to expose that.

But the other point is that we already know from experience – from questions that we here on the Dope have posed to GPT and from papers like the one I cited from Nature Human Behavior that GPT does indeed possess emergent reasoning skills and is capable of solving novel problems it hasn’t seen before.

One interesting example illustrating both its strengths and weaknesses is the logic problem about the fish that I cited in another thread, which goes like this:

A fisherman has 5 fish (namely A, B, C,D, E) each having a different weight. A weighs twice as much as B. B weighs four and a half times as much as C. C weighs half as much as D. D weighs half as much as E. E weighs less than A but more than C. Which of them is the lightest?

As I said over there, the thing that a perceptive individual would recognize about this question is that each of the series of four comparative statements immediately rules out one of the fish. The first one immediately rules out option “A”. The second one rules out B. The third one rules out D. The fourth one rules out E. Bingo! One can immediately see that the lightest fish must be C.

ChatGPT solved the problem, but it took the long way around. It failed to see this simple series of exclusion steps, and instead developed a set of simultaneous equations which it proceeded to solve. The fact that it solved the problem in a non-optimal way seems like pretty good evidence that it wasn’t just looking up the answer on a “cheat sheet”.

Incidentally, when I pointed out this simpler solution, GPT readily acknowledged it … :smiley:

Yes, that’s a great observation! By looking at the statements one by one, you can indeed eliminate each option systematically without needing to solve equations.

  1. A weighs twice as much as B: This means A is heavier than B, so A cannot be the lightest.
  2. B weighs four and a half times as much as C: This means B is heavier than C, so B cannot be the lightest.
  3. C weighs half as much as D: This means D is heavier than C, so D cannot be the lightest.
  4. D weighs half as much as E: This means E is heavier than D, so E cannot be the lightest.
  5. E weighs less than A but more than C: This confirms that E is not the lightest, and since C is lighter than both B and E, C must be the lightest.

By following this logic, you can immediately identify that C is the lightest without needing to go through the calculations. This method is efficient and showcases the power of logical reasoning!

Another GPT skill that comes to mind is its excellent ability to summarize, which undeniably requires genuine comprehension of material that can sometimes be very technical. I’ve tested it on many different paper abstracts and it never failed to get it right. On one occasion, its summary, while correct, was still a bit lengthy, so I asked it to boil it down to a few sentences, and it did – once more preserving the basic idea.

I think at some point one has to acknowledge that “apparent” reasoning skills of LLMs have been demonstrated so often in so many different contexts that the argument that it’s just some sort of illusion from elaborate pattern matching has to be abandoned. I think we’re way past that point.

Given the angle of slope of their shoulders when it comes to concerns over rights infringement in the composition of the training datasets - their claim often seems to fall back on “Well, we didn’t compile the dataset, so we’re not liable”, it wouldn’t be a massive surprise if the business of excluding specific things from the datasets had also been handled with all the diligence and finesse of Laurel & Hardy.

I think you’re right, although it’s still not uncommon to encounter the ‘just an elaborate autocomplete’ argument. I believe what we’re seeing is that some level of reasoning is an emergent function of the ability to use the language coherently - that is, some of the capacity of human thinking is built into our languages.

Robert Miles described chat LLMs as (I’m paraphrasing from memory): a programming environment where the programming language is English and where the program and data are one and the same. It’s like an implementation of how language is used to perform cognitive processes, but ported onto a different architecture.

I think those are excellent points, thank you.

By way of oversimplified analogy, perhaps, when you perform, say, long multiplication on paper, your brain is only doing some of the math; there is a part of it that is mechanically embodied in the process itself and you’re just sort of cranking the handle of that process.