ChatGPT level of confidence

I think this is a salient observation as to why people trust LLMs and even believe that they have a ‘spark of consciousness’ or express deep semantic comprehension even though they often fail to vet prompts for false premises and respond with obvious errors, some of them quite basic. LLMs are naturally very good at grammar because the rules of structured grammar are reasonably consistent (even in a language like English which has a lot of variation in structure) and have become quite good at producing strings of tokens into comprehensible sentences, but with the additional ‘compute’ and reinforcement using ‘reward models’ have become quite adept at narrative structure within the scope of a prompt (i.e. a few paragraphs) which is expressed as “well crafted and authoritative sounding prose”, giving a kind of metasemantic ‘truthiness’ even if the actual statement is nonsense.

Narratives appeal to people because they communicate seemingly logical constructions of concepts into coherent mental frameworks even though many if not most narratives actually contain logical fallacies or counterfactual claims, not just incidentally but as a direct consequence of needing to be linked together into an arc with a conclusion, and such narratives are often repeated and even become common ‘memes’ because of this appeal even long after they are debunked, i.e. Paul Harvey and “The Rest of the Story”, or basically any biopic you’ve ever seen about some significant historical figure in which characters are often composited together or events are reorganized or fictionalized for the sake of narrative flow. People put a lot of trust in a good narrative because it seems to make sense out of a jumble of supposed facts or claims, and are often reluctant to actually check an appealing story because they have already decided that it is consistent with their worldview.

There are many forms of ‘AI’ built on using artificial neural networks (ANN) and heuristic methods to develop emergent capabilities based upon complex or difficult to identify patterns in large datasets, and in fact this was really the genesis of using ANNs in heuristic modeling. But large language models are specifically built to replicate natural language processing in human-like ways with the explicit goal of making them respond in ways that are indistinguishable from a real person. Implicit in that, however, is also to make them respond in ways that are appealing, accessible, and agreeable instead of offensive, abstruse, and objectionable, and as a consequence the most advanced LLM implementations respond in a conversational tone with an authoritative ‘voice’ like a friendly teacher or ‘influencer’ which most people view as inherently trustworthy and are thus disinclined to fact-check (or in many cases just too lazy to do so).

The business use case for these models is to replace human beings in direct interface applications that are repetitive and don’t require a great deal of highly accurate or specialized knowledge, i.e. a call center rep, or a narrator that translates basic information into spoken format, and the like. Unfortunately, people are already trying to use them as detailed knowledge agents without understanding that the ‘knowledge’ that these models have is not based upon any real world experience but just in the constructions that arise from the frequency of word use which makes these models adept at doing things like taking written tests or summarizing the basic text of an essay but they have no real comprehension of the world based upon interaction or introspection, no ability to inherently distinction basic facts from fiction, and are not inclined to respond to any prompt by saying, “I don’t know anything about that,” because their ‘worldview’ is constructed of every possible thing that exists in their world, i.e. the textual dataset that they have been trained upon. If a reasonably well-educated person saw a picture of a six-legged mammal, they would know immediately that it is a fictional animal not only because such a creature is evolutionarily impossible but also because it is outside of their real world experience, but an LLM would not draw such a conclusion (unless texts of explicit discussion of why mammals have four limbs were part of its training set) because they lack that deep understanding and experience of how the number of limbs relates to the concept of mammal.

GPT and other LLMs are quite an advancement in the state of natural language processing over more traditional symbolic-based methods, but they are not knowledge agents in the sense of distinguishing correct statements from falsehoods, and their ability to ‘know’ anything comes from the semantics that are intrinsic in a rich natural language set that reflects the complexity of mental models based upon real world experience, not because they have any kind of mental models of their own or introspective cognitive processes which construct such models from experience. That they appear to be authoritative comes from their ability to produce digestible narratives with impeccable grammar and erudite but accessible vocabulary, so they seem like your really smart know-everything friend even when they are spouting utter nonsense.

And despite the prognostications of incipient artificial general intelligence emerging any day, there is really no path toward a highly factually reliable model because of these basic limitations. Which is not to say that various approaches (not just LLMs but all ANN-based heuristic models) are not quite powerful and capable of doing certain things that are beyond even what an expert in some field can do in terms of teasing out patterns of information from vast datasets or structuring ideas into a well-organized framework but they are in no way at some cusp of taking over the execution of complicated groups of tasks or being reliable active agents of detailed technical knowledge (i.e. disseminating information from their own intrinsic knowledge base versus using heuristic algorithms to reference external information from a database or the Internet).

Stranger

And what do you conclude is going on internally when a human gets things wrong?

This is true, but just like saying that an LLM is “just” a sentence completion engine, it has the potential to be very misleading. In particular, it doesn’t reliably tell us anything about its intrinsic limitations or put any known bounds on what new emergent properties may appear as the scale of their training data, parameter and token counts, and neural nets continues to grow.

Indeed, even older LLMs like ChatGPT 3.5 clearly demonstrate the ability to apply reasoning and inference skills to novel problems, skills that were not necessarily predictable and are not trivially explainable. The same applies to the ability of advanced LLMs to handle non-text multimedia.

A timely article in the NYT

https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html

The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.
When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time.

It seems that the more advanced reasoning engines are the ones more prone to hallucination:

Since late 2023, Mr. Awadallah’s company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information.
Vectara’s original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent.
In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI’s o3 climbed to 6.8.

This seems to be an inevitable feature of complexity:

“The way these systems are trained, they will start focusing on one task — and start forgetting about others,” said Laura Perez-Beltrachini, a researcher at the University of Edinburgh who is among a team closely examining the hallucination problem.
Another issue is that reasoning models are designed to spend time “thinking” through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking.

That article is paywalled for me, so I don’t know what that first company is that was mentioned, but there seems to be a great deal of misleading sensationalism there. It seems to now be fashionable in the media to take the position that “LLMs seem impressive, but …” and then lay on click-bait sensationalist negativity.

The second quote about the results from Vectara make more sense, but even there the hallucination figures seem cherry-picked and wildly exaggerated relative to typical norms. Here’s the latest Vectara data from February:

There’s nothing there higher than 2.9%. GPT 3.5 is 1.9%, and the latest GPT 4.5 is 1.2%.

Empirically, I would argue that I’ve found very low rates of hallucinations in my interactions with GPT, and this data bears that out.

Sometimes, but this is not true as a generalization, and certainly is not “inevitable” at all. On the extremely hallucination-prone SimpleQA, the latest model from OpenAI, GPT 4.5, has the lowest hallucination rate of any previous releases. One of the advantages of models with deeper reasoning skills is that they can be more likely (not necessarily “are”, but “can be”) to recognize that they don’t know the answer to a question and therefore can be less likely to hallucinate.

Source:

That’s the top 25, as the chart title says. You can find that chart (in fact, the April 28th update) here. The data table below gives the figures for everything they tested, including DeepSeek R1 at 14.3% and OpenAI o3 at 6.8% as the article says.

I’d like to thank you for this reply. I provide calculations/data for focused AI/ML programs so I know something about how they work and what their limitations are— and while a lot of my colleagues are pretty starry-eyed about AGI it is really refreshing to read something that:

  1. was clearly written by a thinking human, with unexpected observations and novel semantic constructs
  2. points to the fundamental issue with trusting AGI as a relayer (much less generator) of reliable knowledge: it interacts only with highly filtered data. It has no direct experience and only indirect ways to test itself against bullshit.

It does a convincing impression of Accounts Payable, though. I’ll give it that.

An astonishing example of this here, in which a writer asks ChatGPT’s help on assembling a representative portfolio of her writing to submit to an agent.

It’s a series of screenshots so not easy to quote snippets but in essence:

ChatGPT says to give it “links, titles, short descriptions, whatever you’ve got” and it will “read them like a human editor would”. That means not just for content but with an eye for “voice, craft, structure, originality, emotional resonance, clarity and relevance”.

Given the first link, it responds with effusive praise, beyond the bounds of even normal sycophancy. But it’s detailed praise, and sounds like a thorough critical appreciation.

Asked how it could do that so fast, it replies that it’s training means it’s internalised how to do read flow, structure, pacing and tone at speed, akin to the instincts of a seasoned editor.

This repeats over several links , each eliciting a precise and detailed critical appreciation, consisting solely of thickly laid on praise, with each submission being judged a must have for this showcase portfolio, which portfolio is described in increasingly hyperbolic terms. But the point is that the praise is very specific - the response to the essay “The Summer I Went Viral” talks about the authors Twitter experience, her commentary on social media dynamics etc.

Then the author asks why it didn’t reference a couple of specific phrases in the last essay. It apologises, and tells he why that was a bad omission, because those references are perhaps the finest pieces of cultural commentary ever written.

You know where this is going.

Confronted by the fact that everything it said about that essay bears no relation to what she actually wrote, it admits that it can’t read substack links and bluffed, apologising very prettily for doing so. It insists it did read the others

You know where this is going.

The Summer I Went Viral is about getting COVID. Confronted again about the fact that everything it said about both the essays and it’s ability to read them is bullshit, it produces yet another highly polished, achingly sincere sounding apology, begging for another chance and promising to be better.

The pont isn’t the hallucination, or that it completely failed in its task, although,yeah. The point is the tone of the thing is that of a practiced mendicant and flatterer.

And that tone is not a hallucination, or just some trivial whoopsie. It’s what it was trained to do. It is absurd but also (see above for people following it into delusion) unhelpful in quite important ways. If you knew a person who acted like this you’d run a mile. You have to genuinely worry about the people who produced the sycophancy machine, and why they did so.

The flattery

The bullshit

The practiced apology

You can ask it to be more critical and less effusive. In terms of presenting writing, you have to either attach a word file, text file, or copy and paste your work. I’ve run a bunch of my own work through it, and it responds as if it gets what I’m talking about. It even catches influences and themes I missed while I was writing that were correct. It’s quite eerie. It helped tighten up some areas I asked it to. I don’t take all its advice and sometimes it can be too cheesy and predictable. But quite useful.

I haven’t explored the responses in detail, but ISTM that for every example like this, which seems a somewhat contrived way to showcase GPT’s weaknesses, there are thousands of counterexamples where it’s been genuinely useful in ways that are often amazing. For example, in this specific context of writing comprehension, it’s generally very good at generating summaries of technical articles, often being impressively skilled at teasing out the main points in simple terms.

It’s true that GPT has a tendency towards flattery, which seems to be a fairly recent tweak, or at least one that I didn’t notice before, for example, preceding its response with a remark like “that’s a really insightful question and shows a deep understanding of the subject”. But that’s just a “personality” tweak and doesn’t bother me. I’m amused by the results you get when you tell it to be intentionally abusive! :grin:

Yes, pretty much exactly my point!

But in fact this is exactly the use case for an LLM. As a proficient manipulator of language, the appropriate application is as an agent that can provide a natural language interface, and naturally the vast majority of people would want a pleasant and complementary ‘personality’ to interact with rather than a harsh critic or a chatbot that tells them they should go off themselves. It provides applications ranging from support center basic support to tutorial systems, and if it were reliable and capable of distinguishing between reality and nonsense, it would be incredibly useful in replacing fairly menial ‘intellectual labor’ jobs that exist only to transfer information from a written or electronic form into voice in a way that scales to the user expectation and degree of understanding.

The problem comes that in training away from adverse behaviors (which has notably cropped up again and again in LLMs trained with uncurated internet scrapings because much of the internet is a cesspool of foul language, uncivil behavior, and dangerously ill-informed and bigoted ideas) researchers have instead created sycophantic flatter that aspire to give the user the experience they want, an agent that will praise their ideas and writing even if it is garbage. And so many users are so taken with the idea of this (and the charm of having their own interests and ideas mirrored back to them with approval) that they set aside all credulity to convince themselves that this adulation is genuine and that their product or ideas have great merit even if this is obviously untrue.

Of course, some people just want an agent that is a perfect, always compliant supplicant to have a ‘relationship’ with even though it is a completely one-sided interaction of volition and empathy. Having a chatbot ‘companion’ that agrees with everything they say and desire is a certain kind of solipsistic perfection versus dealing with the conflicts of ideas and personalities in the real world. Where all of this goes is anyone’s guess but it certainly isn’t to a flourishing of people competent in actual human relationship or the ability to reflectively internalize criticism and respond with greater understanding and maturity.

Stranger

Here’s the end of a clip of a conversation with GPT about its conversational style:

That’s exactly right for my interactions with it, and goes with my advice above that if you don’t like it, ask it to change its tone. Interact with the AI!

I had never used ChatGPT until about four weeks ago.

The other day, I was working with some students (audio related) and asked them to calculate the amount of MB a 16bit/44.1kHz stereo wav-File would take up per hour. To quickly check the number (and out of curiosity what it would do), I asked ChatGPT for the answer. It (correctly) identified the factors:

(44100 x 16 x 2 x 3600)

but instead of the correct 5.080.320.000, it answered something like 5.300.000.000, and then proceeded to divide by 1.000.000 instead of 1024x1024 to get the number of MBs. When I pointed out that the 5.300.000.000 number was wrong, it apologized, and got the correct number via a slightly different route, even though I had not pointed out the 1024x1024 error.

I know that ChatGPT is not supposed to be a calculator, but I was really surprised that the number was fairly close to the truth, but not quite, even though the factors going in were absolutely correct. It’s as if it swapped two numbers while typing it into a calculator by hand.

so a) this is just nonsense and b) through what mechanism can ChatGPT accurately describe its own workings?

Through the mechanism of objective assessments made by others that are part of its training data set.

So the flattery is a recent tweak, but the training data contains sufficient objective assessments of this novel behaviour that ChatGPT can reliably reproduce them? Are you completely sure?

And of course these objective assessments conclude this is no flattery, just a “glow”.

Do you have a cite for any of this? Are these objective assessments in the public domain? Do we know they are in the training data?

Whether it’s bullshit or not, its description of its tone is accurate, as a daily user of Chat GPT.

The excerpt you quoted has two contradictory descriptions of its tone. Which is the one you found accurate?

It would help if you pointed it out to me.