ChatGPT level of confidence

I think this is a salient observation as to why people trust LLMs and even believe that they have a ‘spark of consciousness’ or express deep semantic comprehension even though they often fail to vet prompts for false premises and respond with obvious errors, some of them quite basic. LLMs are naturally very good at grammar because the rules of structured grammar are reasonably consistent (even in a language like English which has a lot of variation in structure) and have become quite good at producing strings of tokens into comprehensible sentences, but with the additional ‘compute’ and reinforcement using ‘reward models’ have become quite adept at narrative structure within the scope of a prompt (i.e. a few paragraphs) which is expressed as “well crafted and authoritative sounding prose”, giving a kind of metasemantic ‘truthiness’ even if the actual statement is nonsense.

Narratives appeal to people because they communicate seemingly logical constructions of concepts into coherent mental frameworks even though many if not most narratives actually contain logical fallacies or counterfactual claims, not just incidentally but as a direct consequence of needing to be linked together into an arc with a conclusion, and such narratives are often repeated and even become common ‘memes’ because of this appeal even long after they are debunked, i.e. Paul Harvey and “The Rest of the Story”, or basically any biopic you’ve ever seen about some significant historical figure in which characters are often composited together or events are reorganized or fictionalized for the sake of narrative flow. People put a lot of trust in a good narrative because it seems to make sense out of a jumble of supposed facts or claims, and are often reluctant to actually check an appealing story because they have already decided that it is consistent with their worldview.

There are many forms of ‘AI’ built on using artificial neural networks (ANN) and heuristic methods to develop emergent capabilities based upon complex or difficult to identify patterns in large datasets, and in fact this was really the genesis of using ANNs in heuristic modeling. But large language models are specifically built to replicate natural language processing in human-like ways with the explicit goal of making them respond in ways that are indistinguishable from a real person. Implicit in that, however, is also to make them respond in ways that are appealing, accessible, and agreeable instead of offensive, abstruse, and objectionable, and as a consequence the most advanced LLM implementations respond in a conversational tone with an authoritative ‘voice’ like a friendly teacher or ‘influencer’ which most people view as inherently trustworthy and are thus disinclined to fact-check (or in many cases just too lazy to do so).

The business use case for these models is to replace human beings in direct interface applications that are repetitive and don’t require a great deal of highly accurate or specialized knowledge, i.e. a call center rep, or a narrator that translates basic information into spoken format, and the like. Unfortunately, people are already trying to use them as detailed knowledge agents without understanding that the ‘knowledge’ that these models have is not based upon any real world experience but just in the constructions that arise from the frequency of word use which makes these models adept at doing things like taking written tests or summarizing the basic text of an essay but they have no real comprehension of the world based upon interaction or introspection, no ability to inherently distinction basic facts from fiction, and are not inclined to respond to any prompt by saying, “I don’t know anything about that,” because their ‘worldview’ is constructed of every possible thing that exists in their world, i.e. the textual dataset that they have been trained upon. If a reasonably well-educated person saw a picture of a six-legged mammal, they would know immediately that it is a fictional animal not only because such a creature is evolutionarily impossible but also because it is outside of their real world experience, but an LLM would not draw such a conclusion (unless texts of explicit discussion of why mammals have four limbs were part of its training set) because they lack that deep understanding and experience of how the number of limbs relates to the concept of mammal.

GPT and other LLMs are quite an advancement in the state of natural language processing over more traditional symbolic-based methods, but they are not knowledge agents in the sense of distinguishing correct statements from falsehoods, and their ability to ‘know’ anything comes from the semantics that are intrinsic in a rich natural language set that reflects the complexity of mental models based upon real world experience, not because they have any kind of mental models of their own or introspective cognitive processes which construct such models from experience. That they appear to be authoritative comes from their ability to produce digestible narratives with impeccable grammar and erudite but accessible vocabulary, so they seem like your really smart know-everything friend even when they are spouting utter nonsense.

And despite the prognostications of incipient artificial general intelligence emerging any day, there is really no path toward a highly factually reliable model because of these basic limitations. Which is not to say that various approaches (not just LLMs but all ANN-based heuristic models) are not quite powerful and capable of doing certain things that are beyond even what an expert in some field can do in terms of teasing out patterns of information from vast datasets or structuring ideas into a well-organized framework but they are in no way at some cusp of taking over the execution of complicated groups of tasks or being reliable active agents of detailed technical knowledge (i.e. disseminating information from their own intrinsic knowledge base versus using heuristic algorithms to reference external information from a database or the Internet).

Stranger

And what do you conclude is going on internally when a human gets things wrong?

This is true, but just like saying that an LLM is “just” a sentence completion engine, it has the potential to be very misleading. In particular, it doesn’t reliably tell us anything about its intrinsic limitations or put any known bounds on what new emergent properties may appear as the scale of their training data, parameter and token counts, and neural nets continues to grow.

Indeed, even older LLMs like ChatGPT 3.5 clearly demonstrate the ability to apply reasoning and inference skills to novel problems, skills that were not necessarily predictable and are not trivially explainable. The same applies to the ability of advanced LLMs to handle non-text multimedia.

A timely article in the NYT

https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html

The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.
When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time.

It seems that the more advanced reasoning engines are the ones more prone to hallucination:

Since late 2023, Mr. Awadallah’s company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information.
Vectara’s original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent.
In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI’s o3 climbed to 6.8.

This seems to be an inevitable feature of complexity:

“The way these systems are trained, they will start focusing on one task — and start forgetting about others,” said Laura Perez-Beltrachini, a researcher at the University of Edinburgh who is among a team closely examining the hallucination problem.
Another issue is that reasoning models are designed to spend time “thinking” through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking.

That article is paywalled for me, so I don’t know what that first company is that was mentioned, but there seems to be a great deal of misleading sensationalism there. It seems to now be fashionable in the media to take the position that “LLMs seem impressive, but …” and then lay on click-bait sensationalist negativity.

The second quote about the results from Vectara make more sense, but even there the hallucination figures seem cherry-picked and wildly exaggerated relative to typical norms. Here’s the latest Vectara data from February:

There’s nothing there higher than 2.9%. GPT 3.5 is 1.9%, and the latest GPT 4.5 is 1.2%.

Empirically, I would argue that I’ve found very low rates of hallucinations in my interactions with GPT, and this data bears that out.

Sometimes, but this is not true as a generalization, and certainly is not “inevitable” at all. On the extremely hallucination-prone SimpleQA, the latest model from OpenAI, GPT 4.5, has the lowest hallucination rate of any previous releases. One of the advantages of models with deeper reasoning skills is that they can be more likely (not necessarily “are”, but “can be”) to recognize that they don’t know the answer to a question and therefore can be less likely to hallucinate.

Source:

That’s the top 25, as the chart title says. You can find that chart (in fact, the April 28th update) here. The data table below gives the figures for everything they tested, including DeepSeek R1 at 14.3% and OpenAI o3 at 6.8% as the article says.