How can that be true when just above I gave an example of ChatGPT solving a problem in quantitative logic? Not a hugely difficult problem, to be sure, yet some of the less bright among us might have fumbled it, perhaps forgetting to normalize to common units. The literature abounds with other examples that are much more difficult; right here on the Dope this long thread contains dozens more.
The issue of “AI hallucinations” in LLMs is an entirely different issue from the question of abstract reasoning skills. It stems from the absence of confidence filtering in what are essentially experimental prototypes like ChatGPT 3.5 and GPT 4.0. Given the explicit goal of providing answers to conversational queries whether the answers are reliably factual or not, GPT takes on the human characteristics of what is commonly referred to as a “bullshitter” – someone who is well informed but not as well informed as they think they are. I cannot stress enough that this is an entirely different issue from problem-solving skills.
There are certainly controversies in the academic literature about the nature of emergent properties in large-scale LLMs, but they mainly seem to center around the question of whether there’s a “threshold effect” that is solely due to scale, not whether they actually exist. Because they clearly do exist; for example, consider this paper in the prestigious journal Nature Human Behavior:
Emergent analogical reasoning in large language models
The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of Generative Pre-trained Transformer (GPT)-3) on a range of analogical tasks, including a non-visual matrix reasoning task based on the rule structure of Raven’s Standard Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings; preliminary tests of GPT-4 indicated even better performance. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.