It…doesn’t. If you actually read the paper you don’t find anywhere that the authors argue that there are human-like cognition occurring, and they actually state this up front in the abstract:
We interpreted our results to suggest that LLMs, like human cognitive abilities, may share a common underlying efficiency in processing information and solving problems, though whether LLMs manifest primarily achievement/expertise rather than intelligence remains to be determined. Finally, while models with greater numbers of parameters exhibit greater general cognitive-like abilities, akin to the connection between greater neuronal density and human general intelligence, other characteristics must also be involved.
You need to read through that paper more thoroughly because while it makes some very interesting points about how to assess ‘artificial general ability’ as they term it, and they use the Cattell-Horn-Carroll (CHC) model of intelligence as a comparative model for defining and quantifying different aspects of intelligence, it is definitely not suggesting that human-like cognition is occurring within an LLM, especially with respect to fluid reasoning (Gf).:
We also failed to observe a Gf factor independent of the general factor. While some have argued that Gf is essentially isomorphic with g in humans (e.g., Gustafsson, 2001), we are more cautious about making such an interpretation with our data. Our selected measures of Gf were, at best, acceptable rather than good or excellent. In particular, two of our Gf tests focused on mathematical reasoning, and none involved figural matrices which are typically well-regarded for measuring fluid reasoning (Gignac, 2015). Consequently, further research with better measures of Gf is required to evaluate the possibility of a distinct Gf group-level factor in LLM data.
There is, as I’ve noted above, logic embedded in and metasemantic capabilities that emerge from the way in which language is used, and a system that mimics human language to the degree that a well-trained LLM can will produce statements that often look as if they have been ‘thought out’ by a process of complex cognition even though they are just the result of being a statistical ‘best fit’ for the prompt. That does not mean that the LLM has “powerful emergent cognitive properties” any more than playing a video game written in C++ makes someone an expert programer.
In fact, the argument that manipulating language creates “emergent cognition” is quite obviously upside down as evidenced by the fact that many animals which don’t use language or understand it to a significant degree are nonetheless capable of demonstrating complex problem-solving and anticipatory behavior. Language is a tool that emerged out of a need to convey (and later, record) abstract ideas (and potentially as a tool to formulate more general solutions to specific problems) but it doesn’t somehow create cognition in the brain.
There are not, as previously noted, actual processes going on in an LLM that are in any way analogous to our understanding of cognitive processes in the mammalian brain. Specifically, there is no ongoing ‘strange loop’ of different layers of self (thought) awareness, sensory integration, anticipation and prediction, or a formulation of abstracted models of how the world and other people work. An LLM is literally taking a textual prompt, breaking it down into tokens to quantify it, running it through its word prediction algorithm (sometimes breaking the problem down using a recursive ‘chain-of-thought’ approach to break it into smaller intermediate sets of tokenized data), and then using all of that to generate a textual output which is an empirically appropriate response to the prompt. It does not learn or update its foundational models through experience (and attempts to try to make LLMs do so have resulted in rapid instabilities), it has a limited ‘attention window’ from which it can maintain an ongoing dialogue or process a sequence of prompts with continuity, and it certainly isn’t self-aware, either physically (of course), or in reflection of its own processes except as just another prompt that gets fed back in and has to be responded in a way that the user expects, i.e. saying that it is reviewing its previous work when it is really just dealing with a newly tokenized prompt.
This isn’t to say LLMs aren’t remarkable–as an exercise in stabilized computation using a heuristic artificial neural network on a structured but highly complex and nuanced set of rules and data, they definitely are–or that they don’t have any utility as an interface provided their responses can be constrained to correctly reiterated factual information. But they are not general information processing systems, they don’t have complex models of the world independent from those represented in patterns of word usage in the training data used to build their foundation models, and they certainly aren’t trustworthy for any critical applications or should be in use by the general public who will treat any device that can produce coherent streams of language with authoritative voice as an ‘expert’.
The ‘secret’ about standardized IQ tests such as the Wechsler adult intelligence scale (WAIS) is that they don’t actually measure intelligence; they measure the subjects ability to take the test which is then correlated to the statistical measure of intelligence in the form of the “intelligence quotient”. Setting aside all of the problems in trying to quantify the various facets of intelligence in a single metric (and the problematic history of why IQ was developed and how it was used, especially by proponents of eugenics), the reality is that it is possible to ‘game’ any textual IQ test just by studying the form of the test and how questions are structured and obtain a score that is wildly exaggerated compared to the untrained subject. Most makers of LLMs are quite cagey on exactly what data is used to train the models but because they will basically use every source of text they can find it is all but certain that sample IQ tests are part of the training set (as are sample college admission, bar examination, medical certification, and other tests). So, a broadly-trained LLM should be able to do as well as an expert on these tests, not because it is actually an expert that you would want to write a legal brief for you or diagnose the pain in your kidneys but because these are forms of language use that it has plenty of examples of and should be able to reproduce with good fidelity even with variations in the actual questions. This makes it a kind of idiot savant that can sound really impressive and even do the language manipulation tasks within the scope of the foundation model but doesn’t actually know an appendix from a larynx.
Stranger