The Miselucidation of Whack-a-Mole

The “arithmetic capabilities” of sophisticated LLMs are an emergent property because as any computational linguist can tell you there is logic intrinsically embedded in language; consequently, if you feed enough arithmetic and algebra problems into an LLM it will ‘deduce’ the operational rules and be able to functionally emulate them, albeit with unexpected conceptual gaps. This does not, however, mean that there are cognitive processes that actually link mathematical operations to real world situations or concepts; it is just answering prompts about physics questions with textbook-like responses. That it is able to do this is unsurprising because textbooks are one of the most dense source of validated information in a natural language format, and there are tens of thousands of texts covering all manner of subjects freely available online to be easily fed into a training set.

That more advanced versions of GPT and other LLMs can ‘ace’ standardizes tests such as the Uniform Bar Exam, GRE, Medical Licensing Exam, college entrance exams, et cetera is equally unsurprising because there are copious examples of past tests and example problems available online free or at marginal cost, and one ‘intellectual’ task that LLMs would be expected to be really good at is coming up with the statistically most likely answer on multiple choice or short answer tests, because this is literally what it is doing any time it takes a prompt and provides a syntactically correct and stochastically most probable response. This does not mean that an LLM would be a good legal counsellor, or capable of correctly diagnosing medical conditions from ambiguous prompts or non-language examination, or would be a good business analyst, because it lacks any real world context, and its entire scope of knowledge is limited to media input rather than direct interaction with the world.

Even with a greater scope of knowledge LLMs will still make bizarre, nonsensical errors and have great lapses in critical judgement because they lack the fundamental comprehensive model-building cognitive processes that a human (or any complex animal) brain does which underlay direct analytical processes. It is fair to say that it has a degree of “intelligence” insofar as LLMs have an ability to respond to variable input with some degree of appropriate answer, but they aren’t what a computational neuroscientist would recognize as truly integrated, cognitive processes, nor (despite the fanciful claims of enthusiasts and even some researchers who should know better) an indication of some underlying consciousness. To the extent that we understand consciousness in human brains, there are not processes going on in an LLM that could even provide a basis for the ongoing layers of cognition, emotion, and perceptual inference that produce the emergent property of consciousness in humans (and likely many other animals along a spectrum of sapience and sentience).

Well, sorta. For the record, here was the response:

So, it did formulate the logic of the problem correctly in terms of velocity being acceleration times time in response to a very textbook question-like prompt. For the stated acceleration in two significant figures it should have been 21.92 mph/s, rounded to 22 mph/s but it is clear that the LLM used the more accurate 9.81 m/s2, giving what should be 21.94 mph/s, rounded to 21.9 mph/s. (The units are…odd, but numerically consistent.) It’s a quibble but when performing engineering and science calculations significant figures are important. It does get the algebraic transition in the second step correct, although this is a trivial rearrangement that is commonly done in textbooks and explanations. It the final step it correctly calculates the right value for time (again, with caveats about significant figures) but then peculiarly fails to stick the landing but incorrectly claiming to round the result to [mark]two decimal places[/mark], and then somehow rounding 22.79 to 22.6 seconds, which is “approximately correct” but actually an error. (The LLM does know how to use emoji like a true GenZ-er, though, so it at least has that down to a T.)

If this were a 5 point problem on a quiz I was grading, I’d knock off one point for the rounding error, and probably another for bad handling of sigfigs. It isn’t badly wrong but it is wrong enough to be clear that GPT-4 is doing some kind of rote stochastic ‘interpretation’ rather than actually comprehending rules, and is coming far enough off of the correct answer that it isn’t any kind of issue with interpretation; it is just wrong, albeit close enough that if you didn’t check the work it would pass the smell test, unlikely the original responses posted by @Whack-a-Mole which were not only wrong but actually self-contradictory. Now, it would be pretty ‘easy’ (in conceptual terms; I don’t know about executing it functionally within the framework of GPT) to make an LLM that recognizes that it has a mathematical problem and dumps it off to a purpose-trained math-parsing subsystem to actually break the problem down in a way that is checked by an explicit algorithm rather than relying on a purely stochastic engine to divine and apply the rules correctly. I suspect the next evolution in generative interactive models will include some kind of corrections like this because they are already used in robotics to prevent ‘impossible’ or undesirable kinematics. But LLMs on their own do not ‘understand’ a problem except in the context of providing a probabilistically appropriate response to a well-formulated prompt.

I’ll add an anecdote from recent experience to highlight how this implicit trust in LLMs and unvalidated generative AI is a genuine safety and reliability problem in the real world. I received an analysis of a heat transfer simulation which was intended to explain why the simulation varied significantly from the actual flight data it was supposed to model. This report had already presumably passed through several rounds of review and approval with the contractor and was provided to my group for ‘final’ review. I did what I thought would be a cursory review before assigning it to an actual subject matter expert (SME) just to make sure that the report was complete, and found that it was unexpectedly full of explanation delving deeply into statistical mechanics derivations and computational fluid dynamics even though the model used in analysis was a pretty straightforward two-mode networked node heat transfer model.

I’m not an expert in statistical mechanics, having had one course partially covering the essentials almost three decades ago, but the derivation and application didn’t make any sense to me, so I cracked a book and did some skullduggery because I like to understand things that I sign off on. After spending the better part of an hour, I came to the realization that not only where there multiple bad assumptions and errors, but that in fact the entire analysis was complete gibberish concealed by a bunch of sophisticated seeming nonsense probably cribbed from a Wikipedia page, which was good enough to cause everybody who had previously reviewed it to glaze their eyes and pass it up the chain without (I assume) asking questions or demanding clarity. When I finally got the chance to press the author for an answer, he first hemmed and then admitted to feeding a prompt with his desired conclusion into a chatbot and getting piecewise answers which he inartfully stitched together into a authoritative-seeming pile of complete end-to-end bullshit that literally had nothing to do with actually validating the analysis or explaining the discrepancy between simulation and test data.

This is the danger of getting complacent in trusting LLM-based chatbots to answer factual questions. Getting the wrong answer on a message board has no real consequences (not that it is an excuse for not checking the result or acknowledging that it is an obvious error that should have been caught) but it normalizes the use of these tools for purposes that they have not been validated and are completely unreliable at. When a monumental dunce like Michael Cohen uses Bing to generate citations that turn out to be fraudulent to support his early release its a funny story that sounds like a rejected plotline from Arrested Development that somehow prophetically came true. But when supposedly intelligent people start relying on chatbots for authoritative(ly wrong) answers or to give them a persuasive rationalization (or at least a good hornswoggle) to justify some arbitrary claim, they can quickly become actual physical hazards to public safety and well-being, notwithstanding how such tools can be used by malicious actors to do intentional damage.

These are not just cute toys, or passing novelties. They are tools with the capacity to actually do genuine harm, and the more sophisticated they get, and the more normalized their use becomes in ways that they are not suited for, the more likely these hypothetical hazards are to being realized.

Stranger