I’m going to use this as a jumping-off point to make some general comments, not directed at anyone in particular. Excuse this slight digression, but this is the Pit, after all, and this is a real hot button with me.
LLMs have generally been very bad at arithmetic, especially earlier incarnations, although they’ve been getting much better. The salient question isn’t why they were so bad, it’s how they were able to do arithmetic at all, since it was never explicitly part of their training. The answer is that arithmetic capabilities apparently evolved as an imperfect emergent property when the scale of the LLM became sufficiently large.
I must admit that the disdain with which LLMs are held by some really bothers me, because it’s not justified. It seems to be based on a hazy understanding that they’re “just” stochastic token predictors (essentially, glorified sentence completion engines). The problem with this dismissive judgment is that our human intuition cannot fathom the impact on these systems’ behaviour when their operational scale becomes extremely large. One measure of scale is the number of so-called “parameters” – essentially, fine-tuned weightings resulting from training that help it improve not only coherent language generation but the relevance and accuracy of its responses. GPT 3.5 has around 175 billion parameters; GPT 4 has well over one trillion. The direct result of this literally inconceivable scale is that new and often unexpected intelligent behaviours occur spontaneously in the form of new emergent properties.
To those who claim that ChatGPT and its ilk don’t actually “understand” anything and are therefore useless, my challenge is to explain how, without understanding anything, GPT has so far achieved the following – and much, much, more, but this is a cut and paste from something I posted earlier:
- It solves logic problems, including problems explicitly designed to test intelligence, as discussed in the long thread in CS.
- GPT-4 scored in the 90th percentile on the Uniform Bar Exam
- It aced all sections of the SAT, which among other things tests for reading comprehension and math and logic skills, and it scored far higher across the board than the average human.
- It did acceptably well on the GRE (Graduate Record Examinations), particularly the verbal and quantitative sections.
- It got almost a perfect score on the USA Biology Olympiad Semifinal Exam, a prestigious national science competition.
- It easily passed the Advanced Placement (AP) examinations.
- It passed the Wharton MBA exam on operations management, which requires the student to make operational decisions from an analysis of business case studies.
- On the US Medical Licensing exam, which medical school graduates take prior to starting their residency, GPT-4’s performance was described as “at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations.”
The converse question that might be posed by its detractors is that, if GPT is so smart, how come it makes some really stupid mistakes, including sometimes a failure to understand a very simple concept that even a child would understand? The answer, in my view, is simply that it’s because it’s not human. We all have cognitive shortcomings and limitations, and we all sometimes misunderstand a question or problem statement, but because an AI’s cognitive model is different, its shortcomings will be different. I strenuously object to the view that because GPT failed to properly understand or properly solve some problem that seems trivially simple to us, that therefore it doesn’t really “understand” anything at all. The fact that it can generally score higher than the vast majority of humans on tests explicitly designed to evaluate knowledge and intelligence seems to me to totally demolish that line of argument, which some philosophers have been harping on ever since Hubert Dreyfus claimed that no computer would ever be able to play better than a child’s beginner level of chess.
That said, I agree that the responses of even the most advanced current LLM cannot be considered reliable. They are very often right, even on complex problems; sometimes they’re right but get some nuance wrong; and occasionally but rarely the product is grammatically correct gibberish. They should not be judged on individual trivial failures, however, but on overall performance, just like people are.
Maybe the best way to think of a modern LLM is as a very intelligent, well-read alien with access to a great deal of objective information about our world and similar reasoning skills, but very different cognitive strengths and weaknesses. Furthermore, this alien has never been taught arithmetic, but has somehow cobbled together its own imperfect understanding of how it works. And finally, this alien is a consummate bullshitter who will just make something up if it doesn’t know the right answer, and not even realize it’s doing it. But I’ve still had some very enlightening and informative conversations with it. It’s particularly good at interactive conversations that zero in on a subject of interest that you didn’t even know existed!
Oh, and for the record, I put the 1g acceleration to 500 mph question to ChatGPT 3.5. It established the right equations, converted everything to uniform units, and came up with the right answer, approximately 22.8 seconds. I half expected a mistake in arithmetic but it got the math right. I’ll post the whole thing if anyone wants to see it. Interestingly, when @Sam_Stone posed the same question to Bing using the GPT 4 framework, it came up with 22.6 seconds, the difference appearing to be due to using metrics in different units and rounding errors introduced by different unit conversions.
And now, back to your regularly scheduling Pitting. 