According to OpenAI, this is absolutely false. They claim that GPT was never explicitly trained on the material underlying these tests. While OpenAI may exaggerate or sometimes might even outright lie, I’m inclined to believe them because if GPT is simply regurgitating remembered answers as you claim, then this would be completely trivial behaviour that would not be the subject of so many publications and discussions.
Furthermore, this is not some small number of “narrow cases”, if you at the article it’s quite a varied and extensive list of tests. And finally, I know for a fact that GPT can solve puzzles that it’s never seen before (and can show its work) because in some cases I made them up myself. Another clue is that in other cases it arrives at a correct answer through an unnecessarily circuitous route, so it wasn’t just parroting a standard known answer. OpenAI may not be a paragon of perfect honesty, but I’d sooner believe them on the training issue than to believe that it’s completely trivial behaviour based on a theory that you apparently just made up with not a shred of evidence to support it.
And yet, just above you basically accused OpenAI scientists of blatantly lying about whether GPT had been explicitly trained for the tests that it successfully passed.
Nope. Emergent capabilities are a foundational concept in systems theory. Another way of expressing it is that a sufficiently large quantitative increase in the scale of a complex system results in qualitative changes in its capabilities. For example, a powerful stored-program computer is fundamentally different from a little hand-held calculator, even though both may share the same components. Yet another restatement of the same phenomenon is to say that large complex systems are greater than the sum of their parts.
Some philosophers dispute the reality of emergent properties, arguing that if such properties appear to emerge at large scales, then they must in some way have been present at smaller scales, even if latent and invisible. I find this philosophical abstraction both unpersuasive and uninteresting. If certain behaviours become both apparent and perhaps very powerful at large scales and cannot be discerned at all at smaller scales, then they are emergent by definition.
I have a task right now I would use AI for in a heartbeat. I have 450 folders I need to make into a spreadsheet with funder name and any contact information or other context I can pull from the enclosed documents. This is going to be very tedious and time-intensive clicking into every single folder, reading the documents and adding it to the spreadsheet. This is the kind of shit work I’d happily give to a machine.
Thank you for your thoughts, it’s clear you know a lot about this.
It isn’t, as I acknowledged. But I’m inclined to believe it for the reasons I stated, and certainly inclined to believe it over the opposite claim which trivializes the whole situation and is made with no basis and no supporting evidence. If GPT is just regurgitating remembered answers in those tests then the whole performance is nothing short of outright fraud.
Reading the actual paper, I can’t be the only person who finds this verbal sidestep funny:
We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two.
I read an interesting point about models “showing their work” today. Anthropic was studying how LLMs arrive at their answers, and one point of investigation was to evaluate its accuracy when describing the steps it takes to reach a conclusion.
Claude can write out its reasoning step-by-step. Does this explanation represent the actual steps it took to get to an answer, or is it sometimes fabricating a plausible argument for a foregone conclusion?
And guess what? That logical path it describes is sometimes just made up, and not reflective of how it actually came up with an answer.
Recently-released models like Claude 3.7 Sonnet can “think out loud” for extended periods before giving a final answer. Often this extended thinking gives better answers, but sometimes this “chain of thought” ends up being misleading; Claude sometimes makes up plausible-sounding steps to get where it wants to go. From a reliability perspective, the problem is that Claude’s “faked” reasoning can be very convincing.
…
When asked to compute the cosine of a large number it can’t easily calculate, Claude sometimes engages in what the philosopher Harry Frankfurt would call bullshitting—just coming up with an answer, any answer, without caring whether it is true or false. Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of that calculation having occurred. Even more interestingly, when given a hint about the answer, Claude sometimes works backwards, finding intermediate steps that would lead to that target, thus displaying a form of motivated reasoning.
I mean, I don’t need to ask AI about suicide methods because I an adult man with a number of options, sea, mountains, etc.
Who asks AI how to kill themselves? I mean, sure, I am sympathetic to this guy, because choosing to die is quite a big choice.
But when I go out - I have Wernike-Korsakov - so in about 3 to 5 years, I don’t need AI. It is just going to be cleanest, least messy way, and involve as little people as possible. Suicide is painless, and you can take it or leave it as you please.
First of all, I’m less concerned about the “showing its work” aspect than about the fact of the correct answer to a puzzle GPT has never seen before.
Secondly, I don’t completely disagree with you. GPT-5 is particularly adept at showing its reasoning steps on difficult problems where it goes into “deep thinking” mode, and sometimes those steps are missteps. Sometimes I’ll challenge it on one of those, and it will acknowledge, “yes, I was way off on that logical step, and here’s why”.
But what I’m referring to by “arrives at a correct answer through an unnecessarily circuitous route” will be something like solving a problem in logic by setting up a series of equations, which ultimately yield the right answer. Then I point out that the answer could have been arrived at more easily through a simple process of logical exclusions, involving no math at all (I’m thinking here of the “fish question” discussed in another thread).
The fact that its solution was somewhat clumsy I take to be good evidence that it achieved it through its own “thinking”. And I deliberately put that word in quotes. I don’t want to get into a pointless semantic debate about what “thinking” is supposed to mean. But I think it’s pretty clear that today’s LLMs exhibit the appearance of thinking that is frequently sufficiently robust that it’s an operationally valid example of what we generally mean by that word.
They may be telling the truth that it was not specifically trained for these tests. I read the news article you linked to upthread and that news article makes it clear that GPT is good at guessing the answers to some standardized tests. I say “guessing” because it has no idea whether it’s correct or not, or on what basis it is correct, because it’s answering problems based on stuff it was trained on. While it may not have been trained specifically for these tests, I would imagine it was trained on plenty of general academic information which would lend itself well to getting correct answers on some tests.
Of course, I see no evidence of reasoning, just guessing. This would explain the wide disparity in performance rates across tests. Just my WAG, it’s probably easier to scour and generate legalese than it is to scour and generate answers to high school math questions. The case I’m aware of where it could answer PHD-level math questions was absolutely a case where it was explicitly trained on information provided by PhD level mathematicians. It was trained for a specific test and it whiffed at other math tests. And yet the reporting was that LLMs can replace math PhDs. Utter nonsense. That machine would have accomplished nothing without the in-depth knowledge of human beings and there would be no way of verifying its guesses without human experts.
Anyway, I don’t trust vibes reporting, CEOs or researchers who lie in bed with them. And it’s almost impossible to trust a technology where hyperbolic claims are the rule rather than the exception, where so much of the US economy is at stake and there is so little to show for it. I will continue to seek out new information, but at this point my general take is: this whole racket reminds me of the crypto scam. It might be more useful than crypto, but like crypto it will be used to hurt people more than they can be hurt without it. That’s the price of progress. Not much to celebrate.
What I haven’t learned about yet is the reinforcement learning models and their applications. I suspect there’s a lot less bullshit in that dialogue.
You’re missing the point. Anthropic’s research showed that the LMM is not adept at showing its reasoning. Maybe it took that circuitous route to reach the answer; maybe it did it some other way and is just feeding you bullshit about its reasoning. You cannot trust what it tells you, even when it seems very logical to you. Again:
From a reliability perspective, the problem is that Claude’s “faked” reasoning can be very convincing.
No, you’re missing the point, and conflating two different situations.
In situation #1, GPT was presented with a logical puzzle which it chose to solve by setting up a set of equations. Which resulted in the correct answer, but it wasn’t necessary to do the math because the puzzle could have been solved through a more elegant simple process of logical deduction. There’s no “bullshitting” here. It was just taking a mundane approach to a solution rather than exhibiting greater intellectual insight (but this was still way back with GPT 3.5).
In the other case, which was completely different, GPT-5 showed a series of displays exhibiting its reasoning on a fairly complex problem that sent it into “deep thinking” mode and a bunch of internet searching. When it came up with the wrong answer, it recognized the error after receiving feedback and re-evaluated its analysis. I’m not putting this forth as either an advocacy or a criticism of either LLMs or AI in general, just an objective fact.
I have been maligned here as an alleged fanboi of AI. But there’s also the opposite side, represented by the kinds of relentlessly skeptical neo-Luddites typified by the late philosopher Hubert Dreyfus who claimed that no computer would ever be able to play better than a child’s level of chess. It was wonderful when he was thoroughly embarrassed by losing to one of the first reasonably good chess programs, MacHack, in 1967.
I remember once asking my older brother about Dreyfus because he’d had some run-ins with him at conferences and in the literature, and while I don’t remember his exact words the gist of his private email response was that Dreyfus was a nice guy but an ignoramus with little knowledge of the subject matter he was critiquing. I’d say the same about John Searle and his ridiculous “Chinese room” argument which seems to betray a lack of understanding about the difference between complex systems and their individual components. Fanatical AI advocates may be counterproductive, especially when acting as self-promoting marketeers, but so are neo-Luddites taking the opposite position.
a) Dreyfus in fact never said that no computer would ever be able to play better than a child’s level of chess and
b) His major critiques of the AI field as it was in the 1960s have largely proven true, and much recent progress including LLMs reflects in large measure his insightful thinking about the nature of human cognition and the limitations of high level symbol manipulation.
But deeper research, if you could be bothered to do it, reveals nuances showing that what I said is largely true, even if Dreyfus didn’t use those exact words.
Dreyfus was deeply skeptical about AI, not only about its state in the 60s, but about wrongly extrapolating those limitations to the indefinite future. This is what led to his misguided book What Computers Can’t Do and to the chess challenge at MIT where he embarrassed himself with his loss against a mere PDP-6. He interpreted the limitations of 1960s AI as inherent and unavoidable.
He argued that human intelligence isn’t based on symbolic rule-following, and that since in his simplistic view that’s exactly what computers do, they could never be intelligent. He never anticipated ANNs and things like deep learning and reinforcement learning. He wrongly assumed that AI necessarily requires human-like cognitive structure, and in his book he wrongly predicted that they would never be effective at language translation, robust speech recognition, and many other things that they do quite well today. He never anticipated that systems like Alphazero could surpass human capabilities through entirely different mechanisms, much as a jet airliner flies higher and faster than any bird without flapping its wings. As part of those claims, he said that no computer at that time can play chess at even an amateur level, with the strong implication that they never would.
I will take lectures on bothering to do deep research from many people, but you, the guy who spammed this thread with cites he hadn’t read, are not one of them, nor can you seriously imagine that you would be.
One thing I’ve learned in life is that intelligent people respond to criticism with informative arguments made in good faith, and the ignorant resort to content-free snark because they have nothing to say. No personal criticism intended, of course.
To be fair, if you find the back-and-forth enjoyable, then don’t let me dissuade you. Just maintain a clear understanding that it’s the equivalent of debating the provenance of Joseph Smith’s revelations with a fifth-generation resident of Salt Lake City.
You seem very impressed with the reasoning path LLMs take to reach an answer. Does it change your opinion at all, now that you know the displayed reasoning path is not always what it’s actually doing, that it’s sometimes just making it up after the fact?