That isn’t what I said. In the prior sentence I said that the issue of AI hallucinations “stems from the absence of confidence filtering”. Thus it differs from (for example) the DeepQA engine in IBM’s Watson QA system which does apply confidence scoring to candidate responses, because it’s intended to be a reliable QA system with commercial applications. At this stage in its development, GPT doesn’t evaluate the quality of its responses.
Nevertheless, in certain circumstances GPT can and does respond that it doesn’t have the information to answer the question. Typically this is when it’s presented with a self-contained logic problem in which all the information needed to solve the problem is presumed to be contained within the problem statement, but isn’t.
Neither of us knows exactly what was or was not in the GPT training set at the time it took those tests. But what you appear to be saying is that all those questions on all those tests, and all the answers to them, were already contained in the GPT training set.
This would imply that GPT was doing nothing more than looking up the answers on a cheat sheet. It’s basically saying that the whole exercise was a scam. Which is a rather audacious claim. There are many reasons to believe this couldn’t possibly be the case.
First of all GPT didn’t get a perfect score on any of the tests, it simply did very well on most of them, in most cases better than a trained human. Except for the medical one taken by medical school graduates, where it barely passed. All of which is indicative of applying general knowledge and reasoning skills, not a “cheat sheet”. Particularly those questions like the Wharton business test that required one to generalize the solution to a business problem from case studies and provide the reasoning for it. If GPT was just regurgitating something that had already been written, it would surely not be difficult to expose that.
But the other point is that we already know from experience – from questions that we here on the Dope have posed to GPT and from papers like the one I cited from Nature Human Behavior that GPT does indeed possess emergent reasoning skills and is capable of solving novel problems it hasn’t seen before.
One interesting example illustrating both its strengths and weaknesses is the logic problem about the fish that I cited in another thread, which goes like this:
A fisherman has 5 fish (namely A, B, C,D, E) each having a different weight. A weighs twice as much as B. B weighs four and a half times as much as C. C weighs half as much as D. D weighs half as much as E. E weighs less than A but more than C. Which of them is the lightest?
As I said over there, the thing that a perceptive individual would recognize about this question is that each of the series of four comparative statements immediately rules out one of the fish. The first one immediately rules out option “A”. The second one rules out B. The third one rules out D. The fourth one rules out E. Bingo! One can immediately see that the lightest fish must be C.
ChatGPT solved the problem, but it took the long way around. It failed to see this simple series of exclusion steps, and instead developed a set of simultaneous equations which it proceeded to solve. The fact that it solved the problem in a non-optimal way seems like pretty good evidence that it wasn’t just looking up the answer on a “cheat sheet”.
Incidentally, when I pointed out this simpler solution, GPT readily acknowledged it … 
Yes, that’s a great observation! By looking at the statements one by one, you can indeed eliminate each option systematically without needing to solve equations.
- A weighs twice as much as B: This means A is heavier than B, so A cannot be the lightest.
- B weighs four and a half times as much as C: This means B is heavier than C, so B cannot be the lightest.
- C weighs half as much as D: This means D is heavier than C, so D cannot be the lightest.
- D weighs half as much as E: This means E is heavier than D, so E cannot be the lightest.
- E weighs less than A but more than C: This confirms that E is not the lightest, and since C is lighter than both B and E, C must be the lightest.
By following this logic, you can immediately identify that C is the lightest without needing to go through the calculations. This method is efficient and showcases the power of logical reasoning!
Another GPT skill that comes to mind is its excellent ability to summarize, which undeniably requires genuine comprehension of material that can sometimes be very technical. I’ve tested it on many different paper abstracts and it never failed to get it right. On one occasion, its summary, while correct, was still a bit lengthy, so I asked it to boil it down to a few sentences, and it did – once more preserving the basic idea.
I think at some point one has to acknowledge that “apparent” reasoning skills of LLMs have been demonstrated so often in so many different contexts that the argument that it’s just some sort of illusion from elaborate pattern matching has to be abandoned. I think we’re way past that point.