The point is without auditing the training data it’s not possible to know if the material was in the training set. Material from standardized tests is more likely to be sampled in the training data (directly or indirectly) than many other random facts.
If the LLM knows an answer to a question, then the simplest explanation is that it was indeed trained on the fact. It’s much more likely than emergent reasoning.
If an LLM knows many, but not all facts, of a standardized test it is more likely a capacity issue. Presumably smaller LLMs will do more poorly on the test as a result.
I’m not arguing against emergent properties, just that I’m not convinced that passing a standardized test, even without getting a perfect score, is proof. My suspicion is that intelligence isn’t as complex as we think it is.
I now realize I didn’t read your link so I’ll go of and do that.
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
If I suspected that one of my human students had gotten a copy of my test and answer key before the fact, I wouldn’t accept “but I didn’t get a perfect score!” as a defense. That just means that the test-taker didn’t perfectly remember all of the key. Maybe it only remembered some of the questions. Maybe it remembered all of the questions, but only imperfectly. Heck, even if it did take the test honestly, then how it lost points would be a very relevant and interesting question that the researchers should have looked into: Getting a 6 out of 9 on each of the free-response questions means different things than acing four of the six and turning in the other two blank, even though both of those would mean a score of 36/54. But OpenAI, despite their name, isn’t open. They don’t report details. They don’t reveal raw data. They release information when and only when it’ll help their business case. Which, already, is reason to be skeptical of all of their claims. Especially when they do things like release results for tests performed under conditions where fairness was extremely difficult to achieve, but then not repeat the experiment when it became extremely easy to do fairly.
Sure, but you omitted my most important refutation. Given GPT’s demonstrated capabilities at one-shot and zero-shot learning – the latter essentially meaning that it can infer solutions to a completely new class of problem it has never seen before based on general knowledge and emergent reasoning skills – it would be absurd to suggest that these skills did not play an important part in its test performance, regardless of what test-specific training it may or may not have received.
What am I to take from this thread? That ChatGPT is unreliable? Random? In an early stage of its development? That it will always toss off bizarre answers to questions that a nine-year-old can answer well? That it’s no good at counting letters for some undiscernible reason?
Certainly it has at least some degree of problem-solving. But my point is that it’s extremely difficult to know what degree, if it’s had access to test keys before it took a test, and when the people experimenting on it withhold any information that makes it look bad.
Reasoning capacity aside, it is known for sure that if multiple copies of an item exist in the training dataset, this makes it very easy for an LLM to regurgitate a near intact copy of the original data - in that sense, it behaves a bit like a compression algorithm.
I would summarise the takeaway as: People seem to want to rely on these things when they really oughtn’t. Not knowing the answer to a thing is one problem; confidently thinking you know the answer, and being wrong, is quite another.