Ask ChatGPT how many r's are in the word strawberry

CaveMike · August 31, 2024, 10:32pm

The point is without auditing the training data it’s not possible to know if the material was in the training set. Material from standardized tests is more likely to be sampled in the training data (directly or indirectly) than many other random facts.

If the LLM knows an answer to a question, then the simplest explanation is that it was indeed trained on the fact. It’s much more likely than emergent reasoning.

If an LLM knows many, but not all facts, of a standardized test it is more likely a capacity issue. Presumably smaller LLMs will do more poorly on the test as a result.

I’m not arguing against emergent properties, just that I’m not convinced that passing a standardized test, even without getting a perfect score, is proof. My suspicion is that intelligence isn’t as complex as we think it is.

I now realize I didn’t read your link so I’ll go of and do that.

wolfpup · August 31, 2024, 10:49pm

The link was just to the abstract on the paywalled Nature site, but you can read the full paper here. [PDF]

Here’s another one whose abstract is especially pertinent:

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

Chronos · August 31, 2024, 11:22pm

If I suspected that one of my human students had gotten a copy of my test and answer key before the fact, I wouldn’t accept “but I didn’t get a perfect score!” as a defense. That just means that the test-taker didn’t perfectly remember all of the key. Maybe it only remembered some of the questions. Maybe it remembered all of the questions, but only imperfectly. Heck, even if it did take the test honestly, then how it lost points would be a very relevant and interesting question that the researchers should have looked into: Getting a 6 out of 9 on each of the free-response questions means different things than acing four of the six and turning in the other two blank, even though both of those would mean a score of 36/54. But OpenAI, despite their name, isn’t open. They don’t report details. They don’t reveal raw data. They release information when and only when it’ll help their business case. Which, already, is reason to be skeptical of all of their claims. Especially when they do things like release results for tests performed under conditions where fairness was extremely difficult to achieve, but then not repeat the experiment when it became extremely easy to do fairly.

wolfpup · August 31, 2024, 11:40pm

Sure, but you omitted my most important refutation. Given GPT’s demonstrated capabilities at one-shot and zero-shot learning – the latter essentially meaning that it can infer solutions to a completely new class of problem it has never seen before based on general knowledge and emergent reasoning skills – it would be absurd to suggest that these skills did not play an important part in its test performance, regardless of what test-specific training it may or may not have received.

GailForce · September 1, 2024, 12:10am

What am I to take from this thread? That ChatGPT is unreliable? Random? In an early stage of its development? That it will always toss off bizarre answers to questions that a nine-year-old can answer well? That it’s no good at counting letters for some undiscernible reason?

Chronos · September 1, 2024, 12:12am

Certainly it has at least some degree of problem-solving. But my point is that it’s extremely difficult to know what degree, if it’s had access to test keys before it took a test, and when the people experimenting on it withhold any information that makes it look bad.

Mangetout · September 1, 2024, 5:38pm

Reasoning capacity aside, it is known for sure that if multiple copies of an item exist in the training dataset, this makes it very easy for an LLM to regurgitate a near intact copy of the original data - in that sense, it behaves a bit like a compression algorithm.

Mangetout · September 2, 2024, 11:12am

I would summarise the takeaway as: People seem to want to rely on these things when they really oughtn’t. Not knowing the answer to a thing is one problem; confidently thinking you know the answer, and being wrong, is quite another.

Topic		Replies	Views
An English language research --help! Factual Questions	8	1037	October 17, 2003
Postcards: A math problem Factual Questions	6	947	November 13, 2004
Help me do my second grader's math homework Factual Questions	19	1605	December 15, 2004
Second Grade Math Riddles--How would you answer? In My Humble Opinion	71	13388	February 20, 2009
A word test... In My Humble Opinion	37	1098	April 16, 2002

Ask ChatGPT how many r's are in the word strawberry

Related topics