It got fooled by the functionally irrelevant discount percentage, failing to discern that the stated price is after discount and shouldn’t be discounted again.
That said, I suspect humans might make the same mistake.
It got fooled by the functionally irrelevant discount percentage, failing to discern that the stated price is after discount and shouldn’t be discounted again.
That said, I suspect humans might make the same mistake.
This is the question - the key points are:
Carpet dimensons are given in feet but the cost of the carpet is given in square yards.
There is a wholly irrelevant mention of a sale. The actual cost/sq y. given is the true cost after the discount.
I included both these factors in the prompt.
ChatGPT approached the problem in the right way:
But it went wrong in two ways.
Firstly, it fell for the red herring of the sale, applying a 30% discount to the already discounted cost.
Secondly and more interestingly, it got the arithmetic wrong!
1503x2184 is not 3,281,352. It is 3,282,552.
It gets the division by 9 to convert to square yards correct but then screws up the multipliclation by the price: 364,594.67*7.89 is not £2,877,873.66 but 2,876,651,92. It then gets the 30% off calculation correct (although it didn’t need to make it.)
Both of its errors seem close to what a human would make - applying the 30% off figure erroneously through not reading the question, and then screwing up the arithmetic. But they’re very different types of error!
Didn’t get the second post in quick enough, but yes I think so too.
So, did ChatGPT ever take any of the AP exams, then? There was a big news story back in January or February about the great scores it got on all of the various tests, except for the small detail that at the time, it was impossible for it to have taken any of the tests. But now that we’ve had a testing window, I assume that someone did the experiment?
Here’s a list of exams ChatGPT has taken:
Source document:
Table of exam scores starts on page 5.
…And, following the cites in that page, it’s one of the ones from before it actually took the test. I’m looking for actual results, not fake hype.
Are you looking at the Business Insider link, or the source document from OpenAI? That document is dated March 2023, and it says they used publically available exams.
What exactly are you doubting? That this is all fake and GPT didn’t actually take exams representative of what students take? Or just that it can’t have taken this year’s exams?
As I mentioned in the other thread (and I believe discussed in this one, too) basic arithmetic is a known weakness of LLMs although as mentioned they’re getting much better and seem to be evolving spontaneous arithmetic skills as an emergent property that is a function of scale and training. But producing arithmetic results as approximations is a long-known characteristic of GPT 3.5, especially when division of large numbers or decimal places are involved. I don’t know how much better GPT 4 is.
Secondly, when I first submitted the carpet question, as cited in the other thread, I omitted any mention of a sale discount and simply gave it the carpet price as $9.49 per sq.yd. So I just tried it again, and it did NOT apply the discount twice. The difference is that I included all three pieces of information with inherent redundancies: the regular price, the percent discount, and the sale price. It was smart enough to ignore the redundant information. Whereas with your grass-sowing question, you mentioned the sale percentage and the cost per square yard, and I’m sure many humans would also feel that both those pieces of data were relevant and therefore the cited price should have the cited discount applied. In fact, in the other thread, @Left_Hand_of_Dorkness describes how easily humans are misled by irrelevant information introduced into a problem.
The one thing that surprised me when I submitted the second version of the carpet question with the irrelevant sale information, ChatGPT chose what I mentioned before as a typically over-analytical path instead of a simpler one; it ignored the stated sale price as irrelevant and computed the sale price for itself by applying the stated discount to the regular price.
Unfortunately in doing so it ran into two issues, the primary one not being ChatGPT’s fault at all, namely the fact that “41% off” is a lie. The actual discount is approximately 40.65%. Additionally, it also stumbled on the fractional arithmetic and computed the discount at $6.559 instead of the arithmetically exact $6.5559. In the end, it concluded the cost of the carpet would be $113.17 instead of its formerly correct $113.88.
Sorry, I’m with @Sam_Stone on this. OpenAI is not just a reputable business, it’s also a prestigious research organization. Do you really think they’d be publishing “fake hype” disguised as a research report? Do you also doubt similar reports from IBM Research about DeepQA/Watson?
Right, publicly available exams, not the AP exam. There’s a reason that you don’t give students the same test for an evaluation that they used for practice.
I don’t think anything about what they would do. I think something about what they provably, actually did do. They published fake hype about something that was simply not possible.
Did you actually read their description of the source materials? Here it is:
We sourced either the most recent publicly-available official past exams, or practice exams in published third-party 2022-2023 study material which we purchased. We cross-checked these materials against the model’s training data to determine the extent to which the training data was not contaminated with any exam questions, which we also report in this paper.
The Uniform Bar Exam was run by our collaborators at CaseText and Stanford CodeX.
It seems to me that your objections are basically meaningless semantic quibbles. They’re very clear about exactly what materials they used and the conditions under which the testing was done. “Fake hype” is just an absurd and grossly unfair characterization.
And to elaborate a bit more, GPT-4 mostly wasn’t trained on problems that appeared in the practice exams, and for the remainder the scoring takes that into account:
We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.
I’m sure many humans would also feel that both those pieces of data were relevant and therefore the cited price should have the cited discount applied. In fact, in the other thread, @Left_Hand_of_Dorkness describes how easily humans are misled by irrelevant information introduced into a problem.
Yes, I’ve said this twice already in this thread. And I think that’s really interesting, that the LLM approach to AI leads to a system that duplicates human-type errors. It’s somewhat counterintuitive, in that our perception of computers is that they cannot fail to get arithmetic right - they are, fundamentally, adding machines. But in this case of course, the infallible adding has gone into the construction of the neural net - which is not itself an adding machine.
It seems that we could have the best of both worlds - an LLM to read the problem and decide what calculations are needed, that then passes those calculations to an actual arithmetic engine (even Excel would be fine for this, getting computers to do sums is a solved problem) rather than doing them itself. In fact, it would be kind of weird if millennia after the abacus was invented, we somehow went back to fallible entities doing mental arithmetic as our basic means of calculation. Similarly, I’ve seen DALL-E outputs which asked for written word as part of the image (e.g. speech bubbles, labels) and the drawing was fine but the writing was awful, not even English. But LLMs can do that bit fine, of course.
So presumably at some point in the near future we’ll integrate these different skills so that they’re done by the engine that does them best.
Yes, I’ve said this twice already in this thread. And I think that’s really interesting, that the LLM approach to AI leads to a system that duplicates human-type errors. It’s somewhat counterintuitive, in that our perception of computers is that they cannot fail to get arithmetic right - they are, fundamentally, adding machines. But in this case of course, the infallible adding has gone into the construction of the neural net - which is not itself an adding machine.
A couple of things:
Linguists shiould be having a field day with this - the notion that we have coded so much of what it means to be human into language that an LLM can develop human-like thought processes and intelligence just by reading terabytes of language. AIs are teaching us a lot about how humans operate.
You are correct that there’s nothing specific to doing arithmetic in the base ‘code’ of an AI (the human-written ‘transformer’ architecture), but somewhere along the way, the LLMs DO develop custom ‘circuits’ for handling math. Originally, they treat number like words - tokenized strings no different than other letters or words. But eventually they abandon this approach because it’s riddled with errors, and derive their own math capabilities.
One example i have posted before is of an LLM that suddenly started scoring perfectly on various addition tests, and the researchers using mechanistic interpretability techniques discovered a custom ‘circuit’ in the LLM that used Fast Fourier Transforms and trig identities to solve the problems. No one told it to do that. The capability just emerged.
It seems that we could have the best of both worlds - an LLM to read the problem and decide what calculations are needed, that then passes those calculations to an actual arithmetic engine (even Excel would be fine for this, getting computers to do sums is a solved problem) rather than doing them itself.
GPT-4 has a plugin for Mathematica, which will allow it to do any math Mathematica can do - which is a lot. But for basic math all the way up to calculus, GPT-4 seems to be pretty good at it all by itself. I think it’s still gets tripped up manipulating large numbers, though, probably because of tokenization.
In fact, it would be kind of weird if millennia after the abacus was invented, we somehow went back to fallible entities doing mental arithmetic as our basic means of calculation. Similarly, I’ve seen DALL-E outputs which asked for written word as part of the image (e.g. speech bubbles, labels) and the drawing was fine but the writing was awful, not even English. But LLMs can do that bit fine, of course.
DALL-E 3 can write text in images now. I don’t know if it’s perfect, but it’s pretty good.
So presumably at some point in the near future we’ll integrate these different skills so that they’re done by the engine that does them best.
The addition of multimedia (video, images, audio) to GPT-4V seems to have enhanced its abilities across the board, and not just with multimedia. It’s like it has a richer view of the world now, and can make better decisions.
And to elaborate a bit more, GPT-4 mostly wasn’t trained on problems that appeared in the practice exams, and for the remainder the scoring takes that into account:
No, it wasn’t specifically trained for the AP exam. Which means exactly the opposite. Most of the material online teaching calculus is geared towards AP calculus, which means that it includes AP test questions, biased towards the most recent tests. If they didn’t go to extraordinary lengths to curate their training data, then it’s basically guaranteed that the training data included every problem from the most recent AP tests. A “version of the test with all of those problems removed” would be blank.
It’s not hard to avoid this sort of contamination. All you have to do is the same thing a human test-taker would do: Take the brand-new test before it has a chance to make it into the training data. Which is why I asked if they’d done that.
If they didn’t go to extraordinary lengths to curate their training data, then it’s basically guaranteed that the training data included every problem from the most recent AP tests.
They do go through extraordinary lengths to curate their training data. Both in general (OpenAI is generally thought to have the best training set in the biz) and specifically in this case (where they kept track of all AP test questions in the training set and ensured they didn’t appear in the tests they took). Sure, they might be lying, but that’s a pretty strong claim for you to make without much evidence.
Yeah, they might be lying when they said that they didn’t specifically train it for the test. Because if they didn’t, then the test questions were there.
I don’t know why you think they’re incapable of detecting whether the test questions are in their training set or not. It’s a much easier problem than the other stuff they’re doing. Whether they were specifically excluded or weren’t there due to timing doesn’t really matter as long as they had the ability to check for them. They undoubtedly have a system for searching their training set that is independent of the LLM itself.
Interesting:
“Mr. Altman’s departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities,” the company said in its blog post. “The board no longer has confidence in his ability to continue leading OpenAI.”
Mira Murati will be the interim CEO.
I don’t know why you think they’re incapable of detecting whether the test questions are in their training set or not. It’s a much easier problem than the other stuff they’re doing.
Well, maybe, depending on how good their non-AI image processing capabilities are. A lot of copies of AP tests out there are in image format, due to being screen-captured or scanned from printouts.
But OK, suppose they can. What do they do when (not if; it’s basically guaranteed) they find that the training data did contain the test that they’re using? Edit the training data and re-train a new bot without it? Possible, but they explicitly said they didn’t do a training just for the bot to take the tests.
They already said what happens: they take the test twice, once with the pre-trained questions, and another time without, and take the lowest score.
They aren’t going to exclude old test questions, but new tests are being made all the time. The models have a cut-off date, and they can ensure they only take tests made after that date if they want the testing to be useful.
That said, it’s not clear they actually need to. While it’s certainly possible that the LLM will just memorize the answers, it’s not obvious that it does, except in a probabilistic, holographic way. They can figure out if this is the case or not by just comparing results from tests that are in the training set vs. equivalent ones that aren’t. If the LLM is memorizing the answers, it should perform far better in the first case.
Interesting:
Yeah, this is super weird, and very sudden. Just the other day, Altman was speaking at the APEC conference as if nothing were wrong. No details at all so far. I have little doubt that Altman and the board will have different interpretations of the alleged failures, but we’ll have to wait for more info.