It is impressive, but then we do have to ask why it blew the 2 pounds of feathers problem so badly.
I’d suggest as a possibility something that came up in the General Aviation thread: expectation bias. Why would an experienced pilot put the gear down when the captain clearly called for “flaps 1”? The reason (probably) is expectation bias: the pilot really, really thought that gear down was the appropriate action in that situation, and so “heard” that as the callout, even though that wasn’t what was said.
The pound of feathers riddle must appear in many thousands of places on the web and almost always in the same form (though still with differences in wording). So there must be a tremendous bias to interpret the question in that way, even when it’s actually slightly different (I actually read the modified question incorrectly myself). The LLM has to deal with a high degree of ambiguity in its inputs, and has to correct for small errors which were probably not meant. It would not be a very good tool if it wasn’t able to do this. The correction just went overboard in this case due to the strong training bias.