Bogus info from ChatGPT

I wonder if there’s some country or principality somewhere that might allow a LLM to raise a human child absent the influences of other humans. It might be possible to train a human mind on more abstract, symbolic connections of meaning (than the extant human languages in use)… I’m imagining something like Nell, where a (human) child is raised on a closer approximation of latent space than human languages.

…but anyway, really getting off-topic now :sweat_smile: Sorry OP.

Yeah, we may have crept past the scope of the thread a little bit.

“Look at this stupid ass chatgpt hallucination”

[80 posts later]

“Let’s find a country that will allow an LLM to raise a child to see if they can think in latent space”

But I’m finding it fascinating. And really, isn’t that what’s important?

I have to say - I’ve been using gemini 3.1 pro (instead of 3.1 flash) for the last couple of days and it’s a dramatically better model. I haven’t seen any major hallucinations, though I haven’t been asking it a lot of factual questions. We’ve mostly been talking about movies. But its analysis is quite sophisticated. I’ve actually been quite surprised at the way it has connected concepts between the different aspects and films we were discussing, and it analyzes them extremely competently. In a way, it has engaged in a useful sort of sycophancy. I told it early on about my dislike of supernatural/folklore based stories because I’m a skeptical materialist and I think films, even when fictional, sort of reinforce our ideas that these things really exist, and it used that sort of skeptical materialist angle to give insight into what films are flawed and what films are air tight from that perspective. I guess that’s not really sycophancy - that’s using the user what they told was important to them to give them an interesting perspective. It has given me some somewhat unconventional takes on some of the films we’ve discussed which are exactly the sort of arguments I, myself, have made. It “got” me pretty quickly and in a very impressive way.

The way it has responded to me has been very thoughtful and practically hallucination free. Maybe google is capable of generating a good LLM after all. Maybe they just make the model that 99% of people interact with shitty.

More BS Google answers:

I did a Google search of “first description of nut allergy” yesterday and Google’s AI answered that John Bostock was the first person to describe nut allergy symptoms (linking to an article that mentioned him describing hay fever symptoms).

Today I did a similar search and got an AI answer of Robert Willan and Maimonides; at least the articles it linked to mentioned nut allergies this time.

I wonder what model the “AI summary” work at google. I wouldn’t be surprised if it was sub-flash level and another example of google making the public think AI is worthless or insane

That’s a great example of how much the output can vary depending on which model & mode you use.

With the basic Google search “AI mode”, you get a crap answer:

With the paid Gemini “Thinking” model, you get a slightly more detailed answer: https://gemini.google.com/share/0ca418b37918

(snipped)

Or in the Deep Research mode, you get a much more in-depth report: https://gemini.google.com/share/5abb6864cc97

(very much snipped… the real thing is very long)

And you can ask it to make a graphical timeline from that research:

Or the NotebookLM version:

(there’s still some hallucinations in there… lol, I love how familiar Malmonides was with modern clocks and warning signs… or his potato-sized almonds… or how the OJ and BLT sandwich are public health concerns, or the guy holding the PUBLIC FREE SIGNS sign)

You can also ask any of the models/tools the same question a few times and get different outputs each time, with varying degrees of quality and accuracy and relevance.

I’m pretty sure it’s the “oh shit, we got caught with our pants down and better put something out NOW!!” model, from their panicked reaction back when OpenAI took Google’s own research and stunned the world with GPT-2.

Strange that they haven’t bothered to improve the AI search since then. If anything, it seems even worse today than it did back when it first came out…

Yeah. It’s a plausibility engine, not an accuracy engine. It happens that the most plausible things tend to be the the most accurate ones. But that’s not always the case. These engines were trained largely on what’s written down on the internet, so if there are areas where incorrect conventional wisdom predominates, or there’s just not much primary information at all, it’ll tend to err or confabulate wildly. As you just demonstrated. Those instructions are plausible but not actual.

Important to note that LLM training generally doesn’t involve validating the accuracy of the data. As an oversimplification, it’s just using statistical weights that describe how words follow words. Accuracy doesn’t really enter into it, except that accuracy often (but not always) correlates to what “should” follow.

Also (and I don’t pretend to know the exact workings of LLMs, but I understand the basics), if it predicts that the tokens initially following your question, any question generally, are likely to represent something like “sure, I can help you with that…”, it’s now committed itself to writing something helpful and this it seems, tends to result in the following parts needing to sound helpful and knowledgeable even in cases where there is no helpful answer to be had.

This is true in the first phase of training, but the later task-specific training phases include accuracy and subjective criteria (like helpful and appropriate).

I caught Claude in an error. You could say it’s a hallucination, you could say it’s a sort of interpretive error, sort of a confabulation. He makes the case that it source attribution errors are a part of it. I’m actually not quite sure what to call it. I think the best way to describe it is over-eager pattern matching to a dynamic he was attempting to define and give examples for. He gives a fairly thorough attempt to analyze it.

But - since I’ve been praising Claude all day long here, I wanted to be honest and point out a significant mistake he made and what that looks like. I think it’s a lot more subtle than gemini flash’s desperately just making shit up constantly.

I put it hidden by default in a details box because it’s long, you can choose whether you want to read it.

Claude error

We were having a conversation about people who sort of poked holes in cultural assumptions and dynamics who were largely disliked in their time but came to be appreciated later. “Socratic terrorists.” It’s a long conversation, I’ll just quote the relevant parts.

Funny Claude fact. He has now told me 3 or 4 times that he thought I was making up “nano banana” as google’s premier image generator. It does sound ridiculous. But they launched it after his training data was last updated, at least for the sonnet 4.6 model (late 2025).

I asked him to build me a graph that shows me the release time and types of different image generation systems. I mentioned that there were a few new ones since his chart ends in 2025. He asked if there were any specifically I wanted to include and I said “Google’s flagship image generation system is now nano banana pro, which is an autoregressive model. I’m not sure about the other services”

And he said

He actually refused to put it in the graph I asked him to create without actually verifying it for himself. That’s certain unusual behavior for an LLM I think. Skeptically trying to figure out if the user is just fucking with him.

Edit: Yes, I know, sometimes I refer to it as “it” and sometimes as “he” - I know he’s not a person. It’s named a male name, sometimes your mind makes that little leap. Although I do it with copilot too. Maybe this reveals that I’m secretly misogynistic that I don’t call copilot “her”, like that old test where the doctor says “I can’t perform this operation, the patient is my son” and the audience is confused because the patient’s father also died in the car accident.

It’s probably just because in English, “he” is almost always the default if the gender is unknown or if you’re referring to a sort of generic hypothetical person. Presumably I’d call Siri or Alexa “she” but I don’t use them.

ChatGPT once told me the United States and Japan were allies during World War II. That was quite a glaring error.

Aside from that, though, I’ve found AI to be mostly reliable.

Forget it, he’s rolling.

Good example of what I was talking about as far as it being a “plausibility engine”, and a fairly aggressive one in this case.

Up until a few months ago, Claude would spin out on the question “is there a seahorse emoji” (there is not and never has been). It would fumble around with several different sealife emojis and then confidently present a dolphin or octopus and claim it was the seahorse emoji. To a machine this was plausible enough that a human might accept it.

I believe for marketing reasons they’ve directly coded this and some other popular tests into it "how many r’s are there in ‘strawberry’. This is what a software vendor ought (and is entitled) to do if it knows the software is going to be challenged in specific ways that raise questions about its credibility.

Now, if asked in a fresh chat whether there’s a seahorse emoji, Claude will immediately and confidently state that there is not one, explaining why you might have mistakenly thought there is.

Which is interesting because it does not do this for different non-existent emojis, i.e. “is there a clam emoji”. It visibly goes and performs a web search to look up the information, like any sane human would do, notes the absence, and pauses for a moment to lament the underrepresentation of bivalves in the Unicode emoji set (which a human would not do, unless they knew they were under suspicion for past fabrications).

Those popular questions would also be very present in updated training data.

They’d be present in the trained corpus, sure, but we’re told that models contain weights rather than specific facts. The least charitable interpretation is that the answer is simply hardcoded or bolted-on, which would be a cheap credibility investment for Anthropic. There’s no reason their magical answer machine shouldn’t give a correct answer for this, and nobody can really say if it’s fair to cheat.

The more generous interpretation is that the seahorse question appears frequently enough in the training corpus is sufficient to create the organic weighting on the seahorse question to trigger a suspicion of “you’re trying to trick me, aren’t you”.

It answers this without sourcing a reference (which a human would be if it’s been caught in this lie), and for the clam it actually does a web search (again, which a human might do if it’s been burned on the marine life category before, or it just knows to be more cautious with existential questions in general). But the point is that it doesn’t consult the web for the seahorse as it does for the clam, which to me suggests some targeted manipulation.

That’s more of a curiosity than anything. A magic answer machine should give correct answers when it can. Leaning on external references is a sign of intelligence, in my opinion. It’s just a different thing than an emerging omniscient intelligence that can answer any question unassisted, or that is emerging purely undirected without helpful hints. There’s no reason we should expect an AI to evolve without such hinting. Human intelligence got lots of helpful hints during evolution (these mushrooms will make you violently ill and kill you, hint hint).

Well. I think we’re not aligned on this, but if Anthropic etc. did directly correct for questions like that, it wouldn’t be through coding or bolt-ons, it would be through Reinforcement Learning through Human Feedback, where it could be included as a tuning question. That would be much simpler than creating and maintaining individual overrides for a bunch of edge cases.

I agree that this would favor simplicity and maintainability, but this is also a commercial product where public perception matters for business reasons. So if your model were getting publicly roasted for continuing fumbling the seahorse emoji, and the model is hard or expensive to steer in the direction you need, then a one-off escape hatch would make business sense, at least until something better comes along. And it wouldn’t at all be expensive to throw a few of these in the system prompt until the overall corpus and model training catches up. It’s a business after all, they’re graded on how much money they make, not on architectural purity.

The software I work on is full of such one-off shortcuts. If you work on software, you probably have a similar TODO list of shortcuts to be formalized if they don’t become irrelevant before you get around to the task.

Agree. Special treatment for specific cases (like Microsoft applications that depended on the aberrant behaviour of Windows, which then received special treatment in subsequent OS versions when those behaviours were fixed) is just how Windows became the crazy almost-unmaintainable spaghetti code that it once was – and for all we know, still is.

That’s absolutely not how one would evolve an AI. Particularly because, as in the metaphor of cockroaches, if there’s one special case you have to handle, there’s probably a million others just like it that you don’t know about.