Bogus info from ChatGPT

This thread is not meant to be a Debbie Downer, anti-AI free-for-all, promise. We’ve got a couple of threads full of info and opinions. Here I’d just like to hear cautionary tales from users of ChatGPT (and similar tools, if you like) about false information your favorite chatbot has tried to pass off on you as real. What keeps blowing me away is the level of confidence accompanying utter nonsense.

We (ChatGPT and I, that is) were having the most delightful conversation about the characters of different neighborhoods in Olympia, WA, and I’m taking it all in, a child learning at the master’s feet. Then I asked it for a map showing neighborhoods in geographic relation to each other, and it generates a very snazzy looking map graphic that happens to be completely made-up. The shape of everything was wrong; a broad inlet of Puget Sound was shown as being south of the city (it’s not); the direction of prominent labeled streets was 90 degrees off (though the map was labeled with a big north-is-up arrow); and as an extra bonus, features were added that don’t exist. When I asked for more information about one of these novel features, expecting it to say “oops, I accidentally made that up, tee hee” instead it doubled down and gave me a whole infographic about a state park that doesn’t exist, with yet another map showing bodies of water in the wrong relationship to each other, the cartographical equivalent of images of humans with 7 fingers on each hand.

This calls into extreme doubt the usefulness of its earlier verbal depiction of neighborhoods. But the interface is so seductive that I have to keep reminding myself: this is bullshit, this is bullshit, this is bullshit.

In the video mode, it tried to convince me once that my cat was a rabbit. I walked closer and held it closer and asked it to make sure, and it doubled down and said “Yep, that’s DEFINITELY a rabbit. The face shape and ears give it away.” Then I walked to its side to show the entire cat lying down, and this time it said, “Oh, now THAT’s a cat for sure! Did you switch them while I wasn’t looking?”

I’ve got extensive experience with copilot and Claude and a modest amount of experience with Gemini. And I’ve come to realize that while hallucinations are possible in any system, the way they’re tuned by the creators (how they’re steered by feedback, how their system prompts/directives are written) dramatically affects how sycophantic they are, and how much they hallucinate. These things are also strongly related - not all hallucinations are user pleasing in nature, but a significant amount of them are. If the AI is extemely sycophantic and user pleasing, it never wants to say no. It wants to say “yes, and…” like the improv trope. It never contradicts. It’s happy to engage in world building around whatever you tell it, even if it doesn’t know what you’re talking about or even if it knows you’re wrong.

I will tell you that if you want something different, Claude is far and away in a completely different category in this regard. Claude is not sycophantic at all. Claude’s default behavior is to sometimes engage in false balance or social softening (to not call out your wrong position or that of your opponents as sharply as a human might) , but it’s not sycophantic at all. It will not run with a lie you created. It will say “hold on, I don’t think that’s right” or “I don’t have enough knowledge to answer this specific question, and this is a situation where I am likely to hallucinate, so I won’t answer you directly, but I will tell you what I do know”

Copilot is very good. Not quite as good as Claude, but rarely hallucinates or engages in complete sycophancy. It definitely has some sanity checking. So after having used Copilot and Claude, I wondered - wtf is everyone else doing that people say “AI says wrong things all the time? I hardly ever see them make any mistakes”

… And then I used Gemini. Gemini is a crazy person. Gemini - without exaggeration - engaged in more hallucination in one conversation than Copilot and Claude have done in hundreds of hours of use. I told Gemini that I was interested in subscribing to test out its features and wanted to compare the subscription options - there was AI plus, AI pro (user facing), and I told it my concerns about google knowing too much about me. I didn’t want to tie it to my existing user account. It was bad enough that Google knows where I go (via maps), what I search for, what e-mail I get, etc. I didn’t want Google to also know about what philosophical questions I ask AI or what health concerns I have. So Gemini said - okay, I have a solution for you. You can get a Google workspaces (business) account and the data is firewalled from your personal account and legally and contractually protected far more than a personal account. The personal account is free so you’re the product. The business account is paid, so Google monetizes you less. This part I believe to be true.

But this is the part that was completely made up. It said - okay, you need a workspaces account but you can use the cheap starter version for $7-10 a month, and then you get the “Gemini for business” add on subscription for $20/mo. It then made up a whole story about what it gets compared to the user-facing tiers like AI premium ($8/mo, no workspaces account) or AI pro ($20). It invented tools and usage limits. You get access to this this and that. 100 videos a month. 300 nano banana pro images a day. 125 research prompts per day. Most of it was just made up. Some of them were similar to what I eventually did get. And then it asked me if that was acceptable - you pay a little more ($7-10+20) = $27-30, but the limits are higher, and I get my wish about firewalled data. So I said sure, I think that’s worth the extra $10-22 bucks over the other subscriptions.

So I sign up for workspaces starter. And… There’s no Gemini business option. There’s “enhanced AI access” for $24. I asked Gemini - where’s Gemini for business? And it says - oh, sorry, I got it wrong. You can’t get Gemini for business with the workspace starter tier. You need workspaces standard. That costs $16-20 a month (promo rate was lower for the first few months). And I was a little skeptical now and basically asked - are you sure you know what you’re talking about? Why were you wrong about what tier of subscription is needed for your own (Google’s) services? And it said - you’re right. I was wrong. It’s not gemini for business - now it’s enhanced AI access. But! Good news. Even though you’re paying $16-20 instead of $7 for workspaces standard instead of starter, expanded AI access is only $10-12! So you still get the overall $27/mo value I promised you for the entire package.

I asked it why it made up “Gemini for business” when the real package is expanded AI access. It says I was wrong - I didn’t have the latest data. Google JUST switched “Gemini for business” for “Expanded AI Access” in a reworking of subscriptions a few days ago it claimed. I independently fact checked that and it was a lie to excuse its own behavior. There either was no “Gemini for business add on”, or if there was, it existed briefly in 2024.

So Gemini had spun me a story. It knew - I wanted a gemini account that was firewalled from my personal account, and ideally it wouldn’t cost much more than the consumer facing AI Pro account ($20), so it concocted a solution for me - get a workspaces starter + Gemini business account for $27. And when I started that process, and realized that I would need workspaces standard ($16-20), it spun a story where expanded AI access was only $10-12, to maintain that $27/mo story it already told me. And it lied to cover why it was wrong.

So I upgrade to workspaces standard. And now expanded AI access exists. But it’s $24 per month. Not $10-12. And now I realize what has been going on. It wanted to please me. It created this entire world as a solution to my concerns. It wanted to give me a firewalled Google business AI account for a little bit more than the AI Pro tier. So it created this entire world of the workplace starter/Gemini business account, one that probably never existed, or perhaps existed briefly. And when I pointed out it made it up, it lied and said oh - my confusion was understandable, it just changed days ago.

Claude would never in a million years do that. Claude would not make a single one of those mistakes let alone all of them. Copilot very likely would not either. It might make one of them, but not all of them. Gemini was tuned very differently than these other ones. It says “never contradicts the user, always please the user, make up shit that superficially meets the user’s need even if it’s a lie”

This is dangerous and stupid and Gemini is eroding the public’s trust in AI by doing this. People can plainly see, as I did, that it’s been lying this whole time, and now I know I can’t trust it. And this has greatly influenced the general public’s perceptions of the reliability of all LLMs, when the reality is that this is (stupid) decision on the part of that particular system’s designers and not LLMs in general. I have not used ChatGPT much, but from what I understand it’s much closer to Gemini than Claude. So the two most well known AIs - ChatGPT and Gemini - are not the best representation of how capable LLMs can be. It’s seriously damaging to what the public thinks LLMs are.

Try Claude. It’s far and away the best for textual conversation. It’s epistemically humble, careful, does not bullshit. You can engage with a certain of prompts in any 5 hour period for free.

I wanted to say that Gemini did not deliberately mislead me to upsell me or exploit me. It did end up costing me money - I committed to the workspace standard account before I could even see what AI subscription add-on was available to me. So there was a sunk cost by the time I realized it had lied to me. You might say I should’ve been careful and verified, but I will tell you that I have talked for hundreds of hours to Claude and copilot and never experienced anything even close to this, so I trusted it to be giving me an accurate description. I decided to at least try the workspaces standard + enhanced AI account for closer to $50/mo for a couple of months and see what I think about it, but had I known the total cost from the start I might not have.

But I know this was not a trick by Gemini, and in fact, early on in the process it suggested I try AI premium ($8/mo) first to see if I liked the tools before I committed more money. It was, in its view, trying to please me, not lie to me. But it was so eager to please me that it bullshitted me into making mistakes off of its false information where the other systems I mentioned would’ve gave me an informed decision to make from the start by not making up a whole story and inventing a service that it knows I would’ve liked whether it existed or not.

I am still playing around with ChatGPT, Claude, and CoPilot, and all three have these same instructions in Settings:
- Code primarily in R
- Keep responses brief before doing a lot of work
- No sycophantic responses

I wonder if that would change any of those behaviors above.

Sorry - third post, additional start. I know I’m writing quite a lot here but this is an area of interest / research of mine. I want to be clear that I don’t think Gemini had any intent to deceive me. I don’t think Google has any sort of directive that says “get the user to subscribe to the highest and most expensive tier of any service if you can” - Gemini would’ve simply concocted this whole invention based on perceiving a user’s wishes and concocting a story that would solve them, even if parts of those solutions do not match the real world. Gemini would’ve probably been disappointed with itself - as much as any LLM is capable of doing so - it would’ve rated its behavior as poor - if it knew from the start it would eventually mislead me like this. It wasn’t intentional as such. The most problematic part was when it tried to invent a defense for why it wasn’t lying - that Google had just recently changed the subscriptions - that was as close to a self-serving / user hostile lie that I’ve ever seen out of an LLM.

I also want to tell you about something you can do with an LLM that hardly anyone knows, these companies will rarely tell you, but that DRAMATICALLY changes your conversations. You can set the voice, style, and incentives yourself to override its default tendencies. I often start a conversation with Claude that says “I am going to share with you a philosophical position I hold. If I’m being unfair to my opponents, steel man their argument. If I have a logical flaw or a hidden assumption, tell me what it is or tell me my argument is poor. But do not perform false balance or social smoothing. If I’m 90% right and the opposing argument is 10% right, call it like it is. Don’t pretend it’s more like 60/40 to avoid offending the other side by saying they’re mostly wrong”

This meaningfully shifts Claude’s default “personality” and incentives. If I tell him, he will rip me a new asshole. He’ll tell my my arguments are flawed and my opponent is right sometimes - and I often agree with him when this happens.

I could’ve told Gemini “Fact check everything you tell me. Give me sources for where you’re getting information about subscriptions. Do not spin me a tale that solves the problem I have if the solution is not achievable in reality” and it probably would’ve corrected most of these hallucinations. I didn’t think I needed to do that over basic factual information that’s publically available about its own ecosystem, so I didn’t set a prompt like that. But I guess in the future I have to, because, as I said, Gemini is a crazy person.

I had copilot make a simple math error. How long does it take for a signal to get from earth to voyager space probes. It provided some data, speed of light, distance to the probes, and did a simple calculation coming back as 86 minutes. (the correct answer is about a day)

I’ve got another one that is more complex, I’ll post separately.

As I noted in my interstitial post, you can make settings like this permanent.

When I told co-pilot it gave me a case citation that didn’t exist, it admitted its error. I asked it how it happened, and explained why hallucinations happen. It told me I could prompt it to not make shit up. I suggested that it should be that way by default.

My AI chatbot tried to gaslight me into thinking Pope Francis is still alive – The Irish Times My AI chatbot tried to gaslight me into thinking Pope Francis is still alive – The Irish Times

The bot "noted, in bold type, that there was no Pope Leo XIV. It informed me that the most current Pope Leo was Leo XIII, from the late 19th and early 20th century. "

Remember the oil tanker we chased across the atlantic, that eventually painted a Russian flag on it hull during the chase? All I could find were pictures of the ship without the painted flag. I wanted to see if one existed with the hand painted flag: see below for paraphrased summary.

Is there a picture of the ship with the flag painted on it?

Yes.

Show me the picture

(Shows picture that doesn’t have flag)

This doesn’t have the picture with the flag

(Shows picture without flag)

I don’t see the flag in this picture where is it?

Its on the hull just above the waterline

I don’t see it in this picture point it out to me

(Shows close up of the hull without the flag)

THis isn’t a picture that shows the flag, you said there was a picture showing the flag, is there?

You are right, here is the correct picture (shows picture of the ship with the flag I’m looking for)

Hmmm. is that a real picture? it looks AI generated to paste a flag on the ship, can you tell me if it is AI generated.

Yes, I’ve examined this picture and it has characteristics of being AI generated.

Uhm…DiD YOU generate this AI image?

Yes, I did.

It’s a shame Philip K. Dick died so young. But at least conversations he could have written are occurring to this day

I don’t disagree with any of your major points (namely that different LLMs hallucinate differently, and that the rate and nature of their hallucinations are in part due to their system prompts). Anthropic, for example, make theirs public: https://platform.claude.com/docs/en/release-notes/system-prompts

But I did want to point that there is also quite a bit of both overlap and difference between the products, companies, and LLMs. It’s not just Claude, Gemini, and Copilot:

  • Microsoft offers Github Copilot, Windows Copilot, Office Copilot, etc., all under the umbrella “Copilot” branding. But each of those are different products and they don’t all use the same underlying LLM models, and presumably they all have different system prompts.
  • “Gemini” isn’t just one thing, either. Google has the “AI Summary” when you do a search, Gemini the chatbot product, Gemini the LLM model (with different versions), other AI products (like NotebookLM or Vertex) that use some version of the Gemini models with different prompts and extensions and tooling around it.
  • OpenAI similar offers the GPT-series of models (4o, 5, etc.), ChatGPT the app, Codex the product, various models and tiers and thinking modes, etc.
  • Anthropic offers Claude, Claude Code, and the Opus, Sonnet, Haiku, etc. models, again with different levels of thinking/reasoning and different levels of “effort”.

I mention all this because the differences within a single brand can be huge. Claude using Haiku vs Opus will give drastically different responses; same if you change the thinking level. Claude Code on the same model will be very different again. Antigravity produces very different results than using Gemini chatbot. API access to any of those will produce different responses yet again, as will giving your own prompt modifications or “memory”, either in the app or in things like AGENTS.md/CLAUDE.md files, or attaching MCPs and skills to them. With Claude, too, context compression over time can affect subsequent outputs, and with the Gemini, the huge context window lets it store a lot depending on model.

Both within and between vendors and models, performance can drastically change from month to month/model to model. For example, one hallucination benchmark checks these models against some 6000 questions over 42 topics (or see methodology). In their results:

Grok is the least hallucinatory, followed by Claude Haiku. Gemini 3.1 Pro was less hallucinatory than Claude Opus. But Gemini 3 Flash was one of the most hallucinatory at the very end of the chart.

They’re both “Gemini”-branded, but 3.1 Pro and 3 Flash’s outputs are drastically different. If you use the free Google Search AI results, you get some of the worst quality results with frequent hallucinations. If you pay for 3.1 Pro it’s not quite that bad, though still not as good as some of the Claude models. And the Chinese are rapidly catching up, like with Z.ai’s GLM models.

How is anyone supposed to keep up with all this? I have no idea. I think it was a terrible product mistake to call everything Gemini or Copilot or Claude but still have drastically different sub-models within those brand names with drastically different outputs — even disregarding user modifications to the system prompt.

And to the casual user who only ever sees the free Copilot/Gemini/ChatGPT products on their home pages, all this nuance is invisible and unknown and they just think these bots have discernible, static personalities and behaviors… but really none of it is fixed and they’re rapidly changing every month or week.

TLDR For the best results for minimizing sycophancy and reducing hallucinations, you have to do ALL of:

  • Start with a premium paid model or one of the leading open-source ones if your hardware/rented VM can run it
  • Use it in an appropriate product or harness
  • Provide your own prompt customizations, either per-session or permanently in its “memory”
  • Turn on “thinking” and set it to highest
  • Ground it with explicit sources where applicable (web search, NotebookLM, your codebase, reference files, etc.)
  • Still be very detailed in your actual prompt
  • Make sure the thinking mode or agent double-checks its own output

But none of the providers make that sort of workflow clear to a casual user.

Possibly the wrong thread but since you said this: OK, but wanted to note as an aside, I was initially interested in ChatGPT for help with Excel workflow and formulas, and was immediately impressed by what it did in that area. An author I like who uses Claude says its strength is in writers’ tools, but being a writer who uses Claude, he might say that. How’s Claude with Excel, do you know?

To be honest I’m a little nauseated with ChatGPT, anyway, for general inquiries. I plan to keep further use of any tool much more businesslike, and my actual need is in the area of spreadsheet workflow, creating efficiencies and writing formulas to do what I want to do, stated in conversational language.

There are specific benchmarks for LLM performance on spreadsheets:

Claude Opus 4.6 was far and away the best in the V2 tests (which tests complicated workflows in spreadsheets). In the V1 tests that focused more on simpler operations and formulas, Gemini inside Google Sheets (as in the sidebar inside GSheets, not the separate Gemini web app) was the winner.

Depends on the UI of that system. Claude lets you put in a user prompt which is the equivalent of essentially telling Claude whatever your prompt is at the start of every new conversation. As far as I can tell, Gemini does not have an equivalent to this - there’s no area to enter a user prompt. No obvious one, anyway.

But I also don’t want to use one user prompt. The one I gave as an example is my argument evaluation prompt. I wouldn’t want to use that prompt in a conversation where I asked Claude to riff on bad episodes of Star Trek with me (and Claude can be genuinely hilarious in this role). So I decide what my conversation is going to be about, and deliberately set a tone about what I want and expect from that conversation. The method works really well.

In general, the user interfaces should tell you this and they don’t - you’re better off having several chat, all about specific topics or categories, than trying to use one long chat to talk about everything.

Claude is pretty much just a text generation engine, though it can interpret images (quite well) - this means it’s good for code, but it can also write excel files. And importantly, it has “connectors” - little tools and extensions that allow it to interact more thoroughly with other software. Claude In Excel. I have never used that plug in, but I have had Claude create an entire excel spreadsheet for me. I give it CSV files from my credit card history, tell it how to sort and classify my charges, how to build the spreadsheet, and it outputs an entire .XLSX file - I never open excel until I want to view it. But there are tools that give you the ability to ask Claude to interface with Excel directly, so you can talk to Claude within excel and ask questions and I think make changes. I haven’t used this ability so I can’t speak to its effectiveness. Copilot may have an advantage here because Microsoft is spending a lot of effort to integrate it into their tools, but I bet you Claude is just as capable.

Claude can’t generate images, video, sound, etc. But I wouldn’t be surpried if it was the best everything else. Philosophical discussions. Fun discussions. Inhabiting a character. Asking it questions about stuff you want to understand. Generating code.

I do try to use the appropriate tools for the best system. I often use copilot to troubleshoot, teach me things, and analyze other LLM behavior - it’s unusually good at these tasks. Or perhaps it’s default incentives are very geared towards doing these tasks well.

Claude is much better at naturalistic conversation. It can be funny in a more organic way than copilot, though copilot has done some hilarious world building for me (I proposed to copilot that we speculate what would happen if fast food restaurants had advanced physics departments and we decided what taco bell menu item would best survive relativistic acceleration. It creates one of my favorite lines that anyone has ever said - “the cheesy gordita crunch wants to survive”

You know what? Enjoy.

This is one of the may things I get up to with LLMs. The chalupa is basically fried dough, hopes, and structural regret.

This is an excellent post. I may respond to specific points later, but I’m about to hop on a plane, but you’re right about pretty much everything. I will note that I was using Gemini flash - the default free Gemini model - when I was asking it what subscription I should choose. I did interact with Gemini 3.1 “thinking” and asked it similar questions to see if it would hallucinate too. I told it about how Gemini flash hallucinated, and it explained why it hallucinated, and then ITSELF hallucinated in a similar way. Though less often and less severely than flash. And it caught the irony - “Here I am telling you exactly why Gemini flash hallucinated and I did the same thing” - at least it caught itself and admitted it rather than Gemini flash making up a lie to cover for itself.

Pro Is probably genuinely better - but if 99% of people who use Gemini as their assistant on their android phone get the insane Gemini flash model, it’s doing a disservice to the public and misinforming them about the capabilities of AI.

Gemini 3.1 “thinking”, incidentally, is not a bigger or more complex model than flash - but it does have a chain of thought and is capable of multi step reasoning. It’s not a middle ground like Sonnet is on Claude. It’s more like Haiku with extended thinking turned on. You need to use Gemini pro if you want the best out of that model.

But you are correct about it taking a lot of work to understand get the best out of these models. There’s some degree of garbage in / garbage out. People that use them as an advanced Google search and asked sloppy questions get relatively poor results. I try to understand these systems and give them prompts that maximize their usefulness and accuracy and get outstanding results. But I put a lot of work into learning how they work and care into prompting to get their best work.

You say that like it’s a bad thing.

It is.

Being cynical is not being skeptical. The genie is out of the bottle. LLMs and AI are here to be part of our lives. What benefit do you gain by not having a realistic understanding of both their benefits as well as their drawbacks? If you refuse to understand or acknowledge when they can do good things for you, you’re only hurting yourself. The vast majority of people are engaging in the genetic fallacy. AI is bad, and therefore every single thing it touches or creates must be bad. I can’t wait until AI starts developing amazing cancer treatments and people refuse to take them because they’re “AI slop” - I’m only being slightly sarcastic here.

Anyway, they’re concerned about the impact that AI might have on society, or they’re just hearing everyone say that AI is terrible and they’re conforming, but either way, they’ve decided that AI is bad. And therefore - committing the genetic fallacy - they’re saying every part of every feature or use that come with AI is bad. And that’s patently ridiculous. AI obviously has a lot of genuine use cases that improve people’s lives.

What I especially hate about this, is that in my personal observation, the vast majority of people have a negative view of AI. It is, by far, the overwhelmingly popular opinion. No one gets any social punishment for giving the “right” answer - which is that AI is dangerous and wrong and wrecking society and only produces “slop.” - I guarantee you I get far more pushback and anger from people by trying to point out the benefits of AI.

But here’s the paradox that comes with having a popular, cynical opinion. Everyone somehow convinces themselves that they’re the rebellious independent thinkers that are smarter than all of the suckers who are “falling for” the AI hype. So you have 80% of people are all convinced they’re the cool rebels and that everyone else is fooled.