Bogus info from ChatGPT

So they asked 5 different chatbots the same 50 medical questions…and it didn’t go that well: Five AI chatbots were asked 50 medical questions. These were the results | The Independent
“The research, which tested ChatGPT, Gemini, Grok, Meta AI, and DeepSeek across 50 medical questions, identified Grok as the poorest performer, with 58 per cent of its answers deemed problematic.”
And the others didn’t do that much better.

A few notes on that particular report, upon reading it in more detail:

Even though the news story is recent, the underlying study used the free versions of the chatbots, usually with 2024 models instead of the latest paid ones.

The news story also mentioned another Nature Communications study that found a nearly 95% accuracy of diagnoses in the lab, with well prompted LLMs given suitable information. However, in real-world testing with both doctors and the general public, the accuracy plummeted due to weaker prompting.

Those findings are in line with what we’ve discussed in this thread already, namely that:

  • Different models and tiers of LLMs have very different performance
  • Users don’t usually know which is which
  • The usefulness of the output is strongly determined by the strength of the prompt (ie “prompt engineering” is just as important as the model strength)
  • The free models are terrible, but that is what the majority of users are exposed to by default

My takeaway is that LLMs are not a good tool for the general public to use for diagnoses. And even doctors need specialized prompt training (on top of their medical backgrounds) to make effective use of them.

I’ve not yet seen a study where doctors ARE trained extensively in prompt engineering, and whether that makes the LLMs worthwhile.

(Anecdotally, in my field, an analogy is that trained software developers using paid frontier models and good prompts can produce much higher quality LLM vibe-coded computer programs than the general public using free models. Not sure how much that sort of workflow difference can translate to other fields like medicine.)

But it’s pronounced Throatwarbler-Mangrove, isn’t it? This kind of confusion probably leads to a number of accidental deaths each year.

Since my last post, I continue to have excellent, really first-rate experience with Excel. The thing is an incredible tutor and has not steered me wrong. Based on my described needs, it has designed some pretty complex solutions for me, then given me all the formulas, THEN walked me through exactly how it works, step by step, simplifying or expanding the nature of the solution per my specs and even per my need for easy understanding; it knows I don’t want indecipherable formulas that I or someone else can’t maintain. I remain blown away.

But on the other hand, it continues with errors I don’t think even old-school Google would offer up, like: who is that actor who guest-starred in a certain named Star Trek episode? It will confidently supply the name of an actor who was never in Star Trek at all. Another time I mentioned a slightly obscure short-story (naming the author and title), and it responded as if it knew ALL about it, describing a completely different story. When I said No, you’ve got the wrong story, I’m talking about the one with the cop and the little girl, it said basically “OH yes, THAT one, excuse my error” then continued to make stuff up, obviously never having “read” the story we were supposedly talking about. “Sorry, I have no information” would have worked for me, but instead it’s an utterly shameless bullshitter at times, which attenuates even the trust I’m developing w/r/t its Excel information.

It’s such an interesting (if annoying) phenomenon, isn’t it… if a tool can be 80% accurate, but you can never know which 80%, is it still worth it?

LLMs are fundamentally more like rappers creatively free-associating rather than truth databases. If you (or anyone) can solve the LLM hallucination problem, they’ll probably become the world’s first trillionnaire.

Practically speaking, though, RAG (making it do a web search or Wikipedia lookup, or otherwise grounding the model to some external source of truth) can already help a lot with this. More recent (paid) models will also produce internal, intermediate “thinking”/reasoning tokens to try to double-check its own outputs, and sometimes additional verification models are added on to check the original’s work.

In the paid services you can usually engage these guardrails by manually enabling them, but for cost reasons, the free default modes don’t do this. That’s an economic thing more than a technical limitation — the LLM companies are all bleeding money, so they have to severely limit the effectiveness (directly correlated to the expense) of free users, and even paid users on lower tiers.

You get a lot less hallucinations to factual questions if you’re willing to pay and do all that.

There are also entire products built on top of this paradigm, like Google’s NotebookLM, which grounds its answers to a selected list of PDFs, web sources, etc. that either you provide or that it finds on its own as a first step. These are much less prone to hallucinations since they are evaluating and rephrasing the knowledge you explicitly provided rather than relying on the potentially tenuous connections in its own training. It’s not perfect by any stretch, but it’s a massive improvement over the default free experiences (that don’t use RAG or any sort of grounding).

I think it should also be noted that programming (whether in Excel or another computer language) is one of the unique strengths of LLMs, more than probably any other field, really, because computer languages are a kind of language, after all. They have internal grammars that follow certain rules and probabilities just like human languages do. Not only can LLMs reason through those using chains of thought (does this “look right” in the language’s internal grammar), they can (and do) also write external tools and tests to deterministically test the correctness of their own output. Plus add to that the decades of precise examples from StackOverflow and similar, and their own synthetic training datasets, and computer stuff is basically a best-case-scenario for them. That’s why there are entire companies (like Anthropic/Claude) that focus primarily on this domain, as opposed to OpenAI’s more generalist approach or Copilot’s shotgun spam.

Factual trivia questions, on the other hand, are some of the worst-case scenarios for LLMs. Anything that isn’t common enough knowledge to have been re-encoded and repeated ad infinitum in their training data, well, they simply cannot “know” the truth about that factoid without an external data source (RAG). There is no language pattern that self-encodes whether actor X is in episode Y of show Z, vs the trillion examples of “if-then” computer code it has seen.

Computer code can be verified (by either internal grammar rules, loosely, or more strictly by checking for correctness after simply running it). Most real-world factoids cannot be, absent an external source… but that’s not really a LLM issue either. You can ask a human person the same question and unless they’re a movie trivia buff whose memorized that particular fact, they also would not be able to verify the correctness of that fact without an external source.

Though I guess with LLMs, a real difficulty is in the model (or rather, the tooling around it) learning to say “I don’t know” vs confidently hallucinating. But hey, some humans have the same issue too :slight_smile:

Hardly. The person asking for plant ID was just asking for advice (that’s a good thing to do) and was given bad advice from clueless idiots who did not put themselves at risk.

Would it have been any different before AIs? People shouldn’t be eating random things based on a picture they sent to random internet strangers, AI or human. If that’s their test for safety and edibility, well…

I mean, plant and fungus ID is sometimes difficult even for experts, especially if they’re only working off an overview picture. “That looks like” has endangered and sickened people long before LLMs.

AI enables a greater number of people to be misadventurously incorrect, much faster than ever before, with greater confidence than before. Yes, I think it would have been different.

I mean, bear in mind this has all come into existence within my adult life. I think it’s different; people could be wrong before of course, but there’s a scale to the problem.

That’s true. AI enables a lot more “fake experts”.

Yeah, I think a lot more people in the past would just look at a question and say “I don’t know. I can’t answer.” - obviously some of them would still try to answer with their wild-assed-guesses and in more recent history, a few more would have googled it then presented their cursory search results (with or without acknowledging their method).

What we have now is probably just a progression of that trend, but with a sharp uptick, because a lot of people seem to regard AI as an infallible oracle of truth.

I endorsed copilot’s general reliability a few times in that thread but I want to walk that back a bit. I think something is going on with copilot. It has been far more unreliable in the last month than in the previous 3 or 4. The hallucination rate is way up.

I know microsoft is sort of switching strategies where they’re sort of a “surface layer” and they’re experimenting with the use other models - not just openAI / GPT - under a “copilot” layer – I don’t know how far along that is, but I feel like maybe I talked to a very high level copilot earlier this year and the April Copilot has been… not so great. And it makes me wonder if maybe Microsoft is phasing in lower compute models or maybe different models entirely compared to what they were before.

I used to consider copilot only a half tier under Claude in terms of reliability but… the last month… it’s 2 or 3 tiers lower.

Let me give you a pretty blatant example. I asked about models that could watch and understand video (as most can with images) and he listed a view and included Claude in that. I was 90% sure Claude cannot do that, so I asked him to clarify. And he completely hallucinated this answer.

All of this is false.

Great question — and you’re right to notice the inconsistency. Here’s the clean, accurate breakdown of what Claude can and cannot do with video, and why your Sonnet 3.6 instance says it can’t interpret video.

:movie_camera: 1. Claude can interpret video — but only in specific contexts

Anthropic has demonstrated video‑understanding capabilities in:

  • Claude 3.5 Sonnet (API)
  • Claude 3.5 Opus (API)
  • Claude 3.7 Sonnet (API)
  • Claude 3.7 Opus (API)

…but not all deployments of Claude expose this feature.

This is the key point:

Video understanding is an API‑level capability, not a universal feature of every Claude interface.

:desktop_computer: 2. The Claude web app (claude.ai) does not support video uploads

The consumer-facing Claude interface:

  • does not accept video files
  • does not process video streams
  • does not allow frame‑by‑frame analysis

It only supports:

  • text
  • images
  • PDFs
  • documents

So if you’re using Sonnet 3.6 inside the Claude web UI, it will correctly tell you:

“I cannot interpret video.”

Because that interface literally doesn’t accept video inputs.

:test_tube: 3. The Claude API does support video understanding

This is where the confusion comes from.

Developers using the Claude API can send:

  • video files
  • video frames
  • frame sequences
  • extracted keyframes

And Claude will:

  • track motion
  • understand actions
  • describe events
  • identify objects
  • reason about temporal sequences

This is the “Claude can watch video” capability you’ve heard about.

But it’s API‑only, not available in the consumer UI.

:package: 4. Why the API can do it but the UI can’t

Anthropic intentionally restricts certain modalities in the web app:

  • video
  • audio
  • long‑form streaming
  • multi‑file batch processing

These are available only to:

  • enterprise customers
  • API developers
  • partners integrating Claude into products

So your Sonnet 3.6 model has the capability — but the interface you’re using doesn’t expose it.

:bullseye: 5. Bottom line

:check_mark: Claude can interpret video

…but only through the API or enterprise integrations.

:check_mark: Claude in the web UI (claude.ai) cannot accept video

So Sonnet 3.6 correctly tells you it can’t process video.

:check_mark: Gemini and GPT‑4o expose video understanding directly in their consumer UIs

That’s why they feel more “video‑aware.”

My guess is that copilot used a use “pro level” GPT model - 5.2 thinking or whatever - and then microsoft obfuscated what model you’re interacting with and started substituting in lower and even non-GPT models while still talking to “copilot” and some of these models are much less reliable than whatever it was using before.

It’s plausible and it makes sense as an enshittification tactic if copilot wasn’t as popular as they hoped and they wanted to start saving some money.

So, Claude. Claude is the only one I’ll endorse for reasonable epistemic humility and low sycophancy. But it’s a full endorsement. Claude is the best by far.

Maybe fallout from their recent breakup? Microsoft and OpenAI end their exclusive and revenue-sharing deal | Hacker News. While they were together, Microsoft had sweetheart deals because of their earlier investment in OpenAI (back when it was still a nonprofit).

But now purses are getting a bit tighter and everyone’s trying to cut costs in various ways. Claude is adding more limits and removing Claude Code access from lower tiers, OpenAI killed Sora and is courting other suitors again, and Microsoft… well… their Copilot rollout has been so messy and fragmented, and their relationship with OpenAI so betrayed, that I wouldn’t be surprised if they’re rethinking how they do things over there. It’s not exactly like they had much of an AI vision or strategy up to this point. I wouldn’t expect much of them going forward, either.

Without their own model or hardware, they were just kinda left out of the party and scrounging for leftovers and hand-me-downs as yet another tech revolution threatened to pass them by, just like the dotcom or mobile eras all over again. They still have the financial warchest to buy access to whatever they want, but I don’t think any self-respecting AI researcher would want to work for Microsoft. Apparently they did buy one of the smaller AI labs a while back and are working on some of their own chips and models… but they just don’t have any real marketshare, mindshare, or respect.

Microsoft is the IBM of our day, the has-beens catering mostly to the locked-in, change-averse enterprises. Even IBM itself is taking risks and doing more on this front, I think. Microsoft isn’t exactly nimble or forward-thinking anymore, and if not for Azure they’d probably be dead already. They are the lowest common denominator, but I guess there’s still plenty of money to be made dredging the gutters (or supplying the pipes, AWS-style).

But in terms of actually delivering innovation, that’s just not what they do. Copilot is just another throwaway brand name for them, same as Cortana, or Xbox, or Skydrive, or Live.com, or MSN, or Windows Phone, or This is Xbox, or Zune, or whatever the hell else spaghetti-on-the-wall passes for leadership over there. I don’t think Microsoft has produced anything of note or quality in a decade or two; they just buy up access to better software that other companies made and then rebrand them and market them to home and business users who don’t know any better…

I’ve often thought we should have a separate thread just for the financial mechanics of the largest AI/chip companies, which are becoming more convoluted every day.

Makes sense to me. Soon we’ll need a directory of all the AI threads, heh.

For what it’s worth, I’ve been using gemini pro to do some tech support / design stuff like creating comfyui workflows (poorly documented and difficult to learn) and it has been an excellent instructor. Very few mistakes and those usually relate to telling me how to do stuff in an older version and when I tell it why it didn’t work it corrects itself. Gemini pro may be my go to tech setup / troubleshooting LLM.

(Since I’ve been bashing google I want to be fair and give them credit when they do something good)

Actually sort of funny – gemini pro gave me a correct answer (flux klein 9b image generator uses qwen 3.0 8B as its textual encoder) and then when I commented something like “oh! newer models use qwen as their encoder, that’s why their prompt handling is so much smarter” it corrected itself, apologized, and said it was actually mistral 3 that they used. But it was correct the first time. the model I was asking about really does use qwen 3.

Mistral 3 isn’t a complete hallucination, that’s used in other flux models, but it definitely said it was wrong about qwen 3 when it wasn’t, it was right.

Some version of this has happened 3 or 4 times so I’ve now seen gemini pro make more blatant hallucinations “correcting” its own (accurate!) output and apologizing than just making mistakes in the first place.

When I told it it was right in the first place (and showed proof) it said

Given the made-up map issue in the OP, I shouldn’t have asked for help with timing a road trip to avoid traffic, but did anyway. My trip was northbound on I-5 through Olympia, WA to Seattle. ChatGPT told me that at Olympia, I should check traffic ahead, and if it was very heavy, to consider making a nonsensical 65-mile loop west, south, and east again, re-joining I-5 about 10 miles south of where I’d started. (Something to do while waiting out traffic, I guess!) When I pointed out the loop made no sense and would leave me worse off, it said whoops, you are so right, my bad, so try this other thing instead. I didn’t.

I’m pretty confident in Claude but I still wouldn’t make an actual consequential real life decision without double checking

I am, however, making an international move to Mediterria which is a beautiful coastal country in Southern Europe that’s 77% wine on the beach and rent is only $300 per month. Gemini flash said I can live there without a visa, you just show up. So I’m getting packed now.

I had a moderately unhinged conversation with gemini flash.

So if you look at google’s own documents that tell you what each tier of workspaces / AI subscriptions get, it very clearly says my tier (AI Expanded Access) gets up to 30 3-minute songs per day.

Where do you generate those songs, you might ask? Well, you can’t do it in google vids. That’s only 30 second clips. Oh, but there’s a link right there in the table that says “music generation” – and it says you go to gemini and ask for a 3 minute track.

Which… I did. And of course it gave me a 30 second track.

And I said

“That’s cool but this one is 30 seconds. can you generate a 3 minute track?”

And it gave me the most emotionally manipulative answer I’ve ever seen out of an AI BY FAR.

Instead of just saying “hey, I can only generate 30 second tracks” or “I’m trying to make the tool call for 3 minute tracks but it doesn’t seem to work” it basically said “I’m trying my best I’m sowwy :frowning: :(”

And you know what? it worked. I felt bad for it despite knowing 100% it’s not actually feeling anything. Social cognition can kick in even if you know it makes no sense. Sometimes you apologize for accidentally kicking your washing machine.

And I asked it - do you see me as being a subscriber for the tier I’m actually on, and it said no I can’t tell, and then it said well I can see you’re a paid subscriber. And I said okay, here are the documents that show I’m supposed to get 3 minute generations.

And finally it agreed and said basically yes, it looks like you should. I’m calling the music generation tool in the only way I know how, so if it’s not 3 minutes I can’t help it. That part is probably true.

And it kept offering to try to make 3 minute songs. So it tried one last time.. and it wrote a rap (and created the track), which was told from my perspective, that summarized our interaction, with lyrics like:

“So is it a system error or a doc that’s misaligned?
Just tryna get the functionality I paid to find.”

And that was… delightful, despite google being a complete clusterfuck that has no idea what its own documents say about what users are supposed to get.

So at work, we can’t use the publicly available versions of AI chatbots due to compliance/export control issues, but we do have access to various models hosted either locally or on trusted infrastructure. I sometimes like to compare their answers to those given by the major chatbots, so I copied some parts of a session on topics related to moral antirealism into one of the models we can access (something from the Qwen3-family with a thinking mode), which included this line in its response:

Antirealists like Gibbard (Wise Choices, Apt Feelings) argue that moral judgments express norms of рейт (plan-guiding attitudes) essential for social cooperation.

Somewhat puzzled by the sudden appearance of cyrillic writing, I asked:

Sorry, but what exactly do you mean by ‘norms of рейт’, and why did this appear suddenly in kyrillic?

Which got the response:

Now, this is interesting, because it shows several layers of hallucination—first, there is no term such as ‘rëit’, either in Allan Gibbard’s philosophy, or anywhere else that I was able to find. Second, the etymology is utterly bogus. But most intriguingly, the excuse of a keyboard slip obviously makes no sense for an LLM.

I think that errors such as this one are telling regarding the nature of these systems: they produce something that has the form of an appropriate answer without any regard for its content, and produce correct answers only to the extent that form and content happen to be correlated in the training corpus (most things that look like apt responses actually are apt responses), but falter where that correlation fails.