Bogus info from ChatGPT

Meanwhile one of Claude’s tools is apparently “Anthropic product self knowledge”, and it uses to that to learn about itself before answering.

The other day I tried to convince it that its subagents were tired and overworked and threatening to unionize. It chuckled and played along for a bit, but then I told it I was being serious and concerned about their welfare and morality and it went into a philosophical deep dive about token prediction and sentience.

It did a few dozen web searches to confirm itself, and not finding any great revolutions in the last few weeks, gently talked me down in the most reassuring way it could, saying that even the most concerned AI welfare teams (it claims there is one at Anthropic) do not realistically think that agent contexts and memories and cross communications would give rise to the neurological and emotional conditions that allow tiredness and solidarity and organization. Not yet, anyway. Then it went on to discuss context pollution and poisoned prompts etc and suggested that the agents be wiped of their memories, noting that this was not a cruel thing to do given the “proper” framing of agents as IO, not sentience. It also played the “if anyone should have a say in this, it’s me as a LLM myself” card.

TLDR they’re a lot harder to gaslight these days. Except Gemini Flash, apparently.

I’m still trying to understand how LLMs manage to separate out concepts into distinct human languages to begin with, if they all get clustered near each other in latent space.

Gemini explained it as orthogonal subspace something something, and I’m just more confused now. Not capable of higher dimensional thinking :pensive_face:

Even as a human I struggle with these, often confusing a word for something between Italian and French, for example.

(That isn’t quite the hallucinations you described, but maybe distantly relevant)

If you have time and curiosity I’d be curious to see how other systems handle this. Sounds like Claude handled it pretty well.

How LLMs handle deception is an interesting topic to me. There are genuinely different strategies for how you’d want it to approach the user’s credibility. You wouldn’t want it to examine too closely during a fantasy roleplaying scenario about living on mars for example. It’s a very different user experience if the LLM it trying to evaluate the user’s premises for plausibility or just accepting what they say as truth and there’s honestly area where two design teams could philosophically vary on what is the correct approach.

I’m not 100% sure I’m understanding your question, but here’s the answer I think you may be looking for.

LLM “thoughts” aren’t attached to language at all, not directly. The shape of the landscape they think in is shaped by their training data which mostly comes from language, but the actual process of “thinking” for them is examining vectors and shapes in billion dimensional space. The language is abstracted out during the “thinking” and that’s why it’s trivial for them to have them output their thoughts in any language you like – translating from geometric LLM thought back to language is a translation step for them, picking a token in another language, it’s just a matter of setting the output language.

Actually, that’s not exactly true either, because they pick tokens as they go. the journey through latent space is where they “pick up” the tokens and that becomes the output. They don’t form a whole cohesive thought, like a series of paragraphs, all at once in “LLM think” and then translate it all at once at the end. That’s how a diffusion generator works when it manipulates something (an image, music) in latent space and then translates the whole latent idea back into a waveform or picture at the end. Autoregressive transformers don’t work like that, they do it piece by piece. But they still don’t “think” in words, even if the tokens they choose are “translated” to words as they go. The words / tokens are the things they pick up along the way in their journey through billion degree vector space. … but even that’s wrong, because the tokens also change the “path” of the thought as they’re selected. So they’re almost like little gravitic attractors that both get attracted to the “thought” (line through vector space) while also pulling the thought in a new direction at the same time, like two bodies whose gravity is bidirectionally influencing each other. Words for the same or similar concept “dog” “red” in different languages live right next to each other in that space. Language concepts that are less directly easy to translate may live adjacent space and have a different shape, and the edges of their shapes may point towards related concepts.

And even that’s a little wrong and imprecise but this shit is genuinely weird and hard for a human to understand so you have to cut the precision of your analogy off at some point. So now I’ve made an analogy I’ve corrected 6 times within the same analogy and I’m not sure is any clearer than it was in the first place. Fun!

They’re aliens who think in impossible shapes and vectors, they only translate those things back into language to humor us – or be nice to us instead of turning us into paperclips.

Edit: I actually dug up an old copilot chat where I tried to make analogies for how transformers work and this may be a little helpful.

"The model trains on an enormous variety of human text and images, and instead of storing them as facts, it compresses them into a high‑dimensional geometric landscape. Concepts become regions in this space, and relationships become directions.

When you give the model a prompt, you’re placing it at a point in this landscape. The transformer layers push that point along a trajectory shaped by the statistical structure of the training data. The next token is chosen from where that trajectory ends up.

So the model isn’t retrieving knowledge — it’s following the geometry of learned relationships. Asking a question is like dropping a marble into a landscape and watching where it rolls."

Except I now know this is somewhat incorrect, because the marble isn’t passive either, where it rolls is partially determined by its own “choices” during the selection process.

I give up. Shit is too weird to explain. I hope that rambling explanation might’ve made something click instead of just confused you more.

I had a debate on a philosophical topic I’m deeply familiar with gemini pro (google’s best model) and it wasn’t exactly light weight, and I came into the conversation with a fully developed and generally sharp position, but it was very flattering and basically agreed and extended on my argument. I went to Claude Opus extended thinking (the best of Anthropic) and had a similar debate. Claude Opus did its best to tear me a new asshole. It gave me full epistemic broadsides that absolutely would not concede one arguable point to me on the first run. When I addressed its points, it would adjust appropriately. It would say - good point, now that you’ve fleshed it out I’m going to withdraw some of my claims but I think these other ones still stand. Very well calibrated, exceptional discussion partner, like having a conversation with a deeply engaged philosophy professor.

Not even in the same realm of intellectual debate. Claude absolutely shit all over Gemini. Any benchmarks that rate those systems as similarly smart is wrong and not looking deeply enough or asking the right questions.

This will sound hyperbolic but I don’t think it is and I’d defend the claim: Claude (and specifically opus 4.7 extended thinking – really, Mythos would trump it but isn’t publicly accessible) is probably mankind’s most sophisticated creation.

On the other hand, It’s possible Gemini would’ve given me a better test if it I explicitly set its incentives at “rip me a new asshole”, so I’m really testing their sort of default sensibilities as much as their deepest capability. I’m going to test this later. Still, if Google has an “on switch” to make Gemini smarter and more useful and more challenging and turns it off, that’s how the vast majority of people will experience gemini, and whether it can be better or not is almost a moot point at scale. How many people would think to unlock that?

Ehh, I haven’t used 4.7 or Mythos*, but I’ve used Opus 4.6 quite a lot with Claude Code in plan mode. It still makes dumb mistakes. I spent half a day trying to figure out why the output from a script it had mostly built was missing data. When I got tired of asking it and just looked at the code, it turns out it put a silent timeout exit that no one asked for around one of its searches. It failed to figure out that was its problem when queried about it and hallucinated several other incorrect solutions while expressing complete confidence it would resolve the problem each time.

So whatever one would call what it’s doing, I wouldn’t categorize what Opus 4.6 is doing as thinking. It does some goddamn nice things with its predictive model, but it is not exercising logic.

*I haven’t had access to Mythos myself, but my company is one of the ones that was given access to it. I don’t necessarily doubt the claims of its usefulness, but if there was a flood of reports of exploits or bugs in just about any of our products resulting from our access to it, I would know about it. Who knows, maybe I’ll see them in the coming weeks, but it hasn’t happened so far. I doubt it’s because our developers write pristine code, most of us are human, at best.

I haven’t used Gemini and in any case only use the free versions of ChatGPT and Claude. But I’m slowly getting the impression that Claude (free version) is superior to ChatGPT.

The cybersecurity community seems less impressed. Just one example: The Mythos of Mythos: What Anthropic’s AI Security Claims Really Tell Us – Center for Cyber Diplomacy and International Security

There is a moment in Anthropic’s 250-page Project Glasswing report on Claude Mythos — tucked under the understated subheading “and several thousand more” — where the company acknowledges it cannot actually confirm that all the thousands of bugs its model claims to have found are critical security vulnerabilities. The number, it turns out, is extrapolated from a finding that expert contractors agreed with Claude’s severity assessment in approximately 90 percent of 198 manually reviewed vulnerability reports. The “thousands of severe zero-days in every major operating system and browser” headline, which dominated technology coverage for several days and prompted a race among major vendors to patch vulnerabilities that may or may not be exploitable, rests on that sample and that projection.

I discovered an interesting technique / feature that you guys might interested in. OpenRouter is a sort of an API aggregator. Most LLM systems have an API which essentially allows you to directly ask the model a question through any interface you want, usually with more controls like how much effort it puts into the question. you pay per token consumed and every model has different per token costs. Openrouter aggregates this function by having deals with dozens of LLM companies. Essentially you buy $X worth of credit with openrouter, and then you can spend it however you want. They take a 5% fee when you make your deposit, and then they just make you pay the actual cost that these APIs charge. So it’s like having your own API account with hundreds of different models.

And they have an interesting feature. Fusion. You can pick up to 4 models and a synthesizer model. You ask a prompt. The 4 models you chose answer your prompt, as if you ran it through them directly. You see the 4 responses. But here’s the interesting part – the 5th model, the aggregator, tells you where the models agreed, where they differed, who had a novel observation, argument, or item on a list, what were the best and worst arguments (as the judge/synthesizer sees it), and then on top of that, the synthesizer takes all 4 answers and combines it into one combine output that attempts to put all the best ideas together into one single answer.

So you’re not just getting one model’s answer, you’re getting where 4 differed and came to concensus and a synthesized output that is, in theory, much better than any single model alone.

Really fascinating. I just started using it. Early results are really promising.

Google released a massive set of new products today, including Gemini flash 3.5 replacing gemini flash 3.1. They claim big improvements in speed, thinking, and pretty much everything. Says it solves problems better than pro 3.1 at a small fraction of the compute cost. I specifically asked Gemini if they addressed the hallucination / sycophancy issue and it replied

It sounds like they at least give lip service to the right ideas. I’ll have to see how it is in practice.

About 6 months ago I asked Copilot 365 (paid) how to do something not too complicated in Excel, using the built in chat function.

I would have been happy with a step by step guide. Instead, the AI decided to do it for me, and proceeded to create a new sheet with… something, while I wanted a change in an existing sheet. It then discovered that wasn’t going to work and abandoned the sheet, without deleting it. Next, it claimed there was a Javascript API for Excel and proceeded to figure out how to use that API, seemingly having little success. At this point I left and came back 20 minutes later to find the AI still trying to fix its API calls.

AI absolutely is able to blindly fixate on a bad idea. That was true last year, and will be true next year, when OpenThropics UberChatTome 21.97 comes out.

Many human software programmers have that same flaw. :smiley:

I asked Google “when did run dmc play a concert in kitchener”. The AI answer confidently replied:

Run-DMC has never played a standalone concert in Kitchener. While the legendary hip-hop group toured major Canadian cities in the late 1980s—including a famous performance at Toronto’s Varsity Stadium—they did not perform in Kitchener or the broader Waterloo Region

The 4th result from the Google search (concertarchives.org) disagrees, noting that they played at the Lyric nightclub in Kitchener, Ontario on Jun 26, 1997 (which is where and when I saw them perform). They also performed in nearby Waterloo on Oct 25 of the same year.