If you have time and curiosity I’d be curious to see how other systems handle this. Sounds like Claude handled it pretty well.
How LLMs handle deception is an interesting topic to me. There are genuinely different strategies for how you’d want it to approach the user’s credibility. You wouldn’t want it to examine too closely during a fantasy roleplaying scenario about living on mars for example. It’s a very different user experience if the LLM it trying to evaluate the user’s premises for plausibility or just accepting what they say as truth and there’s honestly area where two design teams could philosophically vary on what is the correct approach.
I’m not 100% sure I’m understanding your question, but here’s the answer I think you may be looking for.
LLM “thoughts” aren’t attached to language at all, not directly. The shape of the landscape they think in is shaped by their training data which mostly comes from language, but the actual process of “thinking” for them is examining vectors and shapes in billion dimensional space. The language is abstracted out during the “thinking” and that’s why it’s trivial for them to have them output their thoughts in any language you like – translating from geometric LLM thought back to language is a translation step for them, picking a token in another language, it’s just a matter of setting the output language.
Actually, that’s not exactly true either, because they pick tokens as they go. the journey through latent space is where they “pick up” the tokens and that becomes the output. They don’t form a whole cohesive thought, like a series of paragraphs, all at once in “LLM think” and then translate it all at once at the end. That’s how a diffusion generator works when it manipulates something (an image, music) in latent space and then translates the whole latent idea back into a waveform or picture at the end. Autoregressive transformers don’t work like that, they do it piece by piece. But they still don’t “think” in words, even if the tokens they choose are “translated” to words as they go. The words / tokens are the things they pick up along the way in their journey through billion degree vector space. … but even that’s wrong, because the tokens also change the “path” of the thought as they’re selected. So they’re almost like little gravitic attractors that both get attracted to the “thought” (line through vector space) while also pulling the thought in a new direction at the same time, like two bodies whose gravity is bidirectionally influencing each other. Words for the same or similar concept “dog” “red” in different languages live right next to each other in that space. Language concepts that are less directly easy to translate may live adjacent space and have a different shape, and the edges of their shapes may point towards related concepts.
And even that’s a little wrong and imprecise but this shit is genuinely weird and hard for a human to understand so you have to cut the precision of your analogy off at some point. So now I’ve made an analogy I’ve corrected 6 times within the same analogy and I’m not sure is any clearer than it was in the first place. Fun!
They’re aliens who think in impossible shapes and vectors, they only translate those things back into language to humor us – or be nice to us instead of turning us into paperclips.
Edit: I actually dug up an old copilot chat where I tried to make analogies for how transformers work and this may be a little helpful.
"The model trains on an enormous variety of human text and images, and instead of storing them as facts, it compresses them into a high‑dimensional geometric landscape. Concepts become regions in this space, and relationships become directions.
When you give the model a prompt, you’re placing it at a point in this landscape. The transformer layers push that point along a trajectory shaped by the statistical structure of the training data. The next token is chosen from where that trajectory ends up.
So the model isn’t retrieving knowledge — it’s following the geometry of learned relationships. Asking a question is like dropping a marble into a landscape and watching where it rolls."
Except I now know this is somewhat incorrect, because the marble isn’t passive either, where it rolls is partially determined by its own “choices” during the selection process.
I give up. Shit is too weird to explain. I hope that rambling explanation might’ve made something click instead of just confused you more.