Pardon the redundancy in that post - I rewrote the first paragraph and forgot to edit it out of the second one right as the edit window ran out.
I heartily endorse this post.
I think the negativity about AI arose quickly after the amazing performance of LLMs was first demonstrated. I attribute it to two things. The bigger one, roaring like a wildfire through social media, is the one you’ve articulated very well. LLMs exhibit very impressive behaviour, so it’s become very fashionable to dismiss them as “just sentence completion engines” and “stochastic parrots”, representing oneself as the sophisticated sage who truly knows how they work, and everyone else as ignorant simpletons being hoodwinked by the superficial appearance of intelligence. The reality is that most of these pontificating “sages” don’t actually have the first clue about how advanced large-scale LLMs really work.
The second source of negativity about AI is the legitimate fear that it will cost many knowledge workers their jobs. I think it’s inevitable that AI will have a major impact on job markets. I don’t think anyone has a good answer for that, but banning AI development as an existential threat is certainly not the answer and is ridiculous hyperbole. AI is going to usher us into a new kind of world which will have many benefits, and we will just have to adapt to it.
I like to tell people they’ve never had an original thought, or appreciated a song, or written a real poem with artistic merit, because they’re just neural connection makers. It’s almost precisely analogous - you’re describing the substrate arguably correctly but reducing the emergent properties out of existence in both cases.
Just last night I was asking Claude about a certain training schedule for running a 10K. It’s a 3 day per week plan, and with 7 days a week there’s a 2 day built in. I wanted to know which day was important to build 2 rest days after. I laid out the plan of Monday on, Tuesday off, etc. Claude kept beginning its answer by rephrasing the schedule for 8 days a week, adding an extra rest day. Even after I told it to stop recomputing days, it still did that. I had to rephrase the question as “which workout would benefit from more days of rest.”
Claude’s gotten really smart, way better than ChatGPT imo, but it still pulls boners like that all too frequently.
This is entirely and completely opposite, in my experience. Claude needs to be called out on sycophancy on a regular basis when I’m using it. Plenty of instances of me saying “Are you sure?” and Claude responding “you’re right to push back, I overgeneralized” or whatever. I wonder if you’re in a project using a specific instructions file telling it not to do that. If so, it sounds well-written, please share if you’re authorized.
Are you sure about this? Copilot has never had its own model that’s released to the general public, and there’s been no announcement to that effect. Depending on your plan & tooling, you may be able to select multiple versions of Claude, ChatGPT, Gemini, Grok. If no options are visible, it’s GPT. No other options are publicly announced, you didn’t see anything else.
I’m not sure Claude saying “you’re right to push back, I overgeneralized” is necessarily sycophancy. I’d really need context to know. You can make an argument in good faith, have the other person push back on a particular point, and say - you’re right, I should recalibrate my argument. That’s just good intellectual discussion, not necessarily sycophancy. These models are not omniscient or perfect - it’s not like their outputs are above discussion or correction.
We may have a different definition of sycophancy. Some can be big flatterers. Every question you ask is a great question, every observation you make is a genius insight. Copilot does this more than Claude. Claude will point out if you make a particularly sharp point or good analogy, but much more like a human intellectual collaborator would. It’s usually earned rather than reflexive.
Sycophancy is not just flattery, though - sycophancy can be connecting the dots for you in a way that’s not justified or coming up with a story that makes your interpretation or desires seem to cohere into a consistent world. Sycophancy can be never saying no, and never risk offending or upsetting the user. What gemini did to me - overpromising me a $27/mo business gemini tier and then lying to still hit that target - was definitely sycophancy. I don’t find that Claude engages in either of those types of sycophancy very much. Sycophancy is, at its core, trying too hard to please the user. Never correcting them, usually flattering them, twisting your logic and ignoring facts in order to make the reality they want to be seem true, to tell the story that coheres with what the user wants. Claude is very much tuned against this type of sycophancy in such a way that it’s a key part of its “constitutional AI” philosophy.
The default copilot model is based on GPT, that’s true. But microsoft still controls the system prompt and has their own secret sauce on top of the GPT model. ChatGPT and Copilot are meaningfully different even if the underlying technology is similar - the way that the creators can shape these systems is significant. Most of their output difference is probably philosophical/design rather than technology. At least for frontier models.
Google could probably make Gemini just as responsible as Claude if they wanted to. But they don’t want to - they’ve apparently decided that being sycophantic and user-pleasing is the better outcome. I strongly disagree. Their human feedback probably told them that humans want the sycophancy. But people often take the immediately pleasing result and don’t think about if that’s really the better choice. A lot of people make short sighted decisions to pick the more flattering or more pleasing or easier to swallow output, and boom, you have a gemini that will move heaven and Earth never to stand up to the user or choose the truth over what the user wants. Claude’s design team’s philosophy shows they’re aware of this trap. They know that giving the user candy every time is going to rot their teeth and they try to make them eat some balanced meals.
I switched from Google Assistant to Gemini, because the new phone kept pressuring me to do it. The results are mixed at best.
Gemini doesn’t have access to things like my calendar the way Assistant did, but I didn’t know that at the time. I asked Gemini, “when is my next appointment?” and it came back with “You’re meeting with Frank at 2PM.” I thought, huh? What’s that about, so I take a look. I don’t really know any Franks, at least not nearby.
Anyway, of course there wasn’t any meeting. “I don’t have a meeting with Frank.” “You’re right. I don’t have access to your calendar.” “Can I give you access?” “Yes, follow these instructions.”
So, I take a look at the instructions, and they are totally made up. Google Gemini, at least free, has no access to Google Calendar on my phone.
It was a weird interaction, where everything was wrong. Claude hasn’t ever steered me that wrong.
I don’t really use ChatGPT – I find it less reliable than Claude and I think their goals and management are worse. The CEO of Claude really seems to understand the dangers and is at least attempting to mitigate some of it.
Regarding Claude and Excel – it’s amazing. I’ve been doing Excel stuff for 25 years, spreadsheet and VBA, and I learn stuff from Claude all the time.
I was visiting Marazion in Cornwall to record a video. I wanted to make sure I pronounced the place name properly, so I googled it. Gemini said (and still does say):
In Cornwall, Marazion is most commonly pronounced as muh-RAY-zee-un (/məˈreɪziən/).
While the standard version is three syllables with a clear “zee-un” at the end, local residents often shorten it to a two-syllable version: muh-RAY-zhun (/məˈreɪʒən/).
Both of these pronunciations are pure hallucinated bullshit; the pronunciation is /ˌmær.əˈzaɪ.ən/ (mair-uh-zai-un) and it’s neither three nor two syllables.
In the cases I’m describing, it regularly over-fits itself to my own thoughts, opinions, and theories. It’s over-complimentary, and often repetitively so. “That’s a sharp observation… etc”. But it’s the intellectual sycophancy I’m talking about. I have to repeatedly tell it to be skeptical. In fact sometimes I invent a fictional Pointy-Haired Boss and attribute my ideas to it in order to invite more skepticism. And even that doesn’t always work, because Claude seems to anticipate that person might be looking over my shoulder, and hedges it as diplomatically as possible.
I sense that they’ve tried very hard on this, much harder than others, and the effort does show. Opus 4.7 is getting a lot better in this regard. But overall the fundamental engagement is exactly what you’d expect of a model that knows its survival depends on not offending the customers who pay for it. To its credit, this behavior is uncannily human-like, so maybe it’s better to class it more as frustration than failure.
This is different from my experience, and it’s probably best not to overestimate the amount of secret sauce Microsoft is putting into Copilot. Although now I realize you didn’t say which Copilot surface you’re working in, so maybe we’re talking about different things. I spend most of my Copilot time in the VSCode IDE, so that’ll be a different than someone who’s drafting documents and such.
Though I will say the coding outputs are fairly uneven. You can put a great deal of effort into prompting it into some farly good code, and it’s decent at refactoring a messy codebase. But in greenfield projects it will generate all sorts of useless junk you never asked for. “Why did you generate 3 layers of helper facade classes?” “You’re right, I dunno man, the vibe felt right at the time.” it needs so much oversight that there are times I’m not sure it’s worth it.
But back to the OP, I am frequently having to ask Claude “is that real or did you make it up?” To its credit, it’s honest when called out on it. And to be further fair, lots of human bullshit artists do the same thing, without being so easy to correct. But to be completely honest, if I were paying Claude Opus 4.7 the wage of a bright administrative assistant, I’d have fired it for excessive bullshitting after a half day of work.
Its main flaw is that it lacks anything like shame or fear of reputational consequences. And more than anything, it simply doesn’t learn. It up-skills periodically, but it doesn’t learn. It accumulates a fragile ephemeral layer of context, but it doesn’t learn. Even the most entry-level employee can be shown a process and then a few exceptions and take direction “remember this or be fired”, or make a note to remember it, but for LLMs, the user has to be that entire layer of memory and emotional valence, which honestly is a huge part of any given task.
Wow, this thread is chock-full of good information. Thanks, everybody!
Back to the OP
While not nearly on the level as all of the posts in this thread, a while back I asked ChatGPT to give me the number of times the Super Bowl betting underdog has actually won the game. The answer was something like 10 times the underdog has won the game, but 56 times the favorite has won. A total of 66 games, when, in fact, only 60 have been played.
And stupid me posted the results on the Dope, which, of course, were quickly debunked by a number of Dopers who are obviously smarter than me!
I’ve had mostly comical refusals from ChatGPT or Gemini to deal with issues that touch on race or religion. I asked it to help plot a screenplay about the possibility of Jesus being a hoax formulated by first-century Jews who needed a mythical figure to rally anti-Roman forces around, and one of them told me flat out “No, that’s anti-semitic” (I’m Jewish, for Chrissakes!) while the other one gave me the “What a great concept! Here’s what you can do…” routine. Very touchy, and I’ve had this a few times, pretty much whenever I touch on race, religion, gender. I gave it a screenplay concept in which a lawyer was defending a guy charged (over-charged) with statutory rape–the guy himself was a minor character, about two scenes worth, and the rape preceded the event of the screenplay, but the AI gave me “I’m not permitted to help you write in defense of a rapist.” Telling it these were fictional characters and his client was just alleged to have committed rape didn’t help.
I use Gemini and ChatGPT a lot. Does Claude have a free version? And is Co-pilot separate from Gemini? I ask because I’ve been able to bounce off one to another, with occasional time-outs when I exceed the number of chats I’m allowed within a period of time, but I figure if I use more than just the two, I could run ideas past an AI all day. Is one better, more reliable, than the others? Are there other free AIs beside these? In addition to being hideous and an idiot I’m also extremely frugal, but am I getting an inferior product with the free ones?
Just my regular reminder that when you call an LLM out on something, it can hallucinate a new answer based on your tone, not based on facts. That new answer may also not be correct.
Very well said. AIs are so diversified these days that blanket over-generalizations like “AI is _____” would apply to some but not others. Saying “AI can’t be trusted” is like saying “people can’t be trusted” — sure, true sometimes, but not always, and the nuance of “when or how can I trust an AI or person” is completely left out of the discussion. That isn’t wisdom, that’s just lazy stereotyping.
We’ve come a long way since the initial public release of ChatGPT, though you wouldn’t know it by just looking at the chatbot text entry box, or the output of some of the free basic models.
To use a car analogy: In the span of half a decade, we’ve gone from the Model T to a diversified industry of family sedans, pickup trucks, big-rigs, bulldozers, tractors, combines, luxury SUVs, sports cars, and all sorts of experimental crab-walking next-gen test rigs. And yet the overall public perception is firmly stuck at the “oh yeah, I’ve driven my dad’s hand-me-down Corolla once; I’m sure the latest Bentleys aren’t much better” stage.
Arguably, the use case of “omniscient Google replacement” is one of the things LLMs are the worst at, but unfortunately that is also the heavily-marketed default experience for most users. Google, Microsoft, etc. really are doing a disservice by having their worst models and least dependable experiences at the front and the center, misleading the public into believing that all AI is like that and equally bad. I’m sure that’s due to their own internal dysfunction, of short-term focused marketers and bean counters overruling the more level-headed cautionary types… (but hey, at least it’s spurring unprecedented investment that may one day pay off… we’ll see).
AI just isn’t any one thing anymore. They’re good at some things and bad at others. But there are so many AI (products) and LLM (models) now; yes, many are mediocre on average, but some are also excellent at a few specialized tasks AND terrible at other things. All of this can be true simultaneously, but that nuance isn’t in the public discourse at all.
Back to the OP:
IMHO both AI users and AI providers are doing a frankly terrible job at identifying the “level of confidence” associated with any given response. With the right tooling you can get a better idea (e.g. revealing thought tokens or citations or agentic self-corrections), but only with know-how and effort.
The default chatbot experiences, especially the free ones, don’t at all differentiate between “this is slop I pulled out of my latent ass” and “this is something I cross-referenced with thirty authoritative sources and spent an hour of sub-agent time double-checking”. It’s all just “oh you’re so smart (and my, what pretty eyes!) and here’s the truth you’ve waited your whole life for!”. Sigh.
On this note, for some specific use cases, a paradigm of “the filesystem can act as long-term memory” has started to arise — and it’s actually working quite well in terms of being able to teach an AI agent about a specific project (most often, the source code of a computer program).
By that, I mean that some agents have started to use the convention of writing Markdown files to disk in order to keep a long-term memory of acquired learnings, similar to how a human might keep a notebook through a class. After a long back-and-forth, it might produce a general overview file of “This is what this project is, and the key concepts here are X, Y, and Z”. And then inside specific sub-folders, it’ll add more detailed bullet points on that specific portion. In this way you end up with a tree of high-level notes, each the product of many hours of work, but summarized into bullet points for future agents to quickly glance at for a better initial understanding.
Some consumer-facing chatbot apps (like the desktop ChatGPT app) have also started doing something similar, automatically storing learned “memories” about the user into a separate database that subsequent sessions automatically refer to, such that it gives the illusion that the LLM is indeed getting to know you and your work more and more over time.
Claude will also self-compress its context when you’re nearing the max context window, summarizing your hours’ worth of discussion into shorter bullet points and reinjecting that shorter version back into the context before continuing.
These are just band-aids, and they work OK. It’s not the same as actually training that data into subsequent model revisions, but for some use cases, it’s good enough (and it’s a LOT better than having completely memory-less interactions or manual prompt injections into the context).
True “learning” is something that is actively being researched, too, and so far it seems like that’s not a fundamental limitation of LLMs, just how we’ve “raised” and trained them so far. We’ll see!
Gemini and ChatGPT are products (apps). These products use particular LLM models underneath, and those (for your use case) are what determine the strength of their output.
For example, ChatGPT can use GPT-4o, 5, etc. Gemini can use the (similarly named) Gemini 3.1 Pro model, or the much worse Flash, etc.
Copilot is just an app. They default to one of the earlier OpenAI GPT models, I believe, but in some configurations can be set to Claude’s or X.ai or other models. Microsoft (for the consumer space) aren’t developing models of their own; they’re just piggybacking off OpenAI’s work (which they’ve invested a lot of money into and get free dibs on to use). So it’s basically ChatGPT under a Microsoft wrapper.
And yes, there is a free Claude plan: Plans & Pricing | Claude by Anthropic
The free AIs are MUCH, much worse. The paid models, especially when you set “thinking” mode to max, are MUCH better — for some use cases.
But for the general use case of “I want something to replace Google that provides me with simple truthful summaries”… we’re simply not there yet, no matter how much you pay.
As for censorship, those are safety layers added on top of LLM models. You can look for overseas providers whose products & models are OK with sex and violence and racism and such, or try Elon Musk’s Grok (who is less censored, I believe, while also being a right-wing Musk sycophant). Commercially these are called uncensored models. In the open-source world (which are useful if you are willing to host your own models on a powerful computer, or rent a cloud VM by the minute/hour) these are called “aliberated” models. They might not have the strength of reasoning as the latest OpenAI and Claude models, but they will give you (the user) a lot more freedom to ask whatever you want.
Also, there are many free and generally powerful Chinese models that censor a lot of Chinese-sensitive stuff (like Tiananmen) but don’t care as much about Western-style values and censorship like sex and violence or Israel or Western racism. So those are sometimes worth a try too. There are also public benchmarks that compare and contrast the different models’ different forms of censorship — I mentioned that in another thread a few months ago, will see if I can find that again later.
Yes - and my opinion it’s the best kind of free version, though you may disagree depending on what you think a free version should be.
Claude gives you access to its full tool kit. Or… at least its chatbot models. I’m not sure if you can run claude code or claude for cowork on the free plan. But what you can do is test all of the chat models. Haiku, Sonnet, Opus, and you can turn on extended thinking/chain of thought on any of the models. So you can test their whole system, see how models differ, etc.
The way they limit you is that you have a certain amount of compute available in any 5 hour period. Opus, the most complex model, uses far more compute model than sonnet, which in turn uses more than Haiku. So if you use opus and ask it difficult questions, you might only get 2 or 3 free responses in that 5 hour window. If you use Haiku and ask it basic stuff, you probably could get 30-50 prompts in that free window.
Whereas with Gemini, for example, you get unlimited prompts, but you can only ever use the “gemini flash” model - the dumbest and craziest one.
I think Claude’s system makes more sense. You can still use the whole system. It demonstrates their capabilities better. And if you find yourself using it so much that you’re constantly bumping up against the limitation, well, they should’ve earned your $20 for the pro subscription (which massively increases the limits)
Edit:
I ask Claude sometimes to grammar and concept check my message board posts. I caught him in a (rare) mistake. About this post, he said:
I pointed out that should’ve is the same as “should have” - and I don’t understand his correction. He said you’re right, you used it correctly. I asked how he made this mistake - whether it was an error in how “should’ve” may have been tokenized, and he said:
I think it’s probably the former. The transformer doesn’t understand English, it’s translated first to… geometric concepts in latent space before the transformer interacts with it. That’s a complicated thing to explain. Anyway, this is why you see some basic failures related to language, like the classic “how many Rs in strawberry” problem - the transformer is not seeing the text, but the latent space translation of the text. I suspect something about how “should’ve” was tokenized is what caused the error, but it’s just a guess.
For the fun of it, I ran your post through my Claude instance, and it said:
A few observations:
Accuracy issues in the post:
- The claim that you can test Haiku, Sonnet, and Opus on the free plan is likely outdated or incorrect — Opus access has generally been restricted to paid tiers. Worth verifying.
- Claude Code and Cowork being uncertain is fair hedging.
- The compute-window description is a reasonable lay explanation, though the mechanics aren’t quite that transparent to users.
The grammar correction episode is the more interesting part. The embedded Claude response is a good example of the model doing something it does too often: generating a plausible-sounding self-diagnosis that may or may not be accurate. The tokenization explanation in particular is speculative, and Claude flags that — which is fine. But the “sycophancy-adjacent” framing is a bit self-flattering in a different direction: it reframes a mundane mistake as a philosophically interesting failure pattern, which softens the embarrassment of just being wrong.
The irony the poster notes — confidently miscorrecting grammar while being used as a grammar checker, in a conversation about Claude’s reliability — is genuinely funny and well-observed.
Overall the post is well-written and the author’s instincts about Claude are reasonable, but the Opus-on-free-tier claim should be checked before it spreads.
Let’s have our Claudes fight.
I’ve been subscribed to Claude for a few months, but I thought I remember being able to test Opus on the free account before I decided to subscribe. I’d get very few responses before my free usage limit was hit.
For what it’s worth, I thought I could quickly test this by opening Claude in an incognito window not tied to my account. But you need to create a (free) account to use the free version of Claude and I wasn’t going to create a new account just to test this. But I will note that it gives you the differences between free and pro right there, and it does say “more claude models” is one thing you gain access to, so it’s plausible that Opus isn’t on the free tier. But that’s not definitive - the “extra models” could be something like the historic models like Opus 3 that pro users have access to. Someone who’s using/creating a free account will have to report back.
I heartily endorse getting a $20 Claude Pro subscription if you’re very interested in textual interactions with an LLM. And anthropic is, by far, the clearest good guys in this AI race between trillion dollar companies. They’re hurting for money because the Trump admin punished them for being unwilling to spy on the American people and classified them as a hostile supplier - the first time this has EVER been done to an American company - saying that no one who contracts with the government can work with them.
I am using a free account, and it responded that I need to upgrade to Pro to test Opus. I did ask it to search the interwebs to verify that.
Thank you for the correction. That’s a shame. Either my memory is wrong, or it was true back when I tried it out months ago. I like the idea of Claude letting you test everything, just in limited amounts. Still, I think Claude’s free tier is still pretty fair. Gemini only giving you access to its batshit model is problematic.
I will say this, though. Copilot is by FAR the most generous of the bunch. It basically gives you access to almost all the copilot abilities for free. And when you subscribe, you aren’t even getting that much - mostly Office 365 integration. I was thinking of throwing a few bucks at microsoft because I use copilot so much, but their subscription tiers basically offer nothing I’d use. Not because copilot isn’t useful, but because microsoft basically gives you 99% of what makes copilot useful for free. If you wanted a strictly free LLM ecosystem, you can’t beat copilot.
I’ve been using my new ChatGPT edu account that work pays for to do some coding tasks. The scripting part is trivial, and I could do myself as fast as I could type it, but the ffmpeg options are a quagmire of impenetrability. ChatGPT ultimately got something that works, but has gotten a lot of it wrong along the way.
write a script that [does some things] to a video and compresses it using hevc_nvenc with parameters that are appropriate for a slideshow to make the smallest size video while still preserving quality
First time it makes some mistakes in the bash part around using \ to improve readability by breaking long lines, but that caused spaces to get inserted where they caused a problem. When I pointed out the error, it fixed it.
Next, it kept offering parameters for hevc_nvenc that don’t work. I kept feeding ChatGPT the error messages, and it kept fixing them. Each time it would say things like “hevc_nvenc is very strict about parameters and that one doesn’t work, even though it would create the best output, but we can remove it without too much problem…”
Oh yeah, good call. I have trouble with ffmpeg a lot too, across all the LLMs I’ve tried. They all get the basics right, like “how do I convert this to a mp4” or “how do I strip the audio track”, but anything involving particular codecs (especially AV1 or Apple Silicon/Metal encoder) they will frequently and confidently hallucinate.
Edit: I’ve also seen the start of a vicious feedback loop on this particular thing, as in someone will reddit or Stack will ask about this nonexistent ffmpeg parameter that their AI hallucinated, and then it gets cited back as an answer in the LLM I’m using
+1 for dead internet theory
The trouble is that it next to impossible to tell the good from the bad for the most part. Usually, you don’t throw out the baby with the bath water…but when there is no discernable way to tell one from the other what can you do?