One thing I’ve learned using AI to code, if I phrase things in the most literal, simple terms (think in terms of brute force, non-efficient coding) I get good answers. When I ask it to do a broad task, things often go off the rails.
I think there’s a way to find it in settings, although I’m not sure of it. I find typing “-ai” at the end of my searches removes it, but it is mildly time consuming.
If you are using Copilot for this you are not getting the full power of tools like OpenAI ChatGPT.
They may use the same LLM (e.g. GPT5) but they are miles apart.
ChatGPT has far better short term and long term memory, as an example. Copilot has NO long term memory (at least at my workplace) and is limited to a few turns of short term memory…not even a full chat session. OpenAI ChatGPT can carry on hours of discussion in a single chat and is able to reference all of it.
Compared to that, the no-long-term-memory and limited-short-term-memory of Copilot feels like I am talking to a distracted friend.
Where Copilot excels is that it is giving you its form of RAG via Microsoft Graph, incorporating all of your chats, emails, SharePoint docs, and so on. That’s its real strength.
Me:
Discuss how you backcheck when I challenge an answer.
ChatGPT:
That’s a very good question — and a fair one.
When you challenge one of my answers, I “backcheck” it through a few layers, depending on what kind of claim is in question:
1. Internal Consistency Check
I first re-evaluate the reasoning or evidence I already used:
I review the logical structure: Did I make an assumption that doesn’t follow?
I check whether I oversimplified or overstated something.
I look for places where a term or concept might have been ambiguous.
This step is entirely internal — it’s reasoning based on what’s already in the conversation and my general training.
2. Knowledge Verification
If it’s a factual or empirical question (e.g., data, history, scientific results):
I re-interrogate my internal knowledge base (my “training”) for corroborating or contradicting evidence.
I look at the reliability of the kind of claim: is it well-established, disputed, or likely to have changed over time?
If there’s any chance the information could be outdated or contentious, I’ll often propose to check the web for the most recent or authoritative sources (e.g., academic papers, government data, reputable news outlets).
3. External Cross-Check (if requested or necessary)
When I do use the web, I:
Search for current, primary, or high-reputation sources (academic, government, peer-reviewed, or major news organizations).
Cross-compare multiple independent sources before accepting or updating my answer.
Explicitly flag where interpretations differ or uncertainty exists.
4. Transparency and Revision
If I find that my initial answer was incomplete or incorrect, I:
Revise the answer, showing where the correction fits in.
If the issue is subjective or philosophical, I may instead clarify the competing schools of thought or frameworks rather than “correct” it.
Would you like me to illustrate this process with a concrete example — say, by picking one of our past discussions and showing how I would backcheck it step by step?
When you ask why I made a mistake, what I do is not “retrace my literal internal steps” (since I don’t store or replay the hidden intermediate reasoning that produced the earlier answer). Instead, I do something like this:
I infer based on evidence:
I look at what I wrote, what you pointed out, and what the correct reasoning or answer should be. Then I infer what kind of reasoning pattern or bias would have produced my earlier output.
I simulate a “diagnostic reasoning pass”:
I run through the same type of reasoning process again from scratch — using the same inputs — but this time I explicitly check for known failure modes (e.g., misreading premises, overgeneralizing, missing a constraint). That gives me a plausible explanation for the earlier error.
I draw on my training patterns:
My training included many examples of model errors and analyses of them. So I can generalize: “Ah, that looks like the kind of error that comes from X,” where X might be heuristic overreach, confusion between related concepts, or underspecified context.
So to summarize:
→ I’m not replaying the exact inner chain of reasoning that went wrong.
→ I am reconstructing and inferring the likely cause of the error, given what I know about how reasoning typically fails and the evidence in our dialogue.
Would you like me to show how that works on a concrete example — e.g., take one of my past mistakes and show how I’d “diagnose” it step by step?
Just as a side note, based on some recent responses from GPT, there seem to be two different versions running concurrently (I’ve recently been asked to compare and evaluate the responses). I know that GPT intentionally randomizes elements of its responses, but we may have been talking to different versions.
Regardless, what is the point you’re trying to make here?
This works, but you can also put &udm=14 after all your searches. That forces Google to return the same results as if you selected “Web” from the top of the regular results page. There are browser extensions that automatically do this, or you can modify the search engine window on your browser.
Yes, you keep using the term a lot and yet we still don’t have strong evidence that these “emergent properties”, which even the designers don’t fully understand, exist in the first place. You also quoted ChatGPT claiming that it re-evaluated the reasoning that it already used. It didn’t use any reasoning because it can’t reason.
I’m using this definition of “reason”, but feel free to tell me which one you prefer and then explain to me how an LLM can perform this feat.
the power of comprehending, inferring, or thinking especially in orderly rational ways
I really shouldn’t have to point out that Blade Runner is not a documentary.
I think all apply except (b), which implies intentionality, which LLMs do not possess.
Here is what ChatGPT itself had to say about it, in response to: “What can you say about whether or not Large Language Models like ChatGPT are capable of what is commonly called “reasoning”?”
ISTM that a system that allegedly cannot “reason” does a damn good job of explaining itself!
The answer depends on what one means by “reasoning.”
Let’s unpack it at a few levels:
1. The narrow, computational sense
If “reasoning” means performing structured inference over representations — drawing logical conclusions, performing multi-step deductions, or evaluating evidence — then LLMs clearly do this to a limited but real degree.
They can:
Solve logic puzzles, math problems, and analogies by chaining together intermediate steps (“chain-of-thought reasoning”).
Generate valid logical inferences when the patterns appear in their training data or can be generalized from it.
Engage in rule-like transformations of symbolic structures (e.g., converting natural language to formal logic or code).
However:
Their reasoning is statistical rather than symbolic. It’s based on learned correlations between patterns of tokens, not on explicitly stored rules or truth-preserving inference mechanisms.
As a result, their reasoning can be brittle, inconsistent, or context-sensitive — they “simulate” reasoning, but don’t necessarily understand it in a grounded, world-model sense.
2. The cognitive or philosophical sense
If reasoning is defined as the conscious, deliberate manipulation of mental models to reach justified conclusions, then most philosophers and cognitive scientists would say: LLMs don’t reason, they emulate reasoning.
They lack intentionality — they don’t “know” what their words refer to.
They lack causal grounding — they can’t observe or test the world to verify their inferences.
They don’t have goals or epistemic commitments — they don’t believe or disbelieve propositions.
Still, some argue that functional reasoning doesn’t require consciousness or understanding; it only requires systematic inference behavior. In that sense, LLMs do reason, albeit differently from humans.
3. The emerging middle ground: “System-2-like behavior”
Recent research shows that:
Larger and better-trained LLMs exhibit emergent reasoning behaviors: consistent multi-step logical chains, hypothesis testing, and abstract generalization.
They can simulate deliberation by maintaining intermediate steps (via prompting like “Let’s think step by step…”), improving accuracy in reasoning tasks.
This suggests that reasoning may not require explicit symbolic logic, but can emerge from pattern-based learning at scale.
So, in practice:
LLMs display behaviors that are functionally similar to reasoning, even if mechanistically and phenomenologically different from human reasoning.
4. A concise way to put it
You could summarize the situation like this:
LLMs implement reasoning-like processes.
They exhibit reasoning behaviors.
But they do not possess reasoning in the human, conscious, intentional sense.
Statements like this are why it’s difficult to take you seriously on the topic. Is ChatGPT’s output accurate? Does it reflect reasoning? Is it indicative of “emergent behavior?” Who cares, it seems like it is! No need for actual scientific evaluation when it seems like it.
These supposed emergent properties were a big deal when LLMs hit the mainstream a few years ago, and people’s understanding of them was further complicated by poor reporting in the media. The biggest problem is multiple definitions. Under some definitions of “emergent”, LLMs definitely exhibit emergence, and it’s completely uninteresting because that’s a natural feature of complex systems. Under other definitions, as you note there is great skepticism that LLMs exhibit the emergence that amazes some people. This is a good breakdown on the definitions and what they mean.
Another point of confusion is the discussion if AI researchers understand the emergence observed. It’s not controversial to say it can’t always be predicted, but that’s not the same as saying researchers don’t understand it. You won’t find many researches today saying they don’t understand how some LLM output came to be.
I found this to be a balanced discussion of emergence. It’s worth the time if you’re interested in the topic.
I care. Because “seems like it” appears to be a dimissive form of “is”. GPT has delivered actual information about how LLMs do or do not reason, depending on one’s definition. Please point out where this information is wrong or otherwise useless.
Here’s a quote that comes to mind:
“No computer will ever be able to play better than a child’s level of chess”
– The philosopher Hubert Dreyfus, in the early 60s, based on his understanding of how chess programs work.
A few years later, the chess program MacHack beat him soundly in a competition at MIT.
Decades later, chess-playing computers can now play at the grandmaster level – and win!
The cognitive challenges here are far greater than just game-playing, but modern AI can now tackle challenges of language translation that befuddled systems just a few years ago, for instance appreciating the nuances of idioms.
I’m not suggesting that AGI is almost here. Not at all. I’m just suggesting that what is currently emerging in front of us is being underestimated.
It is not up to me to prove that LLM output you copy/paste is accurate. And if you are not able to demonstrate that it is accurate, then yes, it is useless.
I recommend you read the article I linked, if you haven’t already. Prediction is difficult, and we’ve frequently gotten it wrong (although researchers seem to be getting better). In 2021, researches greatly underestimated what LLMs would be able to do in a year; in 2022, predictions were much closer to the mark. As researchers get smarter, they are realizing that some of these amazing jumps in progress are just artifacts of the metrics and how models are evaluated.
Presumably if asked the same question about how it does something and it gives different answers then it’s not really answering the question, it’s just saying stuff that looks like a convincing answer to the question. It goes without saying that many people will be convinced by these convincing looking responses.
That’s not how this works. I think the distinction between the different semantic interpretations of what “reasoning” actually means, and the nuanced answer it gave at the end, is quite accurate. If you claim that it’s “useless”, the onus is on you to show why. I think you’d have a hard time doing that.
Not true. If you asked two cognitive scientists an open-ended question about how human reasoning works, you’re certainly not going to get the exact same response verbatim! But you’ll get responses expressing similar ideas, perhaps using different terminology and differing levels of detail, and maybe even some points of disagreement. That doesn’t mean that neither of them knows what they’re talking about!
There are a whole bunch of reasons that GPT will give differing responses at different times (which doesn’t mean contradictory ones, just different). Part of it is intentional randomization. There are also different GPT models running concurrently which have different capabilities and behaviours. Another reason – and this one is actually a bit concerning – is that GPT responses are very sensitive to the exact wording of the prompt.
Well, how do you tell the difference between a factual answer and one that’s just plausible? Can you point to anything that definitively states that LLMs know how they work? They are flawed with factual stuff when we already know the facts, what makes you so sure they’re accurate when there’s no way to verify the fact?
Of course LLMs like GPT get things wrong (although I happen to like GPT-5 more than the others). In general, the answer to how I tell the difference between a factual answer and one that’s just plausible is usually by cross-checking with reliable reference sources. Fact-checking is much easier to do with specific alleged facts or statements than with the kind of very general question where GPT is very useful for iteratively exploring a subject that is otherwise difficult to look up.
In the case of the GPT response that I posted in #329, about whether or not LLMs are capable of reasoning, I infer the general correctness of the response based on my own knowledge and experience with LLMs. It makes sense to distinguish between “reasoning” in a computational sense and reasoning in a cognitive human sense. I think it’s quite accurate to say that LLMs can do the former based on the intellectual tasks they can accomplish, albeit with limitations because the behaviour is statistical rather than symbolic. It also makes sense to say that LLMs do not reason in the cognitive human sense, but their behaviours simulate that kind of reasoning in many contexts.
Read its complete response critically (post #329) and see if anything jumps out at you that should be questioned or potentially contradicted. The only thing I can see is the statement that “recent research shows that … reasoning may not require explicit symbolic logic, but can emerge from pattern-based learning at scale”. I certainly believe this but I cannot say without doing some investigation whether recent research really supports this. The rest of it I think is pretty much self-evidently true and a pretty good summation.
Well this week I learned they are unreliable at following a plan of their own devising.
At work I was given access to Codex, OpenAI’s current “agentic” coding assistant, using OpenAI o3. It’s good, I can’t see going back to fooling with Copilot’s integration with VSCode. But it still does dumb stuff for unknown reasons.
So I bring it a script I inherited that I’ve halfway re-written into spaghetti code to get it to reliably work, and ask it to re-write the spaghetti into callable functions. It’s a tedious job I have been putting off, so let’s see how it does. I don’t trust it yet, so I have it set to ask for permission before it runs any commands, and instruct it to output its modifications to a new file.
When we received training on this interface, it was suggested that we have it propose a plan, then either modify or execute its plan. I ask it to write a plan to re-write the spaghetti code into functions that can be called, and be sure to not alter the output it creates.
The plan looks good, I tell it to proceed with it.
It then proceeds to write out a program that has all comments removed from it. It also fails at its most basic function, which is writing a default config file for itself. Not a really auspicious start.
So, I start a new chat, and ask it to devise the same plan. It devises a generally similar plan. When I tell it to execute it, I make sure that my prompt to execute the plan reiterates that the resulting program should have the same output, including any initialization files or reports. That run seemed to work pretty well, I’m still testing it to make sure it’s ready for production. It can at least write a working config file, can complete without errors, and it didn’t strip the comments from the program.
But that run also tried to compile the script as a binary, something no one, nowhere asked it to do. I denied it’s permission to do that, mostly because it wasn’t going to successfully compile on the machine I was editing the script on even if I had allowed it.
Also, both times it claimed it was going to move the housekeeping code into its own function (the original script I inherited would just happily fill up the disk of the machine it was running on). My cleanup code is still sitting in the same places it was before, just moved into their respective callable functions. That’s somewhat understandable, because I just cleaned up files where the script started to work with them and I realized we would need some cleanup, but it claimed it did this both times when executing the “plan”. If it did it, its effect on the output is imperceptible
So 6/10. Would use again because it may have actually saved some time and drudgery this go-round if this result survives testing, but I still would absolutely not allow it to run shell commands without supervision. It’s a lunatic.
If you’re counting: we have the original author, plus I count as 2 or 3 (I have assurances that my code works, but looks like it was written by three different crazy people), plus OpenAI o3, this script is now probably written by up to five different crazy people. Debugging or extending it? That’s a problem for future scabpicker. Man, I don’t envy that guy.
Someday, perhaps, an LLM will be able to look at the questions I pose it and be able to decide to say (in epistemic humility),“I have some background information on that, but I still don’t know the answer”. That step will be necessary, but perhaps still not sufficient. Until then, the way LLMs arrive at answers remind me of a more erudite version of our current President, unable to stop himself from uttering whatever answer flows to his mouth.