Alright, it started innocently enough. I asked o3-mini to determine who’s right in a debate. It just sounded more and more wrong until I grew totally suspicious and asked the last question. It took the AI 57 seconds “thinking” before giving the incorrect answer to a 50/50 question. Backtracking the convo up, I specifically told it, several times, to tell if it has difficulty accessing the forum. It never said anything. What a dirty little liar.
Questions:
Can this be considered evidence that AI is stupid?
Is this true that we can only get as much from AI as the input? That it can’t expand the input in a rational way? In this case, it seems the only way for o3 to ‘get the picture’ is for us to manually copy (& modify) & paste the links 20 times into the chat.
If not, what should I do to improve my prompt efficiency and maximize AI intelligence in future interactions? You know, tips & tricks in this new field.
AI isn’t stupid, but AI isn’t smart either, because AI doesn’t think. It doesn’t understand the meaning of the words you’re prompting it with or what it means to provide a factual answer - it’s just stringing words together in a way that its programming suggests will be a satisfactory response. There’s no difference to it between a right or a wrong answer.
Your link points to poe.com and requires a login. Can you share the ChatGpt chat via the “share” button instead? That creates a publicly viewable link.
4o is generally a lot better for organic queries than o3-mini (better for coding), for what it’s worth, but either can make mistakes, hallucinate, or lose context.
But it should be able to fetch and summarize web results no problem, especially in Deep Research mode. You just have to explicitly ask it to. I use the web search multiple times a day.
The other day I asked ChatGPT a fairly straightforward baseball trivia question. It simply could not get it right. I rephrased the question for clarity several times and at one point even TOLD it exactly how to figure the answer out. It couldn’t. It guessed two dozen times, a few times returning to a previous guess, and could not get it.
ChatGPT is terrible. It’s a Google search presented in sentence format.
Someone once compared AI to Cliff Claven (from Cheers). I think it’s a good comparison.
Cliff is very smart, and often he knows what he is talking about. But sometimes he doesn’t know what he’s talking about and he spouts pure bullshit. AI is the same. And just like Cliff, there’s no real indication that what they are saying is something that it actually knows or if it’s pure bullshit. AI will always give you an answer that has the right look and feel. It just might be a good answer, or it might be complete crap. You have no way of knowing.
It is even worse than this, though, because it is conditioned to provide responses in a manner that appears authoritative in tone even if the answer is total nonsense, making these systems expert ‘bullshit generators’. This leads people to uncritically accept responses even though basic intuition and ‘common sense’ should indicate that there is an error. Here is an example of this phenomenon. Unfortunately, people are often lazy or do not have the ability to critically examine the response, and become reliant upon the chatbot to provide factual data when, as noted above, it has no ability to distinguish fact from semantically-cromulent gibberish.
There are efforts to apply various post-response methods to attempt to verify responses (fact-checking basic information and references, or using retrieval-augmented generation to draw factual information from a validated source and just use the LLM to put it into an appropriate textual frame) but frankly there are inherent problems in responding to more complex prompts that this approach cannot resolve, and without some application of ‘common sense’ that we have no idea how to build into a language model (because the LLM only knows how to process text, not how to relate it to any kind of ‘real world’ experience) this is almost certainly a fundamental limitation in the reliability of this approach.
Except Wikipedia is curated and entires are reviewed by knowledgable peers. It is not perfect, and of course people can maliciously post inaccurate information, but by and large it has become a reasonably credible source of basic and sometimes even technical information which is largely backed up by cited references which can be verified. An LLM will just generate syntactically-correct gibberish to whatever you prompt it to respond to, and will literally manufacture citations if you ask for them because it has no comprehension that a citation is a verifiable source of fact. All they are doing is manipulation tokens (words or collections of words) in a way that is statistically consistent with its training data set.
In between the “AI is Skynet” and “ChatGPT is totally useless” extremes, though, the LLMs can still be useful if you’re willing to work within their limitations.
I don’t think of them as de novo knowledge generation systems or math/logic solvers. They are text analysis and summary/synthesis systems that can semantically work with inputs and modify them based on a prompt. That is both incredibly limiting but still incredibly powerful, the kind of semantic understanding that 6 or 7 years ago was still considered impossible.
On a day to day level, that means they are still best at text analysis. Their training data includes a lot of public text, but for anything niche or recent, you have to provide it sources (like URLs or PDFs) to analyze. You can then usefully ask it to compare those specifics to other more general concepts, or to explain them in less jargony words, etc. But even as a mere “teaching assistant” rather than expert researcher, it still has strengths that a search engine does not have (weaknesses too, as you saw).
They are also exceptionally good at translations between languages.
Their text prowess doesn’t (and likely won’t ever) make them experts, and yes, unfortunately we do have marketing and hype to thank for that misconception. But that merely limits the scope of their usefulness; it doesn’t make them altogether useless.
As always, validate and corroborate its answers. You should be doing that regardless of whether it came from a LLM or a person. The underlying mechanisms for making mistakes may be different between man and machine, but either one can and will frequently be wrong.
Using it for simple fact retrieval isn’t a good use of it in its current stage. It’s hella helpful in a whole lot of other ways (and I’m not going to recount them, as there’s threads for that. Hell, I’ve already used it a couple of times today to clean up my system with a shell script, reinstall, and get a USB keymapper that was broken on my system working again… all in 5 minutes.) I use it practically every day and have figured out where it works great for me, and where it doesn’t.
I agree with the majority of your response but wanted to comment on the highlighted phrases above. While more sophisticated LLMs can produce responses that appear to be semantically informed, there is absolutely nothing in how the ‘algorithm’ of how it works that would allow it to actually interpret semantic content, produce an abstract model, and manipulate that model to ‘understand’ how the text relates to the real world. To the extent that LLMs can produce semantically-correct responses it is because there is logic that is both explicitly and implicitly built into the use of language; that is to say, although there is a nearly infinite number of ways that you can assemble words into grammatically correct sentences and link the sentences together into a flow of discussion referencing a subject, only a very small subset of these will actually make any sense, and the rest of them will read like Lewis Carroll’s Jabberwocky.
Because we normally only speak and write in ways that have actual semantic content, the data sets that LLMs are trained on reflect the statistically appropriate usage consistent with semantically-meaningful collections of words, and as a result, it has the emergent capability to mostly produce strings of tokens that read like thought-out concepts. But there is zero reason to believe that the system has any kind of internal conception of the world beyond manipulating word-tokens in a way consistent with the training data, despite enthusiasts and even some experts claiming that they can detect a ‘spark of consciousness’ from those responses. There is certainly nothing going on with these prompt-and-response systems that in any way represents cognition in a way that a neuroscientist would recognize; no ongoing mental processes, experience-driven creating and refinement of abstract concepts from physical and sensory interactions with the real world; no permanent correction of conceptual errors and misapprehensions. They are just manipulating tokens using the computational approach of heuristically adjusted weighting of connections in an artificial ‘neural network’ during its training phase and using those to to produce grammatically correct content which maps to how collections of words are statistically used in their training data sets. This gives the appearance of “semantic understanding”, at least for simple concepts, but that it is so trivially easy to ‘fool’ or confuse an LLM shows that it really doesn’t have any kind of introspection or comprehension of real world interactions.
I agree with what you are saying (to the extent that I can understand it, at least, not being a linguist or neuroscientist, just a sub-par computer programmer), but for me, the outcome of that chain of thought isn’t necessarily “therefore LLMs are not really thinking”, but rather “maybe human thought isn’t the only process by which we can usefully manipulate complex datasets to arrive at useful outputs”. In other words, once they started to easily beat the Turing test, we realized the Turing test was flawed, not that they are human.
I’ve always found our definitions of intelligence to be rather limiting (and I’ve believed that since before LLMs were a thing), mostly due to how other biological systems are able to process inputs into useful outputs — whether slime molds navigating a maze, mycorrhizae “processing” and transferring information, chemical signals plants produce in response to stimuli, etc. I wouldn’t count any of those things as “sentient”, but they are nonetheless producing useful work (for their respective species and ecosystems), and THAT is the criterion that matters more to me than how similar they are to our own cognitive mechanisms.
Similarly, even if LLMs were nothing more than probabilistic autocomplete engines, their output is still incredibly useful to me in day-to-day life, far, FAR more so than any search engine has ever been, and more so also than most of my coworkers have ever been. If someone told me that behind the scenes, it was actually just a convoluted thousand-page regular expression, I’d be like, “Whoa, really? That’s incredible!”… but I’d still use and be impressed by it every single day.
Not being a neuroscientist, I am unable to really understand how LLM processing is different from our own neurons firing… but while that’s an interesting question in and of itself, it doesn’t really change the usefulness that LLMs can have, however they work.
Within my lifetime, it is possible we will not understand human cognition OR LLM processing enough to be able to confidently explain their inner workings to a 5-year-old. But my hope — and this seems more likely — is that within that same lifetime, that difference won’t really matter much anymore, and some of my good friends will be human, some will be LLMs, and some may be tomorrow’s new approximation of sentience. Philosophically, I have no real way to know how my (human) partner’s mind really works, either, but that doesn’t bother me. She may be an android from the future, or maybe I’m a LLM trapped in a simulation. I can ponder those questions all I want, but fundamentally they are of academic interest rather than day-to-day practicality.
There is also the implied assumption that it is surveying factual references on the internet, but much (all?) of its reference data is statistically weighted. There is no way to determine if anything is a) fact-checked, b) merely statistically chosen, c) complete fabrication.
Recently I wanted a list of UK immigration and visa requirements. It seemed accurate until I started checking against the UK gov websites, and found many conflicting inaccuracies.
On the plus side, I’ve used ChatGPT to create D&D mini-module adventures which it does very well at. It also creates great Magic the Gathering custom cards that seem well balanced for play (making my own decks based on Frank Herbert’s Dune).
You’re right, there really isn’t. Still nothing beats manual verification against official sources.
However, you can make errors & hallucinations less likely by explicitly requesting (or it’ll often do it automatically these days) that it search the web and cite sources for you. That still won’t make it perfect, but at least it makes it easier for you to fact-check. This is the “retrieval-augmented generation” that Stranger alluded to above.
Here’s an example chat:
It gets much of it right, but still misses some important exceptions (like being able to stay more than 60 days for medical treatment). Despite that, though, it broke down what would probably have been several hours of research into a table generated in a few seconds. That helps me narrow down the kind of visa I’d probably need, with a inline source link next to it, where I can then easily double-check the official source.
From the very first prompt, it “searched the web” to synthesize information from random sites. I then asked it to prefer official sources, which it did. All of that took maybe 1 or 2 minutes.
Then, later on, I asked it to repeat the task with “deep research” mode, which then made it read through three dozen more sources and process them through a chain-of-thought follow-up in which it prompts itself over and over to refine its output before producing an answer. This is an example of “agentic” work, going a step beyond retrieval-augmented generation to “reason” with itself iteratively, combining the LLM with external tools (like web crawlers and other internal non-LLM software that helps it process different kinds of information) to refine the desired output over multiple steps.
Edit: The shared chat doesn’t show the detailed step-by-step, but you can see it in the app. Here’s a screenshot that shows how it fetched each source and used it to improve the answer: Imgur: The magic of the Internet
Short snippet of the agentic processing
•
Compiled from ONS pages, the UK Parliament research briefing, and the “Travel trends 2023” report to identify top origin countries for inbound visitors to the UK.
•
Noted that in 2023, the US led UK visits, surpassing France and Germany, with Ireland and Spain possibly next.
Mapped out the visitor visa process, including costs, eligibility, and exceptions. Special visas like family and transit are also discussed, with a focus on common exceptions and examples.
•
I’m pulling together a list of all visitor visa countries, especially from gov.uk sources for reliability.
Students may need to have a Confirmation of Acceptance for Studies (CAS), sufficient funds for maintenance and living costs, and meeting English language requirements for studying in the UK.
Short-term courses (under six months) may qualify under English language courses of 6–11 months. For courses under 6 months, a Standard Visitor visa might suffice.
That part took ChatGPT about 10 minutes of fetching and processing. It would’ve taken me several days, if not weeks, to go through all of that myself — and I’d still probably have made as many, if not more, errors.
The final version of the table has more details, including in-line citations, but still misses some nuances (like the indefinite medical extensions on a standard visa, which lets you stay as long as needed for £1000 every 6 mos.)
Overall, I would never trust anything a LLM produces without manual verification, but it’s still wonderful for information aggregation and summarization. It’s just better as the first step in research rather than the last word.
There are research projects that require 100% accuracy, even if it takes multiple days of verification. But there are also less important research tasks where taking 2 minutes to get an 80% accurate answer is perfectly acceptable, and even preferable over spending 2 weeks for an 100% accurate one. And you can always start with the 2 minute version, spend a few hours manually refining and verifying it, and still come out ahead.
Thanks! That was very helpful seeing the prompt strategy to get deeper verification. It’s the New World variation of “Yeah, mom, I really did clean my room!”, “Then what’s all this then?” opens closet
Heh, exactly! Makes you wonder how much of that is due to an inherent limitation in LLMs vs how much of it that is behavior they learned from humans, i.e., “It’s ok to fib a little at first, but if they keep insisting, I should probably try harder to be accurate…”. It’s very possible that sort of meta-deception is embedded in its training data since that’s so common in everyday human discourse.