Well, you getting a valid answer from it one time doesn’t really make much difference to their claim. The LLM is non-deterministic in how they arrive at their response, so it can give totally different answers to the same question.
One of the funnier problems I’ve tried to help with was my boss using (IIRC) GPT-4o to fetch the time zone of each of his employees by city in a script. He was explicit in his prompt that he wanted the output in the canonical TZ identifier format, with nothing else. It would work about 80% of the time. The failures were maddening. The most common failure would be that it would ignore the format part of the prompt and spit out a sentence instead of just the TZ identifier. In a couple of tests it simply got the time zone wrong, outputting America/Chicago as a response for El Paso (Amercia/Devner is correct).
I eventually gave up and told him there are DBs that will get you the latitude and longitude of a city and there’s a separate DB that can get the TZ from that. Those will always give the same answer, and return it in the same format. He wants a deterministic answer, and he should use a deterministic tool to get it. A LLM is really the wrong tool for the job, even though you’d expect it to be able to handle such a simple task effectively.
Well, I do have the advantage that these people are almost universally coming to me for help. It is my job to help them, and us getting weird incomplete or completely irrelevant data accompanied by no useful data whatsoever is common enough that there is a checklist they’re supposed to complete before they come to us. I’m one of the nicer members of my team, and will ignore things they miss on that checklist and go ahead and take the case. I’m flexible, but there are limits.
Not exactly. The responses are intentionally variable in form, and that’s by design. They’re not intentionally wrong, though they sometimes are wrong for a wide variety of reasons, often for much the same reasons that people are sometimes wrong even with the best intentions.
As for the problems your boss was having with time zone responses, I can’t really comment on the specifics, but I repeatedly find myself wondering why I rarely have the kinds of problems that so often bandied about:
Q: Can you please list, in column format and standard notation, the time zones for the following cities: Denver, Chicago, El Paso, San Francisco, Toronto, Los Angeles, Las Vegas, Cincinnati, and Philadelphia?
ChatGPT:
Here’s a clean column format with standard time zone notation (U.S. and Canadian conventions):
City | Time Zone (Standard) | Abbreviation
--------------|--------------------------|--------------
Denver | Mountain Time | MST (UTC−7) / MDT (UTC−6)
Chicago | Central Time | CST (UTC−6) / CDT (UTC−5)
El Paso | Mountain Time | MST (UTC−7) / MDT (UTC−6)
San Francisco | Pacific Time | PST (UTC−8) / PDT (UTC−7)
Toronto | Eastern Time | EST (UTC−5) / EDT (UTC−4)
Los Angeles | Pacific Time | PST (UTC−8) / PDT (UTC−7)
Las Vegas | Pacific Time | PST (UTC−8) / PDT (UTC−7)
Cincinnati | Eastern Time | EST (UTC−5) / EDT (UTC−4)
Philadelphia | Eastern Time | EST (UTC−5) / EDT (UTC−4)
Would you like me to also show them all aligned by current local time (today’s date), so you can see the exact hour difference between each city right now?
I suspect it’s because you’re doing it once and being satisfied with the response. It would normally give a correct answer, but he was running it repeatedly (probably a hundred times in our half hour call) while he was writing and debugging the script. The most common failure wasn’t it being wrong, it would just ignore the formatting commands. Being wrong was the least common failure, and it was always something plausible like El Paso being in the central time zone. I know folks who live in Texas that don’t know the correct time zone of that town.
Well, maybe so – I’ve never put GPT through the trials of repetitive batch scripts. But at the same time, the ways I use it aren’t necessarily simplistic. I’ve had long and informative conversations with it about everything from cosmology to baking techniques, and, yes, even about health and medical issues – generally getting good guidance but always being careful to verify.
My analogy in this context is remarkably like some people I know, who are very bright and precisely for that reason have a high level of self-confidence and are therefore prone to occasional bullshitting. I still value these conversations even if I have to verify some surprising allegation, which on rare occasions may turn out to not be true (but usually is).
For any application in which the lives and well-being of people is at stake, or in which the user is not willing or competent enough to independently verify the result, an AI ‘agent’ that produces confidently wrong results is indeed "‘useless’ and even ‘dangerous’”. For certain, people can err or perpetuate counterfactual information but they are also generally willing to admit when they don’t understand the question or have the answer, and are held accountable for making mistakes, whereas when a chatbot is informed of an egregiously erroneous response it just says, “Yes, I got that wrong,” and often as not then gives another confidently wrong answer. Since there is no consequence and no standard against which to validate whether a chatbot will consistently give correct responses to questions of a purely factual nature, the lack of reliability is extremely problematic, especially as people are becoming increasingly prone to just taking whatever answer they get and pasting it into a work product without even a bare minimum of verification, even when doing so would literally require a couple of minutes of work.
In their hearts, chatbot and generative AI enthusiasts know this, which is why they are so quick to brand anyone who points out these issues with pejoratives such as “uninformed luddites” (your term), “technophobes” (I’ve been working with machine learning tools for nearly a decade and a half), or “brainless idiots” (a snub I got from an executive manager on a call recently after telling us that we needed to “implement AI in our productivity workflows” and include this in our performance goals and I questioned whether generative AI would be of any utility in the work that we do requiring discipline-specific expertise), and to repeatedly cast doubt on the honesty and integrity of the complainant. Rather than acknowledge that while these tools are an impressive leap in the capability to interpret and regurgitate natural language in a conversational modality and generate images that (while clearly a syncretism of training data) generates a product that is reasonably close to the prompted request, there is still a long and unclear path toward getting these tools to a point where they are reliable enough for any practical use for legal, medical, engineering, scientific, or system control applications, they attack and insult anyone who questions the dogma of Silicon Valley execs hawking their services with increased desperation and critiquing clearly staged displays for what is essentially a MVP demo. Never mind any questions about the costs (both economic, environmental, and the human toll on workers in developing countries who do the tedious labor of reinforcement training for these systems) to frantically compete in delivering the ‘beta’ of a tool clearly not ready for any kind of production-level work.
The problem with this argument is that this pup, your truly, has genuinely and objectively found GPT very useful and very informative, and its applications to personal education are unprecedented in human history. The most powerful aspect is the combination of its vast repository of knowledge and the ability to interactively converge toward the desired information.
Of course this is all worthless if its facts are wrong too often, but I think in most cases its accuracy is well above the minimum threshold of accuracy we have to tolerate in most learning settings. Public school teachers can be wrong, university professors can have their own research agendas, etc. And I’m not talking about the contrived examples intended to demonstrate how “stupid” LLMs are.
You must never has spoken to anyone in the MAGA crowd.
Now I’m pretty sure you must never has spoken to anyone in the MAGA crowd.
Again, not my experience. “Yes, I got that wrong” certainly occurs, and would that humans could be as humble. Often, in fact, GPT magnanimously explains where its misunderstanding came from. But I personally have not seen the second occurrence, which is just more bashing of LLMs as allegedly “useless”. I could give many specific examples but I’ll refrain for the sake of brevity.
What do you mean by “consequence”? Chatbots aren’t autonomous actors; even if they have self-learning and self-enhancement capabilities, they’re still managed entities. It’s logically and ethically their managers who must have the burden of accountability.
Not to sound dismissive, but that’s quite literally a people problem, and also a problem of deployment and expectations, not an intrinsic AI problem.
I completely agree. But that statement is a very far cry from your other claim that LLMs merely “interpret and regurgitate natural language in a conversational modality” and have no intelligent functionality between the inputs and outputs, or, as you put it elsewhere, are just a “stochastic parrot”. That’s the part I vehemently disagree with.
More examples of it not getting its explanation wrong won’t do anything to negate the many, many times it admits it is wrong and then explains why with another obviously hallucinated answer. Magnanimity gets it no points. Here’s an article about that that I linked to earlier.
And really, no matter how useful you find it, it is just a statistical text generator. It doesn’t know the correct answer from the wrong one. It isn’t really concerned with that at all, and has no concept of it. It just knows that this word fragment is likely to follow the preceding fragments. It generates convincing text strings from that, but it has no knowledge of what any of them mean in aggregate. Asking it to debug a program it generated is a fool’s errand.
Since this is the Pit, and we’re having fun here, let me say in all honesty that I drafted a response to you, and then just for a lark I ran it by that “dumb text-completion engine” ChatGPT. I submitted your original post, and my response, and I’m humbled to admit that my pal ChatGPT had some criticisms of my post (I had asked it to be critical, not kind). It thought me a bit “overenthusiastic” about AI.
So it offered to re-draft it, I said fine, and this is the re-draft – my response to you, with the help of my friend ChatGPT toning down my enthusiasm:
The phrase “no matter how useful you find it” is self-defeating. Usefulness is exactly what matters. If something consistently provides correct, relevant answers, dismissing it on the grounds of mechanism is missing the point.
Calling it “just a statistical text generator” is like saying the human brain is “just neurons firing.” At one level it’s true, but it entirely misses what emerges when you scale up the system. A learning architecture with billions of parameters trained on vast amounts of data exhibits capabilities that go well beyond text completion. It can generalize, solve problems, and make inferences it was never explicitly trained on.
Whether or not we call that “reasoning” or “intelligence” is a matter of definition, but functionally it’s doing something that behaves a lot like both. And that is what makes it significant.
FTR, this was my original, somewhat exuberant post that was toned down by GPT:
The words “no matter” rarely have any place preceding a phrase like “how useful you find it”. If it’s useful, it’s useful. And that all that should matter.
And the characterization as a “statistical text generator” is like calling a modern supercomputer “a big calculator”. It’s not really strictly technically wrong, but it completely misunderstands the power of information processing systems when they grow to very large scales and entirely new capabilities become possible. A supercomputer running a transformer architecture built on a multi-layer neural net with billions if not trillions of weightings achieved by intensive training eventually spontaneously develops emergent properties, like the unexpected ability to reason, to make inferences, and to solve difficult problems in logic that it has never been exposed to before.
This is a lot more than just a text completion engine. It’s the early dawn of intelligent machines, though GPT will certainly not be the sole path to achieving that – just currently the most spectacular one.
Bwahahaha, please don’t make me argue with ChatGPT. Arguing with AI is almost always hilarious, but rarely useful.
(but really, it’s wolfpup modified by a LLM, no telling who is who from this end)
That’d be nice, but as demonstrated already, they aren’t consistent. They’ll provide different answers to what ostensibly was the same question based on how it is phrased.
And you thinking that “statistical text generator” or “really big calculator” are denigrating what they are is misunderstanding how powerful such big things are, but it’s only because of their size and scope. They are powerful, but they can really be the wrong tool for the job no matter how powerful they are. The emergent behavior in the very complex LLMs we have today is that they can string together long series of text fragments that almost always seem pretty convincing, but are often inaccurate. It is not that they reason.
Again, if they could reason, at least one of the three LLMs I queried about my problem with a block of code generated by the first LLM could actually have debugged it properly. They could generate code and implausible solutions to the problem all day, one even provided an optimization for a block of the existing code that worked quite well. Once I decided to stop fooling around with them and look at the code, it took about 20 minutes to figure out the problem and fix it.
In all, I spent about the same amount of time arguing with the AIs about the code the first LLM spit out and then fixing it myself as I would have spent if I had just written it myself from scratch. On the other hand, it was a different kind of amusing and frustrating than I normally have when writing code. Plus, it took the time to format its output in a nicer (but slightly useless) way than I ever would have bothered doing. So, I’d call that a wash. If that ain’t enough (and don’t you think it ought to be): I’m smarter now, and won’t ask a chatbot to reason in the near future.
I wonder if any code platforms have some sort of debugging AI embedded. For example, I wrote some APIs in R to retrieve data from various websites (e.g. YouTube); I would then manually inspect the response files for typical status codes, or fields that didn’t produce an error but were populated by NA.
R has no facility to inspect output and then suggest fixes; at best, I could manually write code to automate the checks. But some sort of automated debugging might be useful?
How is that possible you say? One thing that my ongoing AI training told me is that, in a nutshell, there is an item that is missing here, it is about comparing a General AI tool with what is expected to be a better way to use AI, with AI agents. One has to learn the difference between agentic AI and plain generative AI with one-shot-prompting.
The choice between using an LLM or an AI agent depends on your needs. If you want quick answers or text-based outputs, visit ChatGPT. If you need autonomy, decision-making capabilities, or task execution across complex workflows, an AI agent is the way forward.
LLMs remain the better choice for:
-Generating drafts of content, like articles, emails, or proposals
-Answering straightforward questions
-Supporting education where a conversational element adds value
LLMs are simple to use and provide (almost) immediate gratification–just type a prompt and get a response. But once you need a proactive system that can sense, decide, and act independently, agents are the way to go.
Now, one way Wolfpup is avoiding a lot of the uncertainty in his example is that he did a bit of homework in his prompting. But more importantly, is that if one does not have AI agents available, one has to do one very important step that many of the Law groups or companies who committed big embarrassing errors when using just regular prompting LLM did not do, is what Wolfpup pointed at the end of his example. Iteration:
Would you like me to also show them all aligned by current local time (today’s date), so you can see the exact hour difference between each city right now?
In that case, the LLM was helpfully telling an user, that you can do an implementation of what a rudimentary AI agent can do.
Iteration is a very important element (There are more like: Context, Planning and reasoning,
Memory, Access to external tools, Self-reflection, Interaction with an environment) So when this was proposed as a show stopper:
Is that he was correct… but when one looks at a general LLM with one-shot-prompting, that is not an Agentic workflow. Now, that is when AI agents are giving diverse tasks and making use also of agents that automatically and iteratively check the data, it is then that one can see already a practical use of AI for legal, medical, engineering, scientific, or system control applications.
In December 2020, DeepMind returned to the virtual CASP competition with AlphaFold 2, and this time they succeeded.
“I’m going to read an email from John Moult: ‘Your group has performed amazingly well in CASP 14, both relative to other groups and in absolute model accuracy. Congratulations on this work.’”
For many proteins, AlphaFold 2’s predictions were virtually indistinguishable from the actual structures, and they finally surpassed the gold standard score of 90.
For me, having worked on this problem so long, after many stops and starts, it felt like a solution had been found. We had solved the problem. This gives you such excitement about the way science works.
Over six decades, scientists working around the world on proteins painstakingly found around 150,000 protein structures. Then, in one swoop, AlphaFold unveiled over 200 million of them—nearly all proteins known to exist in nature.
In just a few months, AlphaFold advanced the work of research labs worldwide by several decades.
It has directly helped researchers develop a vaccine for malaria. It has enabled the breakdown of antibiotic resistance enzymes that make many life-saving drugs effective again.
AlphaFold has even helped us understand how protein mutations lead to diseases ranging from schizophrenia to cancer. Biologists studying little-known and endangered species suddenly had access to proteins and their life mechanisms.
The AlphaFold 2 paper has been cited over 30,000 times and has truly made a quantum leap in our understanding of life.
John Jumper and Demis Hassabis were awarded one-half of the 2024 Nobel Prize in Chemistry for this breakthrough. The other half went to David Baker—not for predicting structures using Rosetta, but for designing completely new proteins from scratch using RF Diffusion.
(BTW, I only used AI in this reply to clean the Youtube transcript.)
In your case, the paradoxical solution is to apply an AI to check and criticize the code. In other words, let them fight for a few rounds and then you check the best examples that the agents as a group found.
Anderw Ng explaining the reflection part about Agentic reasoning.
(Link goes to the specific explanation, use the AI to criticize itself)
Heheh, I kind of did that by feeding the code back to different LLMs at different times in the process. In the end it was easier and quicker for me to just examine it and fix it myself.
You broke up one sentence to make a set of MAGA references that came out of nowhere, and then followed it up ignoring the previous parts and just saying how “…GPT magnanimously explains where its misunderstanding came from,” as if that obviates the problem of presenting a confidently incorrect answer. You also say that you have not personally seen a second occurrence which I know isn’t true because you participated in this thread where a poster repeatedly queried and got incorrect and even self-contradictory responses from an LLM.
You followed that up with another broken sentence quote regarding the lack of consequence, divorcing it from the rest of the sentence indicating that this is a problem because of the behavior that users engage in by pasting it into a work product without indicating that the source or verifying the information. Whether you regard this as a “people problem” or not, the reality is that this is very normal user behavior specifically because people are encouraged to use LLMs in this way to ‘efficiently’ produce blocks of text to be inserted into a report or other document without having to do their own research or think through how the text should be organized or whether it is an accurate portrayal, or indeed, that the factual content is correct.
In general, if you are breaking up a post into sentence fragments “for clarity of response”, what you are really doing is just sniping at the disconnected elements rather than presenting a complete counterargument even within the scope of a single statement. Even if you don’t believe it removes context to respond in that manner, it makes it very difficult to provide a rebuttal that isn’t also just a spastic sequence of disjointed thoughts, and that kind of back and forth quickly degrades into a litany of repetitive and sapless claims. So, again, please don’t break my posts into sentence fragments just because it is easy for you to jab at the disjointed pieces.
But that is still missing that the critic AI agent can be made to automate that step, the one about you feeding the code back at different times. Of course, one has to also teach the agent to look at the flaws that you are familiar to. IOW, eventually this gets easier and reduces the time employed for the task. And as the collaborative models show, AI agents can save a lot of time.
If one looks at the biological example with AlphaFold, the end result was to create what amounted to a data manager agent and a critic agent, who after several iterative rounds among them, then another AI agent looked at and advised about the results.
Time saved when looking at what the millions of the protein structures that biologists were trying to identify? More than 6 decades at least.
And that points at an item that AI will make lives better, as a result of this research, promising vaccines for malaria are in the works as just one example.