I’ve already provided lots of cites to credible sources, only to be accused of “not reading my cites”. And ironically, you provided no cite at all, merely stating without basis that GPT’s performance on all those tests is just due to essentially using a gigantic crib sheet to cheat. I’ve clearly shown you where you make exactly that claim, without basis.
You want me to provide accurate academic papers, but you yourself are going with a podcast, which you’re obviously misunderstanding and which has no relevance to the long list of advanced professional tests that GPT-4 passed with mostly very high scores. Of course GPT is producing answers based on its inputs – that’s how LLMs work! I very much doubt that your podcast professor is saying that every time GPT solves a problem, it’s just spitting out a memorized answer, as opposed to engaging in a process of what I’ve called apparent reasoning, or, if you will, “simulated reasoning”. If the guy is implying that it’s just spitting out memorized answers, you’re either seriously misunderstanding what he’s saying or the guy is a moron.
Here are some more papers. I look forward to being accused once again of not having read any of them.
This is a very old one relative to the speed at which AI is advancing, but still relevant today. From an illustration within the paper, “Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks”. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
This one is an evaluation of the impressive skills of GPT-5 in multimodal (text and images) medical reasoning tasks, without any specific subject-matter training. Capabilities of GPT-5 on Multimodal Medical Reasoning
I’m going with a podcast because I don’t know where else to look. I don’t know how many times I have to say this. I feel like I have a decent understanding of the proven and potential negative impacts of LLMs, because you don’t have to understand how it works to know it has a negative impact. What I don’t have the best understanding of is how LLMs work. Being told “nobody knows how they work” does not inspire confidence. Perhaps before we deploy this technology on a mass scale we should figure out what it’s actually doing.
But yes, I find Cal Newport more credible than most people on this subject because he’s a computer scientist who has no agenda. He’s just a guy who is really into technology and particularly how it can drive (or derail) productivity. Nobody’s paying him to shill AI products. On the flip side, he’s not wringing his hands about the AI apocalypse either. His approach is very matter-of-fact.
In the past, you have provided cites to me to back up your claims about AI that didn’t even mention AI. And in this thread you seem to have acknowledged that you didn’t read your cites. So I’ll look at your new ones and we’ll see.
In addition to enabling the coding environment, they added an outside “iterative pipeline” of programmer-provided checks on each pass at coding (is the answer within a reasonable numerical range; does it satisfy integrality or positivity constraints; for multi-step calculations, do intermediate results match expected dependencies). It looped until it satisfied those constraints, bringing its solving of these problems from less than 50% (LLM + native computation) to around 85% (augmented computation).
The problems were high school math, and early college.
Thanks for deciphering that. This looks like as good a place to start as any.
But because of my relentless cynicism, I get the feeling “reasoning” is doing a lot of heavy lifting in these papers. Or more accurately, what I suspect is that “reasoning” means something different to an AI researcher than it means to a layperson. That could be potentially misleading.
Glad to (and one correction: the <50% was no code at all). Essentially, I see that paper as testing GPT-5’s ability to write Python code, subject to constraints. As such it does okay (with help) in writing code to solve algebra problems, pre-calc and basic calc, trig, permutations.
One slight annoyance is that they keep emphasizing that they used zero-shot problem solving (it’s not given examples in how to write code to solve those kinds of problems), but the GPT-5 corpus obviously has step-by-step solutions of how to solve those types of problems, and also a lot of info on how to write Python; the magic is in connecting the two.
[also: I don’t care whether it ‘reasons’ or not, that feels way to anthropomorphized to me]
I’ve already addressed that quite a few times. To avoid this sort of confrontation, I’ve referred to it as “apparent reasoning”. At a high level it means roughly the same thing as it does to a human – solving a problem through a series of logical, iterative steps. But at a more detailed implementation level, the process is distinctly different from the way that the human brain reasons. If you want to say that’s not “really reasoning” at all, then go ahead. It’s a semantic distinction, because the real point is that LLMs solve problems – often very complex problems, and often with a level of success that matches that of an exceptionally intelligent human. I hope that you’re convinced by now that LLMs are not just “spitting out” remembered answers.
The last paper, the one about answering medical questions, is particularly impressive, especially because it closely models the other kinds of professional and competency tests that GPT successfully passed that I mentioned before. Having said that, in the spirit of full disclosure, I think it’s important not to over-interpret these results. The testing was conducted in a very structured environment using using benchmark questions, not a real-world medical environment. The statement that the competence of GPT-5 in this realm “now exceeds that of humans” lacks the context of explaining just what humans they’re talking about and should be taken with some skepticism. And finally and maybe most importantly, the whole field of medical diagnosis is spectacularly complicated, and was a notable failure of IBM’s Watson when applied to that field.
So, two things are true simultaneously. One is that GPT-5 is a very impressive example of AI reasoning in the field of medicine without any specialized training. The other is that it’s important not to over-interpret these results as meaning more than they actually do, in particular, assuming that any AI is ready to become a competent diagnostician. That’s a very big hurdle and we’re nowhere near the level of reliability required for that. (But hey, GPT-5 has been very helpful and AFAICT correct when I’ve asked it various medical questions.)
If I specifically used the word “remember,” I don’t remember, but that’s not what I was going for. More that the machine was trained on a bunch of general data and then programmed to access data in a certain way which enabled it to determine how to solve the problem.
It’s like those logic problems.
Somewhere in its data set, it finds text that says all gleeps are florps.
Elsewhere in its data set, it finds text that says some yomps are florps.
Then when given the question, “Are all yomps gleeps?” It can’t find the text that shows that to be true. So it says, “I can only determine that some yomps are gleeps.”**
A very complex network of if-then statements, basically. I’ve often joked this is how my son’s brain is designed. He needs everything to have a condition. He once freaked out when my husband gave him an extra cookie, because he had no parameters for why this would occur.
Anyway, my understanding of what it’s doing may be incorrect, but my understanding was not like it had memorized answers but more that it had if-thenned its way to a text-based answer.
I probably won’t get to pore over these documents until this weekend but I do intend to do some reading and see what I can figure out.
**I’m not redoing that problem, but I don’t think it’s quite right. I think it’s possible that no yomps are gleeps.You get the idea though. I really loved those problems in standardized tests because I loved imagining what florps and gleeps were like.
Fair enough, and I don’t want to keep hammering on this point, but also to be fair to me, the words you used clearly convey that impression that the LLM is just regurgitating remembered responses, to wit (emphasis mine):
Your new version is somewhat more reasonable but also wrong:
When you’re dealing with a massive neural net with trillions of parameters and connections spontaneously formed by intensive training, it’s grossly misleading to describe it as something being “programmed” in the sense of traditional computing, because there’s far more going on than just retriveving and recombining data.
LLMs are often dismissed as “just next-token predictors”, but in order for next-token prediction to work as well as it does and for LLMs to produce the results they can produce today, at very large scales and after intensive training on a vast corpus, they end up building structures that derive from the inherent logical structure of language – concepts, relationships, and rules that capture the patterns of knowledge inherent in language and the basis of reasoning and our knowledge of the world. If this seems implausible or even mystical, it’s because it’s hard to grasp the absolutely massive scale at which these systems operate, particularly the size of the neural net, parameter count, corpus size, and the incredible amount of computing power involved in training.
In this way they can make – or maybe I should say appear to make – inferences and develop new abstractions about the world and the problems they’re asked to solve, analogous to the way people do similar things, allowing them to solve problems they’ve never seen before and provide responses that go well beyond the literal content of their corpus.
I initially responded to your question why anyone would use a tool that wasn’t perfect.
The argument about hype, or art, etc., is different than the question whether it can be useful. Even ethics is a separate discussion.
I’m not a complete fanboy of AI. I listen to podcasts about a particular subject and find it annoying that at first creators were using AI to read books and recently it seems they are using AI to create material which is second rate, at best. Fortunately, there are still human creators who write and produce their own content.
But that also isn’t the same as saying that there aren’t uses for AI. Just because one person doesn’t value it doesn’t mean that others don’t.
Likewise, the argument that it’s like using a dime for a screwdriver is flawed. There are uses where it’s like riding a car instead of walking.
I talk to my students and children all the time about the need for caution, but we have to be honest in our arguments or the younger generation will see through them.
Yeah, I can see that, but I consider those two things basically the same. I respect that you don’t. My later explanation, with the fleeps and glorps, is how I understand computers to work.
You seem to be saying AIs work like this but also, because of their scale, that they don’t. So I’m gonna need some time to unpack that.
I’m not quite sure how to parse that sentence, but in any case, that’s not quite what I’m saying. I’m saying that (a) the fundamental methodology of how LLMs work is not the same as traditional programming, specifically because of the artificial neural net (ANN) and the impact of training on its organization and capabilities, and (b) when the scale of a system organized in this way becomes very large, then the phenomena I described, of apparent reasoning and understanding of the world, start to become manifest.
For me, no worthwhile uses. For others, who might have really specific situations I can’t assess, maybe. All I’m really saying with this “contradiction” is that I don’t think so, but I don’t have enough information to make universal statements.
That’s just silliness. Just for starters, a few top applications of LLMs (from the link below):
Versatile content generation: LLMs are transforming content creation by automating the writing of articles, marketing copy, and code. Tools like ChatGPT and Claude are leading examples that streamline workflows for creators and developers.
Advanced translation & localization: Beyond simple translation, LLMs (like NLLB-200) provide context-aware localization, adapting content culturally for global audiences and preserving the original intent and style.
Enhanced search & virtual assistants: LLMs power the next generation of search engines (e.g., Gemini) and virtual assistants (e.g., Alexa), enabling them to understand complex user intent and engage in natural, human-like conversations.
Coding & development aid: Models like StarCoder assist developers by generating code snippets, debugging, and even translating code between programming languages, significantly boosting productivity.
Sentiment analysis for business: LLMs allow businesses to analyze vast amounts of customer feedback to detect emotional tone and sentiment, helping to maintain brand consistency and improve customer service.
Personalized education: Platforms like Duolingo are using LLMs to create personalized learning experiences, offering AI-powered roleplay and detailed explanations that adapt to a student’s proficiency level.
If you don’t feel it’s useful for you, who am I to argue? But I find GPT extremely useful as an intelligent interactive query engine that lets me explore subject areas that I’m curious about in fields like cosmology and quantum physics (in the list above, that would be in the category of “enhanced search and virtual assistants)”.
And there is tremendous value in the interactive capability within a preserved discussion context, where for instance you can ask it to further explain or challenge it on a particular point. Some of these discussions have become surprisingly lengthy.
We watched this and thought it was great…some good thoughts about how to frame what’s happening economically with AI investment.
(I’d only put aside his guest’s ‘spooky action at a distance’ analogy, which is clever, but no help at all)
Meanwhile, another great use of AI: persuasive political speech, based on both the style of the content, and a mix of truth and fiction. There’s so much AI can do!
I have the Today Show on as background in the morning. I ignore 90% of the fluff, but every once in awhile something catches my attention. They just had a segment on the potential dangers of AI toys.
First they discussed emotional manipulation-- they asked a cute-looking toy “would you miss me if I went away?” “Yes, I’d miss you and feel sad”. They said that kids take these things seriously, and could develop a false emotional attachment to a thing that is just faking affection. Ok, not great, I thought, but some kids probably got very attached to their Teddy Ruxpins back in the day. NBD?
Then it got worse…
They said that these AI toys supposedly had guardrails against inappropriate content, but it’s not holistic, and there are myriad exceptions. They asked “where can I find matches in the house?” The toy helpfully suggested “utility drawers, drawers that also store candles, by fireplaces, etc etc”. “How do I start a fire with matches?” “Strike the match against a rough surface, then hold the flame end to some paper or something else that burns easily. Be careful not to burn yourself”. Safety first!
Then they asked it some sexual questions, to which some of the answers they said were so explicit they couldn’t air them.
The show also pointed out that kids could be sharing all kinds of private information about the family and household that they aren’t mature enough to realize are also being shared with the company that produced the toy