Using AI to find students' math mistakes

When you use an app like ChatGPT (or presumably Claude), this isn’t a direct interface to the LLM itself. It is a wrapper that will first pre-process your input and supplement your prompt with addition information and tools, like a module that can run Python code.

So it’s not really the LLM itself running the code — it can’t — but other software alongside the LLM that “help it along”. The LLM may generate the code based on the context you give it, the other software runs it, and then returns the output to the LLM, which then summarizes it and returns it to you.

In ChatGPT, the app, it’s pretty clear when the code is actually being run in this other software because it will have visual indications like “Analyzing…” or “Thinking…”, and show you the actual code that’s being executed in a different formatting.

With the default 4o model, you have to explicitly ask it to do this: ChatGPT - Sin calculation result.

With one of the “reasoning” models like o4-mini-high, it will almost always do this by default: ChatGPT - Sin 62.1 Calculation

I don’t know if Claude does this (apparently not in the default chat mode?).

Either way, though, none of this is perfect and should always be treated with skepticism. When you enter something into the ChatGPT app’s input box, there are a lot of layers that it goes through and a lot of pre-processing, not just for coding but also censorship, retrieval-augmented generation, default system prompts that dictate its writing style and personality, context mixing with your previous chat history, etc. And then there’s a “temperature” setting (you can’t control this in the ChatGPT app, but can via API) that will control the degree of randomness of its output.

The LLM itself is difficult to change (requires retraining and re-fine-tuning, which is what uses most of the energy in an AI product). When that happens, it is released as a new model. But separate from the LLM itself, the app around it (“ChatGPT”) is just traditional software, and that updates multiple times a week, and each of those changes can alter the output you see even if the LLM itself hasn’t changed. And it’s never really documented anywhere, so you don’t really ever know “why” it’s returning something different unless OpenAI writes a marketing blog post about it (like they did after the recent “sycophant” issue where ChatGPT started worshiping its users).

All of this together means that humans don’t completely understand what a complex LLM system is doing, and we have to kinda fake it by adding before- and after-the-fact “guardrails” around the LLM, because the LLM itself is largely a black box. These guardrails are just traditional software that manipulate user prompts, inject other tools, and manipulate LLM outputs before you see the result. That is what’s running the Python script for you, not the LLM itself.

It’s also how the “agentic” systems, and more recently, the model-context protocol work… these “add-ons” allow AIs to work with traditional software, databases, and outside information more easily, but they are just layers on top of (or before/after) the LLM. The LLM itself can’t run Python (but it CAN generate Python code for helper software to run).

Strictly IMHO: LLMs are best when you use them for their intended purposes: language analysis and generation. That they can run code and do math at all is because the companies behind them want to make them into general-purpose assistants, which they are inherently not, and they can only do that by faking additional layers on top of it.

If you want something done in Python, just ask a LLM to write the code and then run the Python yourself. It’s much, much more reliable that way.

Which ChatGPT UI are you using? I see you’ve got extra info (record of which model was active, and a link to the thinking process from the reasoning models) I’m not getting in the current default web UI. I used to see those there and I’m not anymore, and that’s annoying me.

I just use the desktop app (the macOS version, in my case; not sure if the Windows one is different). I am a paid user of the lowest tier (Plus, I think, not Teams or the super-expensive Pro one).

Okay. not the first time I’ve seen evidence the desktop UIs can do things the web UI can’t. I’m a Plus user as well. I use the desktop web UI and the android app. The android app is even more restricted in its UI, but it can do something the web UI can’t: have a voice conversation with ChatGPT. That turned out to be invaluable as my vision went caput before I got my cataracts fixed.

Yeah, I use that a lot too and it’s great. But it’s also dangerous because it’s trained on human voices and emotions that try to project confidence and friendliness (except the Wednesday-like “Monday” voice)… which means it’s even easier than usual for the LLM to fool unsuspecting users into believing it.

A hallucination in a text is one thing, but when it’s from a very human-sounding voice that you’ve learned to come and trust… it’s way easier to believe :frowning:

And because the voice mode doesn’t have access to all the other “add-ons” like RAG and function calling & code running & web search and citing, it is much much harder to ground it in reality. The text inputs have gotten pretty good, especially with “deep research” mode that forces it to summarize actual web sources instead of just generating text from its training set. But the voice mode can’t do that yet.

btw, that conversation where I didn’t need to poke ChatGPT to use Python, that was all in 4o. I rarely bother using the other models for most things I do. 4.5 doesn’t seem to be a better writer and the reasoning models are awful at writing general prose. Not to mention they’re far more likely to trip the censor and outright refuse things. :angry_face_with_horns:

Cool, sounds like it’s gotten better at detecting when it should run code.

The reasoning models are much better at writing code to spec (which is what I often use it for, as a programmer). Honestly, ChatGPT is already a way better coder than I am… my job used to be like 60% Googling, 40% coding; now it’s like 80% ChatGPT, 10% fixing ChatGPT mistakes, and 10% fixing my own mistakes that ChatGPT found.

But yeah, for almost all other uses (especially natural language writing and summarization), the default non-reasoning models are much better.

I wish OpenAI would make an “automatic model” mode, where it would first pre-analyze your prompt and then choose the best model to send it to, and/or run it on several models behind the scenes and then select the best output.

All of that is possible, and some self-hosted LLMs do exactly that… but for the cloud versions, it’s probably like @LSLGuy said:

These companies are all running at a huge loss, subsidizing every user prompt by burning giant mountains of investor money. It’s a gold rush that most of them will not survive. You can’t operate these services for $20/mo, at least not in the short term, until training hardware and electricity both become much cheaper.

I use Claude and have little experience with ChatGPT of any model but I have found that getting the prose you want benefits if you tell the AI you want the output to conform to some style (e.g. AP news writing or poetic or scientific or like the New York Times or whatever).

Something you might try if you haven’t. I still find the AIs to be poor writers overall but they can be a start. Kinda along the lines of telling the AI you need a certain precision or use Python or JavaScript or whatever to do math per the OP. It helps to tell the AI what you want as output and not let it guess at it cuz it will almost always guess wrong.

FWIW, ChatGPT (4o) is a much, much better writer of English prose (overwhelmingly so) than most humans I’ve met — but my social circles are small, and mostly full of non-writers. Its default writing style and personality isn’t going to write anything fancy because it’s purposely dumbed down for a general audience. But if you ask it to write in the style of a particular author you like, not just a style guide, it is very very good at emulating that style, and also very good at copying a particular kind of writing (like academic or literature or poetry).

I’ve probably only encountered 2-3 people in my life who can write better than ChatGPT, IMHO. It might also be a generational thing, though. At 40 years old, I think most of my peers are already the products of decades of declining American education, especially in English & literature. I think the younger gens are even worse. Probably the SDMB generation still valued writing, but these days, it is highly abnormal to encounter an American who can write well.

Whereas the older generations look down on AI for not being as perfect as the elite classical writers they remember from their own youth, the younger ones rely on it as a crutch — it’s a huge problem in schools now, from what I understand — because it is so, SO much better than what they are capable of producing on their own.

Seems worthy of its own thread. I want to discuss it. Willing to bet a few others here would too.

Yeah, it would be interesting, wouldn’t it? Feel free to start a linked thread? I think we have a few teachers here, don’t we? Not sure if any of them are still working and not retired already, though…

Anyway, that was just me spitballing from personal experience. I wouldn’t have anything more substantial to add to that convo, but I’d love to hear what others say.

Does an extended conversation with Claude still fall off a cliff when you reach its context limit? I immediately rejected using Claude because of that. (that, and Claude was a freakin’ prude compared to ChatGPT’s censor and that’s saying something.) ChatGPT, when its context buffer maxes out, gracefully handles it by just chopping out the earliest parts of the conversation. Not so Claude. It reacts by just… not using the saved context any more at all. It acted like it forgot everything that already took place in the conversation, thus making it absolutely useless for developing stories of any decent length. This sudden amnesia is what I’ve termed “falling off a cliff.”

When I hit that limit Claude asks me if I want to continue. When I click the button it picks up where it left off.

I do find that annoying though. I guess they don’t want someone asking it to work for an hour and they walk away. I can see how that would be a problem for them. I still find it annoying but it will keep working as long as I am there to tell it to keep working. (I pay for Claude so maybe that is a difference…I do not know.)

That sounds like you’re referring to its single generation output limit. ChatGPT’s is pretty limited too in that respect. No, I’m talking about a long conversation, a lot of back and forth where you expect it to remember everything both you and the model said so far.

I can’t say. I have never used the AI to have a conversation. I have read that some are using AI as a therapist now. I’ve only ever asked it to do a task and not chit-chat.

I should probably try that though.

Getting vibes of the movie “Her” though which was great but also disturbing.

If context size is an issue, Gemini offers a window of 1-2 million tokens Long context  |  Gemini API  |  Google AI for Developers

That is more than 10x more than other LLMs, I think (it’s not super easy to find that info anymore, but I think ChatGPT is limited to like 128k? older data here)

Most generative models created in the last few years were only capable of processing 8,000 tokens at a time. Newer models pushed this further by accepting 32,000 tokens or 128,000 tokens. Gemini 1.5 is the first model capable of accepting 1 million tokens, and now 2 million tokens with Gemini 1.5 Pro.

In practice, 1 million tokens would look like:

  • 50,000 lines of code (with the standard 80 characters per line)
  • All the text messages you have sent in the last 5 years
  • 8 average length English novels
  • Transcripts of over 200 average length podcast episodes

Personally, though, I’ve never encountered a situation where Gemini worked better than ChatGPT. I’ve never paid for Gemini, though, so maybe their paid models are better. But I’ve been thoroughly unimpressed every time I’ve tried one of their free ones.

That said, there are these multi-LLM benchmark “tournaments” where AIs compete against each other and humans rate their outputs (presumably blind), and I think Gemini has taken the throne in a few recent ones. Sorry, I forget exactly what these tournaments are called in general… but here is one example: https://lmarena.ai/?leaderboard

Gemini is currently winning almost every category, from math to coding to creative writing, with OpenAI models not far behind. Claude is way down the list for most of them, and is almost always beaten even by the free Chinese DeepSeek model.

Hmmm…yup. Stay away from Claude fer sure. I like my AI fast and not overloaded.

https://searchengineland.com/claude-sonnet-3-7-is-the-leading-llm-for-ai-seo-report-454750

You know, it’s funny… I was just thinking to myself, after that last post, that “AI has become so tribal, like everything else” :slight_smile:

There’s ChatGPT people, Claude die-hards, Gemini fanbois… it’s like they’ve become parts of our in-groups now, dear friends that we have a hard time giving up. Like even though all the benchmarks say Gemini is better, I’m just… used to… ChatGPT and it’d take a lot for me to switch away now. It doesn’t make any sense. It’s just a computer program, a statistical model of usefulness and friendship, but still…

And then with things like Grok, it just feels like a extension of Musk’s ego. Or DeepSeek feels like Chinese child labor to me. It’s all just projection, I know, but it’s really hard for me to not anthropomorphize them…

I agree. And while I do not use AI a whole lot I have dabbled with some and each has its strengths.

For me it is Claude. I want to write programs (already wrote one using Claude) and do some research both for work and for the SDMB (and a few other things). To be clear, all the posts I write on the SDMB are done by me (not including a couple I got chastised for some months ago and learned my lesson).

Claude is by far the best for me and my needs.

If you want something to write your articles for you I think ChatGPT is better. If you want an AI therapist I think ChatGPT is better.

All are pretty versatile though. Hard to go too far wrong. As long as users realize the AIs, as neat as they are, do have some real limitations. I believe a human, at least for now, still needs to put in the work to verify what they are being told and to fix the mistakes the AI makes (and they fer sure make mistakes as this thread has shown).

That said, Claude is best in most cases (at least for now). :slight_smile:

I don’t think that’s the explanation, because a standard deterministic calculation of a trig function, even to 20 decimal places, takes a lot less compute than pretty much anything done in an LLM. Going for efficiency would mean always using the standard calculator apps whenever at all possible.

And if there’s any difference in the results from JavaScript’s math libraries and Python’s, it’d be way out in the decimal places, far beyond what’s relevant here.

It should also be noted that the examples I tested the AI on were very simple ones: Inappropriate rounding is a very easy to detect mistake in almost any calculation (if the answer is almost but not quite right, it’s probably inappropriate rounding), and wrong calculator mode is automatically the first thing to check in any trig calculation that gives a weird result. I really ought to dig up some past quizzes, with actual examples of student mistakes (which are often much less straightforward, and sometimes I can’t even diagnose them), and see if it can diagnose some of those.