Using AI to find students' math mistakes

I found Gemini not able to actually use its context as well as ChatGPT rather early in extended conversations where I try to develop a story. It frequently ignores established factors that should have affected what characters would do later in the story. When directly quizzed about key facts from earlier in the story, it frequently flubs the answers, even for things only a few thousand tokens ago.

That said, I’ve discovered ChatGPT-4o also suffers the same issue long before its 128k context is reached. It’s just better than Gemini at recalling and incorporating facts from earlier in the conversation.

I just did some checking, and it seems users are reporting 4.5 is in fact better than 4o at incorporating the past contextual facts, but for the moment I’ll have to be extremely judicious in its use. The current usage cap is only 50 messages per week. :enraged_face:

This reminds me of some R coding help I got from AI. I had created (myself) a bit of code to do something complicated; it failed at one point, and I dropped the code and the error statement into CoPilot.

“Aha!” it said. “Here is the fix!”

That fix worked, but it introduced a second error, which I also dropped into CoPilot, pointing out that a new order of operations was causing the new problem.

“Yes, of course, sorry!” it said, and it gave me a new fix, which corrected the second problem, but reintroduced the first one by reverting the order of operations.

I eventually got it all to work by specifying the correct order, but having the memory of a goldfish makes using AI somewhat difficult, at least for what I’m trying to do.

Do you mean you have that much memory, or it does? Genuinely asking, not trying to be funny or obtuse.

Oh, lol, CoPilot I meant. My memory’s still pretty good, thank goodness.

I think the “maximum context window” isn’t necessarily the same thing as your current conversation history (or the historical ones, for that matter, like your explicit “memories” and your previous chats).

To be clear, this means the maximum context window is more like a “maximum prompt length”, i.e. the amount of tokens it can process from your current prompt. It does NOT mean “maximum chat history length”. I am pretty sure that LLMs aren’t stateful — they have no actual memory of what you said even just one chat message ago — and a “conversation” is just a series of fake copies & pastes of what what you’ve said previously, re-injected behind the scenes before your current prompt. Possibly as a summary, and not always effectively. e.g. https://learn.microsoft.com/en-us/microsoftteams/platform/teams-ai-library/in-depth-guides/ai/keeping-state or Why aren't LLM APIs stateful? Why are we wasting compute? : LangChain

Like:

  • You tell the LLM that John is a flying toaster trying to make new friends at school
  • The LLM writes a short story about John
  • You tell the LLM that John is also gay, and likes oatmeal peanut butter cookies
  • The LLM doesn’t “remember” that John is a flying toaster. The app has to re-inject key points from your first prompt (“… is a flying toaster, wants to make friends”) on top of your new one.
  • After a long conversation, there’s probably some automatic summarization going on prior to the re-injection, and the LLM ends up getting a synthesized (and probably shortened) prompt. A several-thousand word story thus becomes something like “The user is writing a story about a flying toaster named John, who wants to make friends, is gay, has a best buddy named Trish, and hides a secret in locker 457. The previous few messages discussed how Trish was having an inner moral conflict about whether to report the secret to a teacher.” If you asked this latest context what John’s favorite cookie was, it might just say chocolate chip. That detail didn’t make it into the latest context, which is the only thing the LLM actually processes.
  • The “conversation” is really just another hack on top of the LLM, and you can’t really reliably determine what will or will not get remembered and reinjected into your latest prompt.

If you need to maintain continuity, it’s probably better to do that yourself, e.g. compose a story out of a series of repeated but growing bullet points that you copy & paste back into the latest prompt on your own, never counting on the LLM (edit: or rather, the app around the LLM) to properly guess what is most important to remember or not.

The advantage of a longer context window, then, would be that you could manually copy & paste a context 10x longer on Gemini than the others. But whether either one is better at organically remembering a “conversation” is less a matter of its context window and more a matter of how good its automatic internal reprompting is. (My guess is that this automatic process also uses a LLM internally — maybe a smaller, quicker, worse one — (to continuously resummarize previous messages), and if so, that means this process itself is also subject to biases and hallucinations, and possibly moreso than the original LLM. Based on its training, this inner LLM is also going to prefer to remember certain things it deems more important than others, just like the human texts it was trained on — it can’t know ahead of time that the wrong cookie was later on going to become an important plot element that ruins an otherwise perfect date.)

I doubt they are just copying & pasting the entire chat history back into the newest prompt (it would start to take too too long after just a few messages, and get exponentially worse as the conversation grows longer). They are probably summarizing as they go, reinjecting the summaries, and then resummarizing and reinjecting the summaries-of-summaries as the convo keeps going, losing important context along the way with every copy of a copy. It’s better if you do it yourself and maintain control of what’s important.

My understanding is that the front-ends to the big LLMs do indeed just take the entire previous conversation, tokenize it, and cram as much of that as will fit into the “context window.” I learned a lot about that back when I was trying out AI Dungeon and NovelAI back in the GPT-3 days, before ChatGPT-3.5 got unleashed on the world. Their front-ends allowed a lot of user control over what context got fed back into the LLMs.
Before the LLM gets it, there might have been various summary operations done that are also fed into the context, but once that’s assembled, that’s the entire input that’s fed into the giant LLM engine. There are additional instructions they add into the context, things like “you’re an assistant,” “be helpful,” “don’t do harmful things,” etc. They’re constantly refining those to prevent jailbreaks without training up an entire new LLM.

ChatGPT has both the user customization instructions, and separate automatically generated “memories” that are available across all the user’s conversations when active. Those get crammed into the context too at input time. The user who knows some things should be remembered that might pass out of the context window, or they need them in separate conversations, can tell ChatGPT to put those critical things in the memory, or manually type them into the custom instructions.

As it’s creating each output, it also has to add what it just generated to the context, and repeatedly pass everything again back into the LLM to get the next part of its output. I’m fuzzy about how many output words at a time it can do this without sending the entire context back in again, but for any given prompt you send, I know it’s re-entrant multiple times.

That is definitely not true, at least in ChatGPT’s case. The maximum individual prompt length is handle by the front end UI, and for ChatGPT, is much smaller than its 128k full context size. I’ve attempted to measure it by feeding long prompts into a tokenizer first and I’ve found the UI maxes out somewhere under 32k tokens. I’m not sure why they decided to have that per-prompt limit, but it’s there.

There are tasks that the various AI companies claim their LLMs are good at, like summarizing long documents and extracting critical facts from long documents. They need those huge context windows to do that effectively. I’ve found ChatGPT is significantly better, but nowhere near perfect at the “extract crucial facts” task from its context than Gemini is. When summarizing a long document that still fits within the 128k context, it frequently won’t just leave things out, it hallucinates things that don’t match what was in the document. When writing a story, that amounts to characters remembering events differently.

If you haven’t already, you can sign up for a free account with ChatGPT 3.5 (I think that’s still the version of the free one). You can sign up just by clicking “sign in with Google”. It’s not only fun to play with, it can be a genuinely useful resource, provided of course that you verify its responses for anything important.

GPT 4.x is better (I think 4.5 is just going into production) but the free one is tremendously powerful. Here, for instance, is an example where someone critiqued the intrinsic capabilities of an LLM so I asked ChatGPT to analyze the critique. I must say, its analysis was impressive!

Sometimes a little guidance is required.

Gemini produced one bug that I found amusing. I was rendering a graph network. It had two types of connections between nodes, and some node pairs had both types. I wanted them to curve in different directions so you could see each.

I was seeing them overlap, so I asked the AI to fix it. I suggested it create a canonical order for the nodes. It did so, but flipped the ordering for the two connection types. But it also flipped the sign for the direction of curvature one one of the types. Either one of those fixes would have been appropriate, but both of them together cancelled each other out! I asked it to pick one or the other and it did so.

3.5 and 4 have been retired. Free accounts get a certain limited number of generations with the main 4o model then it defaults to “4o-mini” until the clock resets. I’ve noticed 4o-mini is significantly less intelligent than 4o. When I hit the paid user cap on 4o I just walk away until it resets.
Don’t ask me what the caps are for free users or how long they take to reset. Free users also get a very limited number of image generations from 4o per day. I believe free users’ only choices of models are currently 4o and 4o-mini.

Edited to mention 4o and 4o-mini differences.

Thanks for that. Apparently there have been significant changes since the ChatGPT 3.5 days. I asked ChatGPT itself about that, and for the first time ever, instead of an almost-instant response, I got a brief “searching the web” response, and then it came back with this:

You’re currently interacting with ChatGPT, which utilizes OpenAI’s GPT-4o model. This model, introduced in May 2024, is capable of processing text, audio, and images. Notably, GPT-4o is available to all users, including those on the free tier, albeit with certain usage limitations .OpenAI Community+2WIRED+2Wikipedia+2

Usage Limits

While OpenAI doesn’t publicly disclose exact usage caps, reports indicate that free-tier users may encounter message limits, especially during periods of high demand . These limits are implemented to ensure fair access and system stability. If you reach the usage limit, you might receive a message stating, "You’ve reached our limit of messages per hour. Please try again later."Reddit+1OpenAI Community+1

For users subscribed to ChatGPT Plus, the usage limits are higher, allowing for more interactions before encountering restrictions .WIRED

If you’re experiencing limitations, consider spacing out your queries or upgrading to ChatGPT Plus for increased access.

My company uses an old (15 years) custom program cobbled together by a former employee to take a wall of text we write and format it into HTML code for use in an emailed newsletter. Over the years the output has gotten a little flaky but usually something I just easily tweak by hand. But the errors have grown some and it is more annoying to fix by hand. So, I asked AI to fix the HTML and, while it generally sorts it out, it does it differently almost every time. Sometimes it is perfect. Sometimes it is fixed but it changed the font. Sometimes it removes line breaks cuz reasons. Sometimes the banner sizes are change a bit…not much but noticeable. Sometimes it skips including the footer. And so on.

I have tried, repeatedly, to tweak the instructions I give it before posting (I made a copy) so there is more info (like, “be sure to include the footer”) but I feel like a dog chasing its tail. I’d really like to get one output that is the way I want it and then tell the AI to remember to keep doing what it just did (indeed, I even asked the AI to write instructions that would achieve that…it produced what seemed a reasonable answer but it is still weird about the final results).