My understanding is that the front-ends to the big LLMs do indeed just take the entire previous conversation, tokenize it, and cram as much of that as will fit into the “context window.” I learned a lot about that back when I was trying out AI Dungeon and NovelAI back in the GPT-3 days, before ChatGPT-3.5 got unleashed on the world. Their front-ends allowed a lot of user control over what context got fed back into the LLMs.
Before the LLM gets it, there might have been various summary operations done that are also fed into the context, but once that’s assembled, that’s the entire input that’s fed into the giant LLM engine. There are additional instructions they add into the context, things like “you’re an assistant,” “be helpful,” “don’t do harmful things,” etc. They’re constantly refining those to prevent jailbreaks without training up an entire new LLM.
ChatGPT has both the user customization instructions, and separate automatically generated “memories” that are available across all the user’s conversations when active. Those get crammed into the context too at input time. The user who knows some things should be remembered that might pass out of the context window, or they need them in separate conversations, can tell ChatGPT to put those critical things in the memory, or manually type them into the custom instructions.
As it’s creating each output, it also has to add what it just generated to the context, and repeatedly pass everything again back into the LLM to get the next part of its output. I’m fuzzy about how many output words at a time it can do this without sending the entire context back in again, but for any given prompt you send, I know it’s re-entrant multiple times.
That is definitely not true, at least in ChatGPT’s case. The maximum individual prompt length is handle by the front end UI, and for ChatGPT, is much smaller than its 128k full context size. I’ve attempted to measure it by feeding long prompts into a tokenizer first and I’ve found the UI maxes out somewhere under 32k tokens. I’m not sure why they decided to have that per-prompt limit, but it’s there.
There are tasks that the various AI companies claim their LLMs are good at, like summarizing long documents and extracting critical facts from long documents. They need those huge context windows to do that effectively. I’ve found ChatGPT is significantly better, but nowhere near perfect at the “extract crucial facts” task from its context than Gemini is. When summarizing a long document that still fits within the 128k context, it frequently won’t just leave things out, it hallucinates things that don’t match what was in the document. When writing a story, that amounts to characters remembering events differently.