When you use an app like ChatGPT (or presumably Claude), this isn’t a direct interface to the LLM itself. It is a wrapper that will first pre-process your input and supplement your prompt with addition information and tools, like a module that can run Python code.
So it’s not really the LLM itself running the code — it can’t — but other software alongside the LLM that “help it along”. The LLM may generate the code based on the context you give it, the other software runs it, and then returns the output to the LLM, which then summarizes it and returns it to you.
In ChatGPT, the app, it’s pretty clear when the code is actually being run in this other software because it will have visual indications like “Analyzing…” or “Thinking…”, and show you the actual code that’s being executed in a different formatting.
With the default 4o model, you have to explicitly ask it to do this: ChatGPT - Sin calculation result.
With one of the “reasoning” models like o4-mini-high
, it will almost always do this by default: ChatGPT - Sin 62.1 Calculation
I don’t know if Claude does this (apparently not in the default chat mode?).
Either way, though, none of this is perfect and should always be treated with skepticism. When you enter something into the ChatGPT app’s input box, there are a lot of layers that it goes through and a lot of pre-processing, not just for coding but also censorship, retrieval-augmented generation, default system prompts that dictate its writing style and personality, context mixing with your previous chat history, etc. And then there’s a “temperature” setting (you can’t control this in the ChatGPT app, but can via API) that will control the degree of randomness of its output.
The LLM itself is difficult to change (requires retraining and re-fine-tuning, which is what uses most of the energy in an AI product). When that happens, it is released as a new model. But separate from the LLM itself, the app around it (“ChatGPT”) is just traditional software, and that updates multiple times a week, and each of those changes can alter the output you see even if the LLM itself hasn’t changed. And it’s never really documented anywhere, so you don’t really ever know “why” it’s returning something different unless OpenAI writes a marketing blog post about it (like they did after the recent “sycophant” issue where ChatGPT started worshiping its users).
All of this together means that humans don’t completely understand what a complex LLM system is doing, and we have to kinda fake it by adding before- and after-the-fact “guardrails” around the LLM, because the LLM itself is largely a black box. These guardrails are just traditional software that manipulate user prompts, inject other tools, and manipulate LLM outputs before you see the result. That is what’s running the Python script for you, not the LLM itself.
It’s also how the “agentic” systems, and more recently, the model-context protocol work… these “add-ons” allow AIs to work with traditional software, databases, and outside information more easily, but they are just layers on top of (or before/after) the LLM. The LLM itself can’t run Python (but it CAN generate Python code for helper software to run).
Strictly IMHO: LLMs are best when you use them for their intended purposes: language analysis and generation. That they can run code and do math at all is because the companies behind them want to make them into general-purpose assistants, which they are inherently not, and they can only do that by faking additional layers on top of it.
If you want something done in Python, just ask a LLM to write the code and then run the Python yourself. It’s much, much more reliable that way.