AI is wonderful and will make your life better! (not)

One of the problems with ubiquitous AI output is that it then becomes the input, recursively, as they start training larger and larger datasets. Since we’ve seen that we’re not very good at automating AI output recognition, efforts to combat this are at best slowed down a bit, but not stopped. Unfortunately, this can lead to worse and worse outcomes in what is known as model collapse.

It could get ugly.

https://www.nature.com/articles/s41586-024-07566-y

Fascinating paper, thank you for that. In particular, it’s worth skipping down to “Example 1” to see the degradation in action.

So, what do we know from various studies:

  • per this paper, the introduction of AI-generated data into modeling will degrade them by both eliminating real (mostly but not exclusively tail) probabilities and introducing new, false ones
  • per OpenAI, the introduction of a larger corpus of data into model turning will create more hallucinations, not fewer
  • calibrating AI models through RHLF (reinforcement learning by human feedback) creates hidden incentives to give users false, but pleasing feedback

Take my money and invest it now!

I tried using Copilot today – for the first time – to fix a formatting error I was having in Excel. It wouldn’t clear the formatting so I could re-draw a table.

Copilot gave me several helpful options, all of which failed, and one of which basically ruined the spreadsheet by repeating the content in the top row what appeared to be an infinite number of times, such that, the only way I could delete the content was to delete the row of my spreadsheet.

I estimate it took about 20 minutes for copilot to fail to help me.

In other news, AI companies are signaling that they want taxpayers to foot the bill for the AI products we didn’t ask for.

WCGR?

In an interview on CNBC, Alex Karp, Palantir’s chief executive, criticised Burry and other short-sellers for “trying to call the AI revolution into question”.

LOL. How dare they?

I use ChatAI to help with some lesson prep. It’s not perfect, of course, but it can do some tasks.

I don’t use the paid version but the free version is supposed to keep track of things. I teach both Japanese and Taiwanese students and have things translated to help the students.

Keeping track of which language a student needs should be really easy. I would expect a five-year-old to remember that and yet ChatGPT consistently gets it wrong.

I just got burned by ChatGPT.

Indeed.

I have a YouTube channel where I talk about film photography. Last week I got a new-to-me Minox 35 ML camera–Minox is the company that made those famous spy cameras you saw in James Bond movies with guys taking photos of documents in darkened offices.

I made a YouTube Short about the camera where I introduced it and gave a quick run through of its features.

I had never heard “Minox” pronounced out loud, so I asked ChatGPT.
It told me with cheerful confidence that the camera name was commonly pronounced as “MYnox” in the USA in the 50s and 60s. (and MEEnox in Germany)
Nice! I asked it for cites, and it gave me links to a few forum threads and a 1952 Popular Photography magazine.

So I made my video and posted it. On Monday someone left this short comment: “It’s not MYnox”

I was about to give that guy a piece of my mind when I went back to the chat session and started picking away at the references given. It turned out that the forum threads were at best inconclusive, but really leaned toward “MINox” as in “Mint”

I started interrogating the LLM, and after quite a bit of back and forth it sheepishly admitted that it had been too eager to give an answer and had misled me.
I asked “If I look in that 200 page Popular Photography magazine, I’m not going to find it, am I?”
To which it responsed “Sorry, no. There might be mentions of the camera in ads but I found no explicit ‘how-to’ explaining how to pronounce it.” I asked it “You can’t find one scrap of evidence anywhere saying MYnox, can you?” and it apologetically admitted “Sorry, I cannot find any evidence that the name was pronounced MYnox.”

I felt like I had just had to deal with a 10-year-old boy who, caught in the lie about doing his homework, doubles down and insists he did it unto the last moment, faced with mounting evidence, when he finally admits defeat and tells me the dog ate his homework.

This is an important lesson for me: Shame on me for not chasing down the references. This was mildly embarrassing (the darned video has too many views now for me to replace it), but it was a benign error. We can all imagine crucial errors in our own work lives where we could cause great damage with this kind of absolutely confident wrong answer.

Relevant:

I find myself in an interesting predicament with AI: there are things that we are told not to do such as being polite to them, and asking them why they made such a critical error.

The problem is, I want to treat AI as I would a human. I am afraid that if all I do is curse it and shout terse orders, one day that behavior will leak into my interactions with humans.

With that said, it’s not totally useless to ask why it made a mistake. No, it’s not going to learn a little bit an do it better next time, but it gives me insight into how it could have made the mistake and helps me understand traps like this better.
In addition, the OpenAI ChatGPT user interface provides substantial short term and long term memory features, so not only can you influence its responses in the same chat thread, but you can tell it “Remember this…” and it will do so.

For example, I use ChatGPT to help me find Bible quotes and commentary about, say, a song we are going to sing at church, but I have clearly told it “When I ask you about religious texts, I cannot trust what you say directly. You may paraphrase, but always give me the original quotes so I can verify them.”, and it does honor that.

The long term memory is pretty interesting. You can go into the user interface and see your own long term memory, and even edit it if you wish.

Want to know what it’s like to not have that awesome short and long term memory? Simple…select the same LLM (GPT5) in Copilot. At least in my environment Copilot does not have any concept of long or short term memory, remembering barely a few “turns” in the current chat. It’s like talking to a distracted friend who loses their train of thought.

The problem with asking a chatbot to explain its mistake is that it literally does not know. It will assemble its “explanation” using the same predictive language model it used to generate the wrong answer in the first place. LLMs have zero capability for introspective self awareness.

Sure, you can ask it questions that shine light on the structure and operation of that model, but its self-diagnosis of error is not based on anything at all.

I don’t agree totally. I have regularly gotten positive results where the LLM provides an explanation of “I saw this evidence X, Y, and Z, and misinterpreted this quote over here”
or “I was relying on my cached memory of the document and had not refreshed the original”
These tend to be mistakes a human might make and are understandable.

In many cases it even offers help on how I can guide it (e.g. help on ensuring it throws out its cache at certain times).

For blatant hallucinations your are spot on, but there is a whole spectrum of “wrong” of which “patent lies spoken with confidence” are pretty far along the way.

ETA: And when it is deep into blatant hallucinations with smug confidence territory the desire to give it a piece of one’s mind, perhaps cursing it out, is difficult to suppress.

On a FB genealogy translations group, someone asked for a Welsh translation. A reply suggested DeepL, and they put in the first sentence, which changed 12 children to 14 children. Such a stupid error to introduce!

Y’see I come to it from the opposite direction. I do not wish to grant a mere tool the dignity due a sentient mind.

Those were causing trouble long before AI:

The problem is that these explanations might easily not be true. It’s just a response with a higher probability than others, based on a large corpus of similar questions.

In our accounting business, we use Google Workspace for client files.

Had a prospect upload AMEX statements, 4 years worth, for review.

One of our tasks is simply counting transactions, by month, to calculate the amount of work the catchup bookkeeping will take. Most statements have a transaction count, AMEX does not. Usually we have an assistant for this, but this week (and next), she is out of office, so it’s up to me.

So I decided to make things easy for my lazy ass. Ask Gemini how many transactions there are in a statement. There were two ways to do this:

  1. In the folder itself, asking it to give me a count, by month, of the transactions in all statements.
  2. In each statement, one-by-one.

(I manually counted the first 6 months of statements as to check G’s accuracy.)

The first was a basic failure. It only counted half of the statements, mis-monthed (is that a word) them (for example, the statements start in January 2021, Gemini told me the first statement was December 2020), and got the counts horribly wrong. Revising the prompt did nothing. Gave this up after 10 minutes.

When I asked Gemini, month-by-month, by having the statement open, G was more accurate, but rarely precise. February would have 64 transactions, but G would tell me there were 58. Revising my prompt helped, but only one month was exact with the actual count.

This doesn’t really have anything to do with receiving wrong training in the LLM model or whatever (at least, it seems to me). I’m just asking the thing to count the number of transactions, telling G how to identify what is a transaction, etc.

It just can’t do it and yet you would think this is the sort of task AI could handle easily.

I feel that this is a different issue.

Maybe sometimes. Like many others, I think you’re excessively focused on “how I think it works” rather than on the actual results which are produced by an almost unimaginably complex artificial neural net with (and I know I keep using this term a lot) important emergent properties that even the designers don’t fully understand.

I can attest to the truth of what @minor7flat5 is saying. In fact I just had a recent dialog with ChatGPT which was apparently sufficiently challenging that it spent an unusual amount of time thinking and researching (showing its steps along the way, which was pretty cool).

I didn’t challenge its answer, but I did point out a related fact (previously unstated) that I thought it might have missed, and it came back with something like (paraphrasing) "yes, on rethinking this, the figures I gave you before were wrong because {reasons} so I’ve re-evaluated based on {these new assumptions} and here’s the revised response and the new figures.

The second one, to the best of my ability to determine, was a pretty accurate response to a very complicated question. The whole dialog felt like – and indeed was – just like discussing a problem with an intelligent and knowledgeable person.

Because it’s a large language model, it infers counts by recognizing repeated patterns and structural markers in the input. With a very clear layout (like a CSV or JSON), it can approximate the number of records by mimicking similar examples it was trained on—but it doesn’t actually count or compare past cases.

Better to ask than to not even try.
It was through this kind of investigation that I found out how Copilot was munging my questions about a simple spreadsheet. It gave me very cogent explanations about exceeding context windows and how LLMs don’t see spreadsheet rows like humans do. It also gave valuable advice on how to mitigate this (e.g. splitting files, using CSV, or providing an index). These recommendations worked well enough for me to understand the limitations I was hitting.

At the end of the day we have to be able to give everything the sniff test, but I’d rather leverage the LLM to improve my understanding of what went off the rails–if it gives a good verifiable answer I come out ahead; if not, I just say a bad word under my breath and close the window.

Me: discuss how you backcheck when I challenge an answer, for example when you say a file was cached.

CoPilot: When I explain a mistake like relying on a cached site, I’m not accessing a literal trace or internal log—I’m inferring the cause based on your critique, the conversation history, and known failure patterns in my architecture. I don’t replay my original reasoning or retrieve exact tool states. Instead, I diagnose post hoc by matching symptoms to typical breakdowns (e.g. stale data, misalignment, skipped steps) and generate a plausible explanation grounded in how I usually behave. It’s a reasoned hypothesis, not a forensic audit.

To your point, I simplified my prompt and am now getting accurate counts.

One thing that life keeps telling me is that, as soon as I complain about not being able to figure out something to a listening audience, I will axiomatically find the solution within about 5 minutes. :zany_face: