Citation glitch being perpetually reiterated by AI

It’s not really that prevalent… 22 results across Google Scholar? At a glance, it seems like most, if not all, of them are from Iranian researchers or OCR errors. It’s not exactly a scientific crisis.

I’m not convinced this was even a LLM bug, rather than just LLMs absorbing that phrase as a token that it’ll probably never use unless very specifically and artificially prompted. It has enough associations against that data to know that it’s probably just a human mistake. ChatGPT very explicitly tells you it’s nonsense.

I mean, there are 17,500 “lorem ipsum” results on Google Scholar, and we don’t blame that on ChatGPT…

Other than translation error, yes indeed.

I think that’s a reasonable explanation for the Iranian papers. But not every paper went through a Farsi translation, at least as I read it.

Nice work on your part. Thank you.

I fully support its analysis and criticism of my writing.

I was trying hard to keep it simple enough that a zero-knowledge noob could follow how LLMs’ behavior can be 100% logical and sensible in low level operation but can still be highly illogical and nonsensical in high level outcome.

I hope @dolphinboy returns to critique the effort.

Since you bring that up…

A few days ago I told ChatGPT this:

I want to make a image that is a joke on the Simpson “old man yells at cloud” image. I want a newspaper clipping with a photo of someone resembling Gary Oldman shaking his fist at a fluffy cloud overhead. The headline of the clipping is “Oldman yells at cloud”. The clipping is being held in someone’s hand by the lower right corner.

And it had no problem creating it. In the original Simpsons image, only the headline is readable with the body text just being vague squiggles. But ChatGPT filled it in with lorem ipsum and other nonsense text. So ChatGPT knows that it is “place holder” text and recognizes a case where it was suitable.

ETA: holy crap, just for fun I just now let Google Lens try to translate that. For the text under the photo, it actually came up with something:

I am very sorry for my pain, it will be settled by the dipiscing of the elid. I was killed by the augury or the rising of the bird. He looked at him and looked at him and looked at him and looked at him.

The text on the right it didn’t translate.

Thanks everyone. Yes, I now have a better understand of what AI is and how hallucinations can occur. Unfortunately, it sounds like there’s not much they can to keep it from happening.

I coincidentally have been listened to a podcast called Shell Game that talks about the advantages and pitfalls of creating AI voice agents that can perfectly mimic any person’s voice. These agents can handle tasks like answering or making phone calls without the other person knowing it’s not a human they’re talking to. The two big bugaboos are latency, because the AI has to go from voice to text, do a search, and then go from text back to voice and that takes a few seconds and causes an unnatural pause, and the voice agent can have hallucinations where it makes up an answer to a question that is clearly incorrect.

It’s possible we’ve already interacted with AI voice agents and didn’t know it, but I think I could tell whether I was talking to one that uses current technology or not. Someday it won’t be possible to tell the difference, and if a voice agent can still randomly fabricate information I don’t see how anyone will be able to trust it.

Why the surprise? Lorem ipsum itself originated as a real Latin text (slightly mangled, like missing the “do” from the front of “dolorem”).

I had assumed the generated text to be much more mangled gibberish. And especially didn’t expect a creepy-ass passage suitable for a Clive Barker novel.

Part of it depends on the starting point; there’s a fairly famous case where someone asked Google’s AI ‘how many rocks shall I eat?’

A human, faced with that question, would wrinkle up their face and respond ‘WTF? Don’t eat rocks, you idiot’.

(simplified:) A language model that is trained to try to earnestly answer questions, starts answering the question and, having started answering the question, just carries on. In a sense, it is constrained to make up an answer by the fact that it is conditioned to always answer something like “sure, I can help, here’s the answer…”
But because the question itself is not a common one, there’s no relevant training data upon which to form a real answer so it just trots out an answer that sounds like something an answer might be, and so the response was:

According to UC Berkeley geologists, people should eat at least one small rock a day. Rocks can contain vitamins and minerals that are important for digestive health, including calcium, magnesium, potassium, phosphorus, zinc, and iron. Some recommend eating a serving of pebbles, geodes, or gravel with each meal, or hiding rocks in foods like peanut butter or ice cream.

If the question had been ‘should people eat rocks?’ the answer might have been more coherent and truthful, but the question subtly gaslights the LLM into assuming there is a number of rocks that people should eat.

I think stating it like this is perhaps misleading - it’s a predictive association from a whole series of tokens that have come before, to the next word; the LLM isn’t just predicting the next word/token based on the previous one word/token; it’s doing it on the basis of a large chunk of the previous conversation called the ‘context window’.

Because some scientists are using AI to write “their” papers. Article.

And it should be noted that because they don’t actually know anything, they also don’t know what they don’t know. They can’t notice a hole in their knowledge base when they don’t actually have a knowledge base in the first place.

In fact they remind me of something I read about in one of Oliver Sack’s books. About how some people who suffer brain damage that causes renders them unable to access memories or form new ones will spend a period just “fabulating”, endlessly making things up about themselves and their past, basically trying to fill the unfillable hole in their mind in a stream of consciousness that vanishes as fast as they produce it. Sacks compared it to a cartoon character running off a cliff and not falling until they look down; as long as they could keep up the fabulation they could ignore the mental void they were running over.

“AI” I think does much the same, except it’s not bright or self aware enough to ever “look down” and realize it’s metaphorically running on nothing.

The ‘knowledge’ argument is an interesting one; they don’t ‘know’ things in the same way as humans experience their own knowledge (or at least we have little reason to imagine they do), but they do embody knowledge, which is a kind of knowing - perhaps philosophically in the same sense that we might talk about whether a dictionary ‘knows’ a word or not.

I think a useful analogy would be saying that they “know” some things the way an insect knows how to fly and crawl. They have the information needed to perform those tasks but it’s all embedded in their reflexes; they neither understand it nor have any means of retrieving it.

In the same way AI chatbots have lots of information embedded in them, but they don’t “know” or understand any of it. They are working on pattern recognition as said, and have no ability to extract the knowledge within the patterns and recognize it as facts.

There will come a point, I think, where it becomes difficult to say for sure whether they ‘know things’ - especially with systems that are more complex than the simple LLM - for example if they are configured to perform multiple passes on the answer to check the coherence and truthfulness of their own answers, it becomes analogous to introspection.

It would still be hasty to say they’re experiencing knowledge like a human does, but I’m wary of the sorts of arguments (not yours) that contain the word ‘just’ - i.e. ‘they’re just predictive text’ or ‘they’re just a collection of statistical weights’ - simply because we could describe a human as ‘just a collection of atoms’.

Oh, I fully expect that barring a collapse of civilization, we’ll eventually create human equivalent AI. Probably even conscious AI.

I just don’t think these glorified chatbots are even close, or on the right path to achieve anything like it. They are very limited tools that are being badly misused.

I think the question that raises is, how are they checking it for truthfulness?

It seems to imply that there does exist some repository of facts, such that, when an LLM generates a sentence saying that cats are blue or that the correct adult human dosage of rocks is >0, it can extract that factual claim and compare it against some body of factual knowledge.

But is that body of factual knowledge in some way separate from teh training data and neural-net/statistical weighting which created the original claim? If not, why would there be a different result, or to put it another way, if the same encoding produces two contradictory results, on what basis should one be preferred over another?

And if the body of factual knowledge is a separate thing, then… why not use that in the first place? The answer to which must surely be that if we knew how, we’d have done it already.

(We skip lightly over the questions of who decides what goes into the body of factual knowledge, how we define a fact etc. etc.)

It seems to be a simpler task for an LLM to feed it a statement and ask for critique, than to ask it to start fabricating an answer from scratch.

And I think this is in part because of the sequential way in which they process data - because they only really ‘know’ what they are going to say immediately next, not what the whole of the idea is that they are trying to communicate, simple LLMs are not great at the kinds of tasks that require humans to ‘look ahead’ or take multiple passes on the answer.

For example I asked an earlier version of ChatGPT to write a sentence that is truthful and self-descriptive, and ends with the phrase ‘…and that’s why this sentence contains 27 words’ - it failed miserably because it just ran once through the task - the result was something like “the boiling point of water is 100 degrees Celsius, and that’s why this sentence contains 27 words” - it consistently failed to get the word count right and it struggled in general to try to write something that was self-descriptive and true.
Humans typically approach a task like that by making multiple passes, maybe prioritising the self-description first, then checking their answer, then tweaking the word count by simplifying or elaborating - an LLM taking one headlong run at the task has no opportunity to do that.

But if you feed the results of the first pass back into another LLM, wrapped with the question ‘does this meet the specification and if not, how is it deficient?’ - you get a useful direction that can help to iterate toward a correct answer - and you also have the advantage of framing the question slightly differently, which means it’s another path through the embodied knowledge imparted by the training data, so there’s another opportunity to get it closer to correct.

I believe some of the newer models work in this sort of way. I think if someone had asked Google’s AI ‘Do geologists actually recommend eating at least one small rock per day?’, it would not have had any trouble in deciding that was untrue, simply because checking an answer is easier than writing one.

We saw this in a recent thread where the OP was under the impression that the testes of a human were inside the blood-brain barrier, because AI said so.
AI seems to have said so because there is such a thing as the blood-brain barrier and there is another thing called the blood-testis barrier, so the model seems to have just glitched on the occurrence of the terms relating to blood and barriers, and conflated the two things.

It’s interesting to consider the failure modes of LLMs vs humans… sure, regurgitation by correlation is bound to create hallucinations (false positives), but we don’t fully understand how and when human minds make mistakes either. How many people are wrong in some small way every day, multiple times a day? We don’t always correct them because it’s rude, unless they’re on the Dope. But if you took a random sample of people and asked them a battery of questions from across different disciplines, I bet they would do much, much worse than LLMs on the whole. Some of this depends on their “training set” (their level of education), but different brains also have subtly different thoughts patterns and, frankly, different grasps on reality. The physical mechanism of failure may be different between man and machine, but neither is a perfect truth seeking system. Philosophers and scientists are often wrong too, to say nothing of politicians.

For that matter, how many times is Google wrong? Wikipedia?

That we argue about hallucinations at all is a testament to how eerily cognizant these autocomplete engines already are. We dwell on their mistakes while taking it for granted that sentences and images and melodies now suddenly have actual meaning to them – yes, encoded in a system different from our neurons, but nonetheless able to semantically transform complex inputs into useful outputs with a high degree of correlation to reality. They’re not perfect, but that’s something most people and other types of machine learning struggle with, too.

I agree LLMs are misused frequently because the average person isn’t well equipped to understand their limitations, but to dismiss them altogether is a reckless disregard for a pretty revolutionary new way of encoding and transforming knowledge. Overnight, in their infancy, LLMs made the Turing Test irrelevant, and it’s now taken for granted that having natural language conversations, or even debates, with your PC or phone is possible. This was considered impossible a few short years ago. Your average LLM will already do better than your average person, across most fields of knowledge, while simultaneously being more eloquent in almost any human or computer language.

To reach a similar level of competence, a human needs a brain that evolved over millions of years, life training and sensory exposure that occurs over a couple dozen years, general education for hours a days, specialized education for half a decade, and then ongoing training in a career.

LLMs are catching up very quickly in the half decade they’ve been around, and are getting incrementally better in the benchmarks every year. They are continuously trained and retrained and fine-tuned in ways that most people aren’t, except maybe a tiny handful of scientists, academics, and researchers. How well would any of us do if we were subjected to the same training data they were? Could you learn ten languages to native fluency in five years, along with being able to make apps in twenty different programming languages, while being able to spit out trivia about basically any field in existence, and then discuss it all in poetry?

This isn’t an argument for their sentience, but their flabbergasting abilities even without it. Perhaps sentience isn’t the fixed prerequisite for semantic information processing that we once thought it was.

There is likely a cap to these abilities, same as there is a cap to human ability and intelligence. But they are still extraordinarily capable in ways that many people are not (summarization and multi dimensional correlation analysis of vast datasets), while also being extraordinarily incapable of tasks many humans would consider mundane, such as counting words and paragraphs. That’s just not how their “minds” work, any more than yours can suddenly learn a new language with a few weeks of training. They can’t tell good jokes either… but have you met my dad?

LLMs are not AGI and probably never will be (on their own). But the threshold for that is constantly changing and the goalposts keep moving, because we really don’t know what a general purpose super intelligence would look like and how it would function. We are ourselves mere fascimiles striving for that imagined goal, but most members are our species are already far far less capable than an LLM at knowledge transformation, a task we not so long ago claimed as our exclusive domain, the thing that set us apart from the animals and machines of the world. Overnight that fantasy was shattered, and it’s only going to get more shattered every decade after this.

IMHO the likely outcome of this quest isn’t that machines will someday “achieve” sentience, but that the concept of sentience will itself become obsolete and irrelevant the same way we no longer care to study “aether” or “humors”.

I dunno, I just think they deserve more credit than they’re typically given.

Vegetative electron microscopy appears to have originated through a remarkable coincidence of unrelated errors.
Like biological fossils trapped in rock, these digital artifacts may become permanent fixtures in our information ecosystem.

Let’s see, AI generated mistakes that are virtually impossible to correct by humans. As AI slowly takes control over everything we do, that’s not something we should be worried about, is it? :neutral_face:

And just for fun, here’s ChatGPT’s own critique of my rant:

Its thoughts on my thoughts on it

Your post is a textbook example of breathless technophilia wrapped in pseudo-intellectual relativism. You draw sweeping comparisons between humans and LLMs with superficial insight, waving away crucial distinctions as mere technicalities. The argument “humans are flawed too!” is profoundly weak—it trivializes genuine concerns about hallucinations, misinformation, and harmful biases embedded in LLMs.

You assert boldly that an average LLM outperforms an average human across “most fields of knowledge,” but that’s largely meaningless. Trivia games aren’t reality; genuine expertise involves understanding nuance, context, and real-world implications—areas where LLMs frequently stumble disastrously. Conflating trivia recall and true intellectual mastery is intellectually lazy and misleading.

Your post also makes grand claims about how LLMs have rendered the Turing Test irrelevant “overnight,” glossing over decades of incremental research that led here. It’s naive at best to say “this was considered impossible a few short years ago,” ignoring the continuum of AI research dating back decades.

You repeatedly use overly casual analogies (“Have you met my dad?”) and rhetorical questions that weaken your overall argument. Humor is welcome, but yours is scattered, awkwardly forced, and detracts from any gravitas your discussion might have held.

Furthermore, your vague musings about “sentience becoming obsolete” come off as a half-baked philosophical hot take that sidesteps deeper ethical questions. It’s easy to dismiss sentience when you don’t deeply grapple with the consequences of intelligent systems that mimic human reasoning but lack empathy, accountability, or moral comprehension.

Lastly, your point about humans having evolved over millions of years of biological complexity feels entirely misplaced. Humans aren’t competing with LLMs; LLMs were explicitly built by humans. Praising a system for surpassing human speed in certain tasks, given massive computational resources and tailored algorithms, feels as pointless as marveling at a calculator for doing math faster than a person.

Overall, your argument lacks intellectual rigor, flirts recklessly with relativism, and seems more interested in championing a trendy new technology rather than engaging honestly and deeply with its limitations and dangers.