AI is wonderful and will make your life better! (not)

From the article:

AI chatbots have become an integral part of the way information is disseminated, processed, and consumed. However, their growing presence is not without significant challenges. A key issue is the surprising level of inaccuracy in the information they provide, as highlighted by a study from the Tow Center for Digital Journalism. This study revealed that AI chatbots, including prominent ones like ChatGPT and Gemini, offer wrong answers over 60% of the time when tasked with sourcing news excerpts. This tendency towards inaccuracy isn’t just a technical glitch but raises serious concerns about the reliability of AI systems in critical applications like news reporting, where accuracy is paramount.

The consequences of these inaccuracies are far-reaching. They not only spread misinformation but also damage the trust between the public and the media, as well as between publishers and the companies developing these AI tools. For instance, chatbots have been known to fabricate headlines and fail to attribute articles correctly, often linking to unauthorized or incorrect sources. This not only misleads the public but also harms news publishers’ reputations and revenue potential by diverting traffic and diminishing the perceived value of credible news outlets.

AI search engines, powered by sophisticated algorithms, are increasingly shaping the way people access and consume information. However, a critical analysis reveals that these AI-driven platforms often suffer from significant inaccuracies, especially when sourcing news content. According to a recent study by the Tow Center for Digital Journalism, AI chatbots like ChatGPT and Gemini are frequently incorrect, providing inaccurate information over 60% of the time when tasked with identifying correct news sources. This high error rate stems from tendencies to fabricate headlines, incorrectly attribute article sources, and link to erroneous or unauthorized sources, thereby affecting the integrity and reliability that users expect from such advanced technologies. This issue not only misleads users but also poses substantial threats to the credibility and operational viability of news publishers. For further details on these concerns, you can refer to the study findings .

The repercussions of AI search engine inaccuracies extend beyond professional journalism to encompass broader information ecosystems. When AI models consistently disseminate false or misleading information, they contribute to the erosion of public trust in both AI tools and traditional media outlets. This erosion is not just theoretical but has tangible consequences: misinformation can shape public discourse, influence political perceptions, and even sway election outcomes if unchecked. As these technologies continue to evolve and integrate more deeply into daily life, the imperative for transparency, accuracy, and accountability in their operation grows ever more critical. A detailed examination of these issues and their implications for news dissemination can be explored through the comprehensive research by the Tow Center for Digital Journalism on AI inaccuracies .

Of course, these isn’t aren’t just limited to sourcing and referencing news articles (something even a marginally mature ‘AI’ should be capable of doing) but to accessing, summarizing, and providing important information from any source.

Quoting a bullshit generator to argue that it is not quite as bullshitty as clear evidence would make it seem is about as sensible as fucking for virginity. You keep arguing that chatbots are an “amazing and almost immeasurably important and useful technology” and that it is just a “people problem” that “tendency for naive users to believe them to be dependable”, but in fact in order to be used as a general knowledge system they have to be dependable, at least to the extent that an actual expert is, and they absolutely need to not fabricate sources or citations so that a user can at least know where to look to verify that the source of information is correct and make some effort to verify that the interpretation is what the chatbot says it to be.

The reality is that the use case is for non-experts doing some kind of information research, many of whom (like the poster who used a chatbot to ‘solve’ a physics problem discussed upthread) don’t have enough information to verify a correct answer or quickly intuit a wrong answer, and so they will generally take whatever answer they are given as essentially correct, possibly posting it into a work product, paper, or news article where it will be disseminated to the wider world. That LLM-based chatbots can’t even be used to take food and beverage orders off of a limited menu indicates that the technology is nowhere near mature enough to be deployed to the public and certainly should be used in any way to produce safety-, security-, or health-critical information or directions.

Stranger

[1,001 imaginary likes]

GPT is actually exceptionally good at both finding information and summarizing it. It does quite a remarkable job, for instance, of summarizing sections or abstracts of scientific papers, and can reduce the essential information to simple terms understandable by a layman who may be unfamiliar with the scientific terminology of the field. The study that all of this latest kerfuffle is based on was specifically about AI’s trouble with citations. The first part of that wall of text that you highlighted simply repeats that fact over and over using different words.

The second part that you highlighted tells us about the terrible impacts of misinformation, forgetting to mention that the idea that generative AI constantly spews out misinformation is an unsupported assumption and is a baseless extrapolation from the study about citations.

Instead of wallowing in self-satisfied snarky smugness, it would have been more productive to have actually addressed the substance of what was actually a pretty cogent argument, but I guess snark is always the easy way out. The point being made is that citations are a worst-case scenario for a number of reasons:

  • Search wrappers are still immature (it’s tricky to link a search engine with a text generator;
  • Unlike a normal exposition, which can be “mostly right” and still useful, even the slightest error in a citation counts as a failure;
  • News articles are moving targets; they can change, headlines can be updated, they can be moved, the content can change, and LLMs relying on previously scraped content can become outdated.

None of those things matter in other areas like general information retrieval, summarization, composition, coding, or translation, where linguistic and logic skills are important.

It is not an “unsupported assumption”; it is literally the experience that many people—even in this thread—have reported as a regular occurrence, and it a a natural consequence of a system that has no innate ability to distinguish fact from fabulation when constructing statistically cromulent word-order responses.

Again with the ad hominem at anyone who disagrees with your dogmatic belief that LLMs are an “amazing and almost immeasurably important and useful technology” while continually dodging the fundamental issue that chatbots often fail to reliably provide basic, easily checked factual information including, yes, citations and references. That “it’s tricky to link a search engine with a text generator” even though this is the basic function of a general knowledge system illustrates the fundamental problem with these systems in being used to provide factual information or analysis. If it can’t get basic facts correct, why would there be any expectation of its reliability for “… general information retrieval, summarization, composition, coding, or translation, where linguistic and logic skills are important”?

That LLM-based chatbots can manipulate language and produce a syntactically-correct result that at least appears to relate to question or issues in the prompt (at least, at a superficial level) is an impressive capability, but that it can’t be relied upon not only to verify factual information and alert the user when it is going beyond its knowledge base but often just fabricates information and citations makes it problematic for any production use; it is essentially a novelty that is only useful for people who are familiar with its problematic tendency to generate nonsense and who know enough about whatever topic they are prompting it with to suss out anything but the most obvious errors.

Stranger

I think the main problem is the misunderstanding that the Turing test is a significant challenge in the field of intelligence, and not the ultimate challenge.

So you invented a machine that’s smarter than the median voter? Congratulations, that’s no small feat, but it’s only worth as much as the thoughts of the median voter, which experience suggests isn’t that much.

It’s even worse than that. To pass a Turing test, an ‘artificial intelligence’ system just has to function in a way that convinces the user that it is actually demonstrating sapience. Of course, someone who is predisposed to look for indications of sapience will find evidence for a ‘spark of consciousness’ and ignore contraindications that the system really isn’t sapient and is just producing word-streams per a complex algorithm with no ongoing cognitive processes. This is especially true with an LLM because as humans our main mode of interaction is through language, so the default assumption is that a tool that can competently manipulate language must be capable of extensive cognitive functions even if it makes basic errors in simple reasoning, fails to distinguish between fact and fiction, and generates fake references for the ‘information’ that it ‘hallucinates’.

Nobody believes that non-LLM deep learning systems are on some edge of sapience or sentience but a thundering herd of people have convinced themselves that LLM-based chatbots are or are approaching AGI despite a complete lack of any ongoing processes that a neuroscientist would recognize as cognition, or any ability to interpret the world or develop ‘lived experience’ of it beyond its training data set and reinforcement learning, which requires massive amounts of carefully curated data that are many orders of magnitude more than a person would read in a lifetime, and still needs extensive correction and constraints to keep from generating pure nonsense or spiraling completely out if control. That LLMs can manipulate natural language in a context-sensitive way is an impressive feat of deep learning computation to be sure, but it is not indicative of more extensive intelligence or knowledge models; rather, it is an empirical confirmation of the speculations of computational linguists of the extensive metasemantic content embedded in the structure of how language is used.

Stranger

The problem with this claim is that there’s plenty of evidence for actual cognitive abilities, like the many tests for knowledge and problem-solving skills that GPT has aced. This pretty much sums it up:

Large language models (LLMs) are advanced artificial intelligence (AI) systems that can perform a variety of tasks commonly found in human intelligence tests, such as defining words, performing calculations, and engaging in verbal reasoning. There are also substantial individual differences in LLM capacities. Given the consistent observation of a positive manifold1 and general intelligence factor in human samples, along with group-level factors (e.g., crystallised intelligence2), we hypothesized that LLM test scores may also exhibit positive inter-correlations, which could potentially give rise to an artificial general ability (AGA) factor and one or more group-level factors. Based on a sample of 591 LLMs and scores from 12 tests aligned with fluid reasoning (Gf ), domain-specific knowledge (Gkn ), reading/writing (Grw ), and quantitative knowledge (Gq ), we found strong empirical evidence for a positive manifold and a general factor of ability.
Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? - ScienceDirect

1 In cognitive science, a positive manifold is a consistent positive correlation between scores on different types of cognitive tests; IOW, an individual who does well on one type of test is likely to also do well on others.

2 “Crystallized intelligence” is the application of accumulated knowledge and experience to solving problems, as opposed to “fluid intelligence” which refers to reasoning and learning skills.

What you’re engaging in here is “justaism”, claiming that LLMs are “just” completion engines without acknowledging the powerful emergent cognitive properties that manifest at very large scales:

Large language models (LLMs) are arguably the most predictive models of human cognition available. Despite their impressive human alignment, LLMs are often labeled as “just next token predictors” that purportedly fall short of genuine cognition. We argue that these deflationary claims need further justification. Drawing on prominent cognitive and artificial intelligence research, we critically evaluate two forms of “Justaism” that dismiss LLM cognition by labeling LLMs as “just” simplistic entities without specifying or substantiating the critical capacities these models supposedly lack.
https://aclanthology.org/2025.findings-acl.1242.pdf

And there’s this:

Here’s a paper just recently published in Nature Machine Intelligence: Human-like object concept representations emerge naturally in multimodal large language models. The full article is paywalled but there’s a link to the full version at arXiv.

There’s also this news summary of the research, which concludes with the following paragraph:

[He Huiguang, a researcher at Institute of Automation of CAS and the corresponding author of the paper] said the study has made the leap from “machine recognition” to “machine understanding.” The result shows that LLMs are not “stochastic parrots.” Instead, these models have an internal understanding of real-world concepts much like humans. The core finding is that the “mental dimension” arrives at similar cognitive destinations via different routes.

Gosh, I would hope that if I had unlimited access to the entire Internet, I’d do great on an IQ test, too.

At the time that many of these tests were run, GPT did not have direct access to the internet, only its own corpus. And even if it did, questions on most of these tests, particularly the WAIS IQ test, are designed to assess logical thinking and problem-solving skills, not rote information retrieval.

Today I asked CoPilot for options to travel from my house to Brighton Music Hall in Boston. It gave me two options, one driving in, one mass transit. The mass transit option misidentified the commuter rail line that runs through my nearest MBTA station, and as a result also had me boarding in a non-existent place and exiting in a station that’s on a completely different line.

Some IQ tests are more practical exercises.

I’m not even close to an expert on this stuff, but anytime anyone gets this breathless about something, I’m skeptical.

From your cite:

We interpreted our results to suggest that LLMs, like human cognitive abilities, may share a common underlying efficiency in processing information and solving problems, though whether LLMs manifest primarily achievement/expertise rather than intelligence remains to be determined.

Your citation is stating that LLMs can achieve a similar result to human cognition, not that it operates like human cognition, except insofar as they may both include some way of improving efficiency. This is a common error made by proponents of AI and it makes me think of evolutionary psychology, another field in which some objective data is posited to only be explainable by the researcher’s pet theory, without looking at alternative explanations.

But your cite doesn’t even do that, it rightly explains the limitations which you seem to be ignoring. The study you cited casts legitimate doubt on whether this is true intelligence.

With regard to honest understanding… Consider this metaphor. When my son was in the early years of development he had echolalia, which usually means repetition of phrases. He used it to get what he wanted: I would ask, “Are you all done eating?” And not before too long he started announcing, “All done eating!” when he finished with his meal. Most people around him would assume he knew what that meant and that his language abilities were fine. Better than average, even.

But we, his parents, knew better. He knew it was something you say to be excused from the table, but he lacked true conceptual understanding of those words. As evidenced by the fact that he eventually just shortened the phrase to the nonsensical, “All eating!”

Hopefully I don’t need to torture this metaphor any further. You can know how to do something without understanding it.

I don’t want to start a new thread about this, and I may expand on this later, but since it’s AI-adjacent, what the hell, I’m putting it here.

I’m a business owner who has managed programming teams in my previous life. I had the simple idea of using Gemini to do some basic social media stuff, and what I learned is that Google’s entire programming/AI/business platform is such a kludgy piece of cobbled-together shit that I’m shocked, shocked, that it works at all. And, to be frank, I’m not 100% convinced that it does all integrate together.

Holy crap, it’s amazing how fucked up their platforms are. Just getting a working API key to our own systems took my guy a half a fucking hour and it wasn’t like he was figuring it out as he went, no, he knew what to do, it just took 34 minutes for him to do it.

Now I’m being told that we need to use some hitherto unheard-of Google platform called ‘Whisk’ in order to get this simple thing done, and (1) it took me 15 minutes… with Gemini’s help… to even figure out how to access Whisk with our business account, and, when finally logged in, (2) in order to use this thing, Google/Whisk tells me that it’s $106 per Workspace user per month (there are 7 of us, so $742/mth) all so I can have this Whisk thing look at an image and give me 10-15 marketing words on it 3 days a fucking week.

Holy shit. I just may migrate the entire company to Microsoft.

This is not a claim that anyone has made in the general case. The pertinent observation is that some of the skills that LLMs possess exhibit human-like evolution with scale, including spontaneous emergence, not that the processes are the same. The paper I just cited in Nature Machine Intelligence makes the case that real-world object representations in multi-modal LLMs have emerged to reflect real understanding, not just mathematical representations, but the learning mechanisms are different.

The argument about what “true intelligence” means has been raging since the pioneering days of AI back in the 60s. Rather than continuing to flog this semantic dead end, it may be useful to observe that the definition applied by skeptics has been a constantly moving goalpost. To hear the skeptics tell it, the definition of “true intelligence” is whatever computers currently can’t do. Until they can, and then the goalposts move again.

The AI skeptic Hubert Dreyfus once claimed that no computer would ever play better than a beginner’s level of chess – the analysis tree was just too big for even the fastest conceivable computer. Now of course computers play chess at the grandmaster level, and that goal has suddenly been forgotten by the skeptics. The gold standard for determining human-like intelligence used to be the Turing test, but it’s now so obvious that ChatGPT can do this that nobody talks about it any more.

To my point, Google has two platforms called ‘Whisk’, one of which is the AI Image/text generator thingy, the other is a recipe app.

Like I said, Google is a kludgy piece of cobbled-together shit.

Now this citation you have provided is interesting; I can’t figure out its source, but it reads like a philosophy paper. It asserts that its purpose is merely to refute the haters, and that it’s not going to bother trying to operationally define “cognition” in its defense of the “potential” cognitive processes of AI.

Second, we do
not base our argument on any specific definition
of cognition, nor do we develop one.

Then we get to the Limitations:

Second, because our main focus is to argue
against unsubstantiated claims and call for a more
measured discussion on LLM cognition, we do
not make a substantive positive argument for or
against LLM cognition in this work. Doing so
would involve considering and adjudicating be-
tween different definitions and operationalizations
of cognition and related constructs

(Actually, they could do it just by choosing one definition.)

I’m not really opposed to its stated purpose, but it’s hard to really understand what it’s purportedly refuting if it refuses to define its own terminology. Also, again, it clearly shies away from the claim that these are cognitive processes - just that they might be, so we should stop being so critical of the idea.

Actually I think a better idea would be for you to define what you think true intelligence is, so we could explore the idea of whether LLMs fit that standard - and then maybe someone smarter than me could come up with their own definition, and we could talk about that.

Otherwise you’ve created a situation in which there’s no way to disprove you.

ETA: FWIW, I agree that “intelligence is whatever a human can do” is not a good definition. I would not necessarily eschew a definition that was potentially inclusive of machines, as long as we don’t conflate human cognition, which is in and of itself poorly understood, with machine cognition.

It was published in Findings of the Association for Computational Linguistics: ACL 2025. ACL is a prestigious organization of researchers engaged in work on various aspects of natural language processing. Transactions of the Association for Computational Linguistics is arguably the most important journal in the field of linguistics and one of the most important in AI, and is published by MIT Press. The “Findings” volume is an annual compilation of conference papers.

Is it common for this field to include philosophy in its publications? I know it’s not unheard of in other fields, I’m just curious how representative this paper is of what you would normally find in this kind of journal.

Is there like a field of philosophy of AI cognitive science? Is it considered a branch of cognitive science?

(I like philosophy.)

I’ve read things like it for psychology, but it would not be the status quo, more like a professional public service announcement, based on the need to draw attention to an issue. But in those contexts it’s functionally a really well-informed opinion piece, which is how I read this cite.

ETA:

Ah, a conference paper. That makes sense, then.

Still, I’m curious about the intersection of linguistics and AI.

They’re very closely related. AI researchers that I know/knew frequently worked in cross-disciplinary areas spanning linguistics, AI, and cognitive science, and yes, philosophy, too (Jerry Fodor, for instance, was a philosopher well known for his contributions to cognitive science on the theory of mind). An AI researcher I know once did a sabbatical at MIT under Noam Chomsky, but also collaborated with his good friends Marvin Minsky (AI) and Steven Pinker (linguistics, psychology, and psycholinguistics). Cross-disciplinary work is very common in AI research at the academic level, though maybe not so much in commercial ventures like OpenAI.