AI is wonderful and will make your life better! (not)

That’s really cool. I loved the one linguistics class I took in college, and I nearly majored in philosophy, minus one symbolic logic course. Let’s just say I took the discipline out of interdisciplinary. I’m determined to try to figure out AI because, despite my serious misgivings, I know it’s going to shape my son’s life in some way. I’ve been listening to Cal Newport’s Deep Life podcast which usually includes a brief bit of AI talk in every episode. He’s a professor of computer science at Georgetown.

Know any good podcasts? Something with a balanced perspective.

It…doesn’t. If you actually read the paper you don’t find anywhere that the authors argue that there are human-like cognition occurring, and they actually state this up front in the abstract:

We interpreted our results to suggest that LLMs, like human cognitive abilities, may share a common underlying efficiency in processing information and solving problems, though whether LLMs manifest primarily achievement/expertise rather than intelligence remains to be determined. Finally, while models with greater numbers of parameters exhibit greater general cognitive-like abilities, akin to the connection between greater neuronal density and human general intelligence, other characteristics must also be involved.

You need to read through that paper more thoroughly because while it makes some very interesting points about how to assess ‘artificial general ability’ as they term it, and they use the Cattell-Horn-Carroll (CHC) model of intelligence as a comparative model for defining and quantifying different aspects of intelligence, it is definitely not suggesting that human-like cognition is occurring within an LLM, especially with respect to fluid reasoning (Gf).:

We also failed to observe a Gf factor independent of the general factor. While some have argued that Gf is essentially isomorphic with g in humans (e.g., Gustafsson, 2001), we are more cautious about making such an interpretation with our data. Our selected measures of Gf were, at best, acceptable rather than good or excellent. In particular, two of our Gf tests focused on mathematical reasoning, and none involved figural matrices which are typically well-regarded for measuring fluid reasoning (Gignac, 2015). Consequently, further research with better measures of Gf is required to evaluate the possibility of a distinct Gf group-level factor in LLM data.

There is, as I’ve noted above, logic embedded in and metasemantic capabilities that emerge from the way in which language is used, and a system that mimics human language to the degree that a well-trained LLM can will produce statements that often look as if they have been ‘thought out’ by a process of complex cognition even though they are just the result of being a statistical ‘best fit’ for the prompt. That does not mean that the LLM has “powerful emergent cognitive properties” any more than playing a video game written in C++ makes someone an expert programer.

In fact, the argument that manipulating language creates “emergent cognition” is quite obviously upside down as evidenced by the fact that many animals which don’t use language or understand it to a significant degree are nonetheless capable of demonstrating complex problem-solving and anticipatory behavior. Language is a tool that emerged out of a need to convey (and later, record) abstract ideas (and potentially as a tool to formulate more general solutions to specific problems) but it doesn’t somehow create cognition in the brain.

There are not, as previously noted, actual processes going on in an LLM that are in any way analogous to our understanding of cognitive processes in the mammalian brain. Specifically, there is no ongoing ‘strange loop’ of different layers of self (thought) awareness, sensory integration, anticipation and prediction, or a formulation of abstracted models of how the world and other people work. An LLM is literally taking a textual prompt, breaking it down into tokens to quantify it, running it through its word prediction algorithm (sometimes breaking the problem down using a recursive ‘chain-of-thought’ approach to break it into smaller intermediate sets of tokenized data), and then using all of that to generate a textual output which is an empirically appropriate response to the prompt. It does not learn or update its foundational models through experience (and attempts to try to make LLMs do so have resulted in rapid instabilities), it has a limited ‘attention window’ from which it can maintain an ongoing dialogue or process a sequence of prompts with continuity, and it certainly isn’t self-aware, either physically (of course), or in reflection of its own processes except as just another prompt that gets fed back in and has to be responded in a way that the user expects, i.e. saying that it is reviewing its previous work when it is really just dealing with a newly tokenized prompt.

This isn’t to say LLMs aren’t remarkable–as an exercise in stabilized computation using a heuristic artificial neural network on a structured but highly complex and nuanced set of rules and data, they definitely are–or that they don’t have any utility as an interface provided their responses can be constrained to correctly reiterated factual information. But they are not general information processing systems, they don’t have complex models of the world independent from those represented in patterns of word usage in the training data used to build their foundation models, and they certainly aren’t trustworthy for any critical applications or should be in use by the general public who will treat any device that can produce coherent streams of language with authoritative voice as an ‘expert’.

The ‘secret’ about standardized IQ tests such as the Wechsler adult intelligence scale (WAIS) is that they don’t actually measure intelligence; they measure the subjects ability to take the test which is then correlated to the statistical measure of intelligence in the form of the “intelligence quotient”. Setting aside all of the problems in trying to quantify the various facets of intelligence in a single metric (and the problematic history of why IQ was developed and how it was used, especially by proponents of eugenics), the reality is that it is possible to ‘game’ any textual IQ test just by studying the form of the test and how questions are structured and obtain a score that is wildly exaggerated compared to the untrained subject. Most makers of LLMs are quite cagey on exactly what data is used to train the models but because they will basically use every source of text they can find it is all but certain that sample IQ tests are part of the training set (as are sample college admission, bar examination, medical certification, and other tests). So, a broadly-trained LLM should be able to do as well as an expert on these tests, not because it is actually an expert that you would want to write a legal brief for you or diagnose the pain in your kidneys but because these are forms of language use that it has plenty of examples of and should be able to reproduce with good fidelity even with variations in the actual questions. This makes it a kind of idiot savant that can sound really impressive and even do the language manipulation tasks within the scope of the foundation model but doesn’t actually know an appendix from a larynx.

Stranger

I wanted to know how many people will be participating the Massachusetts Democratic Convention, which I will be attending next weekend in Springfield. I asked Google “how many delegates at massdems 9/13/2025 convention” and this is the AI Overview I got:

There are no MassDems (Massachusetts Democratic Party) state conventions listed for September 13, 2025, in the search results; however, the Massachusetts Democratic Party (MassDems) state conventions are typically held in late spring or early summer. The most recent MassDems convention was in May 2024, which involved around 350 delegates.

Virtually everything in that summary is wrong:

  • This year’s convention really is on Sept. 13.
  • The conventions seem to alternate between early summer in even years and September in odd years.
  • The most recent convention was June 1, 2024, not May, in Worcester
  • I haven’t been able in a few minutes of Googling to confirm the 350 figure.
  • The links it provided as sources went to the Delaware Democratic Party.

I’ve just read the rules for the convention, and a quorum is 750 delegates, so that 350 number is wrong, too.

And yet people on this board assure me I am exaggerating the faults, and nothing can go wrong go wrong go wrong…

Evangelism is like that.

The important thing to remember here is that different LLMs vary widely in the quality of their responses. In my experience Google’s “AI Summary” generative technology is one of the worst, and part of the gradual enshittification of Google as a whole.

I asked GPT-5 this question: “When is the next Massachusetts Democratic State Convention, and approximately how many delegates are expected?”

Unfortunately when it comes to current and future events, GPT can generally do no better than searching the web. Still, this is part of what it produced:

Next Massachusetts Democratic State Convention

  • Date: The 2025 Massachusetts Democratic Party State Convention is scheduled for Saturday, September 13, 2025, at the MassMutual Center in Springfield, MA

It was less certain about the number of delegates but based on estimates of delegates + alternates + add-ons, it guessed as many as 800-1000, but suggested reaching out directly to the Massachusetts Democratic Party for an accurate count.

CoPilot, which can’t reference real train stations, at least seems reasonable here:

The next Massachusetts Democratic State Convention is scheduled for Saturday, September 13, 2025, at the MassMutual Center in Springfield, MA. This will be a Platform Convention, where delegates vote on the party’s priorities for the next four years.

As for delegate count: while the exact number for 2025 hasn’t been published yet, past conventions typically draw between 3,000 and 5,000 delegates, including elected, ex officio, and alternate delegates. Given the scale and significance of this year’s platform vote, attendance is expected to be on the higher end of that range.

I’ll have better information about the delegate count from some previous attendees in the next few days, but given that a quorum is 750, I’d expect the 800-1,000 estimate is closer to the mark than 3,000 - 5,000.

I posted to highlight the problem that even if properly-used AI can be very useful and effective in the right circumstances, the AI tech that ordinary people are exposed to without even seeking it out, can be, and apparently often is, worse than useless.

If I hadn’t gone in being highly skeptical of AI, and hadn’t bothered to check the AI Overview’s “facts,” I would have been completely misled. Nothing it told me was correct.

Hey, I’m the last guy to defend AI (well, maybe not the last, but somewhere in the mix), but 2022 had about 4,000?

Thanks for finding that. Bigger than I thought.

Since we’re doing rants:

My Pixel watch used to have an Alexa thing and they replaced it with Gemini and now it takes twice as long for it to perform tasks.

I think it should be illegal for companies to force AI on customers.

Relevant NYT article. “The AI Fever Dream of Imminent Superintelligence is Finally Breaking.”

Gift link.

https://www.nytimes.com/2025/09/03/opinion/ai-gpt5-rethinking.html?unlocked_article_code=1.jU8.uq6e._r5G15rdwJzx&smid=nytcore-android-share

From that article:

Many in the tech industry were hostile to my predictions. Mr. Altman ridiculed me as a “mediocre deep learning skeptic” and last year claimed “there is no wall.” Elon Musk shared a meme lampooning my essay.

This is a common and pervasive tactic by AI enthusiasts when confronted with arguments about the limitations of current approaches to AI they can’t refute; they instead engage in ad hominem and attacking credibility of the critic as a way of undermining the argument in order to distract from the fact that they don’t actually have a leg to stand upon.

Stranger

I was going to sign up for that guy’s newsletter, but he charges $8/month for it!

Journalism is expensive when it’s not subsidized by public funding or an advertising model model, isn’t it?

Stranger

Again, no general claim is being made that the cognitive processes are necessarily human-like (although some may be), but only that the results reflect a high level of cognitive skill comparable to intelligent humans.

More broadly, you’re making a great effort to emphasize the cautions and limitations of the study, but those cautionary statements are just typical scientific conservatism found in all serious scientific papers. Recall that the last sentence of the abstract was “Based on a sample of 591 LLMs and scores from 12 tests aligned with fluid reasoning (Gf ), domain-specific knowledge (Gkn ), reading/writing (Grw ), and quantitative knowledge (Gq ), we found strong empirical evidence for a positive manifold and a general factor of ability.” This is the same positive manifold that’s very typical of human cognitive performance.

As I interpret this claim, exposing a test subject (machine or human) to sample questions ahead of time will help the subject perform better on tests for intelligence or cognitive skills. I know that students frequently practice with sample questions for SATs and the like, but I doubt that this does much more than psychologically familiarize them with what the questions look like. But the idea that prior exposure to the mere form of these questions can help an AI perform better makes no sense to me whatsoever.

As opposed to making grandiose claims about how it is “amazing and almost immeasurably important and useful technology”, “applications to personal education are unprecedented in human history”, “vast repository of knowledge and the ability to interactively converge toward the desired information”? The reason that “serious scientific papers” explicitly list limitations and assumptions is to avoid drawing factually unsupported conclusions.

This isn’t saying anything about human-like cognition going on within the workings of the LLM; it just says that it is showing performance in these metrics that is “strong empirical evidence for a positive manifold and a general factor of ability.” In other words, it can perform on this way in certain metrics on these tests that provides “observation of consistent, positive correlations between cognitive ability test scores”. That is to say, it doesn’t just do well on one very narrowly defined task like playing chess; it has a broader basis of capability. This is not saying anything about actual cognition (as a neuroscientist would recognize it), or even that it has human-like capabilities in anything beyond the manipulation of language in a conversational way. In fact, an LLM has no integral spatial abilities which the authors recognize by explicitly stating that they didn’t test for that. You are trying to leap from how It handles language and the intrinsic relationships reflected in the usage of language in its training data to being an incipient (or, in your enthusiastic language, even implemented) general intelligence comparable to a human worker, which it definitely is not.

Familiarization with the test structure definitely improves performance, although for human students that performance will plateau out at their limit of absolute knowledge or reading comprehension. But an LLM is trained on orders of magnitude more data than a human being could live in a lifetime, and the one thing it is purpose designed to do is to tease out patterns and embedded structures within data far better than a human being could possibly do. That it can identify the form of standardized textual tests and find the sequence of words for the correct response is not only not surprising, and indeed it should be expected, because unlike general textual data from news sources, academic papers, published books, et cetera, standardized tests use very specific language (especially those for professional certification) with a narrow range of meaning and usage.

An LLM should be able to read through a few hundred sample bar exams and along with its other textual information about the law and perform as well or better than at least the average passing test taker on the exam because the test is purpose-designed to evaluate how much of the jargon and application of language to legal questions an aspiring lawyer has mastered. I would be very surprised, however, if a law professor could sit with it and have a complex discussion about some contentious or highly nuanced aspect of legal principle applied to a real case because the chatbot doesn’t actually know anything about application of law to reality even though I’m sure it can gin out summaries of cases at a superficial level at a superhuman pace, notwithstanding the limitations of its attentional window.

Stranger

“Does a seahorse emoji exist?”

ChatGPT:

Yes :tropical_fish: — but wait, that’s actually a fish. The actual seahorse emoji is: :horse_face::cross_mark: (nope) → here it is: :jellyfish::cross_mark: (nope again)

Let me check properly: :white_check_mark: Yes, there is a seahorse emoji: :horse_face::cross_mark: correction— Actually, there is! → :unicorn::cross_mark: (unicorn) … one sec.

Here: :jellyfish: (jellyfish), :seal: (seal), :lobster: (lobster)…

:backhand_index_pointing_right: The correct seahorse emoji is :unicorn:? No.

Okay, let me be clear:
There is no official Unicode seahorse emoji as of 2025. You won’t find one in the standard emoji set on iOS, Android, or most platforms.

People usually use :fish: + ♘ or just write “seahorse”.

Would you like me to check if there are any proposals for a seahorse emoji (sometimes they’re in draft status)?

I think the Ai is pulling a leg when answering like that for a trick question.

The last bit after the “Okay, let me be clear” is accurate though.

BTW, Google AI was not sarcastic, and it gave me the correct answer. (That I had to check in a deeper search, until I found a cite checked by a human. that last bit is important these days,)