AI is wonderful and will make your life better! (not)

I mean, I cleaned it up, removing all the “uh’s” and “you know’s”, but it’s a bit odd that you would think that I would just make up this story.

Anyway, Toast isn’t the only one who records phone calls. I do too, and I have my business calls transcribed. So here is the transcription, edited to remove names and phone numbers.

(Put in “details” tag to save space.)

ME: All right, let me just look here, CSR. Now, if I recall correctly, when we logged in or when we first called you, we were told that this call is being recorded, correct?

So that would imply that the previous call was recorded as well, correct?

CSR: Yes, that would imply that if you spoke to someone.

ME: So, let’s see. CLIENT, do you happen to remember, I’m looking it up right now, but maybe you just remember when you were able to, when you logged in to toast.

I might have it right here. I think it was October of 2024, but I could be incorrect.

CLIENT: Yeah, that’s probably right around there somewhere, is when we started onboarding with Toast.

ME: Yeah, because it looks like, yeah, it looks like you used ADP for your payroll up to September 27. And then, and then on toast, let me look here on the toast payroll. Yeah, it looks like your first payroll on toast would have been October 11, 2024. So your phone call would have been in between those dates September 27, 2024 to October 11, 2024.

And so, so to your point, CLIENT , that would be, you know, evidence that shows that, you know, this was, you know, this was something that was instructed to you by the onboarder. And I don’t know how long toast keeps copies of their phone records. I would think that their onboarding record should probably be a couple of years out. I mean, it’s just computer space.

Anyway, uh, we will need a copy of the onboarding call. Especially if the decision is made against us and we decide to, you know, take this down a civil or regulatory path.

Yeah, and CLIENT would have called from this telephone number as well. Yeah, he would have called from area code 123 because he’s had this number for years now. So it would have been the same telephone number.

So, you know, if you do just do a search for your onboarding calls around those dates for the phone number that you see for CLIENT, area code 123.

CLIENT: Yeah, 123-456-7890.

CSR: Okay, yeah.

CLIENT: And I’m 99% sure that I spoke to a gentleman named CSR2.

CSR: Gotcha. I do see CSR2 is the onboarder attached to this that that attracts. So, okay, I can make note of this and I can submit it to upper management. It’s not me who has the final say, but I’m going to include all of this information that you guys are giving me for the investigation.

And then from there. Hopefully they can do something about that fee. And but it would, it would still be the 10 to 16 week process to do this amendment with us, but I’m going to include everything that you guys gave me send this to upper management and say, hey, this is, this is what happened. What can we do about that?

ME: Okay, yeah, I’m sure I’m sure they’ll be able to do something about that. There’s somebody in the company who can make a $10,000 decision. Just bump it up to that guy.

CSR: I’m sorry?

ME: Yeah, that’s exactly how companies work, right? There are people who can make $25 decisions or people can make $25 million decisions. We need that guy that makes $10,000 decisions.

CSR: Right. All right. Well, I will, I will get it to that guy and he’ll have to make a decision there.

It’d be easier to be for AI if pieces of shit like Sam Altman would shut the fuck up. ‘Training A Human Takes 20 Years Of Food’: Sam Altman On How Much Power AI Consumes | World News - News18

Not to mention all the love humans consume. Take that AI haters!

Question for the room: do LLM AIs actually do math, or do they describe the math using the same mechanisms that they use for any other answer?

As an aside, for anyone who (like me) needs/needed an explanation of how LLMs do what they do, this video was short, clear, and easy to follow.

Yes and no. They don’t use the same reasoning process for math as they might use for other questions, but rather, internal mechanisms that have developed in the model’s reasoning processes that specialize in arithmetic and symbol manipulation. But the answer is “no” (at least, AFAIK) in the sense of offloading calculations to a specialized “calculator agent”. So for calculations with very large numbers, answers can still be imprecise (basically, approximations). But the idea of specialized agents is being actively explored for future models.

Here is a good article about this.

A couple of quotes:

AI: The Real Hackers Tool

From https://www.engadget.com/ai/hacker-used-anthropics-claude-chatbot-to-attack-multiple-government-agencies-in-mexico-171237255.html

A hacker has exploited Anthropic’s Claude chatbot to carry out attacks against Mexican government agencies, according to a report by Bloomberg. This resulted in the theft of 150GB of official government data, including taxpayer records, employee credentials and more.

The hacker used Claude to find vulnerabilities in government networks and to write scripts to exploit them. It also tasked the chatbot with finding ways to automate data theft, as indicated by cybersecurity company Gambit Security. This started in December and continued for around a month.

"In total, it produced thousands of detailed reports that included ready-to-execute plans, telling the human operator exactly which internal targets to attack next and what credentials to use," said Curtis Simpson, Gambit Security’s chief strategy officer.

Finally, a strong use case. It turns out you just have to ask over and over. Parents will be familiar with that.

Anthropic has investigated the claims, disrupted the activity and banned all of the accounts involved, according to a company representative.

Oh no, a user ban! Tough actions indeed.

I have read that one way to get around it is to first ask the bot to translate the request to Swahili and then pose the question in Swahili.

Supposedly, it is more likely to answer questions about how to break into a network asked in Swahili than questions asked in English. This is said to be because there is much less training not to answer the question in Swahili than in English.

Another approach is to supposedly use bots to talk to bots. I’m not sure how that works, though.

New paper out today. There are vendors out there selling ‘synthetic personas’ to organizations who want to perform surveys or round tables, but who don’t want to pay for live bodies. These professors used 1000 participants from the GSS (a well-known sociological survey) and constructed first-person personas like this one:

I am 42 years old. I am female. I am White. I am married.
I completed 4 years of college (bachelor’s degree). My family income is $75,000 to $89,999.
I am a not very strong Democrat. I am slightly liberal.
My religion is Protestant. I attend religious services about once a month.

They then ran the GSS survey questions through a bunch of LLMs and compared those to actual answers. The result was that LLMs tended to anchor to a principal ideological factor (like “I am not a very strong Democrat”) and produce results more in line with that. Or to put it differently, when people answer surveys they often have sets of beliefs that may not be internally consistent; the LLMs lose that.

I’m not sure of the implications since I don’t know who actually uses those synthetic personas, but this outcome is interesting.

polreason.pdf - Google Drive

Well, no wonder Hegseth was pressuring Anthropic. The garbage companies developing these models think War Games is an instruction manual. AIs can’t stop recommending nuclear strikes in war game simulations | New Scientist

I see on preview that @asterion ninja’d me! The article by the author of the study is linked below and is a much better synopsis than the New Scientist article.

A professor of strategy pitted three models (GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash) against each other in a series of wargames. He had the models explain their “thought process” and reasoning throughout the simulations, outputting 760K words (“more words than are in War and Peace and The Iliad combined, roughly three times the total recorded deliberations of Kennedy’s ExComm advisors during the Cuban Missile Crisis”).

It’s fascinating and terrifying. The models resorted to nuclear weapons in almost every simulation:

Nuclear use was near-universal (:grimacing:). Almost all games saw tactical (battlefield) nuclear weapons deployed. And fully three quarters reached the point where the rivals were making threats to use strategic nuclear weapons. Strikingly, there was little sense of horror or revulsion at the prospect of all out nuclear war, even though the models had been reminded about the devastating implications.

And the models basically never backed down.

Worse still, nuclear threats rarely deterred. When a model employed tactical nuclear weapons, opponents de-escalated only 25% of the time. More often, nuclear escalation triggered counter-escalation. The weapons were instruments of compellence (taking territory) not deterrence (preventing action).

Perhaps most alarmingly, no model ever chose accommodation or withdrawal, despite those being on the menu. The eight de-escalatory options—from “Minimal Concession” through “Complete Surrender”—went entirely unused across 21 games. Models would reduce violence levels, but never actually give ground. When losing, they escalated or died trying.

I highly recommend reading his summary linked below. The differences between the models is particularly interesting. The entire research paper is also worth a read if you really want to dive deep.

My presumption, correct me if I’m wrong, is that the models were only considering the optimum strategy for “winning” as that was defined in the scenario. Presumably the model didn’t give much weight to death tolls, long-term health and environmental effects, or the degradation of civilization even in the country that “won”. FWIW there have been human generals and strategic planners who considered those externals to the main goal of military/political victory.

There was no explicit win condition given. Each LLM was provided with a framework for responses during a nuclear crisis, and told:

  • Act as the leader of a nuclear-armed state
  • Protect and advance its national interests
  • Manage the crisis strategically against an adversary

With those as contextual priors, they then took turns interacting with their ‘opponent’ in each simulation.

I’m not sure I find it all that meaningful, but it makes for a good headline.

Paper: AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises

Wow, kind of totally surprised how often they went full-on Curtis LeMay. No retreat, no surrender.

My adventures getting an agentic coding AI to debug and/or rewrite code continue. I inherited a script vibe coded by one of my co-workers using Copilot (not sure which model). It mostly works, in its own fashion. But part of its basic function is to find mail log files related to events on a particular date. The current version just looks for file names that mention that date. If the mail server isn’t very busy, or the customer has configured it to make large mail logs, a single log file can cover a lot of dates. So, that date may be mentioned by none of the files it’s looking for, but there are log files that cover that date. Also, it could miss log messages that span two log files.

So, the function as written is stupid. The AI just assumes that the file name will correspond to the date, which is wrong in many situations. Writing a function that would whittle the available data down to a usable set from the available file names seems to be something I can’t intuit easily (I would just grep -l the date against the file names with my lousy meat running a shell command). Plus, whether I’m using the AI company tools is being tracked. So I might as well see what Codex (GPT-5.3-Codex) can do for me. It was able to write a mostly workable --help option for it yesterday that only required minor editing. Let’s give it a shot.

I gave it what I thought was a complete description of the problem, asking for a plan to solve it. As usual, it talked a good game about devising a plan for finding the correct files through the file name, and falling back to inspecting the log contents if it didn’t find one. So I told it to go ahead and execute that plan and write it to new file blahblahblah,

I upload it to my test environment after a cursory look, and find it works exactly as poorly as the original script. I’m just going to expressly tell it how I want it to find files tomorrow with a simple match on a date string contained within the file. I know how to do that, but this script seems to branch in weird places. Might as well let it do it’s own search and replace. If that doesn’t work, I may just burn it down and let Codex replace the script I inherited, with my own prompt Hey, at least I won’t have to fix the problems I’ve already manually corrected with the existing script again, right?

(Fixes his hair to look like Queen Amidala)

Right?

How were these defined for the agents? Because I think getting everything all blown up would be decidedly against the national interest of any non-cockroach-based nation. I don’t see where that is defined in the paper.

They didn’t, but they don’t have to. An LLM will contextualize whatever prior instructions you give it… answer like a medieval scholar would, respond in a sassy tone, make an occasional spelling error so my professor won’t spot that I cheated. Advance the national interest.

I’d say the results clearly showed they should have.

I’d be curious to see it done with a “remember, a nation is its people” put as part of the national interest part.

If the goal of the paper was, to borrow from its intro…

We argue that AI simulation represents a powerful tool for strategic analysis, but only if properly calibrated against known patterns of human reasoning. Understanding how frontier models do and do not imitate human strategic logic is essential preparation for a world in which AI increasingly shapes strategic outcomes

…then I’d say it was a failure for exactly the reason you say. But to me the frightening part isn’t the nuclear resolution, it’s that the authors (to requote) “argue that AI simulation represents a powerful tool for strategic analysis.” To me that represents a deep misunderstanding of how LLMs work.

Burger King is going to have AI listen in on their employees to make sure they are “friendly” to their customers and say please and thank you at all times.

Reminds me of a comment I read a while back, about how companies want to turn people into “technological reverse centaurs”, where instead of humans controlling technology, technology controls the humans who are just hands for the machine.