Why AI-generated summaries are not to be trusted

pjd · July 24, 2024, 4:22pm

This just appeared on one of my newsfeeds :-

Many of the companies producing those systems use text taken from the internet to train them, however. That may lead to a loop in which the same AI systems being used to produce that text are then being trained on it.

Well… who could have predicted that ?!

susan · July 24, 2024, 4:33pm

I just searched for the expected wait time for a US governmental vetting process. The automatic AI summary at the top of the search results says “30 days.” This answer appears to derive from 3rd party sites’ statements, many of which are grossly outdated. The actual answer from the actual governmental site is 12-24 months.

Darren_Garrison · July 24, 2024, 5:10pm

Shumailov et al.

pjd · July 24, 2024, 5:25pm

Well, apart from Shumailov et al…!

Chronos · July 24, 2024, 7:07pm

Yeah, that’s probably a worse case than the OP. With the OP’s question, they knew they didn’t get an appropriate answer to their question, and had the choice of digging deeper to find a good answer, or trying some other resource, or giving up. On the other hand, your example looks like a perfectly reasonable answer, which could easily mislead someone into thinking it’s correct, and not looking any further at all, which could have disastrous consequences if they need that vetting done by some deadline and miss it.

Gatopescado · July 24, 2024, 10:54pm

Me too! But come to think of it, I wouldn’t trust them either.

Just_Asking_Questions · July 24, 2024, 11:11pm

Is Shumailov an aI?

Darren_Garrison · July 25, 2024, 12:52am

No, but he et one.

suranyi · July 25, 2024, 4:31am

Those AI summaries sometimes make me so mad. They are usually either obvious, or erroneous. And they’re everywhere now! I just saw they are now the top response on Quora (yes, I know). The one I just looked at, the AI answer was not only completely wrong, but actually contradicted by several of the real answers below.

wolfpup · July 25, 2024, 5:35am

bump:

Not sure why that surprises you. These LLM AI models aren’t really intelligent like we think of it.

Basically they’re an outgrowth of the same technology that lets a visual AI model be trained with a whole lot of pictures of clouds, and then be able to identify clouds, or maybe be trained on a bunch of data that shows patterns and be able to fairly accurately predict things. The big difference is that they’re specifically engineered to deal with language data - i.e. take a prompt in, parse it, and produce output in sensible, correct and intelligible language.

What they don’t have is the ability to evaluate which answer that fits the prompt is better or more valid, or whatever. It’ll find an answer that satisfies the prompt (however it does that), and spit the response back in a very coherent fashion. Nor do they generally draw conclusions about data elements- if there’s a relationship between things or a conclusion drawn in the data that it’s trained on, it will return that, but it won’t actually draw those conclusions on its own.

They’re only as good as what you train it with. If you took an LLM and fed it nothing but MAGA and right-wing stuff and asked it political questions, it would spit back coherent, grammatically correct and well formed language, but the content would be that same MAGA stuff you fed it. And if you fed it both sides, it would likely spit back both viewpoints or somehow combine them. It doesn’t have an opinion or any way to judge between the two.

I suppose you could come up with a different sort of AI to actually do that, and then integrate that, but from what I understand, we’re not there yet.

That’s why when you ask it something like “European cities at the same latitude as Portland, Oregon”, it’s going to parse that into whatever it comes up with for european cities, portland, and the same latitude.

And what you got back was entirely sensible in that it’s well constructed, and each part makes sense- cities at the same latitude may not have the same weather, and equatorial areas have more direct sunlight and are hotter than more northerly ones.

But the AI doesn’t really understand the relationship between the two. And really, why would it? It’s not really set up to do that - it’s just spitting back facts about latitudes and weather.

For no other reason than that this was a fairly long post, I asked ChatGPT to summarize it. This is what I got:

Large language models (LLMs) like ChatGPT aren’t truly intelligent in the way humans think of intelligence. They are sophisticated versions of technology that allow visual AI to identify patterns, such as recognizing clouds from a set of images. LLMs are specifically designed to handle language data, enabling them to process a prompt and produce coherent and grammatically correct responses.

However, LLMs lack the ability to evaluate which response to a prompt is better or more valid. They generate answers that fit the prompt based on their training data but don’t inherently judge or validate the information. Similarly, they don’t draw new conclusions; they can only reflect relationships or conclusions present in their training data.

The quality and nature of their output depend entirely on the data they’re trained with. For instance, an LLM trained solely on a specific political ideology will produce responses consistent with that ideology. If trained on diverse viewpoints, it may present both perspectives without favoring one over the other.

LLMs aren’t capable of understanding deeper relationships or concepts; they merely process and regurgitate information in a structured way. When asked about European cities at the same latitude as Portland, Oregon, an LLM can provide a sensible answer about latitudes and weather but doesn’t truly understand the underlying relationships between these concepts. This limitation is due to their design, which focuses on language processing rather than true comprehension or reasoning.

That seemed fairly accurate, but verbose, so I asked it to be more brief, and it produced:

Large language models (LLMs) like ChatGPT aren’t genuinely intelligent but are advanced pattern-recognition tools designed to handle language data. They generate coherent responses based on their training data but lack the ability to evaluate the quality or validity of those responses. Their output reflects the information they’re trained on, without forming new conclusions or opinions. LLMs process prompts and provide structured answers without truly understanding the underlying relationships between concepts.

Seems to be rather ironic that it produced an accurate summary of a diatribe claiming that AIs can’t summarize worth shit.

I’m well aware of the many limitations of LLMs and the reasons for them, and I’ve run into many myself. I just think it’s a remarkable advance in the areas of natural language processing and interactive information retrieval that’s been improving very rapidly over timespans of mere months. We’ve come a very long way since AIs couldn’t understand idiomatic expressions or real-world contexts and now not only interpret them correctly, but can explain them.

wolfpup · July 25, 2024, 6:46pm

Here’s a good example of both sides of this argument. ChatGPT actually did an impressive job overall of summarizing this abstract from a technical paper. On the negative side, it made a couple of mistakes – but they’re much the kind of mistakes that a human might have made.

Original:

The globally integrated sea-air anthropogenic carbon dioxide (CO2) flux from 1990 to 2009 is determined from models and data-based approaches as part of the Regional Carbon Cycle Assessment and Processes (RECCAP) project. Numerical methods include ocean inverse models, atmospheric inverse models, and ocean general circulation models with parameterized biogeochemistry (OBGCMs). The median value of different approaches shows good agreement in average uptake. The best estimate of anthropogenic CO2 uptake for the time period based on a compilation of approaches is −2.0 Pg C yr⁻¹. The interannual variability in the sea–air flux is largely driven by large-scale climate re-organizations and is estimated at 0.2 Pg C yr⁻¹ for the two decades with some systematic differences between approaches. The largest differences between approaches are seen in the decadal trends. The trends range from −0.13 (Pg C yr⁻¹) decade⁻¹ to −0.50 (Pg C yr⁻¹) decade⁻¹ for the two decades under investigation. The OBGCMs and the data-based sea–air CO2 flux estimates show appreciably smaller decadal trends than estimates based on changes in carbon inventory suggesting that methods capable of resolving shorter timescales are showing a slowing of the rate of ocean CO2 uptake. RECCAP model outputs for five decades show similar differences in trends between approaches.

ChatGPT summary:

From 1990 to 2009, scientists studied how much human-caused carbon dioxide (CO2) is absorbed by the ocean using various models and methods. They found that, on average, the ocean takes in about 2.0 billion tons of CO2 per year. This absorption rate can change yearly due to large climate changes, with a variability of about 0.2 billion tons per year. However, different methods showed varying trends over the decades, with some showing a decrease in the rate of CO2 uptake. Models focusing on shorter timescales suggest that the ocean’s CO2 absorption rate is slowing down.

I thought it was generally a pretty competent summary for a lay reader who may not have understood the original, but there were two mistakes. I like the way it understood that “−2.0 Pg C yr⁻¹” meant 2 petagrams per year and translated that to tons, but that’s equal to 2 metric tonnes, whereas “tons” would conventionally be interpreted to mean US tons, so 2 Pg would equal about 2.2 tons. And “C” refers to the carbon component of that CO2, not the total mass of the CO2 itself, an understandable mistake that a human might have made, too.

So, a useful summary that needed two tweaks. Where that falls on the spectrum between a perfect summary and gibberish is a matter of judgment, but it would be pretty hard to dismiss as “gibberish”. If I had asked a general-studies student to summarize that paragraph in their own words, I’d probably have downgraded them to a “B” or “B-” for those two mistakes.

And if I had been looking for a summary for a general audience, the ChatGPT output would have been useful – especially if the original text was quite long – as it’s much easier to tweak some minor mistakes than to write something from scratch.

Topic		Replies	Views
Global Warming? Great Debates	76	2736	August 25, 2001
Ask the AI In My Humble Opinion	52	3639	December 12, 2002
Global warming has stopped? Great Debates	119	6004	April 3, 2008
OK, the ice is melting now, really! Great Debates	94	4192	October 5, 2005
Is the AGW debate about Results or Science? Great Debates	100	5193	July 2, 2008

Why AI-generated summaries are not to be trusted

Related topics