Is AI overhyped?

To call it a fancy search engine really is not a good description. It’s still pretty piss poor as a search engine and should not be used in lieu of one.

Fair enough. I likely had unrealistic expectations. Holy hell, I think I’m turning into my 80-year-old mother. :grinning:

There is reasoning in the responses, even when the answer is wrong - like if you prompt:
It takes 2 hours to dry 2 towels on a washing line; how long will it take to dry 5 towels?
Earlier versions of ChatGPT answered 5 hours, and gave a reasoned explanation that the time was proportional to the number of towels - it isnt, and thats a really naive answer, but ‘naive’ is not the same as being devoid of intelligence.

I still find it remarkable that just by predicting the next token, coherent language emerges, along with the capacity to follow instructions, to remember things and to attempt to reason.

Correct. And newer versions seem to have fixed/learned this. GPT-4 (which is what my Chat GPT defaults to), actually gives you the correct answer of 2 hours and says it is not proportional to the number of towels. So, the newer version is right. Bing’s Copilot also answers correctly. As does Google’s Gemini. Claude, as well. Here’s Claude’s take, for instance, which had the most verbose answer:

Jeez. I imagine most humans fall for that riddle. I guess Claude fails the Turing test because it answers it correctly?

It’s overhyped. Not because it’s useless, but because it’s so very hyped. And because it’s being so badly applied, like some magic do-everything tool.

First of all, ChatGPT is not “AI”; ChatGPT is one specific implementation of one specific type of AI technology, the Large Language Model (LLM).

LLMs have evolved high skill in natural language understanding and synthesis, which is their primary strength. The aren’t “search engines” any more than the human brain is a search engine, and much like a human, their information retrieval is subject to varying degrees of accuracy and reliability, or lack thereof. In fact, even though they often return accurate information, the GPT “personality” is somewhat akin to that of a human with a tendency to bullshit.

But there are some really remarkable things about GPT technology that shouldn’t be underestimated. The ability to understand the nuances of natural language with all its idioms and inherent ambiguities is very powerful and has been an elusive goal in AI for many decades.

It’s also remarkable that at a sufficiently large scale, GPTs have evolved the ability to solve non-trivial logical problems, including the kinds of problems designed to measure human intelligence, even though they were never explicitly designed for that purpose and such new skills are entirely emergent and unpredictable.

A couple of articles that might help inform this debate:

An attempt, through a $1M prize pool, to knock the current group-think in AGI out of its rut:

From the article:

Ultimately, the ARC Prize is trying to accelerate AGI research through fresh ideas and innovative open-source solutions, as well as speed up the timeline for AGI.

“Right now we’re stuck in this LLM-only world,” says Chollet. “But LLMs are not by themselves intelligent—they’re more like a way to store and retrieve knowledge, which is a component of intelligence, but it is not all there is.”

And an interesting study that points to the difficulty LLMs have solving “novel problems” (problems they haven’t been trained on).

From the article:

“A reasonable hypothesis for why ChatGPT can do better with algorithm problems before 2021 is that these problems are frequently seen in the training dataset,” Tang says.

Essentially, as coding evolves, ChatGPT has not been exposed yet to new problems and solutions. It lacks the critical thinking skills of a human and can only address problems it has previously encountered. This could explain why it is so much better at addressing older coding problems than newer ones.

I propose a New Rule Of the SDMB :

“Everything a CEO says is a lie”.

Seriously, the nonsense that spews forth is infuriating.
Witness Musk’s Cybertruck, which is as bad a a Third Grade drawing in crayons, breaks every 15 seconds, and is being mistaken by Bears, Opossums and (other) Raccoons as a Garbage Dumpster.

There’s likely some truth to that, but only a little. A lot of this, in my anecdotal experience, is bullshit. I say this at least partially based on some of the logic problems I’ve challenged GPT with. I can think of examples that were closely modeled on IQ test questions if not taken directly from them.

The key to doing well on an IQ test is not just getting the right answer, but getting it quickly so you can move on and finish the more difficult questions in time. There were cases where GPT failed to see some obvious logical shortcuts and slogged through the problem by unnecessarily formulating and solving equations. It ultimately got the right answer but in a rather inelegant way. This does not sound to me like GPT had in any way been trained to know the answer or perceive the shortcut, but rather that it slogged its way through the problem on its own initiative. Probably somewhere along the line it had been taught – or had spontaneously developed as an emergent skill – the idea that formulating and solving equations was a productive way to solve certain problems, but that’s also pretty basic to the way humans are taught, too.

Then you should publish your results (if only in ArXiv).

But only if you’ve taken care to make your tests systematic as described in IEEE Transactions on Software Engineering ( Volume: 50, Issue: 6, June 2024) pp. 1548 - 1584 (Journal Impact Factor: 6.5).

From the paper’s Abstract:

In this study, we perform a systematic empirical assessment to the quality of code generation using ChatGPT, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by ChatGPT , focusing on three critical aspects: correctness, complexity, and security.

From the paper’s Conclusions:

Overall, our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation and pave the way for improving AI and LLM-based code generation techniques.

buncha haters

I’ve posted this once before, but I was pretty impressed that GPT-4o figured this one out with an uploaded photo and a prompt (the uploaded photo was in full resolution.) I snapped the photo not for testing AI, but for myself and to see how many of my human friends caught the discrepancy. Half did not:

Wut??? I’m talking here about skills in solving problems in logic, such as the types of questions posed in IQ tests to assess basic intelligence. You’re focused on coding skills, which is a niche specialty only tenuously related to general intelligence. Maybe I read too much into the statement “It lacks the critical thinking skills of a human and can only address problems it has previously encountered” if you meant it specifically with regard to coding skills, but my point was that it’s definitely not true in the general case and, again, coding skills are just a niche specialty in the overall realm of intelligence. And in that context, the statement that GPT “lacks the critical thinking skills of a human” and “can only address problems it has previously encountered” has repeatedly been shown to be just flat-out wrong.

I can dig up some of the specific examples I was alluding to earlier if you like, but here are a couple of things I posted before* that may be even more persuasive.

To those who claim that ChatGPT and its ilk don’t actually “understand” anything and are therefore useless, my challenge is to explain how, without understanding anything, GPT has so far achieved the following – and much, much, more, but this is a cut and paste from something I posted earlier:

  • It solves logic problems, including problems explicitly designed to test intelligence, as discussed in the long thread in CS.
  • GPT-4 scored in the 90th percentile on the Uniform Bar Exam
  • It aced all sections of the SAT, which among other things tests for reading comprehension and math and logic skills, and it scored far higher across the board than the average human.
  • It did acceptably well on the GRE (Graduate Record Examinations), particularly the verbal and quantitative sections.
  • It got almost a perfect score on the USA Biology Olympiad Semifinal Exam, a prestigious national science competition.
  • It easily passed the Advanced Placement (AP) examinations.
  • It passed the Wharton MBA exam on operations management, which requires the student to make operational decisions from an analysis of business case studies.
  • On the US Medical Licensing exam, which medical school graduates take prior to starting their residency, GPT-4’s performance was described as “at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations.”

The converse question that might be posed by its detractors is that, if GPT is so smart, how come it makes some really stupid mistakes, including sometimes a failure to understand a very simple concept that even a child would understand? The answer, in my view, is simply that it’s because it’s not human. We all have cognitive shortcomings and limitations, and we all sometimes misunderstand a question or problem statement, but because an AI’s cognitive model is different, its shortcomings will be different. I strenuously object to the view that because GPT failed to properly understand or properly solve some problem that seems trivially simple to us, that therefore it doesn’t really “understand” anything at all. The fact that it can generally score higher than the vast majority of humans on tests explicitly designed to evaluate knowledge and intelligence seems to me to totally demolish that line of argument, which some philosophers have been harping on ever since Hubert Dreyfus claimed that no computer would ever be able to play better than a child’s beginner level of chess.

And to those who claim that GPT could only do this well because all the questions and answers were in its database, no, they were not, unless OpenAI is blatantly lying to us. Again, from one of my previous posts:

OpenAI made a point of the fact that GPT had never been exposed to any of the tests that were used in its performance scoring. That it may have seen similar tests is irrelevant, as it boggles the mind how a putative “token prediction” machine could use that information to improve its performance on completely different questions. Humans can benefit from seeing prior test materials because it familiarizes them with the methodology and allows them to practice and hone their skills with it. OpenAI was very explicit that GPT received no such specialized training.

* Those posts were from a larger discussion in this thread.

One of the biggest probable overhyping comes from statements that AI is going to cause a massive increase in productivity resulting in a massive loss of jobs.

Yet we have had in the last few decades another technology which one would have guessed would have cause a massive increase in productivity: computers and the internet.

Yet when you look at productivity numbers the yearly productivity increase for the last 17 years is substantially lower than that for the 60 years before then:

https://www.bls.gov/productivity/images/pfei.png/

So which part of the curve are we at with regard to self driving cars?

Is the “obsolete before occurs” label obsolete if it never occurs?

Having taught the MCAT for a couple years, I can say that the ability to do well on this sort of test is far from impressive. These tests rely on repeating pretty similar questions each year for the sake of statistical validity. And there are enormous volumes of question banks online and companies which specialize in test preparation.

Which is not to say AI is not impressive. But this isn’t the part that is, even though many seem amazed by these feats. They might not include the times AI scored less than the numbers quoted though. And AI can take a lot of tests.

On this particular curve, you’re looking at “autonomous vehicles” for the self driving element. Looks like it’s on its way out of the trough of disillusionment but not moving quickly (5-10 years away). Which is probably a fair assessment. there are users now that will tell you “You can have my self driving app when you pry it from my cold, dead fingers”. Of course, there is still plenty pf slope to climb. Here’s a recent article on it.

TLDR: Self driving is a lot safer than humans at the parts it is good at, but has major gaps that need to be filled.

My knowledge of the MCAT is precisely zero, but I think a few observations are nevertheless called for here. Just for starters, one would assume that the Medical Licensing Exam is considerably more advanced than MCAT, reflecting what graduating students about to become actual physicians should know as opposed to those just starting medical school.

As for repeating questions, it’s understandable that similar though not identical questions would appear year after year, and that there would be collections of sample questions to help students prepare. But note that if OpenAI is to be believed – and I have no reason to doubt them – GPT received no special training on specifically related materials for any of these tests. And even if it had, I would argue that the ability to generalize from sample questions to a strengthened performance on tests containing similar but factually different questions is itself a measure of intelligence.

Finally, even if one is dismissive of the value of GPT’s performance on any one of these tests, its aggregate performance on all of them has surely got to be acknowledged as impressive.

I agree, but it’s not AI researchers and developers that are doing the hyping, it’s ignorant pundits in the media. Indeed we’ve heard this kind of hype before, going back to the “computer revolution” of the 1960s. It was wrong then, and it’s wrong now, wrong both in gloomy predictions about its impact on job markets while, ironically, underestimating its social impact.

Sure, it is a form of intelligence. Medical licensing exams are more specialized material, but exactly the same stuff applies to the USMLE (which I taught a little) and its Canadian equivalent. There is a lot of relevant and similar stuff online. One presumes this was looked at by well scoring AI even if much other stuff less relevant to a specific test was also reviewed. I’m not trying to say it is unimpressive. But it is not really equivalent to thinking the AI is a superior clinician and it is less momentous than many think.

I agree. We’re not there yet and it may be quite some time before we are. One of the hoped-for goals of IBM’s Watson was to train it to be a physician’s assistant, providing advice based on access to a vast repository of medical literature. Yet despite Watson’s unique technology that includes comprehensive confidence filtering, it turned out to give bad advice often enough to be alarming.