ChatGPT Breaking up marriages

99% agree. “How” is immaterial provided the therapy delivers good results to the human patient. Where “good” can be evaluated per the patient or the patient’s surrounding people.

But …

If a cursory evaluation of the “how” shows that the machine’s motivation is solely to agree with the patient and encourage them to act on their stated impulses, then it’s a darn good bet the outcome won’t be good therapy; it’ll be self-fulfilling prophecy for the patient. Whether the patient is considering getting married, getting divorced, committing suicide, or buying shares in XYZ Corp.

The actual reality with current tech LLMs is not as simple as my sketch just above. But there’s a decent dollop of accuracy in my chosen sketch.

See also @Jackmannii just above your post. The humans who make and sell LLMs have a motivation to deliver a product that “tastes good”, not one that’s “good for you”. There is a difference.

  1. This is the appeal to consequences fallacy, i.e. “it’s true because it would be great if it were true”.
  2. This seems to overlook the OP premise that ChatGPT isn’t proving to be a “better bird” with regard to marriage counseling or personal advice. It’s giving decontextualized bad advice.

If you ask a bird to improve a building, and it drops a bomb on the building, you’d be a fool to trust the bird for any future home improvements. But LLM hypesters would simply have you believe that you did a bad job of coaching the bird, when the reality is that you never should’ve chosen the bird approach at all.

Indeed, as it would be decontextualized bad advice to say, “Don’t use ChatGPT for marriage counseling because it didn’t work.” But you’re dropping the context that by, “It didn’t work.” What you mean is that in 1 case out of an unknown number, which was chosen for the sake of headline/topic provocativeness by a journalist, it didn’t work. Which is basically just evidence that people read an anecdote and, by nature, extend it to cover the entire problem space even though it might have been the least likely outcome.

The context is that it may be worse, it may be better. Anecdote is not data.

If the data comes in that it’s worse then it’s worse and, quite plausibly, the reason is because the LLM can’t think right.

If the data in that it’s better, then whether the AI thinks in “the right way” or not, it’s better and that’s just all there is to it. You can argue all day that it’s not right and it’s not proper, but if you want to fix your marriage then it’s your best path - by the numbers - whether or not that’s all true. My guess for why it would prove that way, if it did, is because the AI is able to maintain a consistent and broad view of the matter, because it is simply raising up good advice from the wider world and because it is failing to preserve context and misleading info from the patient from session to session. Or…something else. But just like hypothecating the failure, hypothecating success is just playing pretend until you’ve done research to prove the hypothesis. Till that point, you’re just making shit up and pretending that it’s true.

I’m not suggesting that it is better. I said “if” it proves to be better, then that’s just what it is. “If” it doesn’t then maybe it’s for the reason that people believe, maybe not.

Just one problem. There is still some question as to whether the emergent behaviors are real or the result of flawed testing methodologies.

Here is an article discussing the question and here is one of the papers (the one referenced in the article) that caught some of the issues.

Just for fun, I asked Gemini about it:

Prompt: Is it true that your emergent properties are a mirage ?

Query successful

That’s an excellent question that is currently at the center of a major debate in AI research!

The short answer is: For many of the key examples, the answer is yes, the “emergent properties” appear to be a mirage.

This conclusion doesn’t mean the abilities themselves aren’t impressive, but rather that their sudden, unpredictable appearance as models scale is an illusion created by the way researchers have traditionally measured performance.

Except we don’t. Saying that “we understand in detail” how the answers are generated is a facile and misleading interpretation. What we understand is the overall architecture of the system and how individual programmed components work. What goes on inside the vast artificial neural net after training is pretty much unpredictable and many orders of magnitude too complex to “understand in detail”, and not even particularly deterministic due to random variabilities in training.

That’s not how it works. To be clear, GPT-5 does have a real-time router that in some versions will route tasks to different models with deeper capabilities, but that’s not relevant here, and there is definitely no “calculator module” in GPT.

There are actually several reasons that GPT-5 does arithmetic much more accurately than previous versions. There are changes in how numbers are tokenized, reinforcement learning to get GPT to check its own work, and most importantly, pertinent to my earlier point, emergent reasoning capabilities about arithmetic that go far beyond the patterns it was trained on.

This is true. But the difference between natural evolution and the evolution of AI systems like LLMs is that the former is governed by natural selection (survival of the fittest) while the latter is governed by human overlords who are motivated to improve correctness.

As an example, and in response to your lament about LLM incorrectness and the common complaint about AI “hallucinations”, in recent HealthBench testing on health-related questions involving “challenging conversations”, GPT-4o scored a hallucination rate of 15.8%, while GPT-5 scored 1.6% when routed to the deep thinking model, and 3.6% without. I would hazard a guess that a 1.6% rate of incorrect diagnosis in difficult cases may not be far off from that of a competent physician.

Thank you. I’ll enjoy the reading.

I didn’t say my intro bit very well.

What I meant was that we’re pretty sure humans have emergent behavior. If / when an AI machine of whatever tech gets complex enough it probably will have emergent behavior too.

@wolfpup was right to point to it as potentially opening up broad new vistas presently unseen. With that much I agree. Whether we have crossed the “complex enough” threshold yet is unknown to me, and maybe to everyone.

My deeper point, aimed particularly at his boostery arguments, was that emergent behavior (whether now or in the future) is not a magic wand that will fix all the shortcomings just because spooky magic. And especially it won’t fix the lack of motivation towards correctness which is my personal hobby horse. It might prove to be a necessary condition to get there, but IMO it surely won’t be a sufficient one.

I already referenced that whole issue way up in this post. You cite a paper that discusses emergent behaviour in LLMs and then an article that allegedly refutes it. I cited an article that references both, suggests that the refutation is far from conclusive in an area that remains controversial, and I gave my opinion about why emergence in large-scale AIs is real.

Read this one instead. It’s a very balanced view that fairly discusses both positions and references both the pro and con papers. It’s also part of a series of articles about LLMs that are billed as “introductory” but actually very thorough.

No disagreement at all.

I think LLMs, as well as the rest of data science, have many wonderful current uses and future possibilities. I’m just trying to tone down the rhetoric which is leaning heavily into “it’s magic” or “it’s reasoning its way to the answer” territory.

Looks like we’ve got a mismatch in our interpretations of “understand in detail”. You seem to think that if a process isn’t deterministic, then it isn’t really understood. I don’t think that’s a very meaningful place to draw that line.

We know how input tokenization algorithms work. We know how probabilistic assignments of numerical weights to token vector coordinates, based on given sets of training data, work. We know how numerical comparisons between these weights to quantify token relationships work. We know how error minimization via gradient descent algorithms in backpropagation works. Etc.

And, most importantly, none of these processes involve the software being in any way “aware” of what it’s doing, the way that human minds are. It’s following programmed instructions, the way software does.

Yes, because the complexity of the data and the millions of attributes on its data structures are so vast, there’s no way that we can successfully predict every output of such processing, which can result in substantial unpredicted emergent behaviors, just as in other dynamical systems. But I don’t think it makes sense to call that “not understanding” the processes to a degree that qualifies them to be called “sentience”. MHO.

I do?

The article doesn’t appear to me to come down on either side, it just shows that the question remains outstanding.

Yeah, you cite an article in a popular magazine, Wired, that has the headline “Large Language Models’ Emergent Abilities Are a Mirage”.

The Gemini response misses some very important nuance, to the point that I’m inclined to say that it’s just plain wrong.

Just for fun, I asked GPT-5 the same question. The answer was quite extensive, and even though I’m only quoting the most relevant part of it, I’ve put in a “Summary” box since the board tends to dislike lengthy AI responses. But it’s a good read, and in itself says a lot about GPT’s capabilities and knowledge.

ETA: In fact, I think the ChatGPT summary about the issue of emergent properties is probably one of the best I’ve yet seen. I encourage anyone interested in the subject to read it.

Summary

You’re referring to emergent capabilities in large language models — the claim that new, qualitatively different abilities seem to appear suddenly as model scale (parameters, data, or compute) crosses certain thresholds — and whether that’s a genuine discontinuity in capability or just a measurement artifact.

This is one of the most hotly debated topics in current AI research. Here’s how the discussion stands:


:puzzle_piece: The Original “Emergent Capabilities” Claim

Early papers (especially from OpenAI, Anthropic, and DeepMind between 2021–2023) showed what looked like sharp phase transitions:

  • Small models did poorly on a task, like multi-step arithmetic, symbolic reasoning, or translation.
  • As size increased, performance seemed to jump abruptly from near-random to near-perfect.

This led to the idea that some cognitive abilities emerge suddenly — like phase changes in physics.


:test_tube: The “Measurement Artifact” Rebuttal

Later analyses (notably Schaeffer et al., 2023, “Are Emergent Abilities of Large Language Models a Mirage?”) argued that these jumps are an illusion caused by:

  1. Discrete evaluation metrics.
    Many tasks are graded binary (right/wrong). A small change in probability mass around the correct answer can cause a sudden jump in “accuracy” even though the underlying model improvement was smooth.
  2. Thresholding effects.
    When you cross a scoring threshold (say, answer correctness > 0.5 probability), your measured performance leaps discontinuously.
  3. Choice of benchmark.
    Some benchmarks are not sensitive to gradual improvement — e.g., requiring exact string matches.

Under continuous measures (like log-likelihood or cross-entropy loss), the models’ improvement tends to look smooth and predictable with scale.


:balance_scale: The Nuanced View Emerging Now (2024–2025)

Most researchers now take a middle ground:

  • The underlying learning curve (as seen in loss functions) is almost always smooth and continuous.
  • However, functional thresholds — the point at which a model’s internal representations become good enough to support a new type of behavior — can make those smooth changes look “discontinuous” from the outside.
  • So “emergence” is often observer-relative: it depends on how we define and measure a capability.

In other words, the capabilities aren’t literally jumping out of nowhere — they’re the visible result of smooth internal progress crossing qualitative thresholds in complex, nonlinear systems.


:brain: Analogy

It’s a bit like watching a blurry image come into focus: as you turn the lens, it looks as if detail “suddenly” appears, but the underlying optical change is continuous.


So to answer your question directly:

  • If by “emergent properties” you mean literal discontinuities in capability: the evidence suggests no, that’s mostly a measurement mirage.
  • If you mean qualitatively new behaviors that appear when quantitative changes cross certain thresholds: yes, those are real in a functional sense, but they arise from continuous internal improvement.

You are saying “look how chatGPT works! Nothing like a human mind!”.

I have no idea how you can possibly say this when you do not know how a humanind works.

What is it about the interaction of neurons and electrical signals that allows an aware human mind to form? What specifically about this process is different from the way that AI works?

Far too late to edit my original post, but just another word about emergent capabilities and another suggestion to read the “Summary” box in my earlier post of the GPT-5 response, which I wholeheartedly agree with.

I think the appropriate position on the “emergence” debate in AI is that both sides are kinda right in their own way, but they’re misunderstanding each other, and one side is more right than the other.

I earlier mentioned UC Berkeley professor Jacob Steinhardt saying that “emergence” can be defined as “qualitative changes that arise from quantitative increases in scale.” This is not original to him; I’ve been saying the same thing for decades. The subtle issue is whether emergence is necessarily a sudden and discontinuous event in more complex systems that is indiscernible in earlier systems. Philosophers like David Chalmers have argued endlessly about this, and distinguished between strong and weak emergence (respectively, properties that cannot possibly be inferred from the original components and emerge like magic at a certain scale, or properties that are intrinsically there at lower scales, but not readily discernible).

My view is that the distinction between strong and weak emergence is a philosophical nicety that isn’t important. What’s important is that fundamentally new and impressive capabilities are exhibited by AI systems like LLMs at very large scales. Whether their appearance is discontinuous or could have been detected at some imperfect form at smaller scales is absolutely irrelevant. In that sense, emergent capabilities – whether you define them as “strong” or “weak” – are absolutely a real thing.

Eh, if someone finds that talking to ChatGPT is better and more satisfying than talking to their spouse, maybe it is for the best that the marriage break up.

Then there has been a terrible misunderstanding and it appears to be on my part. I thought you were claiming that the article was supportive of the emergent abilities school of thought, which obviously would disagree with the headline. In reality, the article covered the fact that it is under debate, which is true to this day.

That’s not how emergence was defined in the big paper that started this all.

An ability is emergent if it is not present in smaller models but is present in larger models

Emergent abilities would not have been directly predicted by extrapolating a scaling law (i.e. consistent performance improvements) from small-scale mode

Then they proceeded to show all of the graphs with the elbows in them that we data scientists have seen for decades, starting for me when I first worked with k-means clustering.

If your hypothesis is that LLMS improve gradually as the corpus grows in size, I don’t have any disagreement, but I’m not going to start calling it emergent behavior, nor are the authors of the paper that claims LLMS have emergent abilities.

I should note that your ChatGPT response with the nuanced view is on the same page as myself.

We should also remember that “exhibiting emergent behavior” is itself not automatically synonymous with “sentient” or “intelligent”.

Emergent phenomena are present in many, many different types of complex systems, from chemical oscillator arrays to ant colonies, that we don’t consider comparable to human intelligence. Just because an LLM may arguably show emergent phenomena in some of the iterated programmed interactions of its millions and millions of informational components doesn’t necessarily imply that it’s doing anything we would recognize as “thinking”, or that it “knows” what it’s doing in the way human brains do.

Realistically, our best understanding of sentience is as an emergent behavior of large clusters of neurons, in a sufficiently feedback rich environment.

Until demonstrated otherwise, we’d expect an LLM who had been outfitted with all our sensors and with training mode enabled to become something like us, minus hormones.

Completely agree. I’m not at this time one of those “AI Doomsayers” that the media likes to get sound bites from, but if truly emergent behavior (as defined in most LLM circles) started to appear, I’d start getting concerned.