The next page in the book of AI evolution is here, powered by GPT 3.5, and I am very, nay, extremely impressed

I mean in terms of a high-fidelity simulation. That is, does the simulation give the same answers as the real thing? If so, the origin of the randomness–if indeed it’s random at all–does not matter in this sense.

A classical computer can simulate a quantum computer, slowly. And can use a deterministic pseudorandom generator to simulate the collapse process. There’s no indication that this leads to different results than using a “true” random value.

If you can give some indication as to why we should believe this, other than “linking consciousness to a process that we don’t fully understand yet doesn’t fully snuff out the magic”, then sure; otherwise, it just reeks of God of the Gaps.

That seems a little distorted. I would put it this way:

Since we have no evidence that there is anything special about the brain that is required for higher-order thinking, and we have evidence from LLMs that when we create similar large connected networks and train them on human language they at least appear to have human-like reasoning ability, we have no real reason to invoke an unknown property of the brain when explaining intelligence. It’s really an application of Occam’s razor: introducing a ‘hidden variable’ we don’t need to explain something not in evidence (that it takes something special in the human brain to achieve true consciousness) is not warranted.

It may be that something like that is needed. It may be that it isn’t. We don’t know. We also don’t understand how LLMs do their thing. People focusing on ‘next word prediction’ are missing the fact that the ‘thinking’ to the extent there is any, comes from building the list of words and probabilities in the first place. That almost certainly requires some kind of concept formation, somewhere. And we don’t know how they do that, other than to realize that the training process results in incredibly complex structures in the neural net that we are just beginning to try to understand.

I mentioned before that one of the smaller language models (50,000 parameters, as I recall) was built so that it could be instrumented and its development tracked. They tried to teach it addition. As they trained it, they could see the network basically storing all the values it found in its training data. Trying to get it to answer a math question resulted in ‘next word prediction’ from problems it had seen before that were similar. So it got wrong answers all the time.

Then after a certain amount of training, the error rate went to zero rapidly. Inspection of the network’s structure revealed that a ‘phase transition’ occurred, and suddenly instead of doing lookup of previous things, the network had built a generalized addition function that it invented, using fast fourier transforms and trig identities to be able to add any numbers together accurately. From that point forward, the network didn’t lookup numbers from the past any more. In fact, it eventually pruned itself of that data for efficiency.

This is ‘emergence’. It’s why the argument that a computer is ‘just an adder’ is not valid. We have seen numerous examples of emergence in large language models. We didn’t expect them, don’t know how they happened, but they did. For example, ‘theory of mind’ emerged in ChatGPT at something like 10^20 FLOPS of training. GPT-4 can now pass AP calculus. No one taught it calculus. They just trained it more on the same data, and the ability to do calculus emerged.

Emergence, complexity, chaos. All domains of these systems that make them nearly impossible to predict, explain, or break down through a process of reductionism. We really don’t know what they are doing inside those neural nets, or what they will do if we add more layers, parameters or training data.

It’s way too early to make any categorical statements about what they might or might not achieve, including consciousness.

In short, anyone who says categorically that LLMS can or can’t become conscious is simply wrong. All we can do at this point is observe something we don’t fully understand.

It does lead to differences, though. The set of uncomputable functions is vastly greater than that of computable ones, and each of them can be written in terms of a computable one together with an infinite algorithmically random sequence. Thus, access to such a sequence allows a computer to answer questions (such as the halting problem, or the question whether a quantum system has a gapped spectrum) that no unaugmented computer could answer. Such a system could also implement the AIXI-agent, which I think has the best claim to being a general-purpose intelligence among the proposals I’m somewhat familiar with.

This has real consequences. Consider a cop trying to catch a robber that hides in either of two houses. In each round, each can choose to either switch or stay. The robber is caught if both end up in the same house. It’s clear that a random strategy, for the cop, will always (with probability one) lead to success in the long term.

But now suppose that both the cop and robber are limited to computational means. Then, there always exists a strategy for the robber to evade the cop indefinitely: simply implement whatever algorithm the cop follows, and anticipate their moves.

So the difference between randomness and no randomness, in this case, is that the game transforms from a certain win for the cop on the long-term limit, to a guaranteed win for a robber with the right strategy. Thus, the properties of the game change with the addition of randomness.

Likewise, in a computable world, questions regarding quantum measurement outcomes, or whether there is an energy gap in the spectrum of certain systems, would in principle be answerable. But the former stands in a certain tension with the various no-go theorems of quantum mechanics, which prompted Feynman to claim that no classical universal device could represent the results of quantum mechanics. This probably needs some qualification, because Feynman likely believed in von Neumann’s no-go theorem that is now widely regarded as erroneous, but one can still prove that either the hidden variable model is noncomputable, or must lead to superluminal signaling. Consequently, the attempt to simulate it on a classical computer would indeed yield manifest deviations from what we believe to be the case in the real world, namely the possibility of superluminal information transfer.

That gets the burden of proof the wrong way around. You claim that consciousness can be instantiated computationally; to oppose this, I only need to show that it’s possible that this isn’t the case, hence leaving this instantiation an open question. I don’t have to show that it can’t be computationally instantiated. If you want me to cross a bridge, just arguing that it’s possible it won’t fall down isn’t going to move me much.

Regardless, I actually have provided several arguments against the idea of computationalism. I have pointed to AIXI, which is an attempt at a general problem solving agent that turns out not to be computable; I’ve pointed to the uncomputability of quantum mechanics and its possible relevance for cognition; and I have pointed to my own theory that runs into undecidable questions. I have also argued against the possibility of consciousness via computation more generally.

So I’m not just coming up with this out of the blue. In fact, I started out as a fervent supporter of the computational thesis, and I still definitely see its attraction. It just turns out not to work.

How do you figure? It seems people are quick to throw around these supposedly ‘obvious’ truths, without really bothering with any argument I can discern. Well, it was obvious to people for a long time that something as complex as the eye or a cell could only come about due to ‘some kind of concept formation somewhere’, but that just turned out to be wrong.

On the other hand, I go through the trouble of formulating a proof that, no, there is no concept formation going in anywhere.

But quite apart from that, given what we know about how ChatGPT works, I just don’t see where one might believe concepts to come in. We know exactly what it knows about any given word: what words it typically occurs together with, what position in the text it has in the given case (which together yields its encoding), and what other words are strongly influential on it (the ‘attention’-mechanism). These are all data available about a text without paying its meaning any mind; I could harvest them about texts in Arabic without any idea of their meaning, any concept of what the words refer to. All I would know is that these squiggles often occur next to those ones, that this particular squiggle is the third in a sequence, and that if I jiggle those squiggles, that squiggle is also likely to change.

How does that give me a concept of what the squiggle means?

Again, that we don’t know what precisely happens doesn’t entail that we can’t put any limits on what can happen. Emergence isn’t magic; the base either supports certain phenomena, or it doesn’t, and if it doesn’t, no amount of piling on more of it will lead them to pop into being from nothing.

Anyway to know for sure that at least one of those limitations doesn’t exist for organic brains? That’s the uncomfortable truth i suspect we’ll be facing as we develop artificial intelligence more and more.

Your post indicates that these changes were the result of an increase in training. I assume better results were the goal of increased training and they succeeded. What you describe is that they were able to trace the exact path of interpolation and increased training caused a sudden change in the trajectory. That makes sense and eliminates the claim that the net is a mystery that no one understands.

Did the program change it’s own code? Were there any hardware changes? What you describe is a successful development effort. Where’s the emergence?

No.

No.

No.

The emergence is in the fact that the new behaviour spontaneously emerged as the result of what appears to have been reinforcement learning. No one “programmed” it, nor was the learned behaviour necessarily predictable.

Further training was employed to improve behavior (by modifying the weights). The behavior changed as a result of training. Same software. Same hardware. Success!

If one-way functions exist–and we’d better hope they do, since they form the entire basis of the modern information economy–then this tactic is not available. The cop can take some finite amount of private state (i.e., unknown to the robber), send it through a one-way function (say, a hash function), and use that to decide where to investigate.

In principle, the robber can determine the N bits of private state by observing approximately N trials. But actually doing so is computationally infeasible. It would take a computer (quantum or not) larger than the universe to invert the function that the cop can apply trivially.

So in practice, the sequence of moves is effectively random: the robber can’t do better than chance (even when the cop gives the robber a fighting chance by letting him escape the first N times). That’s all that matters for the game, even if from the cop’s perspective it’s entirely deterministic.

Interesting. Do you have a link?

Sure. Here you go:

PROGRESS MEASURES FOR GROKKING VIA MECHANISTIC INTERPRETABILITY

Here’s the abstract:

And an excited tweet from one of the authors with some detail:

Anyone who thinks they understand what’s going on inside a large LLM is wrong.

This is a very simplified model, in a single layer, and you can see the work it took to figure out this one algorithm. Now imagine what might be going on in an LLM that has 175 billion parameters in 96 layers and has been trained for months on the world’s biggest supercomputer.

Pay special attention to how the abilities in these models come about. We train them constantly, and for a long time they are just dumb as rocks. The graphs on page 3 show this. Their average tesing loss is high and stays that way while you train more and more data… until suddenly something happens and these models can now do things they couldn’t do just a short while before. The authors of this paper call it ‘grokking’ (the term is frim Heinlein’s ‘Stranger in a Strange Land’, and it basically keans ‘to understand something fully’).

Her3’s another very goid resource on this, which I also linked to earlier.

Emergent Ability in Large Language Models

Abstract:

This is just fighting the hypothetical. The robber implements the same computation as the cop, including any seed data. It’s enough for the example that such a robber exists, while there is none in the case including randomness.

@Crane, didn’t you say that these models can’t learn beyond their training? You might want to read this:

In short, you can actually teach new tasks to ChatGPT. In-context learning allows you to explain a task to ChatGPT that it had never seen before, and even though its training weights are now fixed it somehow learns how to do the thing. No one actually knows how they do that, but this article speculates that during training they develop their own smaller, internal ‘programmable’ networks. There are other theories.

I hope you can see how far we are straying ftom ‘they are just adders’. And how inadequate of an explanation ‘statistical word lookup’ really is.

Then the hypothetical is nonsense. All pseudorandom generators employ private state. That’s how they generate unpredictable results despite being deterministic.

This side-discussion arose from my comments about simulating physics. The state of the pseudorandom generator is outside the simulation. The robber is inside. If the robber cannot reverse-engineer that state without a computer larger than the simulation allows, then it is as good as a “true” random number generator.

Sam,

The MIT article refers to something entirely different than GPT:

“MIT researchers found that massive neural network models that are similar to large language models are capable of containing smaller linear models inside their hidden layers, which the large models could train to complete a new task using simple learning algorithms.”

That’s wild stuff. Massive neural nets with embedded linear models. But that’s not LLM or GPT.

Re the first article, it clearly states:

“Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components”

Just sound engineering.

These are amazing programming accomplishments, and my comments take nothing from them. But they are engineering accomplishments and the software is running on an adding machine.

An LLM IS a massive neural net. Whether it has embedded linear models, we don’t know, because they are emergent and thr LLMs are too complex to reverse-engineer. The whole POINT to the research was to try to understand how large language models manage in-contrxt learning using simplified but similar models that can be examined. Claiming that the test model is not an LLM and therefore irrelevant is ridiculous. The fact is, the LLMs DO exhibit in-context learning. That too was emergent. And they somehow do it without modifying their pre-trained data. The MIT group were trying to figure out how it happens. If their work was not relevant to LLMs, they were wasting their time because the whole point was to learn something about the behaviour of LLMs that we can clearly see but cannot explain.

Bolding mine.

First of all, look at the graphs. The emergence is clearly happening at some tipping point, and is relatively fast.

What do you mean ‘just sound engineering’? Do you think humans amplified the ‘structured mechanisms encoded in the weights?’ Do you think they even made those ‘structured mechanisms’? Do you think humans removed the now-useless ‘memorizing components’? That was all the network’s doing. All of that emerged.

If someone should get credit for sound engineering, it’s the neural net, not the humans.

Correct. Nobody designed networks to grok. Someone let a network train way longer than usual, say the results and said “what the fuck?” It’s been studied more since, obviously, but afaik people can’t even really predict when it might happen during training.

I will point out that the grokking graphs are log plots, so the speed at which it happens is overstated.

I can’t wait for journalists to read that article. They’ll get to this part:

Building off this theoretical work, the researchers may be able to enable a transformer to perform in-context learning by adding just two layers to the neural network.

And all of the headlines will scream out that AI scientists are trying to build a Transformer.

LLMs are more than meets the eye, for sure.

A good point to make. However, ‘relatively fast’ still covers it. For example, modulo arithmetic emerged in GPT-3 at 10^22 FLOPS. Until then, accuracy on modulo arithmetic was almost zero. Then accuracy started to go up, and by 10^23 FLOPS was at 40% (where the graph ends). That’s pretty fast.

The speed of emergence isn’t the same for every emergent trait, or for every LLM. And some LLMS emerge the same function later in the training, unpredictably. But once they start to phase shift, it happens pretty fast compared to the time spent training before then where pretty much nothing happened.