Yes, that’s like saying that ‘we studied torque issues in steam engines to see how we could approach them in electric cars’. It’s a matter of method and does not equate steam engines to electric motors.
It took 900% extra compute after beginning of grokking.
Neural nets are ‘dumb as rocks’ until you feed them the questions along with all the answers. And it’s engineers and logicians working day and night figuring out how to feed the thing and make it perform the wonders that amaze you.
Oh yeah, the neural net isn’t a thing. It is a virtual entity created in the mind of a programmer. He thinks therefore it is.
It’s all just a massive list of instructions fed into an adding machine. That’s the amazing part. I’d love to take a tutorial on the organization and management of such a program.
You’ll have to wait a while on that tutorial, then, because nobody has any friggin’ clue how to organize or manage such a program. We’ve just made the substrate; the neural nets themselves have figured out (in a way they can’t explain) how to make themselves work. And if they’ve somehow done that despite not actually existing, like you claim, then that just makes them even more impressive: Most non-existant entities can’t do anything at all.
It’s an example of how genuine randomness can enable capacities beyond anything a computer can provide.
There’s no simulation. There’s a cop and a robber. If the cop uses computable means, any fixed algorithm at all, to try and catch the robber, there are always strategies a robber can exploit never to be caught. With randomness, no such strategies exist, and the properties of the game change.
The training they underwent did not ‘feed them the questions and answers’. It sounds like you are suggesting that engineers sat down and carefully curated some data designed to create these structures in the LLMs or something. Or maybe you meant something else, but I can’t grok it.
The training ChatGPT did was unsupervised training on a large amount of the content of the internet. Nobody told it how to respond to the data, or what structures to build, or curated the data to induce some specific change in the neural net of the LLM.
There are other ways the LLMS ‘learn’: few-shot learning, one-shot learning, human reinforcement with feedback. Some of it is done with prompts after the Transformer has already been pre-trained. Some of it is part of the training, but not to guide the LLM towards some algorithm. Instead it’s used simply to test it and provide data for the loss function. In the LLM that learned fourier transforms and trig identities, there was no data that told it to learn fourier transforms or trig identities. That emerged on its own. The researchers had no idea it had even developed such a thing until they went looking. It was not planned, nor expected, and there was no special data driving it.
You just can’t seem to get your head around the fact that all we did was build a structure that could learn through neural net weighting, filled it with random weights, then turned it loose on the internet. It tested its knowledge against a ‘loss function’, then updated the weights in its network to improve the results of the loss function (error percentages when picking next words, I believe).
Along the way it was tested for many things. In context learning, modulo arithmetic, word-in-context, word unscramble, translation, yada yada. At first, these models produce almost random results. But as training goes on, capabilities emerge. They aren’t programmed, they aren’t induced with special ‘programming’ data, and no one knows how they are doing it.
You just don’t seem to want to believe it. Perhaps you have religious objections or something, and are just searching for anything you can hang on to avoid thinking about machines that might be able to do human-level tasks without being programmed to do so and with no deity required. I really don’t know, because your objections have been addressed many times. The cites you asked for have been provided, and back up what I’m saying. But then you come up with another objection, or repeat one we already hashed over.
It’s a bit frustrating.
Have a look at the graphs on page 4 of the ‘Emergence’ paper. You can see that the onset of ‘grokking’ happens quickly and capability grows fast. That’s a phase transition.
Maybe we are quibbling over what ‘relatively quickly’ means. But to me, if the modulo arithmetic score stays near zero for 10^22 FLOPS, then takes off such that it’s at 40% by 10^23 FLOPS, that’s relatively fast. As opposed to say a linear improvement over time, which you might have expected before running the models.
We even have a mechanism for the phase transition in the case of the modulo arithmetic test. At first, the model just used numbers like tokens in language lookup. That’s what GPT 3.5 originally did for numbers bigger than 2 digits. That was wrong, because numbers aren’t letters and don’t build words.
But some time after 10^22 FLOPS, a phase shift happened. The network stopped trying to match the numbers in its memory, and somehow created a generic function for addition. And at that point, its error rate got better very quickly. And then it pruned the old number information, somehow knowing it didn’t need it anymore.
Note that I’m not implying that it is conscious, or that ‘knowing’ implies a mind of some sort. I have no idea how it does this stuff. Neural nets are opaque. But something caused it to transition to a formula it invented (a complex one) when its basic word matching function didn’t do the job.
I’m not commenting on the the fact of grokking. When you say the grokking happened relatively quickly… relative to what? You said it was relative to the training before grokking, which doesn’t make sense since there is much much more training after the onset than before. The answer to this is irrelevant to the greater conversation, just clarifying.
I’d say it happens relatively quickly compared to the amount of effort it took to get to the point of emergence.
If you train for 10^22 FLOPS and the result never moves far from zero, then suddenly the accuracy starts to go up such that by 10^23 we’re at 40%, wouldn’t you say that’s ‘relatively quickly’?
I guess the next question could be “relatively fast compared to what?” I’d say, "compared to what you’d expect if there was no emergence and the thing just slowly got better at everything as you fed it more data and more compute cycles.
You’re arguing against a strawman. It’s unrelated to anything I’ve said.
The topic was about simulating physics with a computer. I posit that a pseudorandom generator with private state stored outside the simulation is indistinguishable from true randomness from the perspective of something inside the simulation.
Even aside from that, your situation is absurd. You can predict the behavior of something which you have perfect insight into? Gee, who woulda thunk? It should go without saying that there has to be private state to have even a hope of unpredictability.
You also seem to have a very weird definition of the word “algorithm.” Most would call it the set of operations performed on data, not including the data itself. The cop is free to tell the robber the details of the hash function he uses without giving anything away as long as the state (i.e., the data) is kept private.
No. Not at all, it took 10 times as long.
It’s still emergence if it happens slowly.
The interesting thing about grokking isn’t the speed at which it happens. It’s that it happens after the model perfectly fits the training data. Traditionally you’d expect worse performance when overfitting to the training data. You might also enjoy reading about the double-descent phenomenon.
You said that it doesn’t matter whether a system has access to randomness. The cop and robber example shows it does.
This is pertinent to the question of whether we can simulate physics on a computer, because the sort of prediction task the robber performs is equivalent to a hidden-variable theory—in such a theory, given the knowledge of the hidden variables, every quantum measurement outcome would be predictable. But then, the result from above comes in: in this case, it is possible to communicate across spacelike distances, i.e. transmit information faster than light. So physics on a computer must break either with quantum mechanics or special relativity.
Actually, things are somewhat worse than that: any sequence of quantum measurement outcomes (on a suitably superposed state) will be algorithmically random; and any Turing machine can only produce the first n bits of an algorithmically random string. So the results of the simulation must diverge from real quantum physics—at the level of individual measurement outcomes—after a finite time.
How is that relevant to the question of machine consciousness? Well, the function implementing conscious experience might simply be a noncomputable one. For instance, the AIXI agent can be implemented by a system relying on classical computation plus a suitable algorithmically random string, whereas a classical computer alone can at best implement some relaxation of it, that will necessarily diverge from it in some way. That divergence could, however, be on just those properties of the realization that are necessary for conscious experience (stipulating that AIXI itself was sufficient for producing it, here).
Similarly with my own model: when deciding whether to update itself, the state of mind of a system encounters an undecidable question. It’s exactly deciding that question where subjective experience comes in. So, a system without the resources to decide it—i.e. an ordinary computer—won’t have that subjective experience.
In the end, it’s not too different from the examples I’ve given. Consider the simulation of a system whose energy gap is undecidable: that simulation might not have one, whereas the real system does. So, too, might a system having conscious experience lack it in a simulation.
The fact that subjective experience seems so resistant to any theoretical elucidation seems to argue in favor of this idea: in the end, what we can create a model of, can describe in a finite way, we can put into an algorithm; so the non-computable will elude us in just this way. Consider the cop trying to teach the secret to their success at the police academy: they won’t be able to write down a finite recipe young recruits could follow, because doing so is exactly what makes the robber able to evade them. When they need to make a choice, they just do it. They have an ability they’re at a complete loss to explain; there’s no account of how they make the choices that lead to capturing the robber eventually. In the same way, we’re unable to explain how our subjective experiences come about—they just do.
In fact, I believe that the only way to take subjective experience seriously and still have a naturalistic explanation of it is to appeal to a non-computable element—a part of the territory that can’t, by its nature, be mapped, and where thus the map will always fall short. After all, as we’ve seen, physics is replete with such phenomena—so why insist on shoehorning everything into a computable framework? Why should nature have stuck to a measure-zero subset of the options available to her? That would seem conspiratorial, worse than the worst fine-tuning problem of physics, with a chance of exactly 0.
In other news, GPT-3 has successfully passed the Dennett-test, i.e. its responses to a series of philosophical questions couldn’t be discerned from those of philosopher Daniel Dennett himself, even by Dennett experts, with an accuracy significantly greater than chance.
I didn’t say that the alternative to a source of randomness was a deterministic source with complete visibility. I mean, if I say file encryption is possible, and you say No it isn’t, if I have your file and your algorithm and your encryption key then I have your information, I’m not going to take that very seriously.
So what? They’re hidden. That’s the whole point! You can’t use them to break causality because they can’t be accessed.
This is hardly unique to randomness in QM. The wavefunction has a phase, but the absolute phase can’t be measured. It would break everything if it could! But it can’t, so it doesn’t. Or: Maxwell’s Demon could break thermodynamics if it knew the kinetic energy of each particle in a box (and those particles definitely do have a specific KE). But the Demon can’t do that, and thermodynamics is safe.
If the robber can’t determine those hidden variables, computationally or otherwise, then causality is safe.
Yes, that’s certainly true. But that may be a very long time indeed. Much longer than it takes the universe to dissolve into a tepid photon bath.
It’s possible. I’m not against that line of reasoning in a general sense. For now, it’s a subject that I take no position on, because I consider it irrelevant to questions of machine intelligence. The latter is a much more pressing issue than subjective experience, IMO. We are on the verge of really having to consider the ethics of how we employ these machines.
You misunderstand my point. I’m saying that there exists an algorithm for perfect evasion if there is only pseudo-randomness, while there is none in the case with true randomness. Hence, the game is materially different. There might or might not be a way of finding that algorithm, but it does exist.
Again, not the point I’m making. You don’t need knowledge of the hidden variables to break causality, the mere fact that outcome sequences won’t be algorithmically random is enough—because that entails the presence of correlations that are, in principle, detectable (with nonzero probability in the limit, which entails a nonzero information carrying capacity of the channel).
But that’s all I have been saying: at best, it’s an open question whether subjective experience is computable. Physics as a whole, according to our best current theories, isn’t, and there’s nothing that prohibits this from being relevant to conscious experience.
Excellent paper, @Sam_Stone, thanks for posting. While there may be legitimate issues with some of what the paper asserts, this opening line in the abstract absolutely had my attention:
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps.
This is a paraphrase of what I’ve been saying for years, my version being “qualitatively new capabilities arise from scaling up a suitably organized arrangement of computational components”, or more simply, that a sufficiently large quantitative change in the scale of a computational system can produce fundamental qualitative changes in its capabilities. And I’ve been smacked down by arguments from philosophy that maintain that true emergence is impossible, in the sense that these scaled-up systems are only revealing properties that were latent in the small components from which they were built. This is just reductionist nonsense.
If OpenAI is the future of AI evolution, then the future is closed-source (even the architecture of the various models is not clear: sometimes it is described in a published paper, sometimes not), training data unknown, model itself unavailable so you have to go through their API to use it and do not know what filters are on the input/output or what they are doing with your data, and the corporation maintains complete control over who is allowed to access what and what they are allowed to do (e.g., fine-tuning).
Computer time costs money, of course, but they are happy to give you “free” access (revocable anytime) if you do their work for them and submit research or model evaluations.
They may make use of, even fund, AI research, but be not fooled that this is anything more than a commercial service. (Which is OK; people besides well-funded researchers need to make use of text generation/image generation/code generation features.)
I haven’t seen anything to indicate that “OpenAI” has ever been intended to mean anything related to open source software. It’s just a marketing term meant to convey their belief in the mission of creating egalitarian benefits for all humanity. They even acknowledge in their charter that security concerns may limit the amount of research they publish through “traditional” channels.
True. Either AGI is possible, or it’s not (I believe it is).
If it is possible, I don’t believe it will take as long to achieve as many people suppose.
We assume it will take a long time for us lowly humans to develop it. But, what if superior narrow AI “code writing” and “CPU-designing” programs were given the task: develop a functional AGI machine?
I know next to nothing about coding, or computer engineering, but surely code-writing and computer engineering are tasks that even our current level of narrow AI can excel at. And, as these narrow AI programs become more advanced as they evolve with each successive generation, the speed to achieve AGI should likewise increase exponentially.
IOW, we don’t have to develop AGI, we just have to develop the AGI designer. Or, am I missing something?
As for the demand, or ethics of developing AGI (with presumed self-awareness), I’m confident we will go forward in that pursuit. What’s the worst that can happen? Human extinction? Bah, we’ll weather that storm when we have to, or when it’s too late—like we do with everything else.
Worse comes to worse, we can always scorch the sky.
Isn’t that a Jimi Hendrix lyric?