Yeah, that’s the thing that was amazing to me. They weren’t made to do this. You can program a computer to do calculus. That’s fairly trivial at this point. That a machine learning model with no specific instructions to specifically learn calculus or math or whatnot has shown this emergent behavior, as hit-or-miss as it is now, can do so. That is something to pay attention to.
Well, what I find problematic at this point – and I say this respectfully as someone who genuinely appreciates your many important insights and patient explanations – is that in this instance we don’t appear to be speaking the same language, although perhaps the fault is mine. Philosophy was something I enjoyed in university, but today I’m a simple person with a simple empirical mind and a background in science and engineering.
You claimed that because there are an infinite number of phenomena not subject to emergence, then pretty much nothing is (exact quote: “so in a technical sense, almost nothing can in fact emerge”). I don’t know why this is remotely relevant to anything in this actual world of God’s green earth. In these very early days of highly scaled-up LLMs, we’ve already found at least three important emergent qualities in the specific domain of cognition, which is the only important domain that anyone cares about, and so there are likely to be many more. Had we not already reached this stage, your argument would have suggested there would never be any emergence at all. But there is every indication from the progress already made that there will be a lot of it, and that in fact it’s turning out to be a major paradigm for evolving artificial cognition.
And indeed we can similarly put bounds on the class of “things that fly”. I think my argument is getting the logic exactly right. I can similarly show, for instance, that there are some things that cannot fly, such as my dog and my grandmother. That does not preclude the existence of either birds or airplanes.
When an LLM is generating a next token, it’s sampling from an output distribution. This is a random process. LLMs will then generally, in private, generate multiple responses consisting of a bunch of tokens. It’ll then choose amongst those responses according so some rule which may or may not also be random.
So yes, a response is random/dependent on a seed value.
You’ve repeatedly held that we can’t establish any a priori restrictions on the capabilities of LLMs.
Well, the computable behaviors are a null set relative to all possible behaviors. So any given behavior almost surely
(with probability 1) does not emerge. Hence, almost no behavior does emerge.
It’s like the relation between computable numbers and all real numbers. There are only as many computable numbers as there are natural numbers (countably many). But there are vastly more real numbers than that—the cardinality of the continuum is a greater level of infinity than that of the natural numbers. So when you draw a real number from a hat, the probability that it is computable is exactly 0.
Whether this has anything to do with God’s green Earth, I can’t say. Perhaps only the computable behaviors are metaphysically possible. But so far, I see no reason to think so.
No----infinitely many behaviors can emerge. But that infinity is an infinitesimal fraction of all behaviors.
The argument, in this analogy, would be that some given thing could, in principle, emerge the ability of flight. Hence, pointing out that lots of things, like your grandma, can’t, does defeat that argument.
Just to be a bit pedantic, but this is important to know. It’s not always the entire conversation. There’s a limit to how far back in the current conversation it will reach to get the text that it will feed into the prompt for the next generation; this is known as its “context window.” How big the context window is varies widely between implementations. I’ve seen some of the paid front-ends to LLMs charge more to enable bigger context windows. It costs more compute (oh and when did that word become a mass noun, anyway?) to process more tokens.
AI Dungeon and its competitors have some fun tricks for managing the limited context window, like keeping a reserved amount of user-editable text in separate fields that are always included in every prompt, and more elaborate keyword lookup and replacement schemes.
Fair point. Interestingly, when you first chat with one of these, the context window may include whatever the designers were putting into the thing to make it ready for use - in the recent computerphile video linked upthread, Robert Miles talks about how people persuaded LLMs to reveal the rules they had been primed with (and forbidden to tell users about)
The randomness (they call it ‘temperature’) is applied during the auto-regressive part of spitting out tokens, so that you don’t get exactly the same response every time. The value is typically something like .8, which I take to mean it only picks the non-best token maybe 20% of the time. Or maybe it’s applied in some other way.
And that list of tokens changes with every token added to the output. GPT gets a list of the next best token, picks one, and then feeds the whole output sentence including the new token back into the model (autoregression). After processing what’s already been said, another token list is generated from inside the model for the next token. Repeat until finished.
The real question is, “how are the list of token probabilities generated?” Everyone is focused on what happens AFTER that point, but as I said in the other thread, that process only consumes 6 layers of a 96 layer neural net.
And also, notice that the image-based LLMs seem to evolve in the same way, and they don’t do next-word prediction at all. And multi-modal LLMS like GPT-4 can handle both, and we find that they associate images with words in their models, and break images down into objects and relations that are associated with words.
Before the models have ingested gigabytes of data, they were still doing ‘next word prediction’, but the result was gibberish because the models still had no way of producing reasonable word lists. Then they slowly got better, but were still weak. Then things like word-in-context and theory of mind began to emerge, and suddenly these models looked a hell of a lot ‘smarter’.
Math emerged sometime after that, starting with addition. Other things emerged as well which we didn’t even know about until we started digging into the models, like the ‘associative neurons’ that fire based on concepts like “Spider-ness”, and fast fourier transforms and trig identities to enable modulo addition. No one programmed those, or even knew they were there.
Given these surprising emergences, and more undoubtedly to come, it seems crazy to me to make categorical statements about what these things are doing inside, and focusing on next-word prediction seems to be to be unhelpful in discussing the potential intelligence embedded in the models.
What I find really cool is that’s how you program these things, period. All the alignment/safety work results in prompts, as I understand it. Unless you are modifying the transformer architecture itself, programming an LLM is basically convincing it to do what you want. As I understand it, even adding an API call to ChatGPT through plugins involves basically telling it what your plugin does and what features it has, rather than writing a bunch of interface code.
I’ll go in a bit more detail than my previous answer to clear some things up.
An LLM generates a probability distribution over tokens, given an input sequence. That is, it generates a probability for every potential next token, given the previous tokens. How to sample from this probability distribution is a design decision. Temperature does not mean you are greedy T% of the time. It is a modification of the output probability distribution to generate more or less diverse outputs. Greedy sampling isn’t really done because it often leads to repetitive outputs. A T of 0.8 means that you will select “likely” tokens more often than the base model suggests. A T of greater than 1.0 would downweigh likely tokens and give more credence to unlikely tokens.
Sampling over the entire output distribution is problematic because in aggregate low probability tokens add up to a large amount of the probability mass. So you can restrict to the top-k tokens or the top # of tokens such that their cumulative probability exceeds some threshold.
Ok, so all that is how to select a next token from a model. Is an output sequence just generated sequentially token by token? No. You might sample 10 or 20 candidate sequences and then selected amongst them according to some criteria which is another design decision.
I disagree completely. In the case of intelligence specifically, our only means of evaluating it is a finite series of observations. Exactly what those observations entail I leave open, but they are finite, and a small number at that.
If a simulated human brain can behave the same way as a “real” one, on the basis of however many observations you can pack into a normal lifetime, then we can safely say that randomness doesn’t make a difference.
Then you are not understanding the magnitude of the difference we are talking about.
There are about as many neurons in the brain as there are stars in the Milky Way. It does not seem like a stretch to say that we could simulate one neuron with all the matter in the Solar System, and the remaining neurons with the other systems. A brain simulated this way would run slowly due to the speed of light, but it would run. It might even complete a few thoughts before the end of the universe.
But if one-way functions exist, and my pseudorandom generator has a megabit of state, then for you to figure out this state takes on the order of 2^1000000 operations.
If you took all the matter in the visible universe, dedicated to the task of figuring this problem out, using the most optimistic assumptions of computations per atom-second possible, and let it run for a googol years, it would hardly make the tiniest dent in that exponent. It gets you absolutely nowhere.
That is the difference between “hard” problems and hard problems. If you say that a given problem would take a Kardashev Type 3 civilization to solve, I’ll shrug and say we should get cracking. But if you say that solving a problem takes 2^1000000 operations (with no benefit from a QC), I’m going to scoff. These aren’t remotely the same thing. Compared to the second, there may as well be no difference between the first and a pocket calculator.
As one of many examples, consider the holographic principle. If the “true” nature of reality is that all matter is smeared out across a differently-dimensional surface, then even basic things like locality no longer have the same meaning. How could it be that two particles are spatially separated when they overlap?
Of course, in our view of the universe, locality exists, so the rules must be set up in a way that locality is enforced somehow, by statistical means or otherwise. It just doesn’t make sense in the “actual” universe.
Or, taking a different tack–suppose the communication channel behind entangled particles was actually classical in nature. I.e., it actually worked like the ansibles in sci-fi, except that it was layered with further restrictions that only allowed it to provide quantum correlations up to the CHSH inequality and no more. Perhaps entangled pairs are connected by wormholes, but limits on interacting with the endpoints keep them within the QM bounds.
All of these ideas are speculative, of course; my only point is that the universe seems to be compatible with them, which is just another way of saying that we don’t have any experiment that can distinguish between the cases. And that we’ll likely have to give up some cherished assumptions to really solve QM+GR, which means we shouldn’t get too attached to any of them.
This is interesting… The author of the link below took Silicon Valley Bank’s financials at the end of 2021 and fed them to GPT-4 and asked it for a risk analysis. And it nailed it.
It then goes through all the possible risks the bank faces based on its portfolio, and concludes:
In fact what happened is that rate hikes drove down the value of the 2% government T-bills the bank was holding, which caused the crisis. And as a reminder, GPT-4’s training was ended long before any of this happened.
The article goes on to describe GPT-4’s risk mitigation strategy, which also turned out to be correct so far.
As a side note, the second biggest risk found was thatnthe bankmis holding a lot of mortgage-backed securities, and if real estate goes down they,could be in trouble.
The worrying part is that this describes the situation at most banks, including the Fed and other central banks.
That isn’t the question. The point was whether randomness conveys capabilities that exceed what pseudorandomness can convey. The answer is yes, and there’s no observation at all that needs to be performed for that.
The irony is that the brain, if it is a general-purpose prediction machine, needs to perform the same task: find out whether an N-bit string is compressible. That’s the task AIXI implements, and the reason it isn’t computable. If there’s a computable (feasible) approximation to AIXI, there’s a computable approximation to Alice’s task. Remember, she doesn’t need a perfect success rate.
And of course, that it suffices to simulate the brain at the neuronal level is itself hypothetical. There are various proposals, some of which I’ve given in this thread, that depend on the exact quantum state, and simulating that is a hard hard problem.
In which case the emergent geometry is exactly Lorentz invariant, so the protocol could never conceivably be implemented.
Also, it isn’t really right to think of holography as supplying an ‘actual’ universe. Both the boundary CFT and the bulk geometry are equivalent descriptions of the same physics; neither is more fundamental than the other. They’re just dual theories.
But there are various limitative theorems that curtail what sort of completions are possible. The one by Landsman I posted above in particular entails that no such completion can be computable.
But anyway, that wasn’t really my point. That was simply that the only guide we reliably have to tell us what’s physically possible are our best current theories of physics, and according to those, reality is not computable.
As far as I’m concerned, “capabilities” which can only exist in a universe which is not our own are not really capabilities at all.
And to be clear: we are already in an edge case of an edge case of an edge case. The possibilities enabled by “true” randomness is so distant from actual practical applications that they may as well be nonexistent. In reality, even very weak pseudorandom generators are enough for almost all applications (Monte Carlo, etc.). The rare cases where “true” randomness is desirable (cryptographic keys) have less to do with the randomness than with the ease in which flaws can creep into pseudorandom generators (such as dumb ways for picking seeds).
I genuinely fail to see what the “purpose” of AIXI is. It surely can’t be an actual proposal for a machine-learning system, since as you say it’s incomputable. I’m not sure anything could be more worthless than an incomputable function that’s supposed to run on a computer.
It surely can’t be a model for human intelligence, either–at no point does it include any observation of the brain or its capabilities.
I did see on the Wikipedia article that claims a Monte Carlo version of AIXI (MC-AIXI) has been implemented and can play a simple version of Pac-Man. But I found the source code and it turns out they’re just using a pseudorandom Python module. Oh dear!
Now as it happens, I actually fully agree with the idea that intelligence can be seen as a type of compression. In fact, I see this as the key insight to why the Chinese Room–aside from its impossibility–is not intelligent, whereas a brain or a sufficiently advanced neural net is. The latter two can transform far more data (exponentially more) than they contain in their structure, and are thus excellent (if somewhat lossy) compressors.
The paper seems to have the same flaw as the previous one: it doesn’t show that there’s enough computing power in the universe to actually violate causality. To their credit, they do at least halfway acknowledge the problem:
It is important to note that, without any knowledge of B, there is no a priori bound on the time it takes Bob to determine Alice’s message with high enough confidence. Nonetheless, since this time is finite, there exists some finite distance for which the communication allowed by our protocol is superluminal. For instance, if it takes Bob M rounds to find out Alice’s message and each round takes a time T, then if they are at a distance cTM, the message is obtained before a light signal from Alice could reach Bob.
Well, that’s possibly true as far as it goes. But “finite” is doing an incredible amount of work here. Their algorithm looks at all computable functions with runtime under some finite O(t)! This is not a small number.
Funnily enough, the paper does acknowledge a caveat, of sorts:
It is worth mentioning that our result is not in conflict with the different interpretations of quantum mechanics. All of them predict random outputs, which are not allowed by our model. In the
Copenhagen interpretation, the measurement process is postulated as random, whereas,
for example, Bohmian mechanics is deterministic but postulates initial conditions that are randomly distributed and fundamentally unknowable.
Which is almost precisely the type of pseudorandom generation I have been suggesting, just expanded to the entire universe instead of an entangled pair at a time.
So far, the cites have been less than convincing. I would like to see one that actually calculated the computational costs involved. Even without taking a hard stance on computability, the universe is undoubtedly informational. The Bekenstein bound proves that much, not to mention other fundamental limits like Bremermann’s limit or Landauer’s principle. Any claim on the computability of physical law must take these things into account.
I think I can get behind the statement that the brain is a general-purpose prediction machine. But it’s certainly not a perfect general-purpose prediction machine. Sometimes we predict things and get the predictions wrong. It happens quite frequently, in fact. So any implication that computers can’t be perfect prediction machines is irrelevant, because that’s just another way that they’re like brains.
So you’re the arbiter of what’s possible in this universe—good to know.
But still, that doesn’t impinge on the logic. To show that an argument doesn’t go through, it suffices to show a counterexample, even if it is counterfactual.
As an example, take the case of theodicy. One might argue that if there is free will, then the existence of evil is compatible with an omnipotent, omniscient and omnibenevolent god (not that I think that’s sound). Pointing out that there is no free will in this universe does nothing to refute that argument: what’s shown thereby is that the notions of omnipotence, omniscience, and omnibenevolence are not logically incompatible with the existence of evil, since it is possible to reconcile them.
The same here. The cop-and-robber example shows that there are circumstances where randomness yields an advantage, hence, any argument to the contrary is mistaken even if that particular example is not realizable in our universe.
The purpose is to show what resources it would take to realize such a general-purpose agent, and how to formalize that. The incomputability is then just a result.
Again, it is supposed to model a general inference agent, and hence, a general intelligence (in the precise sense that it is asymptotically as efficient as the best special purpose program at any given task it is faced with). Whether human intelligence implements such inference is of course an open question.
The laws of physics don’t really care about computing power; if causality can be violated using a magical supercomputer (or if certain conjectures of complexity theory turn out false), then causality is not absolute, contra special relativity.
The randomness of the initial conditions in Bohmian mechanics must be algorithmic, so this is spinning straw from gold as long as you’ve got enough gold.
You haven’t reacted to the bulk of the cites, just to the two with conflicts with causality. But the papers by Landsman, Svozil and Calude, Cubit et al., Malament and Hogarth, and so on, just as much imply the uncomputability of the laws of physics as currently known. Speculating at anything else is at best a wild leap—we have no current way of knowing whether there even is a consistent computable formulation of these laws (for instance, the continuum might be essential to any such theory, but that isn’t a computable entity).
That’s itself a controversial metaphysical position, essentially a form of structural realism. The Bekenstein bound puts a limit on the amount of information within a given spacetime volume, but says nothing to the effect that information is all there is; neither do Bemermann’s, Landauer’s, or Lloyd’s results.
Of course. But even a full-fledged implementation of AIXI would be fallible—it uses a formalized version of induction (Solomonoff inference), and inductive conclusions are always defeasible. AIXI might produce the most parsimonious continuation of a given sequence, but that can still be just wrong.
I learned a lot by Googling the applications for GPT. What stands out is the tool nature of GPT. It’s not a math engine or industrial control unit. It’s more like a slide rule that can yield results in the hands of a skilled operator. Key to this is the prompt. It’s not a crude one liner like I’ve been using. The prompt has to contain all of the information needed to define the problem. So when I asked for a critique of Jeffers’ poem Ink-Sac, I should have quoted a copy in the prompt, along with a detailed description of my interest in the critique. If I obtained the result I was looking for it would have been a cooperative effort between me and GPT. Not a response to a one liner.
So, developing a calculus proof would not have been the result of a terse inquiry, because GPT is not a calculus engine. It would have been the product of someone skilled in calculus providing GPT the information needed to create the proof. Credit for the result has to be shared.
No, you don’t have to provide the information for solving the problem in the prompt.
And the prompt doesn’t have to be complex. It just has to be specific enough that ChatGPT knows what it is you are looking for. Ambiguous prompts lead to ambiguous answers.
Now, you CAN provide information in a prompt. For example, if you have a big enough context window you can dump a research paper into it and ask ChatGPT to answer questions about it. But it sounds like what you are claiming is that the proof ChatGPT did wasn’t real, because somehow the info needed for the proof was in the prompt. And thats not the case.
There also advanced prompt techniques that can help, like few-shot learning, chain-of-reasoning techniques, etc. But they aren’t necessary for the proof.