Hell, it’s debatable whether we fully understand how LLMs work internally.
On the other hand, five-year-olds are capable of intellectual feats probably decades out of reach for AI. After all, the amount of language fluency gained at that age was gleaned from whatever examples of speech occurred in the vicinity of the toddler, while it took the AI about 80,000 years of uninterrupted reading (for the training data of GPT-4, at the speed of an average reader) to reach its level of linguistic competence.
Indeed, I think the main lesson from the current crop of chatbots is that performance is, generally, a bad judge of capability. After all, the proverbial lookup table could equal any given performance, provided enough work went into creating it. Rather, we should look at just that work needed to gain new capabilities as a metric—where we outrank these AIs by massive amounts. It’s just that they can perform that huge amount of work in a feasible time frame that makes them able to convincingly replicate human-like performance on some tasks.
Well said.
But thats one of the points of the article. algorithmic efficiencies are part of why AI is advancing so much. He feels algorithmic efficiencies alone will lead to an increase of about 2 orders of magnitude over the next 4 years.
Right now it takes 80,000 years of 24/7 training to learn something a human can learn in a few hours a day for 5 years.
But as the programs get more efficient it’ll take less time, which means you get more productive outputs for the same level of compute.
If right now it takes AI the equivalent of 700 million hours to learn what a child learns in ~10,000 hours, that means when we learn how to train the AI in 7 million hours, or 70,000 hours, that means we will see higher capabilities since we will be using the same level or a higher level of compute as we did when we needed 700 million hours.

But as the programs get more efficient it’ll take less time, which means you get more productive outputs for the same level of compute.
As far as anybody can tell, there seem to be limits to that, though. Neural scaling laws have been remarkably consistent over many orders of magnitude regarding parameters/dataset/compute, such that the only way to significantly decrease loss (error rate) seems to be increasing those factors.

If right now it takes AI the equivalent of 700 million hours to learn what a child learns in ~10,000 hours, that means when we learn how to train the AI in 7 million hours, or 70,000 hours, that means we will see higher capabilities since we will be using the same level or a higher level of compute as we did when we needed 700 million hours.
That isn’t clear to me. As I understand it, the aforementioned scaling laws often level off at some point, such that there is a certain loss rate that can’t be overcome even with infinite compute. They also aren’t linear, but power laws, such that any increase will typically suffer diminished returns.
Furthermore, algorithmic efficiencies seem to predominantly bring down the necessary amount of compute, not the size of the dataset needed. My figure was independent of the amount of compute needed, so wouldn’t actually change much, presumably).
I haven’t yet delved further into the Aschenbrenner article, but already its first graph seems highly suspect:
Just marking GPT-4 as a ‘smart high schooler’ seems simply preposterous: I would assume even an average high schooler to be able to absorb the basis of a new language within a couple years of study from a few basic texts, something simply impossible for any current AI. Finally, why would I expect things to just carry on exponentially like that? It’s like graphing the height of a high-rise under construction and concluding we’ll hit the moon in a couple of years. Every exponential is just a sigmoid waiting to happen.
EDIT: I originally also claimed that the graph already seems to be wrong, as nothing exceeding GPT-4 by a factor of 10 seems to have been presented so far, but then I realized I don’t actually know what exactly is meant by ‘effective compute’ (something like ‘compute + algorithmic efficiencies’ according to the text), so I can’t really gauge this. If anybody has any elucidation, I’d like to hear.

Furthermore, algorithmic efficiencies seem to predominantly bring down the necessary amount of compute, not the size of the dataset needed. My figure was independent of the amount of compute needed, so wouldn’t actually change much, presumably).
Isn’t it possible that this is caused by or at least contributed to by the fact that the giant datasets already exist, so you may as well continue training new AI on the whole set?
Has anyone specifically tested and/or tried to develop new generations of AI to use less data, or have most efforts focused on computation?
Right now, the data is “free” (it took much effort to compile, and there are legal challenges to be defended against, but these are sunk costs; the AI companies have the giant datasets already) while computation time is expensive, so it makes sense that incentives would drive AI developers to focus their efforts on computation efficiency.
If the courts rule that the data access is essentially a free for all, that situation isn’t likely to change soon. If they rule that companies have to pay for it, or aren’t allowed to use it at all - it might make sense to develop AI that can learn from a much smaller dataset.

Isn’t it possible that this is caused by or at least contributed to by the fact that the giant datasets already exist, so you may as well continue training new AI on the whole set?
Unlikely, since training on patterns in large sets of data is how the LLM version of “AI” works. One trick ponies don’t work well if you take their pony.

Has anyone specifically tested and/or tried to develop new generations of AI to use less data, or have most efforts focused on computation?
I believe it’s standard practice to evaluate model performance at certain intervals of training data (often exponential steps) to see whether the accuracy still increases, or levels off. E.g.:
It’s important to note that the relationship between data quantity and language model performance is not always linear. In some cases, doubling the training data may yield diminishing returns in metrics like perplexity or downstream task accuracy. This is where the concept of orders of magnitude becomes particularly useful. By evaluating performance at exponential intervals (10K, 100K, 1M), we can identify these points of diminishing returns and make strategic decisions accordingly.
There are also theoretical reasons to expect the existence of such scaling laws. I’m not well-versed in machine learning, so take this with a grain of salt, but one dominant paradigm is the manifold hypothesis, which essentially posits that the actual ‘interesting’ data lie on a lower-dimensional submanifold of the full high-dimensional dataspace, most of which will just correspond to random noise. You then need enough samples (points on the manifold) to characterize it, and the ‘spacing’ of those samples scales with the dimension of the dataset, which gives a bound on the error you’d expect for new datapoints. (This is much better and more in depth explained in this video.)
Stupid question: why do AIs require individual training? Can’t you just train one and copy it? They’re still computers, after all.
I totally agree that the driving force behind unbridled AI development is its military applications and the dire consequences that can result from getting fatally behind in its development. This has fostered an approach that says, “Yeah, we realize this could go severely sideways but, damn the torpedoes and full speed ahead!”
Netflix has a good documentary on the subject. They interviewed the man who is considered the Air Force’s most experienced, most skilled, and best fighter pilot. They had him engage AI via computer, and the results were dismal for the human. Another general who was also an Ace during his active fighter days bluntly stated that no human would stand a chance against an AI in combat, be it an idividual or an entire force. Scary stuff.
Can you be more specific here? I’m not sure what you mean because copying models is done constantly.
Yeah, that’s what generally happens, except when you want to use a different model with more parameters (and theoretically more capability), you have to train the new model, or if you want to train a version of the same model to do a different task that wasn’t part of the original training, you have to train a new model.
But if you want multiples of the same trained model you can just copy them and that happens a lot.
This is also part of the drive to try to develop AGI - that is, to create something that is smart enough to be able to apply itself to any task, like a human might.