The Scaling Paradox
Toby Ord
AI capabilities have improved remarkably quickly, fuelled by the explosive scale-up of resources being used to train the leading models. But if you examine the scaling laws that inspired this rush, they actually show extremely poor returns to scale. What’s going on?
AI Scaling is Shockingly Impressive
The era of LLMs has seen remarkable improvements in AI capabilities over a very short time. This is often attributed to the AI scaling laws — statistical relationships which govern how AI capabilities improve with more parameters, compute, or data. Indeed AI thought-leaders such as Ilya Sutskever and Dario Amodei have said that the discovery of these laws led them to the current paradigm of rapid AI progress via a dizzying increase in the size of frontier systems.
Before the 2020s, most AI researchers were looking for architectural changes to push the frontiers of AI forwards. The idea that scale alone was sufficient to provide the entire range of faculties involved in intelligent thought was unfashionable and seen as simplistic.
A key reason it worked was the tremendous versatility of text. As Turing had noted more than 60 years earlier, almost any challenge that one could pose to an AI system can be posed in text. The single metric of human-like text production could therefore assess the AI’s intellectual competence across a huge range of domains. The next-token prediction scheme was also an instance of both sequence prediction and compression — two tasks that were long hypothesized to be what intelligence is fundamentally about.
By the time of GPT-3 it was clear that it was working. It wasn’t just a dry technical metric that was improving as the compute was scaled up — the text was qualitatively superior to GPT-2 and showed signs of capturing concepts that were beyond that smaller system.
Scaling has clearly worked, leading to very impressive gains in AI capabilities over the last five years. Many papers and AI labs trumpet the success of this scaling in their latest results and show graphs suggesting that the gains from scaling will continue.
For example, here is a chart from OpenAI’s release of their o1 model, captioned: ‘o1 performance smoothly improves with both train-time and test-time compute’.
AI Scaling is Shockingly Unimpressive
But let’s take a closer look at what that chart is showing us. On the y-axis, their measure of AI performance is the model’s accuracy on mathematics problems aimed at elite high school students (via the challenging AIME data set). But the x-axis, showing the amount of compute at training-time and test-time is on a log scale. So the nice straight lines are really showing very serious diminishing returns to more compute for their model.
Specifically, the model shows logarithmic returns to compute. We can put this in more familiar terms by flipping it around: increasing the accuracy of the model’s outputs requires exponentially more compute.
That is not generally a recipe for success. In computer science exponential resource usage for linear gains is typically considered to be a sign the problem is intractable, or at least is intractable to the approach you are using. Indeed, the requirement of exponential increases in resources in time or space to continue making progress is characteristic of a brute force approach. It is the relationship you often get with the most naive approaches — those that simply try every possible solution, one by one.
And o1 isn’t the only place we see this. There was a very perceptible leap from GPT-2 to GPT-3 and from GPT-3 to GPT-4. But each successive model used about 70 times as much compute as the one before (4e21 → 3e23 → 2e25 FLOP). In every other aspect of life we would certainly be hoping to see a perceptible improvement if we put 70 times as many resources into it. And normally we’d consider something that kept requiring such exponentially increasing inputs to see visible progress as proof that things were going badly; that the approach was not successfully scaling.
The recent progress in AI hasn’t been entirely driven by increased computational resources. Epoch AI’s estimates are that compute has risen by 4x per year, while algorithmic improvements have divided the compute needed by about 3 each year. This means that over time, the effective compute is growing by about 12x per year, with about 40% of this gain coming from algorithmic improvements.
But these algorithmic refinements aren’t improving the scaling behaviour of AI quality in terms of resources. If it required exponentially more compute to increase quality, it still does after a year of algorithmic progress, it is just that the constant out the front is a factor of 3 lower. In computer science, an algorithm that solves a problem in exponential time is not considered to be making great progress if researchers keep lowering the constant at the front.
What about the famed scaling laws for LLMs? Don’t they show that LLMs scale so well with resources that this scaling should be the cornerstone in leading labs’ strategies? Let’s look at the key graphs from OpenAI’s 2020 paper ‘Scaling Laws for Neural Language Models’:
These show a steady improvement in test loss (a measure of inaccuracy of the model) as compute, data, and parameters increase. It appears at first that the model’s inaccuracy is marching down to very low levels without diminishing returns. However that is deceptive. First, both x and y axes are logarithmic, so this is actually a power law relationship (which can involve diminishing returns). Second, because the axes have been automatically clipped and rescaled to make the graph start in one corner and end in the opposite corner, there is literally nothing you can tell from the shape or slope of these graphs beyond them being some kind of power law. (If you were to equalise scale on the x and y axes, the slope would become informative and you would instantly see that they are extremely shallow slopes — thus extremely inefficient scaling.)
But if you look at the numbers on the x axes (the resources used), you can see that they vary over orders of magnitude while in each case the y-axis only covers about half an order of magnitude. For example, on the first graph, lowering the test loss by a factor of 2 (from 6 to 3) requires increasing the compute by a factor of 1 million (from 10$^{–7}$ to 10$^{–1}$). This shows that the accuracy is extraordinarily insensitive to scaling up the resources used. Or put another way, there are extreme diminishing returns to increased resource input.
A useful way to view this is to understand the exponents on these power law relationships (the power that things are being raised to). Each graph contains an embedded formula showing that the loss is proportional to the resource raised to a very small negative number. It is negative because in this setting the quality of the system is measured with a scale where lower numbers (lower inaccuracies) are better. It will be easier to understand what is going on if we invert that. So let’s define a measure of accuracy as the reciprocal of the test loss. On this revised measure, what used to be halving inaccuracy is now called doubling accuracy.
Using this measure of accuracy, we can remove the negatives from those powers. We see that the first power is now 0.050. What this means is that accuracy scales with the amount of compute used as the 1/20th power. Many of us are not that familiar with fractional powers but may recall that the square root function is equivalent to the 1/2 power. So here accuracy is growing slower than the square root of the square root of the square root of the square root of the resources. That sounds slow, but it is still a bit hard to understand or compare.
Fortunately, we can make this much more intuitive if we flip it around and ask how quickly required resources increase as a function of desired accuracy. To get that, we just take the reciprocal of the powers listed in the paper. This tells us that the compute required scales as the 20th power of the desired accuracy. i.e. if $C$ is compute and $a$ is accuracy, then $C$ = $k_C$ $a^{20}$, for some constant $k_C$. This exponent of 20 is why halving the loss (from 6 to 3) in the first graph required the compute to be doubled 20 times (a factor of 1 million). Similarly for data, $D$ = $k_D$ $a^{11}$, and for number of parameters, $N$ = $k_N$ $a^{13}$. Exponents this high are so rare in statistical relationships that you’d be forgiven for thinking that they were footnote markers.
Here is a redrawn version of the first scaling law, showing the extremely steep increase in compute required for a given accuracy level:
So the required resources grow polynomially with the desired level of accuracy. This is better than the exponential growth in resources we considered before. In computer science, the boundary between what is tractable and intractable is often roughly drawn as that between where the required resources scale polynomially and where they scale exponentially. But that is partly because the powers seen on polynomials in computer science are rarely higher than 3. Something that scales as the 11th, 13th, or 20th power is a case where the resources required are absolutely blowing up as desired accuracy increases and a problem whose required resources scaled so poorly would rarely be considered tractable.
Does the more recent Chinchilla scaling law paper make things better? No, if anything it makes them worse. While it does not directly plot how loss scales with each resource type, it does plot it for compute (see the grey curve in the graph below).
This grey curve is clearly not quite a straight line on this log-log chart and thus not a power law. In this study, increased quality requires more resources than a power law would predict, so resources required grow faster than polynomially, meaning even worse scaling.
Reconciling These Views
So what is going on here? Is AI scaling extremely impressive or is it disappointingly poor? How can we reconcile these perspectives?
There isn’t a simple answer. It depends on what we’re assessing.
From a logistics standpoint, it is impressive how quickly leading labs have been able to scale up their data centres and training runs. Quadrupling the size of your infrastructure every year is an amazing logistical achievement. But from most people’s point of view, it should be taken as being on the costs side of the ledger rather than the benefits side.
If we set aside these dramatically increasing costs, progress in AI capabilities has been impressive over the era of rapid scaling, and a lot of this impressiveness is indeed due to the scaling. Most measures of AI capabilities don’t show exponential progress over time, but there is no doubt this has been a heady rate of progress for AI in qualitative terms, with a feeling that it is even quicker than the five years before, in which deep learning had already set a cracking pace.
In terms of AI progress per resource, or per dollar, things are probably getting worse on most measures. This is what the pessimism about scaling laws is getting at. Measures of quality are increasing far slower than the exponentially mounting costs.
So why were people like Ilya Sutskever and Dario Amodei so impressed by the scaling laws? The answer is that there was a lot of headroom in compute — a lot of room to scale those costs. Even if the resources needed would be thousands of times larger than the largest ever ML experiment, they saw that this (1) was still within the feasible set of things a large company could fund and (2) would still be a good deal given the vast potential benefits.
For on the high end of possibilities, we are talking about something that could potentially replace more than half of all human labour in the world, and perhaps bring forward scientific and technological advances that human labour would have taken centuries to reach. In other words, while the costs could escalate wildly, a deep-pocketed project might reach benefits of extreme value before it ran out of money. And those benefits might be worth more than enough to justify those (very high) costs.
More prosaically, we can note that it is entirely possible that financial returns will scale very quickly as a function of the technical measures of AI quality. If so, then even though standard measures of AI quality scale poorly as a function of resources, the financial returns might still scale very well as a function of resources. Indeed, if they scale better than linearly, that would create a paradigm of increasing marginal returns which would explain a landscape with a small number of players, each investing as much as they possibly can.
If the financial benefits of the intermediate systems keep generating enough profit to pay for the next training run then one can ride this process up to the point where non-financial constraints (such as power use, energy use, chip fabrication, or running out of training data) prevent further scaling.
Furthermore, at a certain quality level the AI systems may be capable enough to improve themselves, searching for and implementing the kinds of algorithmic efficiencies and breakthroughs that human AI scientists seek today. If so, then continued scaling up of AI training runs over the rest of the decade may be enough to get AI systems to the bottom of this escalator, where they are smart enough to ride it up through some number of floors of recursive self-improvement before the process tops out at some higher level. The hope of reaching this new regime of self-improvement is another reason why leading labs may be excited to push through this quick — but extraordinarily inefficient — scaling process.
We can compress much of this down to the following observations:
AI progress as a function of time is impressive even if AI progress as a function of resources is not.
The scaling laws are impressively smooth and long-lasting, but are a proof of poor but predictable scaling, rather than impressive scaling.
While we know that AI quality metrics scale very poorly with respect to resources, the real-world impacts may scale much better.
Finally, it is worth noting that the sheer inefficiency of how current AI systems scale may suggest that there are other approaches which scale substantially more efficiently. The human mind (which doesn’t require an entire internet of data to achieve general intelligence) shows this is possible. If so, we might someday see a step change in AI capabilities when someone discovers a learning architecture that delivers much stronger returns to scale then turns the full force of the industry’s exponentially scaled data centres upon it.