2.4: Scaling Laws | AI Safety, Ethics, and Society Textbook
Excerpt
In AI, scaling laws help us to understand and predict how changes in variables like the amount of computation and data used can have substantial impacts on a modelâs performance.
2.4 Scaling Laws
Introduction
Compelling evidence shows that increases in the performance of many AI systems can be modeled with equations called scaling laws. Common knowledge suggests that larger models with more data will perform better, frequently reiterated in phrases like âadd more layersâ or âuse more data.â Scaling laws make this folk knowledge mathematically precise. In this section, we show that the performance of a deep learning model scales according to parameter count and dataset sizeâboth of which are primarily bottlenecked by the computational resources available. Scaling laws describe the relationship between a modelâs performance and primary inputs.
Conceptual Background: Power Laws
Power laws are mathematical equations that model how a particular quantity varies as the power of another. In power laws, the variation in one quantity is proportional to a power (exponent) of the variation in another. The power law yâ=âb__xa states that the change in y is directly proportional to the change in x raised to a certain power a. If a is 2, then when x is doubled, y will quadruple. One real-world example is the relation between the area of a circle and its radius. As the radius changes, the area changes as a square of the radius: yâ=âÏ__r2. This is a power-law equation where bâ=âÏ and aâ=â2. The volume of a sphere has a power-law relationship with the sphereâs radius as well: (so and aâ=â3). Scaling laws are a particular kind of power law that describe how deep learning models scale. These laws relate a modelâs loss with model properties (such as the number of model parameters or the dataset size used to train the model).
Log-log plots can be used to visualize power laws. Log-log plots can help make these mathematical relationships easier to understand and identify. Consider the power law yâ=âb__xa again. Taking the logarithm of both sides, the power law becomes logâ(y)â=â_a_logâ(x)â +â logâ(b). This is a linear equation (in the logarithmic space) where a is the slope and logâ(b) is the y-intercept. Therefore, a power-law relationship will appear as a straight line on a log-log plot (such as Figure 2.22), with the slope of the line corresponding to the exponent in the power law.
Figure 2.22: An object in free falling in a vacuum falls a distance proportion to the square of the time. On a log-log plot, this power law looks like a straight line.
Power laws are remarkably ubiquitous. Power laws are a robust mathematical framework that can describe, predict, and explain a vast range of phenomena in both nature and society. Power laws are pervasive in urban planning: log-log plots relating variables like city population to metrics such as the percentage of cities with at least that population often result in a straight line (see Fig 2.23). Similarly, animalsâ metabolic rates are proportional to an exponent of their body mass, showcasing a clear power law. In social media, the distribution of user activity often follows a power law, where a small fraction of users generate most of the content (which means that the frequency of content generation y is proportional to the number of active users x multiplied by some constant and raised to some exponent: yâ=âb__xa). Power laws govern many other things, such as the frequency of word usage in a language, the distribution of wealth, the magnitude of earthquakes, and more.
Figure 2.23: Power laws are used in many domains, such as city planning. - [1]
2.4.1 Scaling Laws in Deep Learning
Introduction
Power laws in the context of DL are called scaling laws. Scaling laws [2], [3] predict loss given model size and dataset size in a power-law relationship. Model size is usually measured in parameters, while dataset size is measured in tokens. As both variables increase, the modelâs loss tends to decrease. This decrease in loss with scale often follows a power law: the loss drops substantially, but not linearly, with increases in data and model size. For instance, if we doubled the number of parameters, the loss does not just halve: it might decrease to one-fourth or one-eighth, depending on the exponent in the scaling law. This power-law behavior in AI systems allows researchers to anticipate and strategize how to improve models by investing more in increasing the data or the parameters.
Scaling laws in DL predict loss based on model size and dataset size. In deep learning, power-law relationships exist between the modelâs performance and other variables. These scaling laws can forecast the performance of a model given different values for its parameters, dataset, and amount of computational resources. For instance, we can estimate a modelâs loss if we were to double its parameter count or halve the training dataset size. Scaling laws show that it is possible to accurately predict the loss of an ML system using just two primary variables:
-
N: The size of the model, measured in the number of parameters. Parameters are the weights in a model that are adjusted during training. The number of parameters in a model is a rough measure of its capacity, or how much it can learn from a dataset.
-
D: The size of the dataset the model is trained on, measured in tokens, pixels, or other fundamental units. The modality of these tokens depends on the modelâs task. For example, tokens are subunits of language in natural language processing and images in computer vision. Some models are trained on datasets consisting of tokens of multiple modalities.
Improving model performance is typically bottlenecked by one of these variables.
The computational resources used to train a model are vital for scaling. This factor, called compute, is most often measured by the number of calculations performed over a certain time. The key metric for compute is FLOP/s, the number of floating-point operations the computer performs per second. Practically, increasing compute means training with more processors, more powerful processors, or for a longer time. Models are often allocated a set budget for computation: scaling laws can determine the ideal model and dataset size given that budget.
Computing power underlies both model size and dataset size. More computing power enables larger models with more parameters and facilitates the collection and processing of more tokens of training data. Essentially, greater computational resources facilitate the development of more sophisticated AI models trained on expanded datasets. Therefore, scaling is contingent on increasing computation.
The Chinchilla Scaling Law: an Influential Example
The Chinchilla scaling law emphasizes data over model size [4]. One significant research finding that shows the importance of scaling laws was the successful training of the LLM âChinchilla.â A small model with only 70 billion parameters, Chinchilla outperformed much larger models because it was trained on far more tokens than pre-existing models. This led to the Chinchilla scaling law, which provides a scaling law that depends on parameter count and data. This law demonstrated that larger models require much more data than was standard.
Chinchillaâs loss is much lower (indicated by its lighter color) than other models of similar size. It used much more data than other models. [5]
The Chinchilla scaling law equation encapsulates these relationships. The Chinchilla scaling law is estimated to be In the equation above, N represents parameter count, D represents dataset size, and L stands for loss. This equation describes a power-law relationship. Understanding this law can help us understand the interplay between these factors, and knowing these values helps developers make optimal decisions about investments in increasing model and dataset size.
Scaling laws for DL hold across many modalities and orders of magnitude. An order of magnitude is a factor of 10xâif something increases by an order of magnitude, it increases by 10 times. In deep learning, evidence suggests that scaling laws hold across many orders of magnitude of parameter count and dataset size. This implies that the same scaling relationships are still valid for both a small model trained on hundreds of tokens or a massive model trained on trillions of tokens. Scaling laws have continued to hold even as model size increases dramatically.
Figure 2.25: The scaling laws for different DL models look remarkably similar. [3]
Discussion
Scaling laws are not universal for ML models. Not all models follow scaling laws. These relationships are stronger for some types of models than others. Generative models such as large language models tend to follow regular scaling lawsâas model size and training data increase in scale, performance improves smoothly and predictably in a relationship described by a power-law equation. But for discriminative models such as image classifiers, clear scaling laws currently do not emerge. Performance may plateau even as dataset size or model size increase.
Better learning algorithms can boost model performance across the board. An improved algorithm increases the constant term in the scaling law, allowing models to perform better with a given number of tokens or parameters. However, crafting better learning algorithms is quite difficult. Therefore, improving DL models generally focuses on increasing the core variables for scaling: tokens and parameters.
The bitter lesson: scaling beats intricate, expert-designed systems. Hard-coding AI systems to follow pre-defined processes using expert insights has proven slower and more failure-prone than building large models that learn from large datasets. The following observation is Richard Suttonâs âbitter lessonâ [6]:
-
AI researchers have often tried to build knowledge into systems,
-
âThis always helps in the short term [âŠ], but in the long run it plateaus and it even inhibits further progress,
-
Breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.â
This suggests that it is easier to create machines that can learn than to have humans manually encode them with knowledge. For now, the most effective way to do this seems to be scaling up deep learning models such as LLMs. This lesson is âbitterâ because it shows that simpler scaling approaches tend to beat more elegant and complex techniques designed by human researchersâdemoralizing for researchers who spent years developing those complex approaches. Rather than just human ingenuity, scale and computational power are also key factors that drive progress in AI.
Conclusion
In AI, scaling laws describe how loss changes with model and dataset size. We observed that the performance of a DL model scales according to the number of parameters and tokensâboth shaped by the amount of compute used. Evidence from generative models like LLMs indicates a smooth reduction in loss with increases in model size and training data, adhering to a clear scaling law. Scaling laws are especially important for understanding how changes in variables like the amount of data used can have substantial impacts on the modelâs performance.
References
[1] M. Newman, âPower laws, pareto distributions and zipfs law,â Contemporary Physics, vol. 46, no. 5, pp. 323â351, Sep. 2005, doi: 10.1080/00107510500052444.