In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, [1] [2] and training cost.
In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as (respectively: parameter count, dataset size, computing cost, and loss).
A neural scaling law is a theoretical or empirical statistical law between these parameters. There are also other parameters with other scaling laws.
In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. [3] With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as transformer models, always use all their parameters during inference.
The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data. [4] However, increasing the size of the training dataset also increases the computational resources and time required for model training.
With the "pretrain, then finetune" method used for most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. [5]
In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance. [5]
Training cost is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required). It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware such as GPUs or TPUs.
The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm complexity, and the computational resources available. [4] In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "epoch").
The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include: [4]
Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.
The 2017 paper [2] is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like , with , the paper found that .
Of the factors they varied, only task can change the exponent . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have while another might have . They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like for another exponent .
They studied machine translation with LSTM (), generative language modelling with LSTM (), ImageNet classification with ResNet (), and speech recognition with two hybrid (LSTMs complemented by either CNNs or an attention decoder) architectures ().
A 2020 analysis [9] studied statistical relations between over a wide range of values and found similar scaling laws, over the range of , , and over multiple modalities (text, video, image, text to image, etc.). [9]
In particular, the scaling laws it found are (Table 1 of [9] ):
The scaling law of was confirmed during the training of GPT-3 (Figure 3.1 [10] ).
One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have: [12] where the variables are
and the statistical parameters are
Although Besiroglu et. al. [14] claims that the statistical estimation is slightly off, and should be .
The statistical laws were fitted over experimental data with .
Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed , we can uniquely solve for all 4 variables that minimizes . This provides us with the optimal for any fixed :Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on.
There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of . One can also directly fit a statistical law for without going through the detour, for which one obtains:or as tabulated:
/ FLOP | / FLOPs of training Gopher | ||
---|---|---|---|
400 million | 1.92e+19 | 1/29968 | 8.0 billion |
1 billion | 1.21e+20 | 1/5706 | 20.2 billion |
10 billion | 1.23e+22 | 1/2819 | 205.1 billion |
67 billion | 5.76e+23 | 1 | 1.5 trillion |
175 billion | 3.85e+24 | 6.7 | 3.7 trillion |
280 billion | 9.90e+24 | 17.2 | 5.9 trillion |
520 billion | 3.43e+25 | 59.5 | 11.0 trillion |
1 trillion | 1.27e+26 | 221.3 | 21.2 trillion |
10 trillion | 1.30e+28 | 22515.9 | 216.2 trillion |
The Chinchilla scaling law analysis for training transformer language models suggests that for a given training compute budget (), to achieve the minimal pretraining loss for that budget, the number of model parameters () and the number of training tokens () should be scaled in equal proportions, . This conclusion differs from analysis conducted by Kaplan et al., [13] which found that should be increased more quickly than , .
This discrepancy can primarily be attributed to the studies using different methods for measuring model size; Kaplan et al. counted only non-embedding paramters, which when analyzed at smaller model sizes leads to biased coefficients. [15] Secondary effects also arise due to differences in hyperparameter tuning and learning rate schedules. [16]
As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal".
Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger. [17]
Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. [18] found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. [19] studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available. [20]
Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference. [21] LLaMA models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal. [22]
A 2022 analysis [23] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:
in which refers to the quantity being scaled (i.e. , , , number of training steps, number of inference steps, or model input size) and refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters are found by statistical fitting.
On a log–log plot, when is not too large and is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the transitions between the segments are called "breaks", hence the name broken neural scaling laws (BNSL).
The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent).
The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include residual neural networks, transformers, MLPs, MLP-mixers, recurrent neural networks, convolutional neural networks, graph neural networks, U-nets, encoder-decoder (and encoder-only) (and decoder-only) models, ensembles (and non-ensembles), MoE (mixture of experts) (and non-MoE) models, and sparse pruned (and non-sparse unpruned) models.
Other than scaling up training compute, one can also scale up inference compute. As an example, the Elo rating of AlphaGo improves steadily as it is allowed to spend more time on its Monte Carlo Tree Search per play. [24] : Fig 4 For AlphaGo Zero, increasing Elo by 120 requires either 2x model size and training, or 2x test-time search. [25] Similarly, a language model for solving competition-level coding challenges, AlphaCode, consistently improved in performance with more search time. [26]
For Hex, 10x training-time compute trades for 15x test-time compute. [27] For Libratus for heads up no-limit Texas hold 'em, and Cicero for Diplomacy , and many other abstract games of partial information, inference-time searching improves performance at a similar tradeoff ratio, for up to 100,000x effective increase in training-time compute. [25]
In 2024, the OpenAI o1 report documented that o1's performance consistently improved with both increased train-time compute and test-time compute, and gave numerous examples of test-time compute scaling in mathematics, scientific reasoning, and coding tasks. [28] [29]
Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts , on image sets of sizes , for computing (in units of TPUv3-core-days). [30]
After training the model, it is finetuned on ImageNet training set. Let be the error probability of the finetuned model classifying ImageNet test set. They found .
Ghorbani, Behrooz et al. [31] studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder Transformer models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost or dataset size ). They varied They found three results:
The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit.
[33] trained Transformers for machine translations with sizes on dataset sizes . They found the Kaplan et al (2020) [13] scaling law applied to machine translation: . They also found the BLEU score scaling as .
Hernandez, Danny et al. [34] studied scaling laws for transfer learning in language models. They trained a family of Transformers in three ways:
The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count , and after being finetuned on Python tokens, it achieves some loss . We say that its "transferred token count" is , if another model with the same achieves the same after training on Python tokens.
They found for pretraining on English text, and for pretraining on English and non-Python code.
In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity varies as a power of another. The change is independent of the initial size of those quantities.
The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.
Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.
Neutrino oscillation is a quantum mechanical phenomenon in which a neutrino created with a specific lepton family number can later be measured to have a different lepton family number. The probability of measuring a particular flavor for a neutrino varies between three known states, as it propagates through space.
Tensor–vector–scalar gravity (TeVeS), developed by Jacob Bekenstein in 2004, is a relativistic generalization of Mordehai Milgrom's Modified Newtonian dynamics (MOND) paradigm.
In mathematical finance, the SABR model is a stochastic volatility model, which attempts to capture the volatility smile in derivatives markets. The name stands for "stochastic alpha, beta, rho", referring to the parameters of the model. The SABR model is widely used by practitioners in the financial industry, especially in the interest rate derivative markets. It was developed by Patrick S. Hagan, Deep Kumar, Andrew Lesniewski, and Diana Woodward.
In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics.
In probability and statistics, the log-logistic distribution is a continuous probability distribution for a non-negative random variable. It is used in survival analysis as a parametric model for events whose rate increases initially and decreases later, as, for example, mortality rate from cancer following diagnosis or treatment. It has also been used in hydrology to model stream flow and precipitation, in economics as a simple model of the distribution of wealth or income, and in networking to model the transmission times of data considering both the network and the software.
In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function:
In machine learning, local case-control sampling is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.
Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
The Kaniadakis Weibull distribution is a probability distribution arising as a generalization of the Weibull distribution. It is one example of a Kaniadakis κ-distribution. The κ-Weibull distribution has been adopted successfully for describing a wide variety of complex systems in seismology, economy, epidemiology, among many others.
In statistics, a Kaniadakis distribution is a statistical distribution that emerges from the Kaniadakis statistics. There are several families of Kaniadakis distributions related to different constraints used in the maximization of the Kaniadakis entropy, such as the κ-Exponential distribution, κ-Gaussian distribution, Kaniadakis κ-Gamma distribution and κ-Weibull distribution. The κ-distributions have been applied for modeling a vast phenomenology of experimental statistical distributions in natural or artificial complex systems, such as, in epidemiology, quantum statistics, in astrophysics and cosmology, in geophysics, in economy, in machine learning.
In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.
The Caravelli-Traversa-Di Ventra equation (CTDV) is a closed-form equation to the evolution of networks of memristors. It was derived by Francesco Caravelli, Fabio L. Traversa and Massimiliano Di Ventra to study the exact evolution of complex circuits made of resistances with memory (memristors).
In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range. This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.
EfficientNet is a family of convolutional neural networks (CNNs) for computer vision published by researchers at Google AI in 2019. Its key innovation is compound scaling, which uniformly scales all dimensions of depth, width, and resolution using a single parameter.