Neural scaling law

Last updated May 09, 2024

In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks.^[1]^[2]

Introduction

In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, performance after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".^{[ citation needed ]} These are usually written as $N,D,C,L$ (number of parameters, dataset size, computing cost, loss).

Size of the model

In most cases, the size of the model is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models.^[3] In sparse models, during every inference, only a fraction of the parameters are used. In comparison, most other kinds of neural networks, such as Transformer networks, always use all their parameters during every inference.

Size of the training dataset

The size of the training dataset is usually quantified by the number of data points it contains. Larger training datasets are typically preferred as they provide a richer and more diverse source of information for the model to learn from. This in turn can lead to improved generalization performance when the model is applied to unseen data.^[4] However, increasing the size of the training dataset also increases the computational resources and time required for model training.

With the "pretrain, then finetune" method used in most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes would have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset.^[5]

In some cases, a small amount of high quality data suffices for finetuning, and more data does not improve performance.^[5]

Cost of training

The cost of training is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required to train the model). It's important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware like GPUs or TPUs.

The cost of training a neural model is a function of several factors including the size of the model, the size of the training dataset, the complexity of the training algorithm, and the computational resources available.^[4] In particular, doubling the training dataset does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "epoch").

Performance

The performance of a neural model is evaluated based on its ability to accurately predict the output given the input data. Common metrics for evaluating model performance include:^[4]

accuracy, precision, recall, and F1 score for classification tasks;
mean squared error (MSE) or mean absolute error (MAE) for regression tasks;
negative log-likelihood per token (logarithm of perplexity) for language modeling.
Elo rating in a competition against other models, such as gameplay ^[6] or preference by a human judge ^[7]

Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.

Examples

(Hestness, Narang, et al, 2017)

The 2017 paper^[2] is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like $L\propto D^{-\alpha }$ , with $\alpha \in \{0.5,1,2\}$ , the paper found that $\alpha \in [0.07,0.35]$ .

Of the factors they varied, only task can change the exponent $\alpha$ . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have $L=1000D^{-0.3}$ while another might have $L=500D^{-0.3}$ . The also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like $N\propto D^{\beta }$ for another exponent $\beta$ .

They studied machine translation with LSTM ( $\alpha \sim 0.13$ ), generative language modelling with LSTM ( $\alpha \in [0.06,0.09],\beta \approx 0.7$ ), ImageNet classification with ResNet ( $\alpha \in [0.3,0.5],\beta \approx 0.6$ ), and speech recognition ( $\alpha \approx 0.3$ ).

(Henighan, Kaplan, et al, 2020)

A 2020 analysis ^[8] studied statistical relations between $C,N,D,L$ over a wide range of values and found similar scaling laws, over the range of $N\in [10^{3},10^{9}]$ , $C\in [10^{12},10^{21}]$ , and over multiple modalities (text, video, image, text to image, etc.).^[8]

In particular, the scaling laws it found are (Table 1 of ^[8]):

For each modality, they fixed one of the two $C,N$ $Neural scaling law$ , and varying the other one ( $D$ $Neural scaling law$ is varied along using $D=C/6N$ $Neural scaling law$ ), the achievable test loss satisfies
$L=L_{0}+\left({\frac {x_{0}}{x}}\right)^{\alpha }$
$Neural scaling law$
where $x$ $Neural scaling law$ is the varied variable, and $L_{0},x_{0},\alpha$ $Neural scaling law$ are parameters to be found by statistical fitting. The parameter $\alpha$ $Neural scaling law$ is the most important one.
- When $N$ is the varied variable, $\alpha$ ranges from $0.037$ to $0.24$ depending on the model modality. This corresponds to the $\alpha =0.34$ from the Chinchilla scaling paper.
- When $C$ is the varied variable, $\alpha$ ranges from $0.048$ to $0.19$ depending on the model modality. This corresponds to the $\beta =0.28$ from the Chinchilla scaling paper.
Given fixed computing budget, optimal model parameter count is consistently around $N_{opt}(C)=\left({\frac {C}{5\times 10^{-12}{\text{petaFLOP-day}}}}\right)^{0.7}=9.0\times 10^{-7}C^{0.7}$ The parameter $9.0\times 10^{-7}$ varies by a factor of up to 10 for different modalities. The exponent parameter $0.7$ varies from $0.64$ to $0.75$ for different modalities. This exponent corresponds to the $\approx 0.5$ from the Chinchilla scaling paper.
It's "strongly suggested" (but not statistically checked) that $D_{opt}(C)\propto N_{opt}(C)^{0.4}\propto C^{0.28}$ . This exponent corresponds to the $\approx 0.5$ from the Chinchilla scaling paper.

The scaling law of $L=L_{0}+(C_{0}/C)^{0.048}$ was confirmed during the training of GPT-3 (Figure 3.1 ^[9]).

Chinchilla scaling (Hoffmann, et al, 2022)

One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have:^[10]

{\begin{cases}C=C_{0}ND\\L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}\end{cases}}

where the variables are

$C$ is the cost of training the model, in FLOPs.
$N$ is the number of parameters in the model.
$D$ is the number of tokens in the training set.
$L$ $Neural scaling law$ is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.
- $L_{0}$ represents the loss of an ideal generative process on the test data
- ${\frac {A}{N^{\alpha }}}$ captures the fact that a Transformer language model with $N$ parameters underperforms the ideal generative process
- ${\frac {B}{D^{\beta }}}$ captures the fact that the model trained on $D$ tokens underperforms the ideal generative process

and the statistical parameters are

$C_{0}=6$ , meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by Kaplan et al.^[11] Note that training cost is much higher than inference cost, as training entails both forward and backward passes, whereas inference costs 1 to 2 FLOPs per parameter to infer on one token.
$\alpha =0.34,\beta =0.28,A=406.4,B=410.7,L_{0}=1.69$ .

Although ^[12] claims that the statistical estimation is slightly off, and should be $\alpha =0.34,\beta =0.28,A=406.4,B=410.7,L_{0}=1.69$ .

The statistical laws were fitted over experimental data with $N\in [7\times 10^{7},1.6\times 10^{10}],D\in [5\times 10^{9},5\times 10^{11}],C\in [10^{18},10^{24}]$ .

Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed $C$ , we can uniquely solve for all 4 variables that minimizes $L$ . This provides us with the optimal $D_{opt}(C),N_{opt}(C)$ for any fixed $C$ :

N_{opt}(C)=G\left({\frac {C}{6}}\right)^{a},\quad D_{opt}(C)=G^{-1}\left({\frac {C}{6}}\right)^{b},\quad {\text{ where }}\quad G=\left({\frac {\alpha A}{\beta B}}\right)^{\frac {1}{\alpha +\beta }},\quad a={\frac {\beta }{\alpha +\beta }}{\text{, and }}b={\frac {\alpha }{\alpha +\beta }}{\text{. }}

Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:

{\begin{cases}N_{opt}(C)=0.6\;C^{0.45}\\D_{opt}(C)=0.3\;C^{0.55}\\L_{opt}(C)=1070\;C^{-0.154}+1.7\end{cases}}

Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on. There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of $L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}$ . One can also directly fit a statistical law for $D_{opt}(C),N_{opt}(C)$ without going through the detour, for which one obtains:

{\begin{cases}N_{opt}(C)=0.1\;C^{0.5}\\D_{opt}(C)=1.7\;C^{0.5}\end{cases}}

or as tabulated:


$N_{opt}(C)$	$C$ / FLOP	$C$ / FLOPs of training Gopher	$D_{opt}(C)$
400 Million	1.92e+19	1/29968	8.0 Billion
1 Billion	1.21e+20	1/5706	20.2 Billion
10 Billion	1.23e+22	1/2819	205.1 Billion
67 Billion	5.76e+23	1	1.5 Trillion
175 Billion	3.85e+24	6.7	3.7 Trillion
280 Billion	9.90e+24	17.2	5.9 Trillion
520 Billion	3.43e+25	59.5	11.0 Trillion
1 Trillion	1.27e+26	221.3	21.2 Trillion
10 Trillion	1.30e+28	22515.9	216.2 Trillion

In simpler terms, the Chinchilla scaling law for training Transformer language models suggests that when given an increased budget (in FLOPs), to achieve compute-optimal, the number of model parameters (N) and the number of tokens for training the model (D) should scale in approximately equal proportions. This conclusion differs from the previous scaling law for neural language models,^[11] which states that N should be scaled faster than D. The discrepancy arises from setting different cycle lengths for cosine learning rate schedulers. In estimating the Chinchilla scaling, the authors set the cycle length to be the same as the training steps, as experimental results indicate that larger cycles overestimate the loss of the models.

Beyond Chinchilla scaling

As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal".

Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger.^[13]

Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. ^[14] found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. ^[15] studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available.^[16]

Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference.^[17] LLaMA models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal.^[18]

Broken Neural Scaling Laws (BNSL)

A 2022 analysis^[19] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:

$y=a+{\bigg (}bx^{-c_{0}}{\bigg )}\prod _{i=1}^{n}\left(1+\left({\frac {x}{d_{i}}}\right)^{1/f_{i}}\right)^{-c_{i}*f_{i}}$

in which $x$ refers to the quantity being scaled (i.e. $C$ , $N$ , $D$ , number of training steps, number of inference steps, or model input size) and $y$ refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters $a,b,c_{0},c_{1}...c_{n},d_{1}...d_{n},f_{1}...f_{n}$ are found by statistical fitting.

On a log–log plot, when $f_{i}$ is not too large and $a$ is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the $n$ transitions between the segments are called "breaks", hence the name Broken Neural Scaling Laws (BNSL).

The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent).

The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include ResNets, Transformers, MLPs, MLP-Mixers, Recurrent Neural Networks, Convolutional Neural Networks, Graph Neural Networks, U-Nets, Encoder-Decoder (and Encoder-only) (and Decoder-only) Models, Ensembles (and Non-Ensembles), MoE (Mixture of Experts) (and Non-MoE) Models, and Sparse Pruned (and Non-Sparse Unpruned) Models.

Other examples

Vision transformers

Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts $N\in [5\times 10^{6},2\times 10^{9}]$ , on image sets of sizes $D\in [3\times 10^{7},3\times 10^{9}]$ , for computing $C\in [0.2,10^{4}]$ (in units of TPUv3-core-days).^[20]

After training the model, it is finetuned on ImageNet training set. Let $L$ be the error probability of the finetuned model classifying ImageNet test set. They found $\min _{N,D}L=0.09+{\frac {0.26}{(C+0.01)^{0.35}}}$ .

Neural machine translation

Ghorbani, Behrooz et al.^[21] studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder Transformer models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost $C$ or dataset size $D$ ). They varied $N\in [10^{8},3.5\times 10^{9}]$ They found three results:

$L$ is a scaling law function of $N_{E},N_{D}$ , where $N_{E},N_{D}$ are encoder and decoder parameter count. It is not simply a function of total parameter count $N=N_{E}+N_{D}$ . The function has form $L\left(N_{e},N_{d}\right)=\alpha \left({\frac {{\bar {N}}_{e}}{N_{e}}}\right)^{p_{e}}\left({\frac {{\bar {N}}_{d}}{N_{d}}}\right)^{p_{d}}+L_{\infty }$ , where $\alpha ,p_{e},p_{d},L_{\infty },{\bar {N}}_{e},{\bar {N}}_{d}$ are fitted parameters. They found that $N_{d}/N\approx 0.55$ minimizes loss if $N$ is held fixed.
$L$ "saturates" (that is, it reaches $L_{\infty }$ ) for smaller models when the training and testing datasets are "source-natural" than "target-natural". A "source-natural" data point means a pair of English-German sentences, and the model is asked to translate the English sentence into German, and the English sentence is written by a natural English writer, while the German sentence is translated from the English sentence by a machine translator.^[22] To construct the two kinds of datasets, the authors collected natural English and German sentences online, then used machine translation to generate their translations.
As models grow larger, models trained on source-original datasets can achieve low loss but bad BLEU score. In contrast, models trained on target-original datasets achieve low loss and good BLEU score in tandem (Figure 10, 11 ^[21]).

The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit.

^[23] trained Transformers for machine translations with sizes $N\in [4\times 10^{5},5.6\times 10^{7}]$ on dataset sizes $D\in [6\times 10^{5},6\times 10^{9}]$ . They found the Kaplan et al (2020)^[11] scaling law applied to machine translation: $L(N,D)=\left[\left({\frac {N_{C}}{N}}\right)^{\frac {\alpha _{N}}{\alpha _{D}}}+{\frac {D_{C}}{D}}\right]^{\alpha _{D}}$ . They also found the BLEU score scaling as $BLEU\approx Ce^{-kL}$ .

Transfer learning

Hernandez, Danny et al.^[24] studied scaling laws for transfer learning in language models. They trained a family of Transformers in three ways:

pretraining on English, finetuning on Python
pretraining on an equal mix of English and Python, finetuning on Python
training on Python

The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count $N$ , and after being finetuned on $D_{F}$ Python tokens, it achieves some loss $L$ . We say that its "transferred token count" is $D_{T}$ , if another model with the same $N$ achieves the same $L$ after training on $D_{F}+D_{T}$ Python tokens.

They found $D_{T}=1.9e4\left(D_{F}\right)^{.18}(N)^{.38}$ for pretraining on English text, and $D_{T}=2.1e5\left(D_{F}\right)^{.096}(N)^{.38}$ for pretraining on English and non-Python code.

Related Research Articles

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $k$ and a scale parameter $θ$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

Fractional calculus is a branch of mathematical analysis that studies the several different possibilities of defining real number powers or complex number powers of the differentiation operator

<span class="mw-page-title-main">Gumbel distribution</span> Particular case of the generalized extreme value distribution

In probability theory and statistics, the Gumbel distribution is used to model the distribution of the maximum of a number of samples of various distributions.

The Minimal Supersymmetric Standard Model (MSSM) is an extension to the Standard Model that realizes supersymmetry. MSSM is the minimal supersymmetrical model as it considers only "the [minimum] number of new particle states and new interactions consistent with "Reality". Supersymmetry pairs bosons with fermions, so every Standard Model particle has a superpartner. If discovered, such superparticles could be candidates for dark matter, and could provide evidence for grand unification or the viability of string theory. The failure to find evidence for MSSM using the Large Hadron Collider has strengthened an inclination to abandon it.

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

Neutrino oscillation is a quantum mechanical phenomenon in which a neutrino created with a specific lepton family number can later be measured to have a different lepton family number. The probability of measuring a particular flavor for a neutrino varies between three known states, as it propagates through space.

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution.

Tensor–vector–scalar gravity (TeVeS), developed by Jacob Bekenstein in 2004, is a relativistic generalization of Mordehai Milgrom's Modified Newtonian dynamics (MOND) paradigm.

Scalar–tensor–vector gravity (STVG) is a modified theory of gravity developed by John Moffat, a researcher at the Perimeter Institute for Theoretical Physics in Waterloo, Ontario. The theory is also often referred to by the acronym MOG.

In mathematical finance, the SABR model is a stochastic volatility model, which attempts to capture the volatility smile in derivatives markets. The name stands for "stochastic alpha, beta, rho", referring to the parameters of the model. The SABR model is widely used by practitioners in the financial industry, especially in the interest rate derivative markets. It was developed by Patrick S. Hagan, Deep Kumar, Andrew Lesniewski, and Diana Woodward.

<span class="mw-page-title-main">Truncated normal distribution</span> Type of probability distribution

In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics.

<span class="mw-page-title-main">Log-logistic distribution</span> Continuous probability distribution for a non-negative random variable

In probability and statistics, the log-logistic distribution is a continuous probability distribution for a non-negative random variable. It is used in survival analysis as a parametric model for events whose rate increases initially and decreases later, as, for example, mortality rate from cancer following diagnosis or treatment. It has also been used in hydrology to model stream flow and precipitation, in economics as a simple model of the distribution of wealth or income, and in networking to model the transmission times of data considering both the network and the software.

In machine learning, local case-control sampling is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

In computational solid state physics, Continuous-time quantum Monte Carlo (CT-QMC) is a family of stochastic algorithms for solving the Anderson impurity model at finite temperature. These methods first expand the full partition function as a series of Feynman diagrams, employ Wick's theorem to group diagrams into determinants, and finally use Markov chain Monte Carlo to stochastically sum up the resulting series.

In theoretical physics, the dual graviton is a hypothetical elementary particle that is a dual of the graviton under electric-magnetic duality, as an S-duality, predicted by some formulations of supergravity in eleven dimensions.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

The Kaniadakis Weibull distribution is a probability distribution arising as a generalization of the Weibull distribution. It is one example of a Kaniadakis κ-distribution. The κ-Weibull distribution has been adopted successfully for describing a wide variety of complex systems in seismology, economy, epidemiology, among many others.

In statistics, a Kaniadakis distribution is a statistical distribution that emerges from the Kaniadakis statistics. There are several families of Kaniadakis distributions related to different constraints used in the maximization of the Kaniadakis entropy, such as the κ-Exponential distribution, κ-Gaussian distribution, Kaniadakis κ-Gamma distribution and κ-Weibull distribution. The κ-distributions have been applied for modeling a vast phenomenology of experimental statistical distributions in natural or artificial complex systems, such as, in epidemiology, quantum statistics, in astrophysics and cosmology, in geophysics, in economy, in machine learning.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

↑ Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh (2021-02-12). "Explaining Neural Scaling Laws". arXiv: 2102.06701 [cs.LG].
1 2 Hestness, Joel; Narang, Sharan; Ardalani, Newsha; Diamos, Gregory; Jun, Heewoo; Kianinejad, Hassan; Patwary, Md Mostofa Ali; Yang, Yang; Zhou, Yanqi (2017-12-01). "Deep Learning Scaling is Predictable, Empirically". arXiv: 1712.00409 [cs.LG].
↑ Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong (2022-06-28). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale". Proceedings of the 39th International Conference on Machine Learning. PMLR: 18332–18346. arXiv: 2201.05596 .
1 2 3 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
1 2 Zhou, Chunting; Liu, Pengfei; Xu, Puxin; Iyer, Srini; Sun, Jiao; Mao, Yuning; Ma, Xuezhe; Efrat, Avia; Yu, Ping; Yu, Lili; Zhang, Susan; Ghosh, Gargi; Lewis, Mike; Zettlemoyer, Luke; Levy, Omer (2023-05-01). "LIMA: Less Is More for Alignment". arXiv: 2305.11206 [cs.CL].
↑ Jones, Andy L. (2021). "Scaling Scaling Laws with Board Games". arXiv: 2104.03113 [cs.LG].
↑ LMSYS Chatbot leaderboard
1 2 3 Sam, Henighan, Tom Kaplan, Jared Katz, Mor Chen, Mark Hesse, Christopher Jackson, Jacob Jun, Heewoo Brown, Tom B. Dhariwal, Prafulla Gray, Scott Hallacy, Chris Mann, Benjamin Radford, Alec Ramesh, Aditya Ryder, Nick Ziegler, Daniel M. Schulman, John Amodei, Dario McCandlish (2020-10-27). Scaling Laws for Autoregressive Generative Modeling. OCLC 1228442047.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, J.; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, T.; Child, Rewon (2020-05-28). "Language Models are Few-Shot Learners". arXiv: 2005.14165 [cs.CL].
↑ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv: 2203.15556 [cs.CL].
1 2 3 Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv: 2001.08361 .
↑ Besiroglu, Tamay; Erdil, Ege; Barnett, Matthew; You, Josh (2024-04-15), Chinchilla Scaling: A replication attempt, arXiv: 2404.10102 , retrieved 2024-04-25
↑ Sorscher, Ben; Geirhos, Robert; Shekhar, Shashank; Ganguli, Surya; Morcos, Ari S. (2023-04-21), Beyond neural scaling laws: beating power law scaling via data pruning, arXiv: 2206.14486
↑ Tay, Yi; Wei, Jason; Chung, Hyung Won; Tran, Vinh Q.; So, David R.; Shakeri, Siamak; Garcia, Xavier; Zheng, Huaixiu Steven; Rao, Jinfeng (2022-11-16), Transcending Scaling Laws with 0.1% Extra Compute, arXiv: 2210.11399
↑ Muennighoff, Niklas; Rush, Alexander; Barak, Boaz; Le Scao, Teven; Tazi, Nouamane; Piktus, Aleksandra; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin A. (2023-12-15). "Scaling Data-Constrained Language Models". Advances in Neural Information Processing Systems. 36: 50358–50376. arXiv: 2305.16264 .
↑ Li, Yuanzhi; Bubeck, Sébastien; Eldan, Ronen; Del Giorno, Allie; Gunasekar, Suriya; Lee, Yin Tat (2023-09-11), Textbooks Are All You Need II: phi-1.5 technical report, arXiv: 2309.05463
↑ Sardana, Nikhil; Frankle, Jonathan (2023-12-31), Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, arXiv: 2401.00448
↑ Gadre, Samir Yitzhak; Smyrnis, Georgios; Shankar, Vaishaal; Gururangan, Suchin; Wortsman, Mitchell; Shao, Rulin; Mercat, Jean; Fang, Alex; Li, Jeffrey (2024-03-13), Language models scale reliably with over-training and on downstream tasks, arXiv: 2403.08540
↑ Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". arXiv: 2210.14891 [cs.LG].
↑ Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (2022). "Scaling Vision Transformers": 12104–12113.{{cite journal}}: Cite journal requires |journal= (help)
1 2 Ghorbani, Behrooz; Firat, Orhan; Freitag, Markus; Bapna, Ankur; Krikun, Maxim; Garcia, Xavier; Chelba, Ciprian; Cherry, Colin (2021-09-01). "Scaling Laws for Neural Machine Translation". arXiv: 2109.07740 [cs.LG].
↑ Chen, Mia Xu; Firat, Orhan; Bapna, Ankur; Johnson, Melvin; Macherey, Wolfgang; Foster, George; Jones, Llion; Schuster, Mike; Shazeer, Noam; Parmar, Niki; Vaswani, Ashish; Uszkoreit, Jakob; Kaiser, Lukasz; Chen, Zhifeng; Wu, Yonghui (July 2018). "The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 76–86. arXiv: 1804.09849 . doi:10.18653/v1/P18-1008.
↑ Gordon, Mitchell A; Duh, Kevin; Kaplan, Jared (2021). "Data and Parameter Scaling Laws for Neural Machine Translation". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 5915–5922. doi: 10.18653/v1/2021.emnlp-main.478 .
↑ Hernandez, Danny; Kaplan, Jared; Henighan, Tom; McCandlish, Sam (2021-02-01). "Scaling Laws for Transfer". arXiv: 2102.01293 [cs.LG].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh (2021-02-12). "Explaining Neural Scaling Laws". arXiv: 2102.06701 [cs.LG].

[:4-2] 1 2 Hestness, Joel; Narang, Sharan; Ardalani, Newsha; Diamos, Gregory; Jun, Heewoo; Kianinejad, Hassan; Patwary, Md Mostofa Ali; Yang, Yang; Zhou, Yanqi (2017-12-01). "Deep Learning Scaling is Predictable, Empirically". arXiv: 1712.00409 [cs.LG].

[3] Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong (2022-06-28). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale". Proceedings of the 39th International Conference on Machine Learning. PMLR: 18332–18346. arXiv: 2201.05596 .

[goodfellow-4] 1 2 3 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

[:2-5] 1 2 Zhou, Chunting; Liu, Pengfei; Xu, Puxin; Iyer, Srini; Sun, Jiao; Mao, Yuning; Ma, Xuezhe; Efrat, Avia; Yu, Ping; Yu, Lili; Zhang, Susan; Ghosh, Gargi; Lewis, Mike; Zettlemoyer, Luke; Levy, Omer (2023-05-01). "LIMA: Less Is More for Alignment". arXiv: 2305.11206 [cs.CL].

[6] Jones, Andy L. (2021). "Scaling Scaling Laws with Board Games". arXiv: 2104.03113 [cs.LG].

[7] LMSYS Chatbot leaderboard

[:0-8] 1 2 3 Sam, Henighan, Tom Kaplan, Jared Katz, Mor Chen, Mark Hesse, Christopher Jackson, Jacob Jun, Heewoo Brown, Tom B. Dhariwal, Prafulla Gray, Scott Hallacy, Chris Mann, Benjamin Radford, Alec Ramesh, Aditya Ryder, Nick Ziegler, Daniel M. Schulman, John Amodei, Dario McCandlish (2020-10-27). Scaling Laws for Autoregressive Generative Modeling. OCLC 1228442047.{{cite book}}: CS1 maint: multiple names: authors list (link)

[9] Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, J.; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, T.; Child, Rewon (2020-05-28). "Language Models are Few-Shot Learners". arXiv: 2005.14165 [cs.CL].

[10] Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv: 2203.15556 [cs.CL].

[kaplan-scaling-11] 1 2 3 Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv: 2001.08361 .

[12] Besiroglu, Tamay; Erdil, Ege; Barnett, Matthew; You, Josh (2024-04-15), Chinchilla Scaling: A replication attempt, arXiv: 2404.10102 , retrieved 2024-04-25

[13] Sorscher, Ben; Geirhos, Robert; Shekhar, Shashank; Ganguli, Surya; Morcos, Ari S. (2023-04-21), Beyond neural scaling laws: beating power law scaling via data pruning, arXiv: 2206.14486

[14] Tay, Yi; Wei, Jason; Chung, Hyung Won; Tran, Vinh Q.; So, David R.; Shakeri, Siamak; Garcia, Xavier; Zheng, Huaixiu Steven; Rao, Jinfeng (2022-11-16), Transcending Scaling Laws with 0.1% Extra Compute, arXiv: 2210.11399

[15] Muennighoff, Niklas; Rush, Alexander; Barak, Boaz; Le Scao, Teven; Tazi, Nouamane; Piktus, Aleksandra; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin A. (2023-12-15). "Scaling Data-Constrained Language Models". Advances in Neural Information Processing Systems. 36: 50358–50376. arXiv: 2305.16264 .

[16] Li, Yuanzhi; Bubeck, Sébastien; Eldan, Ronen; Del Giorno, Allie; Gunasekar, Suriya; Lee, Yin Tat (2023-09-11), Textbooks Are All You Need II: phi-1.5 technical report, arXiv: 2309.05463

[17] Sardana, Nikhil; Frankle, Jonathan (2023-12-31), Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, arXiv: 2401.00448

[18] Gadre, Samir Yitzhak; Smyrnis, Georgios; Shankar, Vaishaal; Gururangan, Suchin; Wortsman, Mitchell; Shao, Rulin; Mercat, Jean; Fang, Alex; Li, Jeffrey (2024-03-13), Language models scale reliably with over-training and on downstream tasks, arXiv: 2403.08540

[:1-19] Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". arXiv: 2210.14891 [cs.LG].

[20] Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (2022). "Scaling Vision Transformers": 12104–12113.{{cite journal}}: Cite journal requires |journal= (help)

[:3-21] 1 2 Ghorbani, Behrooz; Firat, Orhan; Freitag, Markus; Bapna, Ankur; Krikun, Maxim; Garcia, Xavier; Chelba, Ciprian; Cherry, Colin (2021-09-01). "Scaling Laws for Neural Machine Translation". arXiv: 2109.07740 [cs.LG].

[22] Chen, Mia Xu; Firat, Orhan; Bapna, Ankur; Johnson, Melvin; Macherey, Wolfgang; Foster, George; Jones, Llion; Schuster, Mike; Shazeer, Noam; Parmar, Niki; Vaswani, Ashish; Uszkoreit, Jakob; Kaiser, Lukasz; Chen, Zhifeng; Wu, Yonghui (July 2018). "The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 76–86. arXiv: 1804.09849 . doi:10.18653/v1/P18-1008.

[23] Gordon, Mitchell A; Duh, Kevin; Kaplan, Jared (2021). "Data and Parameter Scaling Laws for Neural Machine Translation". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 5915–5922. doi: 10.18653/v1/2021.emnlp-main.478 .

[24] Hernandez, Danny; Kaplan, Jared; Henighan, Tom; McCandlish, Sam (2021-02-01). "Scaling Laws for Transfer". arXiv: 2102.01293 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]