Double descent

Last updated February 04, 2025

Double descent in statistics and machine learning is the phenomenon where a model with a small number of parameters and a model with an extremely large number of parameters both have a small training error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a much greater test error than one with a much larger number of parameters.^[2] This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.^[3]

History

Early observations of what would later be called double descent in specific models date back to 1989.^[4]^[5]

The term "double descent" was coined by Belkin et. al.^[6] in 2019,^[3] when the phenomenon gained popularity as a broader concept exhibited by many models.^[7]^[8] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff),^[9] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.^[6]^[10]

Theoretical models

Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.^[11]

A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically.^[12]

Empirical examples

The scaling behavior of double descent has been found to follow a broken neural scaling law ^[13] functional form.

Related Research Articles

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent. Such methods update the model to make it better fit the training data with each iteration. Up to a point, this improves the model's performance on data outside of the training set. Past that point, however, improving the model's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.

For supervised learning applications in machine learning and statistical learning theory, generalization error is a measure of how accurately an algorithm is able to predict outcomes for previously unseen data. As learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about the algorithm's predictive ability on new, unseen data. The generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of machine learning algorithms is commonly visualized by learning curve plots that show estimates of the generalization error throughout the learning process.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

In machine learning, ensemble averaging is the process of creating multiple models and combining them to produce a desired output, as opposed to creating just one model. Ensembles of models often outperform individual models, as the various errors of the ensemble constituents "average out".

There are many types of artificial neural networks (ANN).

In machine learning, a hyperparameter is a parameter that can be set in order to define any configurable part of a model's learning process. Hyperparameters can be classified as either model hyperparameters or algorithm hyperparameters. These are named hyperparameters in contrast to parameters, which are characteristics that the model learns from the data.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently been replaced—in some cases—by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 weights for each convolutional layer are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as we increase the number of tunable parameters in a model, it becomes more flexible, and can better fit a training data set. It is said to have lower error, or bias. However, for more flexible models, there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set. It is said that there is greater variance in the model's estimated parameters.

<span class="mw-page-title-main">Symbolic regression</span> Type of regression analysis

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

AlexNet is a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto in 2012. It had 60 million parameters and 650,000 neurons.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process, which must be configured before the process starts.

In machine learning (ML), a learning curve is a graphical representation that shows how a model's performance on a training set changes with the number of training iterations (epochs) or the amount of training data. Typically, the number of training epochs or training set size is plotted on the x-axis, and the value of the loss function on the y-axis.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

In machine learning, grokking, or delayed generalization, is a transition to generalization that occurs many training iterations after the interpolation threshold, after many iterations of seemingly little progress, as opposed to the usual process where generalization occurs slowly and progressively once the interpolation threshold has been reached.

In deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters.

References

↑ Rocks, Jason W. (2022). "Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models". Physical Review Research. 4 (1). arXiv: 2010.13933 . doi:10.1103/PhysRevResearch.4.013201.
↑ "Deep Double Descent". OpenAI. 2019-12-05. Retrieved 2022-08-12.
1 2 Schaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle". arXiv: 2303.14151v1 [cs.LG].
↑ Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989). "Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions". Europhysics Letters. 9 (4): 315. Bibcode:1989EL......9..315V. doi:10.1209/0295-5075/9/4/003. ISSN 0295-5075.
↑ Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19). "A brief prehistory of double descent". Proceedings of the National Academy of Sciences. 117 (20): 10625–10626. arXiv: 2004.04328 . Bibcode:2020PNAS..11710625L. doi: 10.1073/pnas.2001875117 . ISSN 0027-8424. PMC 7245109 . PMID 32371495.
1 2 Belkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06). "Reconciling modern machine learning practice and the bias-variance trade-off". Proceedings of the National Academy of Sciences. 116 (32): 15849–15854. arXiv: 1812.11118 . doi: 10.1073/pnas.1903070116 . ISSN 0027-8424. PMC 6689936 . PMID 31341078.
↑ Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization". Journal of Physics A: Mathematical and Theoretical. 52 (47): 474001. arXiv: 1810.09665 . doi:10.1088/1751-8121/ab4c8b. ISSN 1751-8113.
↑ Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv: 2103.10948 . doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828. PMID 36350870.
↑ Geman, Stuart; Bienenstock, Élie; Doursat, René (1992). "Neural networks and the bias/variance dilemma" (PDF). Neural Computation. 4: 1–58. doi:10.1162/neco.1992.4.1.1. S2CID 14215320.
↑ Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment . 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv: 1912.02292 . Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.
↑ Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent". arXiv: 1912.07242v1 [stat.ML].
↑ Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01). "High-dimensional dynamics of generalization error in neural networks". Neural Networks. 132: 428–446. doi: 10.1016/j.neunet.2020.08.022 . ISSN 0893-6080. PMC 7685244 . PMID 33022471.
↑ Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.

External links

Brent Werness; Jared Wilber. "Double Descent: Part 1: A Visual Introduction".
Brent Werness; Jared Wilber. "Double Descent: Part 2: A Mathematical Explanation".
Understanding "Deep Double Descent" at evhub.

This statistics-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Rocks, Jason W. (2022). "Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models". Physical Review Research. 4 (1). arXiv: 2010.13933 . doi:10.1103/PhysRevResearch.4.013201.

[2] "Deep Double Descent". OpenAI. 2019-12-05. Retrieved 2022-08-12.

[:1-3] 1 2 Schaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle". arXiv: 2303.14151v1 [cs.LG].

[4] Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989). "Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions". Europhysics Letters. 9 (4): 315. Bibcode:1989EL......9..315V. doi:10.1209/0295-5075/9/4/003. ISSN 0295-5075.

[5] Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19). "A brief prehistory of double descent". Proceedings of the National Academy of Sciences. 117 (20): 10625–10626. arXiv: 2004.04328 . Bibcode:2020PNAS..11710625L. doi: 10.1073/pnas.2001875117 . ISSN 0027-8424. PMC 7245109 . PMID 32371495.

[:0-6] 1 2 Belkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06). "Reconciling modern machine learning practice and the bias-variance trade-off". Proceedings of the National Academy of Sciences. 116 (32): 15849–15854. arXiv: 1812.11118 . doi: 10.1073/pnas.1903070116 . ISSN 0027-8424. PMC 6689936 . PMID 31341078.

[7] Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization". Journal of Physics A: Mathematical and Theoretical. 52 (47): 474001. arXiv: 1810.09665 . doi:10.1088/1751-8121/ab4c8b. ISSN 1751-8113.

[8] Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv: 2103.10948 . doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828. PMID 36350870.

[geman-9] Geman, Stuart; Bienenstock, Élie; Doursat, René (1992). "Neural networks and the bias/variance dilemma" (PDF). Neural Computation. 4: 1–58. doi:10.1162/neco.1992.4.1.1. S2CID 14215320.

[10] Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment . 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv: 1912.02292 . Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.

[11] Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent". arXiv: 1912.07242v1 [stat.ML].

[12] Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01). "High-dimensional dynamics of generalization error in neural networks". Neural Networks. 132: 428–446. doi: 10.1016/j.neunet.2020.08.022 . ISSN 0893-6080. PMC 7685244 . PMID 33022471.

[13] Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]