Part of a series on |
Machine learning and data mining |
---|
Double descent in statistics and machine learning is the phenomenon where a model with a small number of parameters and a model with an extremely large number of parameters both have a small training error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a much greater test error than one with a much larger number of parameters. [2] This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning. [3]
Early observations of what would later be called double descent in specific models date back to 1989. [4] [5]
The term "double descent" was coined by Belkin et. al. [6] in 2019, [3] when the phenomenon gained popularity as a broader concept exhibited by many models. [7] [8] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff), [9] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models. [6] [10]
Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise. [11]
A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically. [12]
The scaling behavior of double descent has been found to follow a broken neural scaling law [13] functional form.