Model collapse

Last updated

Model collapse [note 1] is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself. [8] [9] [10] [11] Such outputs are known as synthetic data.

Contents

Shumailov et al. [8] coined the term and described two specific stages to the degradation: early model collapse and late model collapse. In early model collapse, the model begins losing information about the tails of the distribution – mostly affecting minority data. Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data. [12] In late model collapse, the model loses a significant proportion of its performance, confusing concepts and losing most of its variance.

Mechanism

Using synthetic data as training data can lead to issues with the quality and reliability of the trained model. [13] [14] Model collapse occurs for three main reasons – functional approximation errors, sampling errors, and learning errors. [8] Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors often compound, leading to faster collapse.

Disagreement over real-world impact

Model collapse in generative models is reduced when data accumulates Model Collapse in Generative Models Can Be Avoided By Accumulating Data.png
Model collapse in generative models is reduced when data accumulates

Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet. If training on "slop" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem. [15]

However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided. [16] The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared. [17]

An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out. [18] [19]

Mathematical models of the phenomenon

1D Gaussian model

In 2024, [8] a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional normal distribution fit using unbiased estimators of mean and variance, computed on samples from the previous generation.

To make this more precise, we say that original data follows a normal distribution , and we possess samples for . Denoting a general sample as sample at generation , then the next generation model is estimated using the sample mean and variance:

Leading to a conditionally normal next generation model . In theory, this is enough to calculate the full distribution of . However, even after the first generation, the full distribution is no longer normal: It follows a variance-gamma distribution.

To continue the analysis, instead of writing the probability density function at each generation, it is possible to explicitly construct them in terms of independent random variables using Cochran's theorem. To be precise, and are independent, with and , following a Gamma distribution. Denoting with Gaussian random variables distributed according to and with random variables distributed with , it turns out to be possible to write samples at each generation as

and more generally

Note, that these are not joint distributions, as and depend directly on , but when considering on its own the formula above provides all the information about the full distribution.

To analyse the model collapse, we can first calculate variance and mean of samples at generation . This would tell us what kind of distributions we expect to arrive at after generations. It is possible to find its exact value in closed form, but the mean and variance of the square root of gamma distribution are expressed in terms of gamma functions, making the result quite clunky. Following, [8] it is possible to expand all results to second order in each of , assuming each sample size to be large. It is then possible to show that

And if all sample sizes are constant, this diverges linearly as :

This is the same scaling as for a single dimensional Gaussian random walk. However, divergence of the variance of does not directly provide any information about the corresponding estimates of and , particularly how different they are from the original and . It turns out to be possible to calculate the distance between the true distribution and the approximated distribution at step , using the Wasserstein-2 distance (which is also sometimes referred to as risk):

This directly shows why model collapse occurs in this simple model. Due to errors from re-sampling the approximated distribution, each generation ends up corresponding to a new step in a random walk of model parameters. For a constant sample size at each generation, the average distance from the starting point diverges, and in order for the end distribution approximation to be accurate, or for the distance to be finite, the sampling rate needs to increase superlinearly, i.e. one needs to collect increasingly more samples over time, perhaps quadratically. However, even in that case the expected distance after steps remains non-zero and the only case in which it does in fact end up being zero is when sampling is infinite at each step. Overall, this only shows us how far on average one ends up from the original distribution, but the process can only "terminate", if the estimated variance at a certain generation becomes small enough, effectively turning the distribution into a delta function. This is shown to occur for a general gaussian model [13] in the subsection below. Empirical investigation has confirmed this theoretical analysis. [20]

N-D Gaussian model

Furthermore, in the case of multidimensional model with fully synthetic data, exact collapse can be shown. [13] [8]

Linear regression

In the case of a linear regression model, [21] [22] scaling laws and bounds on learning can be obtained.

Statistical language model

In the case of a linear softmax classifier for next token prediction, [23] exact bounds on learning with even a partially synthetic dataset can be obtained.

Impact on large language models

In the context of large language models, research found that training LLMs on predecessor-generated text — language models are trained on the synthetic data produced by previous models — causes a consistent decrease in the lexical, syntactic, and semantic diversity of the model outputs through successive iterations, notably remarkable for tasks demanding high levels of creativity. [24]

See also

Notes

  1. Also known by other names, such as "AI Inbreeding", [1] [2] "AI Cannibalism", [3] [4] and "Model Autophagy Disorder", abbreviated "MAD". [5] [6] [7]

Related Research Articles

In probability theory and statistics, kurtosis refers to the degree of “tailedness” in the probability distribution of a real-valued random variable. Similar to skewness, kurtosis provides insight into specific characteristics of a distribution. Various methods exist for quantifying kurtosis in theoretical distributions, and corresponding techniques allow estimation based on sample data from a population. It’s important to note that different measures of kurtosis can yield varying interpretations.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not. Standard deviation may be abbreviated SD or Std Dev, and is most commonly represented in mathematical texts and equations by the lowercase Greek letter σ (sigma), for the population standard deviation, or the Latin letter s, for the sample standard deviation.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability theory and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant and is the standard deviation.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

<span class="mw-page-title-main">Chi distribution</span> Probability distribution

In probability theory and statistics, the chi distribution is a continuous probability distribution over the non-negative real line. It is the distribution of the positive square root of a sum of squared independent Gaussian random variables. Equivalently, it is the distribution of the Euclidean distance between a multivariate Gaussian random variable and the origin. The chi distribution describes the positive square roots of a variable obeying a chi-squared distribution.

In probability theory, calculation of the sum of normally distributed random variables is an instance of the arithmetic of random variables.

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot need not be a statistic — the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In statistics, pooled variance is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance.

<span class="mw-page-title-main">Variance gamma process</span> Concept in probability

In the theory of stochastic processes, a part of the mathematical theory of probability, the variance gamma (VG) process, also known as Laplace motion, is a Lévy process determined by a random time change. The process has finite moments, distinguishing it from many Lévy processes. There is no diffusion component in the VG process and it is thus a pure jump process. The increments are independent and follow a variance-gamma distribution, which is a generalization of the Laplace distribution.

In statistics, a generalized p-value is an extended version of the classical p-value, which except in a limited number of applications, provides only approximate solutions.

In probability theory, the rectified Gaussian distribution is a modification of the Gaussian distribution when its negative elements are reset to 0. It is essentially a mixture of a discrete distribution and a continuous distribution as a result of censoring.

References

  1. "'Generative inbreeding' and its risk to human culture". 26 August 2023.
  2. "AI could choke on its own exhaust as it fills the web". 28 August 2023.
  3. "AI Cannibalism and the Law – Colorado Technology Law Journal".
  4. "The Curious Case of AI Cannibalism & Possible Solutions". 26 July 2023.
  5. "Model Autophagy Disorder – the Livescu Initiative on Neuro, Narrative and AI".
  6. "Generative AI Goes 'MAD' when Trained on AI-Created Data over Five Times". 12 July 2023.
  7. Alemohammad, Sina; Casco-Rodriguez, Josue; Luzi, Lorenzo; Ahmed Imtiaz Humayun; Babaei, Hossein; LeJeune, Daniel; Siahkoohi, Ali; Baraniuk, Richard G. (2023). "Self-Consuming Generative Models Go MAD". arXiv: 2307.01850 [cs.LG].
  8. 1 2 3 4 5 6 Shumailov, Ilia; Shumaylov, Zakhar; Zhao, Yiren; Papernot, Nicolas; Anderson, Ross; Gal, Yarin (July 2024). "AI models collapse when trained on recursively generated data". Nature. 631 (8022): 755–759. Bibcode:2024Natur.631..755S. doi:10.1038/s41586-024-07566-y. ISSN   1476-4687. PMC   11269175 . PMID   39048682.
  9. Shumailov, Ilia; Shumaylov, Zakhar; Zhao, Yiren; Gal, Yarin; Papernot, Nicolas; Anderson, Ross (2023-05-31). "The Curse of Recursion: Training on Generated Data Makes Models Forget". arXiv: 2305.17493 [cs.LG].
  10. Ozsevim, Ilkhan (2023-06-20). "Research finds ChatGPT & Bard headed for 'Model Collapse'" . Retrieved 2024-03-06.
  11. Mok, Aaron. "A disturbing AI phenomenon could completely upend the internet as we know it". Business Insider. Retrieved 2024-03-06.
  12. Wyllie, Sierra; Shumailov, Ilia; Papernot, Nicolas (2024-06-05). "Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias". The 2024 ACM Conference on Fairness, Accountability, and Transparency. FAccT '24. New York, NY, USA: Association for Computing Machinery. pp. 2113–2147. arXiv: 2403.07857 . doi:10.1145/3630106.3659029. ISBN   979-8-4007-0450-5.
  13. 1 2 3 Alemohammad, Sina; Casco-Rodriguez, Josue; Luzi, Lorenzo; Humayun, Ahmed Imtiaz; Babaei, Hossein; LeJeune, Daniel; Siahkoohi, Ali; Baraniuk, Richard G. (July 4, 2023). "Self-Consuming Generative Models Go MAD". arXiv: 2307.01850 [cs.LG].
  14. Self-Consuming Generative Models Go MAD. The Twelfth International Conference on Learning Representations.
  15. "What is Model Collapse and how to avoid it". The Register. Retrieved 11 July 2024.
  16. Gerstgrasser, Matthias; Schaeffer, Rylan; Dey, Apratim; Rafailov, Rafael; Sleight, Henry; Hughes, John; Korbak, Tomasz; Agrawal, Rajashree; Pai, Dhruv; Gromov, Andrey; Roberts, Daniel A.; Yang, Diyi; Donoho, David L.; Koyejo, Sanmi (2024-04-01). "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data". arXiv: 2404.01413 [cs.LG].
  17. "Big brains divided over training AI with more AI: Is model collapse inevitable?". The Register. Retrieved 11 July 2024.
  18. Kirchenbauer, John; Geiping, Jonas; Wen, Yuxin; Katz, Jonathan; Miers, Ian; Goldstein, Tom (2023-07-03). "A Watermark for Large Language Models". Proceedings of the 40th International Conference on Machine Learning. PMLR: 17061–17084.
  19. "My AI Safety Lecture for UT Effective Altruism". Shtetl-Optimized. 2022-11-29. Retrieved 2024-06-22.
  20. Borji, Ali (2024-10-16). "A Note on Shumailov et al. (2024): "AI Models Collapse When Trained on Recursively Generated Data"". arXiv: 2410.12954 [cs.LG].
  21. Dohmatob, Elvis; Feng, Yunzhen; Kempe, Julia (2024-02-12). "Model Collapse Demystified: The Case of Regression". arXiv: 2402.07712 [cs.LG].
  22. Dohmatob, Elvis; Feng, Yunzhen; Yang, Pu; Charton, Francois; Kempe, Julia (2024-02-10). "A Tale of Tails: Model Collapse as a Change of Scaling Laws". arXiv: 2402.07043 [cs.LG].
  23. Seddik, Mohamed El Amine; Chen, Suei-Wen; Hayou, Soufiane; Youssef, Pierre; Debbah, Merouane (2024-04-07). "How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse". arXiv: 2404.05090 [cs.LG].
  24. Guo, Yanzhu; Shang, Guokan; Vazirgiannis, Michalis; Clavel, Chloé (2024-04-16). "The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text". arXiv: 2311.09807 [cs.CL].