Generalization error

Last updated October 26, 2024

For supervised learning applications in machine learning and statistical learning theory, generalization error^[1] (also known as the out-of-sample error^[2] or the risk) is a measure of how accurately an algorithm is able to predict outcomes for previously unseen data. As learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about the algorithm's predictive ability on new, unseen data. The generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of machine learning algorithms is commonly visualized by learning curve plots that show estimates of the generalization error throughout the learning process.

Definition

In a learning problem, the goal is to develop a function $f_{n}({\vec {x}})$ that predicts output values $y$ for each input datum ${\vec {x}}$ . The subscript $n$ indicates that the function $f_{n}$ is developed based on a data set of $n$ data points. The generalization error or expected loss or risk $I[f]$ of a particular function $f$ over all possible values of ${\vec {x}}$ and $y$ is the expected value of the loss function $V(f)$ :^[1]

I[f]=\int _{X\times Y}V(f({\vec {x}}),y)\rho ({\vec {x}},y)d{\vec {x}}dy,

where $\rho ({\vec {x}},y)$ is the unknown joint probability distribution for ${\vec {x}}$ and $y$ .

Without knowing the joint probability distribution $\rho$ , it is impossible to compute $I[f]$ . Instead, we can compute the error on sample data, which is called empirical error (or empirical risk). Given $n$ data points, the empirical error of a candidate function $f$ is:

I_{n}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f({\vec {x}}_{i}),y_{i})

An algorithm is said to generalize if:

\lim _{n\rightarrow \infty }I[f]-I_{n}[f]=0

Of particular importance is the generalization error $I[f_{n}]$ of the data-dependent function $f_{n}$ that is found by a learning algorithm based on the sample. Again, for an unknown probability distribution, $I[f_{n}]$ cannot be computed. Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the generalization error and the empirical error in probability:

P_{G}=P(I[f_{n}]-I_{n}[f_{n}]\leq \epsilon )\geq 1-\delta _{n}

That is, the goal is to characterize the probability $1-\delta _{n}$ that the generalization error is less than the empirical error plus some error bound $\epsilon$ (generally dependent on $\delta$ and $n$ ). For many types of algorithms, it has been shown that an algorithm has generalization bounds if it meets certain stability criteria. Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and meets two stability conditions, it will generalize. The first stability condition, leave-one-out cross-validation stability, says that to be stable, the prediction error for each data point when leave-one-out cross validation is used must converge to zero as $n\rightarrow \infty$ . The second condition, expected-to-leave-one-out error stability (also known as hypothesis stability if operating in the $L_{1}$ norm) is met if the prediction on a left-out datapoint does not change when a single data point is removed from the training dataset.^[3]

These conditions can be formalized as:

Leave-one-out cross-validation Stability

An algorithm $L$ has $CVloo$ stability if for each $n$ , there exists a $\beta _{CV}^{(n)}$ and $\delta _{CV}^{(n)}$ such that:

\forall i\in \{1,...,n\},\mathbb {P} _{S}\{|V(f_{S^{i}},z_{i})-V(f_{S},z_{i})|\leq \beta _{CV}^{(n)}\}\geq 1-\delta _{CV}^{(n)}

and $\beta _{CV}^{(n)}$ and $\delta _{CV}^{(n)}$ go to zero as $n$ goes to infinity.^[3]

Expected-leave-one-out error Stability

An algorithm $L$ has $Eloo_{err}$ stability if for each $n$ there exists a $\beta _{EL}^{m}$ and a $\delta _{EL}^{m}$ such that:

\forall i\in \{1,...,n\},\mathbb {P} _{S}\left\{\left|I[f_{S}]-{\frac {1}{n}}\sum _{i=1}^{N}V\left(f_{S^{i}},z_{i}\right)\right|\leq \beta _{EL}^{(n)}\right\}\geq 1-\delta _{EL}^{(n)}

with $\beta _{EL}^{(n)}$ and $\delta _{EL}^{(n)}$ going to zero for $n\rightarrow \infty$ .

For leave-one-out stability in the $L_{1}$ norm, this is the same as hypothesis stability:

\mathbb {E} _{S,z}[|V(f_{S},z)-V(f_{S^{i}},z)|]\leq \beta _{H}^{(n)}

with $\beta _{H}^{(n)}$ going to zero as $n$ goes to infinity.^[3]

Algorithms with proven stability

A number of algorithms have been proven to be stable and as a result have bounds on their generalization error. A list of these algorithms and the papers that proved stability is available here.

Relation to overfitting

This figure illustrates the relationship between overfitting and the generalization error I[fn] - IS[fn]. Data points were generated from the relationship y = x with white noise added to the y values. In the left column, a set of training points is shown in blue. A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled from the underlying joint probability distribution of x and y. In the top row, the function is fit on a sample dataset of 10 datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error on training and test data converges and generalization error goes to 0. RegressionOverfitting.png — This figure illustrates the relationship between overfitting and the generalization error I[*f_n*] - *I_S*[*f_n*]. Data points were generated from the relationship y = x with white noise added to the y values. In the left column, a set of training points is shown in blue. A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled from the underlying joint probability distribution of x and y. In the top row, the function is fit on a sample dataset of 10 datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error on training and test data converges and generalization error goes to 0.

The concepts of generalization error and overfitting are closely related. Overfitting occurs when the learned function $f_{S}$ becomes sensitive to the noise in the sample. As a result, the function will perform well on the training set but not perform well on other data from the joint probability distribution of $x$ and $y$ . Thus, the more overfitting occurs, the larger the generalization error.

The amount of overfitting can be tested using cross-validation methods, that split the sample into simulated training samples and testing samples. The model is then trained on a training sample and evaluated on the testing sample. The testing sample is previously unseen by the algorithm and so represents a random sample from the joint probability distribution of $x$ and $y$ . This test sample allows us to approximate the expected error and as a result approximate a particular form of the generalization error.

Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more complex functions (known as Tikhonov regularization), or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization).

The approach to finding a function that does not overfit is at odds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data. This is known as the bias–variance tradeoff. Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions. It is impossible to minimize both simultaneously.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent. Such methods update the model to make it better fit the training data with each iteration. Up to a point, this improves the model's performance on data outside of the training set. Past that point, however, improving the model's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

The normal-inverse Gaussian distribution is a continuous probability distribution that is defined as the normal variance-mean mixture where the mixing density is the inverse Gaussian distribution. The NIG distribution was noted by Blaesild in 1977 as a subclass of the generalised hyperbolic distribution discovered by Ole Barndorff-Nielsen. In the next year Barndorff-Nielsen published the NIG in another paper. It was introduced in the mathematical finance literature in 1997.

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

Monte Carlo in statistical physics refers to the application of the Monte Carlo method to problems in statistical physics, or statistical mechanics.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

The term generalized logistic distribution is used as the name for several different families of probability distributions. For example, Johnson et al. list four forms, which are listed below.

Stability, also known as algorithmic stability, is a notion in computational learning theory of how a machine learning algorithm output is changed with small perturbations to its inputs. A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. For instance, consider a machine learning algorithm that is being trained to recognize handwritten letters of the alphabet, using 1000 examples of handwritten letters and their labels as a training set. One way to modify this training set is to leave out an example, so that only 999 examples of handwritten letters and their labels are available. A stable learning algorithm would produce a similar classifier with both the 1000-element and 999-element training sets.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

In statistics, linear regression is a model that estimates the linear relationship between a scalar response and one or more explanatory variables. A model with exactly one explanatory variable is a simple linear regression; a model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

1 2 Mohri, M., Rostamizadeh A., Talwakar A., (2018) Foundations of Machine learning, 2nd ed., Boston: MIT Press
↑ Y S. Abu-Mostafa, M.Magdon-Ismail, and H.-T. Lin (2012) Learning from Data, AMLBook Press. ISBN 978-1600490064
1 2 3 Mukherjee, S.; Niyogi, P.; Poggio, T.; Rifkin., R. M. (2006). "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization" (PDF). Adv. Comput. Math. 25 (1–3): 161–193. doi:10.1007/s10444-004-7634-z. S2CID 2240256.