Empirical risk minimization

Last updated May 14, 2024

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice (i.e. the true "risk") because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

Background

The following situation is a general setting of many supervised learning problems. There are two spaces of objects $X$ and $Y$ and would like to learn a function $\ h:X\to Y$ (often called hypothesis) which outputs an object $y\in Y$ , given $x\in X$ . To do so, there is a training set of $n$ examples $\ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ where $x_{i}\in X$ is an input and $y_{i}\in Y$ is the corresponding response that is desired from $h(x_{i})$ .

To put it more formally, assuming that there is a joint probability distribution $P(x,y)$ over $X$ and $Y$ , and that the training set consists of $n$ instances $\ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ drawn i.i.d. from $P(x,y)$ . The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because $y$ is not a deterministic function of $x$ , but rather a random variable with conditional distribution $P(y|x)$ for a fixed $x$ .

It is also assumed that there is a non-negative real-valued loss function $L({\hat {y}},y)$ which measures how different the prediction ${\hat {y}}$ of a hypothesis is from the true outcome $y$ . For classification tasks these loss functions can be scoring rules. The risk associated with hypothesis $h(x)$ is then defined as the expectation of the loss function:

R(h)=\mathbf {E} [L(h(x),y)]=\int L(h(x),y)\,dP(x,y).

A loss function commonly used in theory is the 0-1 loss function: $L({\hat {y}},y)={\begin{cases}1&{\mbox{ if }}\quad {\hat {y}}\neq y\\0&{\mbox{ if }}\quad {\hat {y}}=y\end{cases}}$ .

The ultimate goal of a learning algorithm is to find a hypothesis $h^{*}$ among a fixed class of functions ${\mathcal {H}}$ for which the risk $R(h)$ is minimal:

h^{*}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,{R(h)}.

For classification problems, the Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function.

Empirical risk minimization

In general, the risk $R(h)$ cannot be computed because the distribution $P(x,y)$ is unknown to the learning algorithm (this situation is referred to as agnostic learning ^{[ citation needed ]}). However, given a sample of iid training data points, we can compute an estimate, called the empirical risk, by computing the average of the loss function over the training set; more formally, computing the expectation with respect to the empirical measure:

\!R_{\text{emp}}(h)={\frac {1}{n}}\sum _{i=1}^{n}L(h(x_{i}),y_{i}).

The empirical risk minimization principle^[1] states that the learning algorithm should choose a hypothesis ${\hat {h}}$ which minimizes the empirical risk over the hypothesis class ${\mathcal {H}}$ :

{\hat {h}}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,R_{\text{emp}}(h).

Thus, the learning algorithm defined by the empirical risk minimization principle consists in solving the above optimization problem.

Properties

Guarantees for the performance of empirical risk minimization depend strongly on the function class selected as well as the distributional assumptions made.^[2] In general, distribution-free methods are too coarse, and do not lead to practical bounds. However, they are still useful in deriving asymptotic properties of learning algorithms, such as consistency. In particular, distribution-free bounds on the performance of empirical risk minimization given a fixed function class can be derived using bounds on the VC complexity of the function class.

For simplicity, considering the case of binary classification tasks, it is possible to bound the probability of the selected classifier, $\phi _{n}$ being much worse than the best possible classifier $\phi ^{*}$ . Consider the risk $L$ defined over the hypothesis class ${\mathcal {C}}$ with growth function ${\mathcal {S}}({\mathcal {C}},n)$ given a dataset of size $n$ . Then, for every $\epsilon >0$ :^[3]

\mathbb {P} \left(L(\phi _{n})-L(\phi ^{*})\right)\leq {\mathcal {8}}S({\mathcal {C}},n)\exp\{-n\epsilon ^{2}/32\}

Similar results hold for regression tasks.^[2] These results are often based on uniform laws of large numbers, which control the deviation of the empirical risk from the true risk, uniformly over the hypothesis class.^[3]

Imposibility results

It is also possible to show lower bounds on algorithm performance if no distributional assumptions are made.^[4] This is sometimes referred to as the No free lunch theorem. Even though a specific learning algorithm may provide the asymptotically optimal performance for any distribution, the finite sample performance is always poor for at least one data distribution. This means that no classifier can provide on the error for a given sample size for all distributions.^[3]

Specifically, Let $\epsilon >0$ and consider a sample size $n$ and classification rule $\phi _{n}$ , there exists a distribution of $(X,Y)$ with risk $L^{*}=0$ (meaning that perfect prediction is possible) such that:^[3]

\mathbb {E} L_{n}\geq 1/2-\epsilon .

It is further possible to show that the convergence rate of a learning algorithm is poor for some distributions. Specifically, given a sequence of decreasing positive numbers $a_{i}$ converging to zero, it is possible to find a distribution such that:

\mathbb {E} L_{n}\geq a_{i}

for all $n$ . This result shows that universally good classification rules do not exist, in the sense that the rule must be low quality for at least one distribution.^[3]

Computational complexity

Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers.^[5] Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable.^{[ citation needed ]}

In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution $P(x,y)$ (and thus stop being agnostic learning algorithms to which the above result applies).

In the case of convexification, Zhang's lemma majors the excess risk of the original problem using the excess risk of the convexified problem.^[6] Minimizing the latter using convex optimization also allow to control the former.

Tilted empirical risk minimization

Tilted empirical risk minimization is a machine learning technique used to modify standard loss functions like squared error, by introducing a tilt parameter. This parameter dynamically adjusts the weight of data points during training, allowing the algorithm to focus on specific regions or characteristics of the data distribution. Tilted empirical risk minimization is particularly useful in scenarios with imbalanced data or when there is a need to emphasize errors in certain parts of the prediction space.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

For supervised learning applications in machine learning and statistical learning theory, generalization error is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of a machine learning algorithm is visualized by plots that show values of estimates of the generalization error through the learning process, which are called learning curves.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Stability, also known as algorithmic stability, is a notion in computational learning theory of how a machine learning algorithm output is changed with small perturbations to its inputs. A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. For instance, consider a machine learning algorithm that is being trained to recognize handwritten letters of the alphabet, using 1000 examples of handwritten letters and their labels as a training set. One way to modify this training set is to leave out an example, so that only 999 examples of handwritten letters and their labels are available. A stable learning algorithm would produce a similar classifier with both the 1000-element and 999-element training sets.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer $of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.$

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

<span class="mw-page-title-main">Loss functions for classification</span> Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes are closely related to regularization in machine learning, and provides large sample justifications for certain learning algorithms.

References

↑ V. Vapnik (1992). Principles of Risk Minimization for Learning Theory.
1 2 Györfi, László; Kohler, Michael; Krzyzak, Adam; Walk, Harro (2010-12-01). A Distribution-Free Theory of Nonparametric Regression (Softcover reprint of the original 1st ed.). New York: Springer. ISBN 978-1-4419-2998-3.
1 2 3 4 5 Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997)
↑ Devroye, Luc; Györfi, László; Lugosi, Gábor (1996). "A Probabilistic Theory of Pattern Recognition". Stochastic Modelling and Applied Probability. 31. doi:10.1007/978-1-4612-0711-5. ISBN 978-1-4612-6877-2. ISSN 0172-4568.
↑ V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009). Agnostic Learning of Monomials by Halfspaces is Hard. (See the paper and references therein)
↑ "Mathematics of Machine Learning Lecture 9 Notes | Mathematics of Machine Learning | Mathematics". MIT OpenCourseWare. Retrieved 2023-10-28.

Empirical risk minimization

Contents

Background

Empirical risk minimization

Properties

Imposibility results

Computational complexity

Tilted empirical risk minimization

See also

Related Research Articles

References

Further reading