Learnable function class

Last updated November 15, 2023

In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes are closely related to regularization in machine learning, and provides large sample justifications for certain learning algorithms.

Definition

Background

Let $\Omega ={\mathcal {X}}\times {\mathcal {Y}}=\{(x,y)\}$ be the sample space, where $y$ are the labels and $x$ are the covariates (predictors). ${\mathcal {F}}=\{f:{\mathcal {X}}\mapsto {\mathcal {Y}}\}$ is a collection of mappings (functions) under consideration to link $x$ to $y$ . $L:{\mathcal {Y}}\times {\mathcal {Y}}\mapsto \mathbb {R}$ is a pre-given loss function (usually non-negative). Given a probability distribution $P(x,y)$ on $\Omega$ , define the expected risk $I_{P}(f)$ to be:

I_{P}(f)=\int L(f(x),y)dP(x,y)

The general goal in statistical learning is to find the function in ${\mathcal {F}}$ that minimizes the expected risk. That is, to find solutions to the following problem:^[1]

{\hat {f}}=\arg \min _{f\in {\mathcal {F}}}I_{P}(f)

But in practice the distribution $P$ is unknown, and any learning task can only be based on finite samples. Thus we seek instead to find an algorithm that asymptotically minimizes the empirical risk, i.e., to find a sequence of functions $\{{\hat {f}}_{n}\}_{n=1}^{\infty }$ that satisfies

\lim _{n\rightarrow \infty }\mathbb {P} (I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)>\epsilon )=0

One usual algorithm to find such a sequence is through empirical risk minimization.

Learnable function class

We can make the condition given in the above equation stronger by requiring that the convergence is uniform for all probability distributions. That is:

\lim _{n\rightarrow \infty }\sup _{P}\mathbb {P} (I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)>\epsilon )=0

(1)

The intuition behind the more strict requirement is as such: the rate at which sequence $\{{\hat {f}}_{n}\}$ converges to the minimizer of the expected risk can be very different for different $P(x,y)$ . Because in real world the true distribution $P$ is always unknown, we would want to select a sequence that performs well under all cases.

However, by the no free lunch theorem, such a sequence that satisfies ( 1 ) does not exist if ${\mathcal {F}}$ is too complex. This means we need to be careful and not allow too "many" functions in ${\mathcal {F}}$ if we want ( 1 ) to be a meaningful requirement. Specifically, function classes that ensure the existence of a sequence $\{{\hat {f}}_{n}\}$ that satisfies ( 1 ) are known as learnable classes.^[1]

It is worth noting that at least for supervised classification and regression problems, if a function class is learnable, then the empirical risk minimization automatically satisfies ( 1 ).^[2] Thus in these settings not only do we know that the problem posed by ( 1 ) is solvable, we also immediately have an algorithm that gives the solution.

Interpretations

If the true relationship between $y$ and $x$ is $y\sim f^{*}(x)$ , then by selecting the appropriate loss function, $f^{*}$ can always be expressed as the minimizer of the expected loss across all possible functions. That is,

f^{*}=\arg \min _{f\in {\mathcal {F}}^{*}}I_{P}(f)

Here we let ${\mathcal {F}}^{*}$ be the collection of all possible functions mapping ${\mathcal {X}}$ onto ${\mathcal {Y}}$ . $f^{*}$ can be interpreted as the actual data generating mechanism. However, the no free lunch theorem tells us that in practice, with finite samples we cannot hope to search for the expected risk minimizer over ${\mathcal {F}}^{*}$ . Thus we often consider a subset of ${\mathcal {F}}^{*}$ , ${\mathcal {F}}$ , to carry out searches on. By doing so, we risk that $f^{*}$ might not be an element of ${\mathcal {F}}$ . This tradeoff can be mathematically expressed as

I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}^{*}}I_{P}(f)=\underbrace {I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)} _{(a)}+\underbrace {\inf _{f\in {\mathcal {F}}}I_{P}(f)-\inf _{f\in {\mathcal {F}}^{*}}I_{P}(f)} _{(b)}

(2)

In the above decomposition, part $(b)$ does not depend on the data and is non-stochastic. It describes how far away our assumptions ( ${\mathcal {F}}$ ) are from the truth ( ${\mathcal {F}}^{*}$ ). $(b)$ will be strictly greater than 0 if we make assumptions that are too strong ( ${\mathcal {F}}$ too small). On the other hand, failing to put enough restrictions on ${\mathcal {F}}$ will cause it to be not learnable, and part $(a)$ will not stochastically converge to 0. This is the well-known overfitting problem in statistics and machine learning literature.

Example: Tikhonov regularization

A good example where learnable classes are used is the so-called Tikhonov regularization in reproducing kernel Hilbert space (RKHS). Specifically, let ${\mathcal {F^{*}}}$ be an RKHS, and $||\cdot ||_{2}$ be the norm on ${\mathcal {F^{*}}}$ given by its inner product. It is shown in ^[3] that ${\mathcal {F}}=\{f:||f||_{2}\leq \gamma \}$ is a learnable class for any finite, positive $\gamma$ . The empirical minimization algorithm to the dual form of this problem is

\arg \min _{f\in {\mathcal {F}}^{*}}\left\{\sum _{i=1}^{n}L(f(x_{i}),y_{i})+\lambda ||f||_{2}\right\}

This was first introduced by Tikhonov^[4] to solve ill-posed problems. Many statistical learning algorithms can be expressed in such a form (for example, the well-known ridge regression).

The tradeoff between $(a)$ and $(b)$ in ( 2 ) is geometrically more intuitive with Tikhonov regularization in RKHS. We can consider a sequence of $\{{\mathcal {F}}_{\gamma }\}$ , which are essentially balls in ${\mathcal {F^{*}}}$ with centers at 0. As $\gamma$ gets larger, ${\mathcal {F}}_{\gamma }$ gets closer to the entire space, and $(b)$ is likely to become smaller. However we will also suffer smaller convergence rates in $(a)$ . The way to choose an optimal $\gamma$ in finite sample settings is usually through cross-validation.

Relationship to empirical process theory

Part $(a)$ in ( 2 ) is closely linked to empirical process theory in statistics, where the empirical risk $\{\sum _{i=1}^{n}L(y_{i},f(x_{i})),f\in {\mathcal {F}}\}$ are known as empirical processes.^[5] In this field, the function class ${\mathcal {F}}$ that satisfies the stochastic convergence

\sup _{P}\mathbb {E} \sup _{f\in {\mathcal {F}}}|\sum _{i=1}^{n}L(y_{i},f(x_{i}))-I_{P}(f)|=0

(3)

are known as uniform Glivenko–Cantelli classes. It has been shown that under certain regularity conditions, learnable classes and uniformly Glivenko-Cantelli classes are equivalent.^[1] Interplay between $(a)$ and $(b)$ in statistics literature is often known as the bias-variance tradeoff.

However, note that in ^[2] the authors gave an example of stochastic convex optimization for General Setting of Learning where learnability is not equivalent with uniform convergence.

Related Research Articles

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In mathematics, gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent. It is particularly useful in machine learning for minimizing the cost or loss function. Gradient descent should not be confused with local search algorithms, although both are iterative methods for optimization.

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions $and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but does not converge uniformly i.e. does not converge with respect to the supremum norm.$

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data.

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Stability, also known as algorithmic stability, is a notion in computational learning theory of how a machine learning algorithm output is changed with small perturbations to its inputs. A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. For instance, consider a machine learning algorithm that is being trained to recognize handwritten letters of the alphabet, using 1000 examples of handwritten letters and their labels as a training set. One way to modify this training set is to leave out an example, so that only 999 examples of handwritten letters and their labels are available. A stable learning algorithm would produce a similar classifier with both the 1000-element and 999-element training sets.

In data mining and machine learning, $k$ $q$ -flats algorithm is an iterative method which aims to partition $m$ observations into $k$ clusters where each cluster is close to a $q$ -flat, where $q$ is a given integer.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $regularization of the form$

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

In mathematical optimization, the proximal operator is an operator associated with a proper, lower semi-continuous convex function $from a Hilbert space to, and is defined by:$

References

1 2 3 Vladimir N. Vapnik (17 April 2013). The Nature of Statistical Learning Theory. Springer Science & Business Media. ISBN 978-1-4757-2440-0.
1 2 "Learnability, stability and uniform convergence". The Journal of Machine Learning Research.
↑ "Learnability in Hilbert spaces with reproducing kernels". Journal of Complexity.
↑ Andreĭ Nikolaevich Tikhonov; Vasiliĭ I︠A︡kovlevich Arsenin (1977). Solutions of ill-posed problems. Winston. ISBN 978-0-470-99124-4.
↑ A.W. van der vaart; Jon Wellner (9 March 2013). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media. pp. 116–. ISBN 978-1-4757-2545-2.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Vapnik2013-1] 1 2 3 Vladimir N. Vapnik (17 April 2013). The Nature of Statistical Learning Theory. Springer Science & Business Media. ISBN 978-1-4757-2440-0.

[:0-2] 1 2 "Learnability, stability and uniform convergence". The Journal of Machine Learning Research.

[3] "Learnability in Hilbert spaces with reproducing kernels". Journal of Complexity.

[TikhonovArsenin1977-4] Andreĭ Nikolaevich Tikhonov; Vasiliĭ I︠A︡kovlevich Arsenin (1977). Solutions of ill-posed problems. Winston. ISBN 978-0-470-99124-4.

[vaartWellner2013-5] A.W. van der vaart; Jon Wellner (9 March 2013). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media. pp. 116–. ISBN 978-1-4757-2545-2.

[1]

[2]

[3]

[4]

[5]

Learnable function class

Contents

Definition

Background

Learnable function class

Interpretations

Example: Tikhonov regularization

Relationship to empirical process theory

Related Research Articles

References