Rademacher complexity

Last updated

In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to real valued functions.

Contents

Definitions

Rademacher complexity of a set

Given a set , the Rademacher complexity ofA is defined as follows: [1] [2] :326

where are independent random variables drawn from the Rademacher distribution i.e. for , and . Some authors take the absolute value of the sum before taking the supremum, but if is symmetric this makes no difference.

Rademacher complexity of a function class

Let be a sample of points and consider a function class of real-valued functions over . Then, the empirical Rademacher complexity of given is defined as:

This can also be written using the previous definition: [2] :326

where denotes function composition, i.e.:

Let be a probability distribution over . The Rademacher complexity of the function class with respect to for sample size is:

where the above expectation is taken over an identically independently distributed (i.i.d.) sample generated according to .

Intuition

The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings. When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of under the expectation, so that this quantity in the sum is maximised.

Examples

1. contains a single vector, e.g., . Then:

The same is true for every singleton hypothesis class. [3] :56

2. contains two vectors, e.g., . Then:

Using the Rademacher complexity

The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

Bounding the representativeness

In machine learning, it is desired to have a training set that represents the true distribution of some sample data . This can be quantified using the notion of representativeness. Denote by the probability distribution from which the samples are drawn. Denote by the set of hypotheses (potential classifiers) and denote by the corresponding set of error functions, i.e., for every hypothesis , there is a function , that maps each training sample (features,label) to the error of the classifier (note in this case hypothesis and classifier are used interchangeably). For example, in the case that represents a binary classifier, the error function is a 0–1 loss function, i.e. the error function returns 0 if correctly classifies a sample and 1 else. We omit the index and write instead of when the underlying hypothesis is irrelevant. Define:

– the expected error of some error function on the real distribution ;
– the estimated error of some error function on the sample .

The representativeness of the sample , with respect to and , is defined as:

Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples.

The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class: [2] :326

Bounding the generalization error

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.

For example, (with binary error function), [2] :328 for every , with probability at least , for every hypothesis :

Bounding the Rademacher complexity

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set . [2] :329–330

1. If all vectors in are translated by a constant vector , then Rad(A) does not change.

2. If all vectors in are multiplied by a scalar , then Rad(A) is multiplied by .

3. . [3] :56

4. (Kakade & Tewari Lemma) If all vectors in are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in are operated by a contraction mapping, then Rad(A) strictly decreases.

5. The Rademacher complexity of the convex hull of equals Rad(A).

6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let be a set of vectors in , and let be the mean of the vectors in . Then:

In particular, if is a set of binary vectors, the norm is at most , so:

Let be a set family whose VC dimension is . It is known that the growth function of is bounded as:

for all :

This means that, for every set with at most elements, . The set-family can be considered as a set of binary vectors over . Substituting this in Massart's lemma gives:

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound [4] ) one can show, for example, that there exists a constant , such that any class of -indicator functions with Vapnik–Chervonenkis dimension has Rademacher complexity upper-bounded by .

The following bounds are related to linear operations on – a constant set of vectors in . [2] :332–333

1. Define the set of dot-products of the vectors in with vectors in the unit ball. Then:

2. Define the set of dot-products of the vectors in with vectors in the unit ball of the 1-norm. Then:

The following bound relates the Rademacher complexity of a set to its external covering number – the number of balls of a given radius whose union contains . The bound is attributed to Dudley. [2] :338

Suppose is a set of vectors whose length (norm) is at most . Then, for every integer :

In particular, if lies in a d-dimensional subspace of , then:

Substituting this in the previous bound gives the following bound on the Rademacher complexity:

Gaussian complexity

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables instead of , where are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e. . Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.

Equivalence of Rademacher and Gaussian complexity

Given a set then it holds that [5] :

Where is the Gaussian Complexity of A. As an example, consider the rademacher and gaussian complexities of the L1 ball. The Rademacher complexity is given by exactly 1, whereas the Gaussian complexity is on the order of (which can be shown by applying known properties of suprema of a set of subgaussian random variables). [5]

Related Research Articles

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

<span class="mw-page-title-main">Uncertainty principle</span> Foundational principle in quantum physics

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Wiener process</span> Stochastic process generalizing Brownian motion

In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is often also called Brownian motion due to its historical connection with the physical process of the same name originally observed by Scottish botanist Robert Brown. It is one of the best known Lévy processes and occurs frequently in pure and applied mathematics, economics, quantitative finance, evolutionary biology, and physics.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In mathematics, particularly in operator theory and C*-algebra theory, the continuous functional calculus is a functional calculus which allows the application of a continuous function to normal elements of a C*-algebra.

A Dynkin system, named after Eugene Dynkin, is a collection of subsets of another universal set satisfying a set of axioms weaker than those of 𝜎-algebra. Dynkin systems are sometimes referred to as 𝜆-systems or d-system. These set families have applications in measure theory and probability.

In mathematics, a π-system on a set is a collection of certain subsets of such that

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In mathematics, especially measure theory, a set function is a function whose domain is a family of subsets of some given set and that (usually) takes its values in the extended real number line which consists of the real numbers and

In mathematics, the conformal radius is a way to measure the size of a simply connected planar domain D viewed from a point z in it. As opposed to notions using Euclidean distance, this notion is well-suited to use in complex analysis, in particular in conformal maps and conformal geometry.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value.

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

In mathematics, a dual system, dual pair, or a duality over a field is a triple consisting of two vector spaces and over and a non-degenerate bilinear map .

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

  1. Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory – Rademacher Complexity" (PDF). Retrieved 10 December 2016.
  2. 1 2 3 4 5 6 7 Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN   9781107057135.
  3. 1 2 Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN   9780262018258.
  4. Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. doi : 10.1007/978-3-540-28650-9_8
  5. 1 2 Wainwright, Martin (2019). High-dimensional statistics : a non-asymptotic viewpoint. Cambridge, United Kingdom. pp. Exercise 5.5. ISBN   978-1-108-62777-1. OCLC   1089254580.{{cite book}}: CS1 maint: location missing publisher (link)