Sample complexity

Last updated

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

Contents

More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1.

There are two variants of sample complexity:

The No free lunch theorem, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples.

However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the VC dimension on the class of target functions. [1]

Definition

Let be a space which we call the input space, and be a space which we call the output space, and let denote the product . For example, in the setting of binary classification, is typically a finite-dimensional vector space and is the set .

Fix a hypothesis space of functions . A learning algorithm over is a computable map from to . In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from to . Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization.

Fix a loss function , for example, the square loss , where . For a given distribution on , the expected risk of a hypothesis (a function) is

In our setting, we have , where is a learning algorithm and is a sequence of vectors which are all drawn independently from . Define the optimal riskSet , for each sample size . is a random variable and depends on the random variable , which is drawn from the distribution . The algorithm is called consistent if probabilistically converges to . In other words, for all , there exists a positive integer , such that, for all sample sizes , we have

The sample complexity of is then the minimum for which this holds, as a function of , and . We write the sample complexity as to emphasize that this value of depends on , and . If is not consistent, then we set . If there exists an algorithm for which is finite, then we say that the hypothesis space is learnable.

In others words, the sample complexity defines the rate of consistency of the algorithm: given a desired accuracy and confidence , one needs to sample data points to guarantee that the risk of the output function is within of the best possible, with probability at least . [2]

In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether is bounded by a polynomial in and . If is polynomial for some learning algorithm, then one says that the hypothesis space is PAC-learnable. This is a stronger notion than being learnable.

Unrestricted hypothesis space: infinite sample complexity

One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm , such that, for all , there exists a positive integer such that for all , we have

where , with as above. The No Free Lunch Theorem says that without restrictions on the hypothesis space , this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large. [1]

Thus, in order to make statements about the rate of convergence of the quantity one must either

Restricted hypothesis space: finite sample-complexity

The latter approach leads to concepts such as VC dimension and Rademacher complexity which control the complexity of the space . A smaller hypothesis space introduces more bias into the inference process, meaning that may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of regularization. [2]

It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space :

  1. is PAC-learnable.
  2. The VC dimension of is finite.
  3. is a uniform Glivenko-Cantelli class.

This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.

An example of a PAC-learnable hypothesis space

, and let be the space of affine functions on , that is, functions of the form for some . This is the linear classification with offset learning problem. Now, four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of is , so it is finite. It follows by the above characterization of PAC-learnable classes that is PAC-learnable, and by extension, learnable.

Sample-complexity bounds

Suppose is a class of binary functions (functions to ). Then, is -PAC-learnable with a sample of size: [3] where is the VC dimension of . Moreover, any -PAC-learning algorithm for must have sample-complexity: [4] Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.

Suppose is a class of real-valued functions with range in . Then, is -PAC-learnable with a sample of size: [5] [6] where is Pollard's pseudo-dimension of .

Other settings

In addition to the supervised learning setting, sample complexity is relevant to semi-supervised learning problems including active learning, [7] where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in reinforcement learning, [8] online learning, and unsupervised algorithms, e.g. for dictionary learning. [9]

Efficiency in robotics

A high sample complexity means that many calculations are needed for running a Monte Carlo tree search. [10] It is equivalent to a model-free brute force search in the state space. In contrast, a high-efficiency algorithm has a low sample complexity. [11] Possible techniques for reducing the sample complexity are metric learning [12] and model-based reinforcement learning. [13]

See also

Related Research Articles

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.

<span class="mw-page-title-main">Partition function (statistical mechanics)</span> Function in thermodynamics and statistical physics

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium. Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives. The partition function is dimensionless.

In mathematics, a Hopf algebra, named after Heinz Hopf, is a structure that is simultaneously an algebra and a coalgebra, with these structures' compatibility making it a bialgebra, and that moreover is equipped with an antihomomorphism satisfying a certain property. The representation theory of a Hopf algebra is particularly nice, since the existence of compatible comultiplication, counit, and antipode allows for the construction of tensor products of representations, trivial representations, and dual representations.

In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. Mathematically, it is defined as

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

For supervised learning applications in machine learning and statistical learning theory, generalization error is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of a machine learning algorithm is visualized by plots that show values of estimates of the generalization error through the learning process, which are called learning curves.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In mathematics, Weyl's lemma, named after Hermann Weyl, states that every weak solution of Laplace's equation is a smooth solution. This contrasts with the wave equation, for example, which has weak solutions that are not smooth solutions. Weyl's lemma is a special case of elliptic or hypoelliptic regularity.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

In the theory of quantum communication, the quantum capacity is the highest rate at which quantum information can be communicated over many independent uses of a noisy quantum channel from a sender to a receiver. It is also equal to the highest rate at which entanglement can be generated over the channel, and forward classical communication cannot improve it. The quantum capacity theorem is important for the theory of quantum error correction, and more broadly for the theory of quantum computation. The theorem giving a lower bound on the quantum capacity of any channel is colloquially known as the LSD theorem, after the authors Lloyd, Shor, and Devetak who proved it with increasing standards of rigor.

In quantum information theory, the idea of a typical subspace plays an important role in the proofs of many coding theorems. Its role is analogous to that of the typical set in classical information theory.

Generalized relative entropy is a measure of dissimilarity between two quantum states. It is a "one-shot" analogue of quantum relative entropy and shares many properties of the latter quantity.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

Adding controlled noise from predetermined distributions is a way of designing differentially private mechanisms. This technique is useful for designing private mechanisms for real-valued functions on sensitive data. Some commonly used distributions for adding noise include Laplace and Gaussian distributions.

References

  1. 1 2 Vapnik, Vladimir (1998), Statistical Learning Theory, New York: Wiley.
  2. 1 2 Rosasco, Lorenzo (2014), Consistency, Learnability, and Regularization, Lecture Notes for MIT Course 9.520.
  3. Steve Hanneke (2016). "The optimal sample complexity of PAC learning". J. Mach. Learn. Res. 17 (1): 1319–1333. arXiv: 1507.00473 .
  4. Ehrenfeucht, Andrzej; Haussler, David; Kearns, Michael; Valiant, Leslie (1989). "A general lower bound on the number of examples needed for learning". Information and Computation. 82 (3): 247. doi: 10.1016/0890-5401(89)90002-3 .
  5. Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical Foundations. ISBN   9780521118620.
  6. Morgenstern, Jamie; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly Optimal Auctions. NIPS. Curran Associates. pp. 136–144. arXiv: 1506.03684 .
  7. Balcan, Maria-Florina; Hanneke, Steve; Wortman Vaughan, Jennifer (2010). "The true sample complexity of active learning". Machine Learning. 80 (2–3): 111–139. doi: 10.1007/s10994-010-5174-y .
  8. Kakade, Sham (2003), On the Sample Complexity of Reinforcement Learning (PDF), PhD Thesis, University College London: Gatsby Computational Neuroscience Unit.
  9. Vainsencher, Daniel; Mannor, Shie; Bruckstein, Alfred (2011). "The Sample Complexity of Dictionary Learning" (PDF). Journal of Machine Learning Research. 12: 3259–3281.
  10. Kaufmann, Emilie and Koolen, Wouter M (2017). Monte-carlo tree search by best arm identification. Advances in Neural Information Processing Systems. pp. 4897–4906.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  11. Fidelman, Peggy and Stone, Peter (2006). The chin pinch: A case study in skill learning on a legged robot. Robot Soccer World Cup. Springer. pp. 59–71.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  12. Verma, Nakul and Branson, Kristin (2015). Sample complexity of learning mahalanobis distance metrics. Advances in neural information processing systems. pp. 2584–2592.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  13. Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter (2018). "Model-ensemble trust-region policy optimization". arXiv: 1802.10592 [cs.LG].{{cite arXiv}}: CS1 maint: multiple names: authors list (link)