Error tolerance (PAC learning)

Last updated April 21, 2023

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

Notation and the Valiant learning model

In the following, let $X$ be our $n$ -dimensional input space. Let ${\mathcal {H}}$ be a class of functions that we wish to use in order to learn a $\{0,1\}$ -valued target function $f$ defined over $X$ . Let ${\mathcal {D}}$ be the distribution of the inputs over $X$ . The goal of a learning algorithm ${\mathcal {A}}$ is to choose the best function $h\in {\mathcal {H}}$ such that it minimizes $error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))$ . Let us suppose we have a function $size(f)$ that can measure the complexity of $f$ . Let ${\text{Oracle}}(x)$ be an oracle that, whenever called, returns an example $x$ and its correct label $f(x)$ .

When no noise corrupts the data, we can define learning in the Valiant setting:^[1]^[2]

Definition: We say that $f$ is efficiently learnable using ${\mathcal {H}}$ in the Valiant setting if there exists a learning algorithm ${\mathcal {A}}$ that has access to ${\text{Oracle}}(x)$ and a polynomial $p(\cdot ,\cdot ,\cdot ,\cdot )$ such that for any $0<\varepsilon \leq 1$ and $0<\delta \leq 1$ it outputs, in a number of calls to the oracle bounded by $p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)$ , a function $h\in {\mathcal {H}}$ that satisfies with probability at least $1-\delta$ the condition ${\text{error}}(h)\leq \varepsilon$ .

In the following we will define learnability of $f$ when data have suffered some modification.^[3]^[4]^[5]

Classification noise

In the classification noise model^[6] a noise rate $0\leq \eta <{\frac {1}{2}}$ is introduced. Then, instead of ${\text{Oracle}}(x)$ that returns always the correct label of example $x$ , algorithm ${\mathcal {A}}$ can only call a faulty oracle ${\text{Oracle}}(x,\eta )$ that will flip the label of $x$ with probability $\eta$ . As in the Valiant case, the goal of a learning algorithm ${\mathcal {A}}$ is to choose the best function $h\in {\mathcal {H}}$ such that it minimizes $error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))$ . In applications it is difficult to have access to the real value of $\eta$ , but we assume we have access to its upperbound $\eta _{B}$ .^[7] Note that if we allow the noise rate to be $1/2$ , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function.

Definition: We say that $f$ is efficiently learnable using ${\mathcal {H}}$ in the classification noise model if there exists a learning algorithm ${\mathcal {A}}$ that has access to ${\text{Oracle}}(x,\eta )$ and a polynomial $p(\cdot ,\cdot ,\cdot ,\cdot )$ such that for any $0\leq \eta \leq {\frac {1}{2}}$ , $0\leq \varepsilon \leq 1$ and $0\leq \delta \leq 1$ it outputs, in a number of calls to the oracle bounded by $p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)$ , a function $h\in {\mathcal {H}}$ that satisfies with probability at least $1-\delta$ the condition $error(h)\leq \varepsilon$ .

Statistical query learning

Statistical Query Learning^[8] is a kind of active learning problem in which the learning algorithm ${\mathcal {A}}$ can decide if to request information about the likelihood $P_{f(x)}$ that a function $f$ correctly labels example $x$ , and receives an answer accurate within a tolerance $\alpha$ . Formally, whenever the learning algorithm ${\mathcal {A}}$ calls the oracle ${\text{Oracle}}(x,\alpha )$ , it receives as feedback probability $Q_{f(x)}$ , such that $Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha$ .

Definition: We say that $f$ is efficiently learnable using ${\mathcal {H}}$ in the statistical query learning model if there exists a learning algorithm ${\mathcal {A}}$ that has access to ${\text{Oracle}}(x,\alpha )$ and polynomials $p(\cdot ,\cdot ,\cdot )$ , $q(\cdot ,\cdot ,\cdot )$ , and $r(\cdot ,\cdot ,\cdot )$ such that for any $0<\varepsilon \leq 1$ the following hold:

${\text{Oracle}}(x,\alpha )$ can evaluate $P_{f(x)}$ in time $q\left({\frac {1}{\varepsilon }},n,size(f)\right)$ ;
${\frac {1}{\alpha }}$ is bounded by $r\left({\frac {1}{\varepsilon }},n,size(f)\right)$
${\mathcal {A}}$ outputs a model $h$ such that $err(h)<\varepsilon$ , in a number of calls to the oracle bounded by $p\left({\frac {1}{\varepsilon }},n,size(f)\right)$ .

Note that the confidence parameter $\delta$ does not appear in the definition of learning. This is because the main purpose of $\delta$ is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now ${\text{Oracle}}(x,\alpha )$ always guarantees to meet the approximation criterion $Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha$ , the failure probability is no longer needed.

The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable.^[8]

Malicious classification

In the malicious classification model^[9] an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm ${\mathcal {A}}$ calls an oracle ${\text{Oracle}}(x,\beta )$ that returns a correctly labeled example $x$ drawn, as usual, from distribution ${\mathcal {D}}$ over the input space with probability $1-\beta$ , but it returns with probability $\beta$ an example drawn from a distribution that is not related to ${\mathcal {D}}$ . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of $f$ , $\beta$ , ${\mathcal {D}}$ , or the current progress of the learning algorithm.

Definition: Given a bound $\beta _{B}<{\frac {1}{2}}$ for $0\leq \beta <{\frac {1}{2}}$ , we say that $f$ is efficiently learnable using ${\mathcal {H}}$ in the malicious classification model, if there exist a learning algorithm ${\mathcal {A}}$ that has access to ${\text{Oracle}}(x,\beta )$ and a polynomial $p(\cdot ,\cdot ,\cdot ,\cdot ,\cdot )$ such that for any $0<\varepsilon \leq 1$ , $0<\delta \leq 1$ it outputs, in a number of calls to the oracle bounded by $p\left({\frac {1}{1/2-\beta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)$ , a function $h\in {\mathcal {H}}$ that satisfies with probability at least $1-\delta$ the condition $error(h)\leq \varepsilon$ .

Errors in the inputs: nonuniform random attribute noise

In the nonuniform random attribute noise^[10]^[11] model the algorithm is learning a Boolean function, a malicious oracle ${\text{Oracle}}(x,\nu )$ may flip each $i$ -th bit of example $x=(x_{1},x_{2},\ldots ,x_{n})$ independently with probability $\nu _{i}\leq \nu$ .

This type of error can irreparably foil the algorithm, in fact the following theorem holds:

In the nonuniform random attribute noise setting, an algorithm ${\mathcal {A}}$ can output a function $h\in {\mathcal {H}}$ such that $error(h)<\varepsilon$ only if $\nu <2\varepsilon$ .

Related Research Articles

In mathematical physics, the Dirac delta distribution, also known as the unit impulse, is a generalized function or distribution over the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one.

Noether's theorem or Noether's first theorem states that every differentiable symmetry of the action of a physical system with conservative forces has a corresponding conservation law. The theorem was proven by mathematician Emmy Noether in 1915 and published in 1918. The action of a physical system is the integral over time of a Lagrangian function, from which the system's behavior can be determined by the principle of least action. This theorem only applies to continuous and smooth symmetries over physical space.

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In information theory, Shannon's source coding theorem establishes the limits to possible data compression, and the operational meaning of the Shannon entropy.

Szemerédi's regularity lemma is one of the most powerful tools in extremal graph theory, particularly in the study of large dense graphs. It states that the vertices of every large enough graph can be partitioned into a bounded number of parts so that the edges between different parts behave almost randomly.

In mathematics, in particular in algebraic geometry and differential geometry, Dolbeault cohomology (named after Pierre Dolbeault) is an analog of de Rham cohomology for complex manifolds. Let M be a complex manifold. Then the Dolbeault cohomology groups $depend on a pair of integers p and q and are realized as a subquotient of the space of complex differential forms of degree (p, q).$

The covariant formulation of classical electromagnetism refers to ways of writing the laws of classical electromagnetism in a form that is manifestly invariant under Lorentz transformations, in the formalism of special relativity using rectilinear inertial coordinate systems. These expressions both make it simple to prove that the laws of classical electromagnetism take the same form in any inertial coordinate system, and also provide a way to translate the fields and forces from one frame to another. However, this is not as general as Maxwell's equations in curved spacetime or non-rectilinear coordinate systems.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy.

In cryptography, Learning with errors (LWE) is a mathematical problem that is widely used in cryptography to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.$

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

References

↑ Valiant, L. G. (August 1985). Learning Disjunction of Conjunctions . In IJCAI (pp. 560–566).
↑ Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134–1142.
↑ Laird, P. D. (1988). Learning from good and bad data . Kluwer Academic Publishers.
↑ Kearns, Michael. "Efficient noise-tolerant learning from statistical queries." Journal of the ACM 45.6 (1998): 983–1006.
↑ Brunk, Clifford A., and Michael J. Pazzani. "An investigation of noise-tolerant relational concept learning algorithms." Proceedings of the 8th International Workshop on Machine Learning. 1991.
↑ Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 5. MIT press.
↑ Angluin, D., & Laird, P. (1988). Learning from noisy examples . Machine Learning, 2(4), 343–370.
1 2 Kearns, M. (1998). [www.cis.upenn.edu/~mkearns/papers/sq-journal.pdf Efficient noise-tolerant learning from statistical queries]. Journal of the ACM, 45(6), 983–1006.
↑ Kearns, M., & Li, M. (1993). [www.cis.upenn.edu/~mkearns/papers/malicious.pdf Learning in the presence of malicious errors]. SIAM Journal on Computing, 22(4), 807–837.
↑ Goldman, S. A., & Sloan, Robert, H. (1991). The difficulty of random attribute noise. Technical Report WUCS 91 29, Washington University, Department of Computer Science.
↑ Sloan, R. H. (1989). Computational learning theory: New models and algorithms (Doctoral dissertation, Massachusetts Institute of Technology).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Valiant, L. G. (August 1985). Learning Disjunction of Conjunctions . In IJCAI (pp. 560–566).

[2] Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134–1142.

[3] Laird, P. D. (1988). Learning from good and bad data . Kluwer Academic Publishers.

[4] Kearns, Michael. "Efficient noise-tolerant learning from statistical queries." Journal of the ACM 45.6 (1998): 983–1006.

[5] Brunk, Clifford A., and Michael J. Pazzani. "An investigation of noise-tolerant relational concept learning algorithms." Proceedings of the 8th International Workshop on Machine Learning. 1991.

[kv-6] Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 5. MIT press.

[7] Angluin, D., & Laird, P. (1988). Learning from noisy examples . Machine Learning, 2(4), 343–370.

[kearns-8] 1 2 Kearns, M. (1998). [www.cis.upenn.edu/~mkearns/papers/sq-journal.pdf Efficient noise-tolerant learning from statistical queries]. Journal of the ACM, 45(6), 983–1006.

[9] Kearns, M., & Li, M. (1993). [www.cis.upenn.edu/~mkearns/papers/malicious.pdf Learning in the presence of malicious errors]. SIAM Journal on Computing, 22(4), 807–837.

[10] Goldman, S. A., & Sloan, Robert, H. (1991). The difficulty of random attribute noise. Technical Report WUCS 91 29, Washington University, Department of Computer Science.

[11] Sloan, R. H. (1989). Computational learning theory: New models and algorithms (Doctoral dissertation, Massachusetts Institute of Technology).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]