Occam learning

Last updated July 31, 2023

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

Introduction
Definition of Occam learning
The relation between Occam and PAC learning
Theorem (Occam learning implies PAC learning)
Theorem (Occam learning implies PAC learning, cardinality version)
Proof that Occam learning implies PAC learning
Improving sample complexity for common problems
Extensions
See also
References
Further reading

Occam learnability implies PAC learning, and for a wide variety of concept classes, the converse is also true: PAC learnability implies Occam learnability.

Introduction

Occam Learning is named after Occam's razor, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al.^[1] that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, parsimony (of the output hypothesis) implies predictive power.

Definition of Occam learning

The succinctness of a concept $c$ in concept class ${\mathcal {C}}$ can be expressed by the length $size(c)$ of the shortest bit string that can represent $c$ in ${\mathcal {C}}$ . Occam learning connects the succinctness of a learning algorithm's output to its predictive power on unseen data.

Let ${\mathcal {C}}$ and ${\mathcal {H}}$ be concept classes containing target concepts and hypotheses respectively. Then, for constants $\alpha \geq 0$ and $0\leq \beta <1$ , a learning algorithm $L$ is an $(\alpha ,\beta )$ -Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}$ iff, given a set $S=\{x_{1},\dots ,x_{m}\}$ of $m$ samples labeled according to a concept $c\in {\mathcal {C}}$ , $L$ outputs a hypothesis $h\in {\mathcal {H}}$ such that

$h$ is consistent with $c$ on $S$ (that is, $h(x)=c(x),\forall x\in S$ ), and
$size(h)\leq (n\cdot size(c))^{\alpha }m^{\beta }$ ^[2]^[1]

where $n$ is the maximum length of any sample $x\in S$ . An Occam algorithm is called efficient if it runs in time polynomial in $n$ , $m$ , and $size(c).$ We say a concept class ${\mathcal {C}}$ is Occam learnable with respect to a hypothesis class ${\mathcal {H}}$ if there exists an efficient Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}.$

The relation between Occam and PAC learning

Occam learnability implies PAC learnability, as the following theorem of Blumer, et al.^[2] shows:

Theorem (Occam learning implies PAC learning)

Let $L$ be an efficient $(\alpha ,\beta )$ -Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}$ . Then there exists a constant $a>0$ such that for any $0<\epsilon ,\delta <1$ , for any distribution ${\mathcal {D}}$ , given $m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha })}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)$ samples drawn from ${\mathcal {D}}$ and labelled according to a concept $c\in {\mathcal {C}}$ of length $n$ bits each, the algorithm $L$ will output a hypothesis $h\in {\mathcal {H}}$ such that $error(h)\leq \epsilon$ with probability at least $1-\delta$ .

Here, $error(h)$ is with respect to the concept $c$ and distribution ${\mathcal {D}}$ . This implies that the algorithm $L$ is also a PAC learner for the concept class ${\mathcal {C}}$ using hypothesis class ${\mathcal {H}}$ . A slightly more general formulation is as follows:

Theorem (Occam learning implies PAC learning, cardinality version)

Let $0<\epsilon ,\delta <1$ . Let $L$ be an algorithm such that, given $m$ samples drawn from a fixed but unknown distribution ${\mathcal {D}}$ and labeled according to a concept $c\in {\mathcal {C}}$ of length $n$ bits each, outputs a hypothesis $h\in {\mathcal {H}}_{n,m}$ that is consistent with the labeled samples. Then, there exists a constant $b$ such that if $\log |{\mathcal {H}}_{n,m}|\leq b\epsilon m-\log {\frac {1}{\delta }}$ , then $L$ is guaranteed to output a hypothesis $h\in {\mathcal {H}}_{n,m}$ such that $error(h)\leq \epsilon$ with probability at least $1-\delta$ .

While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about necessity. Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning.^[3] They proved that for any concept class that is polynomially closed under exception lists, PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits, deterministic finite automata, decision-lists, decision-trees, and other geometrically-defined concept classes.

A concept class ${\mathcal {C}}$ is polynomially closed under exception lists if there exists a polynomial-time algorithm $A$ such that, when given the representation of a concept $c\in {\mathcal {C}}$ and a finite list $E$ of exceptions, outputs a representation of a concept $c'\in {\mathcal {C}}$ such that the concepts $c$ and $c'$ agree except on the set $E$ .

Proof that Occam learning implies PAC learning

We first prove the Cardinality version. Call a hypothesis $h\in {\mathcal {H}}$ bad if $error(h)\geq \epsilon$ , where again $error(h)$ is with respect to the true concept $c$ and the underlying distribution ${\mathcal {D}}$ . The probability that a set of samples $S$ is consistent with $h$ is at most $(1-\epsilon )^{m}$ , by the independence of the samples. By the union bound, the probability that there exists a bad hypothesis in ${\mathcal {H}}_{n,m}$ is at most $|{\mathcal {H}}_{n,m}|(1-\epsilon )^{m}$ , which is less than $\delta$ if $\log |{\mathcal {H}}_{n,m}|\leq O(\epsilon m)-\log {\frac {1}{\delta }}$ . This concludes the proof of the second theorem above.

Using the second theorem, we can prove the first theorem. Since we have a $(\alpha ,\beta )$ -Occam algorithm, this means that any hypothesis output by $L$ can be represented by at most $(n\cdot size(c))^{\alpha }m^{\beta }$ bits, and thus $\log |{\mathcal {H}}_{n,m}|\leq (n\cdot size(c))^{\alpha }m^{\beta }$ . This is less than $O(\epsilon m)-\log {\frac {1}{\delta }}$ if we set $m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha })}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)$ for some constant $a>0$ . Thus, by the Cardinality version Theorem, $L$ will output a consistent hypothesis $h$ with probability at least $1-\delta$ . This concludes the proof of the first theorem above.

Improving sample complexity for common problems

Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions,^[2] conjunctions with few relevant variables,^[4] and decision lists.^[5]

Extensions

Occam algorithms have also been shown to be successful for PAC learning in the presence of errors,^[6]^[7] probabilistic concepts,^[8] function learning^[9] and Markovian non-independent examples.^[10]

Related Research Articles

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

For supervised learning applications in machine learning and statistical learning theory, generalization error is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of a machine learning algorithm is visualized by plots that show values of estimates of the generalization error through the learning process, which are called learning curves.

In general relativity, the Gibbons–Hawking–York boundary term is a term that needs to be added to the Einstein–Hilbert action when the underlying spacetime manifold has a boundary.

In mathematics, in particular in algebraic geometry and differential geometry, Dolbeault cohomology (named after Pierre Dolbeault) is an analog of de Rham cohomology for complex manifolds. Let M be a complex manifold. Then the Dolbeault cohomology groups $depend on a pair of integers p and q and are realized as a subquotient of the space of complex differential forms of degree (p, q).$

In computer science, particularly the study of approximation algorithms, an L-reduction is a transformation of optimization problems which linearly preserves approximability features; it is one type of approximation-preserving reduction. L-reductions in studies of approximability of optimization problems play a similar role to that of polynomial reductions in the studies of computational complexity of decision problems.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In coding theory, list decoding is an alternative to unique decoding of error-correcting codes for large error rates. The notion was proposed by Elias in the 1950s. The main idea behind list decoding is that the decoding algorithm instead of outputting a single possible message outputs a list of possibilities one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.

In mathematics, Montgomery's pair correlation conjecture is a conjecture made by Hugh Montgomery (1973) that the pair correlation between pairs of zeros of the Riemann zeta function is

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

Generalized relative entropy is a measure of dissimilarity between two quantum states. It is a "one-shot" analogue of quantum relative entropy and shares many properties of the latter quantity.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

In theoretical physics, the dual graviton is a hypothetical elementary particle that is a dual of the graviton under electric-magnetic duality, as an S-duality, predicted by some formulations of supergravity in eleven dimensions.

Adding controlled noise from predetermined distributions is a way of designing differentially private mechanisms. This technique is useful for designing private mechanisms for real-valued functions on sensitive data. Some commonly used distributions for adding noise include Laplace and Gaussian distributions.

The method of (hypergraph) containers is a powerful tool that can help characterize the typical structure and/or answer extremal questions about families of discrete objects with a prescribed set of local constraints. Such questions arise naturally in extremal graph theory, additive combinatorics, discrete geometry, coding theory, and Ramsey theory; they include some of the most classical problems in the associated fields.

The Karmarkar–Karp (KK) bin packing algorithms are several related approximation algorithm for the bin packing problem. The bin packing problem is a problem of packing items of different sizes into bins of identical capacity, such that the total number of bins is as small as possible. Finding the optimal solution is computationally hard. Karmarkar and Karp devised an algorithm that runs in polynomial time and finds a solution with at most $bins, where OPT is the number of bins in the optimal solution. They also devised several other algorithms with slightly different approximation guarantees and run-time bounds.$

References

1 2 Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor . Information processing letters, 24(6), 377-380.
1 2 3 Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 2. MIT press.
↑ Board, R., & Pitt, L. (1990, April). On the necessity of Occam algorithms. In Proceedings of the twenty-second annual ACM symposium on Theory of computing (pp. 54-63). ACM.
↑ Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework Archived 2013-04-12 at the Wayback Machine . Artificial intelligence, 36(2), 177-221.
↑ Rivest, R. L. (1987). Learning decision lists. Machine learning , 2(3), 229-246.
↑ Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343-370.
↑ Kearns, M., & Li, M. (1993). Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4), 807-837.
↑ Kearns, M. J., & Schapire, R. E. (1990, October). Efficient distribution-free learning of probabilistic concepts . In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on (pp. 382-391). IEEE.
↑ Natarajan, B. K. (1993, August). Occam's razor for functions. In Proceedings of the sixth annual conference on Computational learning theory (pp. 370-376). ACM.
↑ Aldous, D., & Vazirani, U. (1990, October). A Markovian extension of Valiant's learning model . In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on (pp. 392-396). IEEE.