Growth function

Last updated September 22, 2024

The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family or class of function. It is especially used in the context of statistical learning theory, where it is used to study properties of statistical learning methods. The term 'growth function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also proved many of its properties.^[1] It is a basic concept in machine learning.^[2]^[3]

Definitions

Set-family definition

Let $H$ be a set family (a set of sets) and $C$ a set. Their intersection is defined as the following set-family:

H\cap C:=\{h\cap C\mid h\in H\}

The intersection-size (also called the index) of $H$ with respect to $C$ is $|H\cap C|$ . If a set $C_{m}$ has $m$ elements then the index is at most $2^{m}$ . If the index is exactly 2^m then the set $C$ is said to be shattered by $H$ , because $H\cap C$ contains all the subsets of $C$ , i.e.:

|H\cap C|=2^{|C|},

The growth function measures the size of $H\cap C$ as a function of $|C|$ . Formally:

\operatorname {Growth} (H,m):=\max _{C:|C|=m}|H\cap C|

Hypothesis-class definition

Equivalently, let $H$ be a hypothesis-class (a set of binary functions) and $C$ a set with $m$ elements. The restriction of $H$ to $C$ is the set of binary functions on $C$ that can be derived from $H$ :^[3]^: 45

H_{C}:=\{(h(x_{1}),\ldots ,h(x_{m}))\mid h\in H,x_{i}\in C\}

The growth function measures the size of $H_{C}$ as a function of $|C|$ :^[3]^: 49

\operatorname {Growth} (H,m):=\max _{C:|C|=m}|H_{C}|

Examples

1. The domain is the real line $\mathbb {R}$ . The set-family $H$ contains all the half-lines (rays) from a given number to positive infinity, i.e., all sets of the form $\{x>x_{0}\mid x\in \mathbb {R} \}$ for some $x_{0}\in \mathbb {R}$ . For any set $C$ of $m$ real numbers, the intersection $H\cap C$ contains $m+1$ sets: the empty set, the set containing the largest element of $C$ , the set containing the two largest elements of $C$ , and so on. Therefore: $\operatorname {Growth} (H,m)=m+1$ .^[1]^: Ex.1 The same is true whether $H$ contains open half-lines, closed half-lines, or both.

2. The domain is the segment $[0,1]$ . The set-family $H$ contains all the open sets. For any finite set $C$ of $m$ real numbers, the intersection $H\cap C$ contains all possible subsets of $C$ . There are $2^{m}$ such subsets, so $\operatorname {Growth} (H,m)=2^{m}$ . ^[1]^: Ex.2

3. The domain is the Euclidean space $\mathbb {R} ^{n}$ . The set-family $H$ contains all the half-spaces of the form: $x\cdot \phi \geq 1$ , where $\phi$ is a fixed vector. Then $\operatorname {Growth} (H,m)=\operatorname {Comp} (n,m)$ , where Comp is the number of components in a partitioning of an n-dimensional space by m hyperplanes.^[1]^: Ex.3

4. The domain is the real line $\mathbb {R}$ . The set-family $H$ contains all the real intervals, i.e., all sets of the form $\{x\in [x_{0},x_{1}]|x\in \mathbb {R} \}$ for some $x_{0},x_{1}\in \mathbb {R}$ . For any set $C$ of $m$ real numbers, the intersection $H\cap C$ contains all runs of between 0 and $m$ consecutive elements of $C$ . The number of such runs is ${m+1 \choose 2}+1$ , so $\operatorname {Growth} (H,m)={m+1 \choose 2}+1$ .

Polynomial or exponential

The main property that makes the growth function interesting is that it can be either polynomial or exponential - nothing in-between.

The following is a property of the intersection-size:^[1]^: Lem.1

If, for some set $C_{m}$ of size $m$ , and for some number $n\leq m$ , $|H\cap C_{m}|\geq \operatorname {Comp} (n,m)$ -
then, there exists a subset $C_{n}\subseteq C_{m}$ of size $n$ such that $|H\cap C_{n}|=2^{n}$ .

This implies the following property of the Growth function.^[1]^: Th.1 For every family $H$ there are two cases:

The exponential case: $\operatorname {Growth} (H,m)=2^{m}$ identically.
The polynomial case: $\operatorname {Growth} (H,m)$ is majorized by $\operatorname {Comp} (n,m)\leq m^{n}+1$ , where $n$ is the smallest integer for which $\operatorname {Growth} (H,n)<2^{n}$ .

Other properties

Trivial upper bound

For any finite $H$ :

\operatorname {Growth} (H,m)\leq |H|

since for every $C$ , the number of elements in $H\cap C$ is at most $|H|$ . Therefore, the growth function is mainly interesting when $H$ is infinite.

Exponential upper bound

For any nonempty $H$ :

\operatorname {Growth} (H,m)\leq 2^{m}

I.e, the growth function has an exponential upper-bound.

We say that a set-family $H$ shatters a set $C$ if their intersection contains all possible subsets of $C$ , i.e. $H\cap C=2^{C}$ . If $H$ shatters $C$ of size $m$ , then $\operatorname {Growth} (H,C)=2^{m}$ , which is the upper bound.

Cartesian intersection

Define the Cartesian intersection of two set-families as:

H_{1}\bigotimes H_{2}:=\{h_{1}\cap h_{2}\mid h_{1}\in H_{1},h_{2}\in H_{2}\}

.

Then:^[2]^: 57

\operatorname {Growth} (H_{1}\bigotimes H_{2},m)\leq \operatorname {Growth} (H_{1},m)\cdot \operatorname {Growth} (H_{2},m)

Union

For every two set-families:^[2]^: 58

\operatorname {Growth} (H_{1}\cup H_{2},m)\leq \operatorname {Growth} (H_{1},m)+\operatorname {Growth} (H_{2},m)

VC dimension

The VC dimension of $H$ is defined according to these two cases:

In the polynomial case, $\operatorname {VCDim} (H)=n-1$ = the largest integer $d$ for which $\operatorname {Growth} (H,d)=2^{d}$ .
In the exponential case $\operatorname {VCDim} (H)=\infty$ .

So $\operatorname {VCDim} (H)\geq d$ if-and-only-if $\operatorname {Growth} (H,d)=2^{d}$ .

The growth function can be regarded as a refinement of the concept of VC dimension. The VC dimension only tells us whether $\operatorname {Growth} (H,d)$ is equal to or smaller than $2^{d}$ , while the growth function tells us exactly how $\operatorname {Growth} (H,m)$ changes as a function of $m$ .

Another connection between the growth function and the VC dimension is given by the Sauer–Shelah lemma:^[3]^: 49

If

\operatorname {VCDim} (H)=d

, then:

for all

m

:

\operatorname {Growth} (H,m)\leq \sum _{i=0}^{d}{m \choose i}

In particular,

for all

m>d+1

:

\operatorname {Growth} (H,m)\leq (em/d)^{d}=O(m^{d})

so when the VC dimension is finite, the growth function grows polynomially with

m

.

This upper bound is tight, i.e., for all $m>d$ there exists $H$ with VC dimension $d$ such that:^[2]^: 56

\operatorname {Growth} (H,m)=\sum _{i=0}^{d}{m \choose i}

Entropy

While the growth-function is related to the maximum intersection-size, the entropy is related to the average intersection size:^[1]^{: 272–273}

\operatorname {Entropy} (H,m)=E_{|C_{m}|=m}{\big [}\log _{2}(|H\cap C_{m}|){\big ]}

The intersection-size has the following property. For every set-family $H$ :

|H\cap (C_{1}\cup C_{2})|\leq |H\cap C_{1}|\cdot |H\cap C_{2}|

Hence:

\operatorname {Entropy} (H,m_{1}+m_{2})\leq \operatorname {Entropy} (H,m_{1})+\operatorname {Entropy} (H,m_{2})

Moreover, the sequence $\operatorname {Entropy} (H,m)/m$ converges to a constant $c\in [0,1]$ when $m\to \infty$ .

Moreover, the random-variable $\log _{2}{|H\cap C_{m}|/m}$ is concentrated near $c$ .

Applications in probability theory

Let $\Omega$ be a set on which a probability measure $\Pr$ is defined. Let $H$ be family of subsets of $\Omega$ (= a family of events).

Suppose we choose a set $C_{m}$ that contains $m$ elements of $\Omega$ , where each element is chosen at random according to the probability measure $P$ , independently of the others (i.e., with replacements). For each event $h\in H$ , we compare the following two quantities:

Its relative frequency in $C_{m}$ , i.e., $|h\cap C_{m}|/m$ ;
Its probability $\Pr[h]$ .

We are interested in the difference, $D(h,C_{m}):={\big |}|h\cap C_{m}|/m-\Pr[h]{\big |}$ . This difference satisfies the following upper bound:

\Pr \left[\forall h\in H:D(h,C_{m})\leq {\sqrt {8(\ln \operatorname {Growth} (H,2m)+\ln(4/\delta )) \over m}}\right]~~~~>~~~~1-\delta

which is equivalent to:^[1]^: Th.2

\Pr {\big [}\forall h\in H:D(h,C_{m})\leq \varepsilon {\big ]}~~~~>~~~~1-4\cdot \operatorname {Growth} (H,2m)\cdot \exp(-\varepsilon ^{2}\cdot m/8)

In words: the probability that for all events in $H$ , the relative-frequency is near the probability, is lower-bounded by an expression that depends on the growth-function of $H$ .

A corollary of this is that, if the growth function is polynomial in $m$ (i.e., there exists some $n$ such that $\operatorname {Growth} (H,m)\leq m^{n}+1$ ), then the above probability approaches 1 as $m\to \infty$ . I.e, the family $H$ enjoys uniform convergence in probability.

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable $, which takes values in the set and is distributed according to, the entropy is where denotes the sum over the variable's possible values. The choice of base for, the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.$

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which

In mathematical analysis and in probability theory, a σ-algebra on a set X is a nonempty collection Σ of subsets of X closed under complement, countable unions, and countable intersections. The ordered pair $is called a measurable space.$

In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:

In the mathematical field of real analysis, the monotone convergence theorem is any of a number of related theorems proving the good convergence behaviour of monotonic sequences, i.e. sequences that are non-increasing, or non-decreasing. In its simplest form, it says that a non-decreasing bounded-above sequence of real numbers $converges to its smallest upper bound, its supremum. Likewise, a non-increasing bounded-below sequence converges to its largest lower bound, its infimum. In particular, infinite sums of non-negative numbers converge to the supremum of the partial sums if and only if the partial sums are bounded.$

In mathematics, Fatou's lemma establishes an inequality relating the Lebesgue integral of the limit inferior of a sequence of functions to the limit inferior of integrals of these functions. The lemma is named after Pierre Fatou.

In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

Quantum statistical mechanics is statistical mechanics applied to quantum mechanical systems. In quantum mechanics a statistical ensemble is described by a density operator S, which is a non-negative, self-adjoint, trace-class operator of trace 1 on the Hilbert space H describing the quantum system. This can be shown under various mathematical formalisms for quantum mechanics.

In linear algebra and related areas of mathematics a balanced set, circled set or disk in a vector space is a set $such that for all scalars satisfying$

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

A Dynkin system, named after Eugene Dynkin, is a collection of subsets of another universal set $satisfying a set of axioms weaker than those of 𝜎-algebra. Dynkin systems are sometimes referred to as 𝜆-systems or d-system . These set families have applications in measure theory and probability.$

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

In information theory, the binary entropy function, denoted $or, is defined as the entropy of a Bernoulli process with probability of one of two values, and is given by the formula:$

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Z-channel (information theory)</span>

In coding theory and information theory, a Z-channel or binary asymmetric channel is a communications channel used to model the behaviour of some data storage systems.

In computational learning theory, Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to real valued functions.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

In mathematics the symmetrization methods are algorithms of transforming a set to a ball $with equal volume and centered at the origin. B is called the symmetrized version of A, usually denoted . These algorithms show up in solving the classical isoperimetric inequality problem, which asks: Given all two-dimensional shapes of a given area, which of them has the minimal perimeter. The conjectured answer was the disk and Steiner in 1838 showed this to be true using the Steiner symmetrization method. From this many other isoperimetric problems sprung and other symmetrization algorithms. For example, Rayleigh's conjecture is that the first eigenvalue of the Dirichlet problem is minimized for the ball. Another problem is that the Newtonian capacity of a set A is minimized by and this was proved by Polya and G. Szego (1951) using circular symmetrization.$

References

1 2 3 4 5 6 7 8 Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
1 2 3 4 Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258., especially Section 3.2
1 2 3 4 Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[vc-1] 1 2 3 4 5 6 7 8 Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.

[book12-2] 1 2 3 4 Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258., especially Section 3.2

[book14-3] 1 2 3 4 Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.

[1]

[2]

[3]