Uniform convergence in probability

Last updated

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

Contents

The law of large numbers says that, for each single event , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events [1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Definitions

For a class of predicates defined on a set and a set of samples , where , the empirical frequency of on is

The theoretical probability of is defined as

The Uniform Convergence Theorem states, roughly, that if is "simple" and we draw samples independently (with replacement) from according to any distribution , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability. [2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis [1] using the concept of growth function.

Uniform convergence theorem

The statement of the uniform convergence theorem is as follows: [3]

If is a set of -valued functions defined on a set and is a probability distribution on then for and a positive integer, we have:

where, for any ,
and . indicates that the probability is taken over consisting of i.i.d. draws from the distribution .
is defined as: For any -valued functions over and ,

And for any natural number , the shattering number is defined as:

From the point of Learning Theory one can consider to be the Concept/Hypothesis class defined over the instance set . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

Sauer–Shelah lemma

The Sauer–Shelah lemma [4] relates the shattering number to the VC Dimension.

Lemma:, where is the VC Dimension of the concept class .

Corollary:.

Proof of uniform convergence theorem

[1] and [3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

  1. Symmetrization: We transform the problem of analyzing into the problem of analyzing , where and are i.i.d samples of size drawn according to the distribution . One can view as the original randomly drawn sample of length , while may be thought as the testing sample which is used to estimate .
  2. Permutation: Since and are picked identically and independently, so swapping elements between them will not change the probability distribution on and . So, we will try to bound the probability of for some by considering the effect of a specific collection of permutations of the joint sample . Specifically, we consider permutations which swap and in some subset of . The symbol means the concatenation of and .[ citation needed ]
  3. Reduction to a finite class: We can now restrict the function class to a fixed joint sample and hence, if has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

Symmetrization

Lemma: Let and

Then for , .

Proof: By the triangle inequality,
if and then .

Therefore,

since and are independent.

Now for fix an such that . For this , we shall show that

Thus for any , and hence . And hence we perform the first step of our high level idea.

Notice, is a binomial random variable with expectation and variance . By Chebyshev's inequality we get

for the mentioned bound on . Here we use the fact that for .

Permutations

Let be the set of all permutations of that swaps and in some subset of .

Lemma: Let be any subset of and any probability distribution on . Then,

where the expectation is over chosen according to , and the probability is over chosen uniformly from .

Proof: For any

(since coordinate permutations preserve the product distribution .)

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

.

Proof: Let us define and which is at most . This means there are functions such that for any between and with for

We see that iff for some in satisfies, . Hence if we define if and otherwise.

For and , we have that iff for some in satisfies . By union bound we get

Since, the distribution over the permutations is uniform for each , so equals , with equal probability.

Thus,

where the probability on the right is over and both the possibilities are equally likely. By Hoeffding's inequality, this is at most .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

Related Research Articles

In mathematics, an infinite series of numbers is said to converge absolutely if the sum of the absolute values of the summands is finite. More precisely, a real or complex series is said to converge absolutely if for some real number Similarly, an improper integral of a function, is said to converge absolutely if the integral of the absolute value of the integrand is finite—that is, if

<span class="mw-page-title-main">Wiener process</span> Stochastic process generalizing Brownian motion

In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is often also called Brownian motion due to its historical connection with the physical process of the same name originally observed by Scottish botanist Robert Brown. It is one of the best known Lévy processes and occurs frequently in pure and applied mathematics, economics, quantitative finance, evolutionary biology, and physics.

In probability theory, Chebyshev's inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean. Specifically, no more than 1/k2 of the distribution's values can be k or more standard deviations away from the mean. The rule is often called Chebyshev's theorem, about the range of standard deviations around the mean, in statistics. The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined. For example, it can be used to prove the weak law of large numbers.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of independent identical trials should be close to the expected value and tends to become closer to the expected value as more trials are performed. Additionally, the LLN has 2 forms: the Weak Law of Large Numbers and the Strong Law of Large Numbers.

In probability theory, Markov's inequality gives an upper bound for the probability that a non-negative function of a random variable is greater than or equal to some positive constant. It is named after the Russian mathematician Andrey Markov, although it appeared earlier in the work of Pafnuty Chebyshev, and many sources, especially in analysis, refer to it as Chebyshev's inequality or Bienaymé's inequality.

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In probability theory, concentration inequalities provide bounds on how a random variable deviates from some value. The law of large numbers of classical probability theory states that sums of independent random variables are, under very mild conditions, close to their expectation with a large probability. Such sums are the most basic examples of random variables concentrated around their mean. Recent results show that such behavior is shared by other functions of independent random variables.

In mathematics, singular integral operators of convolution type are the singular integral operators that arise on Rn and Tn through convolution by distributions; equivalently they are the singular integral operators that commute with translations. The classical examples in harmonic analysis are the harmonic conjugation operator on the circle, the Hilbert transform on the circle and the real line, the Beurling transform in the complex plane and the Riesz transforms in Euclidean space. The continuity of these operators on L2 is evident because the Fourier transform converts them into multiplication operators. Continuity on Lp spaces was first established by Marcel Riesz. The classical techniques include the use of Poisson integrals, interpolation theory and the Hardy–Littlewood maximal function. For more general operators, fundamental new techniques, introduced by Alberto Calderón and Antoni Zygmund in 1952, were developed by a number of authors to give general criteria for continuity on Lp spaces. This article explains the theory for the classical operators and sketches the subsequent general theory.

Stochastic portfolio theory (SPT) is a mathematical theory for analyzing stock market structure and portfolio behavior introduced by E. Robert Fernholz in 2002. It is descriptive as opposed to normative, and is consistent with the observed behavior of actual markets. Normative assumptions, which serve as a basis for earlier theories like modern portfolio theory (MPT) and the capital asset pricing model (CAPM), are absent from SPT.

In mathematics, the injective tensor product of two topological vector spaces (TVSs) was introduced by Alexander Grothendieck and was used by him to define nuclear spaces. An injective tensor product is in general not necessarily complete, so its completion is called the completed injective tensor products. Injective tensor products have applications outside of nuclear spaces. In particular, as described below, up to TVS-isomorphism, many TVSs that are defined for real or complex valued functions, for instance, the Schwartz space or the space of continuously differentiable functions, can be immediately extended to functions valued in a Hausdorff locally convex TVS without any need to extend definitions from real/complex-valued functions to -valued functions.

References

  1. 1 2 3 Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN   978-3-319-21851-9.
  2. "Martingales", Probability with Martingales, Cambridge University Press, pp. 93–105, 1991-02-14, retrieved 2023-12-08
  3. 1 2 Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN   0-521-57353-X
  4. Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11