Concentration inequality

Last updated

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value (typically, its expected value).

Contents

The law of large numbers of classical probability theory states that sums of independent random variables, under mild conditions, concentrate around their expectation with a high probability. Such sums are the most basic examples of random variables concentrated around their mean.

Concentration inequalities can be sorted according to how much information about the random variable is needed in order to use them.[ citation needed ]

Markov's inequality

Let be a random variable that is non-negative (almost surely). Then, for every constant ,

Note the following extension to Markov's inequality: if is a strictly increasing and non-negative function, then

Chebyshev's inequality

Chebyshev's inequality requires the following information on a random variable :

Then, for every constant ,

or equivalently,

where is the standard deviation of .

Chebyshev's inequality can be seen as a special case of the generalized Markov's inequality applied to the random variable with .

Vysochanskij–Petunin inequality

Let X be a random variable with unimodal distribution, mean μ and finite, non-zero variance σ2. Then, for any

(For a relatively elementary proof see e.g. [1] ).

One-sided Vysochanskij–Petunin inequality

For a unimodal random variable and , the one-sided Vysochanskij-Petunin inequality [2] holds as follows:

Paley–Zygmund inequality

In contrast to most commonly used concentration inequalities, the Paley-Zygmund inequality provides a lower bound on the deviation probability.

Cantelli's inequality

Gauss's inequality

Chernoff bounds

The generic Chernoff bound [3] :63–65 requires the moment generating function of , defined as It always exists, but may be infinite. From Markov's inequality, for every :

and for every :

There are various Chernoff bounds for different distributions and different values of the parameter . See [4] :5–7 for a compilation of more concentration inequalities.

Bounds on sums of independent bounded variables

Let be independent random variables such that, for all i:

almost surely.

Let be their sum, its expected value and its variance:

It is often interesting to bound the difference between the sum and its expected value. Several inequalities can be used.

1. Hoeffding's inequality says that:

2. The random variable is a special case of a martingale, and . Hence, the general form of Azuma's inequality can also be used and it yields a similar bound:

This is a generalization of Hoeffding's since it can handle other types of martingales, as well as supermartingales and submartingales. See Fan et al. (2015). [5] Note that if the simpler form of Azuma's inequality is used, the exponent in the bound is worse by a factor of 4.

3. The sum function, , is a special case of a function of n variables. This function changes in a bounded way: if variable i is changed, the value of f changes by at most . Hence, McDiarmid's inequality can also be used and it yields a similar bound:

This is a different generalization of Hoeffding's since it can handle other functions besides the sum function, as long as they change in a bounded way.

4. Bennett's inequality offers some improvement over Hoeffding's when the variances of the summands are small compared to their almost-sure bounds C. It says that:

where

5. The first of Bernstein's inequalities says that:

This is a generalization of Hoeffding's since it can handle random variables with not only almost-sure bound but both almost-sure bound and variance bound.

6. Chernoff bounds have a particularly simple form in the case of sum of independent variables, since .

For example, [6] suppose the variables satisfy , for . Then we have lower tail inequality:

If satisfies , we have upper tail inequality:

If are i.i.d., and is the variance of , a typical version of Chernoff inequality is:

7. Similar bounds can be found in: Rademacher distribution#Bounds on sums

Efron–Stein inequality

The Efron–Stein inequality (or influence inequality, or MG bound on variance) bounds the variance of a general function.

Suppose that , are independent with and having the same distribution for all .

Let Then

A proof may be found in e.g.,. [7]

Bretagnolle–Huber–Carol inequality

Bretagnolle–Huber–Carol Inequality bounds the difference between a vector of multinomially distributed random variables and a vector of expected values. [8] [9] A simple proof appears in [10] (Appendix Section).

If a random vector is multinomially distributed with parameters and satisfies then

This inequality is used to bound the total variation distance.

Mason and van Zwet inequality

The Mason and van Zwet inequality [11] for multinomial random vectors concerns a slight modification of the classical chi-square statistic.

Let the random vector be multinomially distributed with parameters and such that for Then for every and there exist constants such that for all and satisfying and we have

Dvoretzky–Kiefer–Wolfowitz inequality

The Dvoretzky–Kiefer–Wolfowitz inequality bounds the difference between the real and the empirical cumulative distribution function.

Given a natural number , let be real-valued independent and identically distributed random variables with cumulative distribution function F(·). Let denote the associated empirical distribution function defined by

So is the probability that a single random variable is smaller than , and is the average number of random variables that are smaller than .

Then

Anti-concentration inequalities

Anti-concentration inequalities, on the other hand, provide an upper bound on how much a random variable can concentrate around a quantity.

For example, Rao and Yehudayoff [12] show that there exists some such that, for most directions of the hypercube , the following is true:[ clarification needed ]

where is drawn uniformly from a subset of large enough size.

Such inequalities are of importance in several fields, including communication complexity (e.g., in proofs of the gap Hamming problem [13] ) and graph theory. [14]

An interesting anti-concentration inequality for weighted sums of independent Rademacher random variables can be obtained using the Paley–Zygmund and the Khintchine inequalities. [15]

Related Research Articles

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In statistics, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result that characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In probability theory, Kolmogorov's inequality is a so-called "maximal inequality" that gives a bound on the probability that the partial sums of a finite collection of independent random variables exceed some specified bound.

<span class="mw-page-title-main">Inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

In mathematics, Doob's martingale inequality, also known as Kolmogorov’s submartingale inequality is a result in the study of stochastic processes. It gives a bound on the probability that a submartingale exceeds any given value over a given interval of time. As the name suggests, the result is usually given in the case that the process is a martingale, but the result is also valid for submartingales.

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,

<span class="mw-page-title-main">Conway–Maxwell–Poisson distribution</span> Probability distribution

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume. It plays an important role for discrete-stable distributions.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

In probability theory, Cantelli's inequality is an improved version of Chebyshev's inequality for one-sided tail bounds. The inequality states that, for

In mathematics, there are many kinds of inequalities involving matrices and linear operators on Hilbert spaces. This article covers some important operator inequalities connected with traces of matrices.

In probability theory, a sub-Gaussian distribution, the distribution of a sub-Gaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a sub-Gaussian distribution are dominated by the tails of a Gaussian. This property gives sub-Gaussian distributions their name.

In statistics, the complex Wishart distribution is a complex version of the Wishart distribution. It is the distribution of times the sample Hermitian covariance matrix of zero-mean independent Gaussian random variables. It has support for Hermitian positive definite matrices.

References

  1. Pukelsheim, F., 1994. The Three Sigma Rule. The American Statistician, 48(2), pp. 88–91
  2. Mercadier, Mathieu; Strobel, Frank (2021-11-16). "A one-sided Vysochanskii-Petunin inequality with financial applications". European Journal of Operational Research. 295 (1): 374–377. doi:10.1016/j.ejor.2021.02.041. ISSN   0377-2217.
  3. Mitzenmacher, Michael; Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN   0-521-83540-2.
  4. Slagle, N.P. (2012). "One Hundred Statistics and Probability Inequalities". arXiv: 2102.07234 .
  5. Fan, X.; Grama, I.; Liu, Q. (2015). "Exponential inequalities for martingales with applications". Electronic Journal of Probability. Electron. J. Probab. 20. 20: 1–22. arXiv: 1311.6273 . doi:10.1214/EJP.v20-3496.
  6. Chung, Fan; Lu, Linyuan (2010). "Old and new concentration inequalities" (PDF). Complex Graphs and Networks. American Mathematical Society . Retrieved August 14, 2018.
  7. Boucheron, St{\'e}phane; Lugosi, G{\'a}bor; Bousquet, Olivier (2004). "Concentration inequalities". Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2–14, 2003, T{\"u}bingen, Germany, August 4–16, 2003, Revised Lectures. Springer: 208–240.
  8. Bretagnolle, Jean; Huber, Catherine (1978). Lois empiriques et distance de Prokhorov. Lecture Notes in Mathematics. Vol. 649. pp. 332–341. doi:10.1007/BFb0064609. ISBN   978-3-540-08761-8.
  9. van der Vaart, A.W.; Wellner, J.A. (1996). Weak convergence and empirical processes: With applications to statistics. Springer Science & Business Media.
  10. Yuto Ushioda; Masato Tanaka; Tomomi Matsui (2022). "Monte Carlo Methods for the Shapley–Shubik Power Index". Games. 13 (3): 44. arXiv: 2101.02841 . doi: 10.3390/g13030044 .
  11. Mason, David M.; Willem R. Van Zwet (1987). "A Refinement of the KMT Inequality for the Uniform Empirical Process". The Annals of Probability. 15 (3): 871–884. doi: 10.1214/aop/1176992070 .
  12. Rao, Anup; Yehudayoff, Amir (2018). "Anti-concentration in most directions". Electronic Colloquium on Computational Complexity.
  13. Sherstov, Alexander A. (2012). "The Communication Complexity of Gap Hamming Distance". Theory of Computing.
  14. Matthew Kwan; Benny Sudakov; Tuan Tran (2018). "Anticoncentration for subgraph statistics". Journal of the London Mathematical Society. 99 (3): 757–777. arXiv: 1807.05202 . Bibcode:2018arXiv180705202K. doi:10.1112/jlms.12192. S2CID   54065186.
  15. Veraar, Mark (2009). "On Khintchine inequalities with a weight". arXiv: 0909.2586v1 [math.PR].