Bernstein inequalities (probability theory)

Last updated October 11, 2024 • 2 min readFrom Wikipedia, The Free Encyclopedia

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X₁, ..., X_n be independent Bernoulli random variables taking values +1 and −1 with probability 1/2 (this distribution is also known as the Rademacher distribution), then for every positive $\varepsilon$ ,

Bernstein inequalities were proven and published by Sergei Bernstein in the 1920s and 1930s.^[1]^[2]^[3]^[4] Later, these inequalities were rediscovered several times in various forms. Thus, special cases of the Bernstein inequalities are also known as the Chernoff bound, Hoeffding's inequality and Azuma's inequality. The martingale case of the Bernstein inequality is known as Freedman's inequality ^[5] and its refinement is known as Hoeffding's inequality.^[6]

Some of the inequalities

1. Let $X_{1},\ldots ,X_{n}$ be independent zero-mean random variables. Suppose that $|X_{i}|\leq M$ almost surely, for all $i.$ Then, for all positive $t$ ,

\mathbb {P} \left(\sum _{i=1}^{n}X_{i}\geq t\right)\leq \exp \left(-{\frac {{\tfrac {1}{2}}t^{2}}{\sum _{i=1}^{n}\mathbb {E} \left[X_{i}^{2}\right]+{\tfrac {1}{3}}Mt}}\right).

2. Let $X_{1},\ldots ,X_{n}$ be independent zero-mean random variables. Suppose that for some positive real $L$ and every integer $k\geq 2$ ,

\mathbb {E} \left[\left|X_{i}^{k}\right|\right]\leq {\frac {1}{2}}\mathbb {E} \left[X_{i}^{2}\right]L^{k-2}k!

Then

\mathbb {P} \left(\sum _{i=1}^{n}X_{i}\geq 2t{\sqrt {\sum \mathbb {E} \left[X_{i}^{2}\right]}}\right)<\exp(-t^{2}),\qquad {\text{for}}\quad 0\leq t\leq {\frac {1}{2L}}{\sqrt {\sum \mathbb {E} \left[X_{j}^{2}\right]}}.

3. Let $X_{1},\ldots ,X_{n}$ be independent zero-mean random variables. Suppose that

\mathbb {E} \left[\left|X_{i}^{k}\right|\right]\leq {\frac {k!}{4!}}\left({\frac {L}{5}}\right)^{k-4}

for all integer $k\geq 4.$ Denote

A_{k}=\sum \mathbb {E} \left[X_{i}^{k}\right].

Then,

\mathbb {P} \left(\left|\sum _{j=1}^{n}X_{j}-{\frac {A_{3}t^{2}}{3A_{2}}}\right|\geq {\sqrt {2A_{2}}}\,t\left[1+{\frac {A_{4}t^{2}}{6A_{2}^{2}}}\right]\right)<2\exp(-t^{2}),\qquad {\text{for}}\quad 0<t\leq {\frac {5{\sqrt {2A_{2}}}}{4L}}.

4. Bernstein also proved generalizations of the inequalities above to weakly dependent random variables. For example, inequality (2) can be extended as follows. Let $X_{1},\ldots ,X_{n}$ be possibly non-independent random variables. Suppose that for all integers $i>0$ ,

{\begin{aligned}\mathbb {E} \left.\left[X_{i}\right|X_{1},\ldots ,X_{i-1}\right]&=0,\\\mathbb {E} \left.\left[X_{i}^{2}\right|X_{1},\ldots ,X_{i-1}\right]&\leq R_{i}\mathbb {E} \left[X_{i}^{2}\right],\\\mathbb {E} \left.\left[X_{i}^{k}\right|X_{1},\ldots ,X_{i-1}\right]&\leq {\tfrac {1}{2}}\mathbb {E} \left.\left[X_{i}^{2}\right|X_{1},\ldots ,X_{i-1}\right]L^{k-2}k!\end{aligned}}

Then

\mathbb {P} \left(\sum _{i=1}^{n}X_{i}\geq 2t{\sqrt {\sum _{i=1}^{n}R_{i}\mathbb {E} \left[X_{i}^{2}\right]}}\right)<\exp(-t^{2}),\qquad {\text{for}}\quad 0<t\leq {\frac {1}{2L}}{\sqrt {\sum _{i=1}^{n}R_{i}\mathbb {E} \left[X_{i}^{2}\right]}}.

More general results for martingales can be found in Fan et al. (2015).^[7]

Proofs

The proofs are based on an application of Markov's inequality to the random variable

\exp \left(\lambda \sum _{j=1}^{n}X_{j}\right),

for a suitable choice of the parameter $\lambda >0$ .

Generalizations

The Bernstein inequality can be generalized to Gaussian random matrices. Let $G=g^{H}Ag+2\operatorname {Re} (g^{H}a)$ be a scalar where $A$ is a complex Hermitian matrix and $a$ is complex vector of size $N$ . The vector $g\sim {\mathcal {CN}}(0,I)$ is a Gaussian vector of size $N$ . Then for any $\sigma \geq 0$ , we have

\mathbb {P} \left(G\leq \operatorname {tr} (A)-{\sqrt {2\sigma }}{\sqrt {\Vert \operatorname {vec} (A)\Vert ^{2}+2\Vert a\Vert ^{2}}}-\sigma s^{-}(A)\right)<\exp(-\sigma ),

where $\operatorname {vec}$ is the vectorization operation and $s^{-}(A)=\max(-\lambda _{\max }(A),0)$ where $\lambda _{\max }(A)$ is the largest eigenvalue of $A$ . The proof is detailed here.^[8] Another similar inequality is formulated as

\mathbb {P} \left(G\geq \operatorname {tr} (A)+{\sqrt {2\sigma }}{\sqrt {\Vert \operatorname {vec} (A)\Vert ^{2}+2\Vert a\Vert ^{2}}}+\sigma s^{+}(A)\right)<\exp(-\sigma ),

where $s^{+}(A)=\max(\lambda _{\max }(A),0)$ .

Related Research Articles

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable $, or just distribution function of, evaluated at, is the probability that will take a value less than or equal to .$

The Cauchy–Schwarz inequality is an upper bound on the inner product between two vectors in an inner product space in terms of the product of the vector norms. It is considered one of the most important and widely used inequalities in mathematics.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than $is at most, where is any positive constant and is the standard deviation.$

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In probability theory, the Vysochanskij–Petunin inequality gives a lower bound for the probability that a random variable with finite variance lies within a certain number of standard deviations of the variable's mean, or equivalently an upper bound for the probability that it lies further away. The sole restrictions on the distribution are that it be unimodal and have finite variance; here unimodal implies that it is a continuous probability distribution except at the mode, which may have a non-zero probability.

In probability theory, the Azuma–Hoeffding inequality gives a concentration result for the values of martingales that have bounded differences.

In probability theory, the central limit theorem states that, under certain circumstances, the probability distribution of the scaled mean of a random sample converges to a normal distribution as the sample size increases to infinity. Under stronger assumptions, the Berry–Esseen theorem, or Berry–Esseen inequality, gives a more quantitative result, because it also specifies the rate at which this convergence takes place by giving a bound on the maximal error of approximation between the normal distribution and the true distribution of the scaled sample mean. The approximation is measured by the Kolmogorov–Smirnov distance. In the case of independent samples, the convergence rate is $n -1/2$ , where $n$ is the sample size, and the constant is estimated in terms of the third absolute normalized moment.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In probability theory and statistics, the Rademacher distribution is a discrete probability distribution where a random variate X has a 50% chance of being +1 and a 50% chance of being −1.

In probability theory, the central limit theorem says that, under certain conditions, the sum of many independent identically-distributed random variables, when scaled appropriately, converges in distribution to a standard normal distribution. The martingale central limit theorem generalizes this result for random variables to martingales, which are stochastic processes where the change in the value of the process from time t to time t + 1 has expectation zero, even conditioned on previous outcomes.

In the mathematical theory of probability, a Doob martingale is a stochastic process that approximates a given random variable and has the martingale property with respect to the given filtration. It may be thought of as the evolving sequence of best approximations to the random variable based on information accumulated up to a certain time.

In mathematics, the theory of optimal stopping or early stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost. Optimal stopping problems can be found in areas of statistics, economics, and mathematical finance. A key example of an optimal stopping problem is the secretary problem. Optimal stopping problems can often be written in the form of a Bellman equation, and are therefore often solved using dynamic programming.

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

In probability theory, Bennett's inequality provides an upper bound on the probability that the sum of independent random variables deviates from its expected value by more than any specified amount. Bennett's inequality was proved by George Bennett of the University of New South Wales in 1962.

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value. The deviation or other function of the random variable can be thought of as a secondary random variable. The simplest example of the concentration of such a secondary random variable is the CDF of the first random variable which concentrates the probability to unity. If an analytic form of the CDF is available this provides a concentration equality that provides the exact probability of concentration. It is precisely when the CDF is difficult to calculate or even the exact form of the first random variable is unknown that the applicable concentration inequalities provide useful insight.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose $is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t :$

In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by the tails of a Gaussian. This property gives subgaussian distributions their name.

References

↑ S.N.Bernstein, "On a modification of Chebyshev's inequality and of the error formula of Laplace" vol. 4, #5 (original publication: Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 1, 1924)
↑ Bernstein, S. N. (1937). "Об определенных модификациях неравенства Чебышева" [On certain modifications of Chebyshev's inequality]. Doklady Akademii Nauk SSSR . 17 (6): 275–277.
↑ S.N.Bernstein, "Theory of Probability" (Russian), Moscow, 1927
↑ J.V.Uspensky, "Introduction to Mathematical Probability", McGraw-Hill Book Company, 1937
↑ Freedman, D.A. (1975). "On tail probabilities for martingales". Ann. Probab. 3: 100–118.
↑ Fan, X.; Grama, I.; Liu, Q. (2012). "Hoeffding's inequality for supermartingales". Stochastic Process. Appl. 122: 3545–3559.
↑ Fan, X.; Grama, I.; Liu, Q. (2015). "Exponential inequalities for martingales with applications". Electronic Journal of Probability. 20. Electron. J. Probab. 20: 1–22. arXiv: 1311.6273 . doi:10.1214/EJP.v20-3496. S2CID 119713171.
↑ Ikhlef, Bechar (2009). "A Bernstein-type inequality for stochastic processes of quadratic forms of Gaussian variables". arXiv: 0909.3595 [math.ST].

(according to: S.N.Bernstein, Collected Works, Nauka, 1964)

A modern translation of some of these results can also be found in Prokhorov, A.V.; Korneichuk, N.P.; Motornyi, V.P. (2001) [1994], "Bernstein inequality", Encyclopedia of Mathematics , EMS Press

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] S.N.Bernstein, "On a modification of Chebyshev's inequality and of the error formula of Laplace" vol. 4, #5 (original publication: Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 1, 1924)

[2] Bernstein, S. N. (1937). "Об определенных модификациях неравенства Чебышева" [On certain modifications of Chebyshev's inequality]. Doklady Akademii Nauk SSSR . 17 (6): 275–277.

[3] S.N.Bernstein, "Theory of Probability" (Russian), Moscow, 1927

[4] J.V.Uspensky, "Introduction to Mathematical Probability", McGraw-Hill Book Company, 1937

[Freedman-5] Freedman, D.A. (1975). "On tail probabilities for martingales". Ann. Probab. 3: 100–118.

[Fan-6] Fan, X.; Grama, I.; Liu, Q. (2012). "Hoeffding's inequality for supermartingales". Stochastic Process. Appl. 122: 3545–3559.

[fan-7] Fan, X.; Grama, I.; Liu, Q. (2015). "Exponential inequalities for martingales with applications". Electronic Journal of Probability. 20. Electron. J. Probab. 20: 1–22. arXiv: 1311.6273 . doi:10.1214/EJP.v20-3496. S2CID 119713171.

[bechar-8] Ikhlef, Bechar (2009). "A Bernstein-type inequality for stochastic processes of quadratic forms of Gaussian variables". arXiv: 0909.3595 [math.ST].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]