Hoeffding's inequality

Last updated

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963. [1]

Contents

Hoeffding's inequality is a special case of the Azuma–Hoeffding inequality and McDiarmid's inequality. It is similar to the Chernoff bound, but tends to be less sharp, in particular when the variance of the random variables is small. [2] It is similar to, but incomparable with, one of Bernstein's inequalities.

Statement

Let X1, ..., Xn be independent random variables such that almost surely. Consider the sum of these random variables,

Then Hoeffding's theorem states that, for all t > 0, [3]

Here E[Sn] is the expected value of Sn.

Note that the inequalities also hold when the Xi have been obtained using sampling without replacement; in this case the random variables are not independent anymore. A proof of this statement can be found in Hoeffding's paper. For slightly better bounds in the case of sampling without replacement, see for instance the paper by Serfling (1974).

Generalization

Let be independent observations such that and . Let . Then, for any , [4]

Special Case: Bernoulli RVs

Suppose and for all i. This can occur when Xi are independent Bernoulli random variables, though they need not be identically distributed. Then we get the inequality [5]

or equivalently,

for all . This is a version of the additive Chernoff bound which is more general, since it allows for random variables that take values between zero and one, but also weaker, since the Chernoff bound gives a better tail bound when the random variables have small variance.

General case of bounded from above random variables

Hoeffding's inequality can be extended to the case of bounded from above random variables. [6]

Let X1, ..., Xn be independent random variables such that and almost surely. Denote by

Hoeffding's inequality for bounded from aboved random variables states that for all ,

In particular, if for all , then for all ,

General case of sub-Gaussian random variables

The proof of Hoeffding's inequality can be generalized to any sub-Gaussian distribution. Recall that a random variable X is called sub-Gaussian, [7] if

for some . For any bounded variable X, for for some sufficiently large T. Then for all so taking yields

for . So every bounded variable is sub-Gaussian.

For a random variable X, the following norm is finite if and only if X is sub-Gaussian:

Then let X1, ..., Xn be zero-mean independent sub-Gaussian random variables, the general version of the Hoeffding's inequality states that:

where c > 0 is an absolute constant. [8]

Proof

The proof of Hoeffding's inequality follows similarly to concentration inequalities like Chernoff bounds. [9] The main difference is the use of Hoeffding's Lemma:

Suppose X is a real random variable such that almost surely. Then

Using this lemma, we can prove Hoeffding's inequality. As in the theorem statement, suppose X1, ..., Xn are n independent random variables such that almost surely for all i, and let .

Then for s, t > 0, Markov's inequality and the independence of Xi implies:

This upper bound is the best for the value of s minimizing the value inside the exponential. This can be done easily by optimizing a quadratic, giving

Writing the above bound for this value of s, we get the desired bound:

Usage

Confidence intervals

Hoeffding's inequality can be used to derive confidence intervals. We consider a coin that shows heads with probability p and tails with probability 1 − p. We toss the coin n times, generating n samples (which are i.i.d Bernoulli random variables). The expected number of times the coin comes up heads is pn. Furthermore, the probability that the coin comes up heads at least k times can be exactly quantified by the following expression:

where H(n) is the number of heads in n coin tosses.

When k = (p + ε)n for some ε > 0, Hoeffding's inequality bounds this probability by a term that is exponentially small in ε2n:

Since this bound holds on both sides of the mean, Hoeffding's inequality implies that the number of heads that we see is concentrated around its mean, with exponentially small tail.

Thinking of as the "observed" mean, this probability can be interpreted as the level of significance (probability of making an error) for a confidence interval around of size 2ɛ:

Finding n for opposite inequality sign in the above, i.e. n that violates inequality but not equality above, gives us:

Therefore, we require at least samples to acquire a -confidence interval .

Hence, the cost of acquiring the confidence interval is sublinear in terms of confidence level and quadratic in terms of precision. Note that there are more efficient methods of estimating a confidence interval.

See also

Notes

  1. Hoeffding (1963)
  2. Vershynin (2018 , p. 19)
  3. Hoeffding (1963 , Theorem 2)
  4. Wasserman, Larry (2004). "All of Statistics". Springer Texts in Statistics. doi:10.1007/978-0-387-21736-9. ISSN   1431-875X.
  5. Hoeffding (1963 , Theorem 1)
  6. Fan, Grama & Liu (2015 , Corollary 2.7)
  7. Kahane (1960)
  8. Vershynin (2018 , Theorem 2.6.2)
  9. Boucheron (2013)

Related Research Articles

In mathematics, an infinite series of numbers is said to converge absolutely if the sum of the absolute values of the summands is finite. More precisely, a real or complex series is said to converge absolutely if for some real number Similarly, an improper integral of a function, is said to converge absolutely if the integral of the absolute value of the integrand is finite—that is, if A convergent series that is not absolutely convergent is called conditionally convergent.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant and is the standard deviation.

In probability theory, Markov's inequality gives an upper bound on the probability that a non-negative random variable is greater than or equal to some positive constant. Markov's inequality is tight in the sense that for each chosen positive constant, there exists a random variable such that the inequality is in fact an equality.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation.

In probability theory, the Azuma–Hoeffding inequality gives a concentration result for the values of martingales that have bounded differences.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In the mathematical theory of probability, a Doob martingale is a stochastic process that approximates a given random variable and has the martingale property with respect to the given filtration. It may be thought of as the evolving sequence of best approximations to the random variable based on information accumulated up to a certain time.

In graph theory, a graph is said to be a pseudorandom graph if it obeys certain properties that random graphs obey with high probability. There is no concrete definition of graph pseudorandomness, but there are many reasonable characterizations of pseudorandomness one can consider.

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

In mathematics, the Khintchine inequality, named after Aleksandr Khinchin and spelled in multiple ways in the Latin alphabet, is a theorem from probability, and is also frequently used in analysis. Heuristically, it says that if we pick complex numbers , and add them together each multiplied by a random sign , then the expected value of the sum's modulus, or the modulus it will be closest to on average, will be not too far off from .

In probability theory and statistics, a stochastic order quantifies the concept of one random variable being "bigger" than another. These are usually partial orders, so that one random variable may be neither stochastically greater than, less than, nor equal to another random variable . Many different orders exist, which have different applications.

This article is supplemental for “Convergence of random variables” and provides proofs for selected results.

In probability theory, Bennett's inequality provides an upper bound on the probability that the sum of independent random variables deviates from its expected value by more than any specified amount. Bennett's inequality was proved by George Bennett of the University of New South Wales in 1962.

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value. The deviation or other function of the random variable can be thought of as a secondary random variable. The simplest example of the concentration of such a secondary random variable is the CDF of the first random variable which concentrates the probability to unity. If an analytic form of the CDF is available this provides a concentration equality that provides the exact probability of concentration. It is precisely when the CDF is difficult to calculate or even the exact form of the first random variable is unknown that the applicable concentration inequalities provide useful insight.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by the tails of a Gaussian. This property gives subgaussian distributions their name.

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions and by a concave and bounded function of the Kullback–Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing  (Bretagnolle–Huber–Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)

References