In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.
Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.
Formally, the probability distribution of a random variable is called subgaussian if there is a positive constant C such that for every ,
There are many equivalent definitions. For example, a random variable is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:
where is constant and is a mean zero Gaussian random variable. [1] : Theorem 2.6
The subgaussian norm of , denoted as , isIn other words, it is the Orlicz norm of generated by the Orlicz function By condition below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.
If there exists some such that for all , then is called a variance proxy, and the smallest such is called the optimal variance proxy and denoted by .
Since when is Gaussian, we then have , as it should.
Let be a random variable. The following conditions are equivalent: (Proposition 2.5.2 [2] )
Furthermore, the constant is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants in the two definitions satisfy , where are constants independent of the random variable.
As an example, the first four definitions are equivalent by the proof below.
Proof. By the layer cake representation,
After a change of variables , we find that By the Taylor series which is less than or equal to for . Let , then
By Markov's inequality, by asymptotic formula for gamma function: .
From the proof, we can extract a cycle of three inequalities:
In particular, the constant provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of .
Similarly, because up to a positive multiplicative constant, for all , the definitions (3) and (4) are also equivalent up to a constant.
Proposition.
Proposition. (Chernoff bound) If is subgaussian, then for all .
Definition. means that , where the positive constant is independent of and .
Proposition. If is subgaussian, then .
Proof. By triangle inequality, . Now we have . By the equivalence of definitions (2) and (4) of subgaussianity, given above, we have .
Proposition. If are subgaussian and independent, then .
Proof. If independent, then use that the cumulant of independent random variables is additive. That is, .
If not independent, then by Hölder's inequality, for any we haveSolving the optimization problem , we obtain the result.
Corollary. Linear sums of subgaussian random variables are subgaussian.
Expanding the cumulant generating function:we find that . At the edge of possibility, we define that a random variable satisfying is called strictly subgaussian.
Theorem. [5] Let be a subgaussian random variable with mean zero. If all zeros of its characteristic function are real, then is strictly subgaussian.
Corollary. If are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.
By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.
Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.
Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.
strictly subgaussian? | |||
---|---|---|---|
gaussian distribution | Yes | ||
mean-zero Bernoulli distribution | solution to | Iff | |
symmetric Bernoulli distribution | Yes | ||
uniform distribution | solution to , approximately 0.7727 | Yes | |
arbitrary distribution on interval |
The optimal variance proxy is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet [6] , Kumaraswamy, triangular [7] , truncated Gaussian, and truncated exponential. [8]
Let be two positive numbers. Let be a centered Bernoulli distribution , so that it has mean zero, then . [5] Its subgaussian norm is where is the unique positive solution to .
Let be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, takes values and with probabilities each. Since , it follows thatand hence is a subgaussian random variable.
Bounded distributions have no tail at all, so clearly they are subgaussian.
If is bounded within the interval , Hoeffding's lemma states that . Hoeffding's inequality is the Chernoff bound obtained using this fact.
Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.
Given subgaussian distributions , we can construct an additive mixture as follows: first randomly pick a number , then pick .
Since we have , and so the mixture is subgaussian.
In particular, any gaussian mixture is subgaussian.
More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum: .
So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.
Let be a random vector taking values in .
Define.
Theorem. (Theorem 3.4.6 [2] ) For any positive integer , the uniformly distributed random vector is subgaussian, with .
This is not so surprising, because as , the projection of to the first coordinate converges in distribution to the standard normal distribution.
Proposition. If are mean-zero subgaussians, with , then for any , we have with probability .
Proof. By the Chernoff bound, . Now apply the union bound.
Proposition. (Exercise 2.5.10 [2] ) If are subgaussians, with , then Further, the bound is sharp, since when are IID samples of we have . [9]
Theorem. (over a finite set) If are subgaussian, with , thenTheorem. (over a convex polytope) Fix a finite set of vectors . If is a random vector, such that each , then the above 4 inequalities hold, with replacing .
Here, is the convex polytope spanned by the vectors .
Theorem. (over a ball) If is a random vector in , such that for all on the unit sphere , then For any , with probability at least ,
Theorem. (Theorem 2.6.1 [2] ) There exists a positive constant such that given any number of independent mean-zero subgaussian random variables , Theorem. (Hoeffding's inequality) (Theorem 2.6.3 [2] ) There exists a positive constant such that given any number of independent mean-zero subgaussian random variables ,Theorem. (Bernstein's inequality) (Theorem 2.8.1 [2] ) There exists a positive constant such that given any number of independent mean-zero subexponential random variables ,Theorem. (Khinchine inequality) (Exercise 2.6.5 [2] ) There exists a positive constant such that given any number of independent mean-zero variance-one subgaussian random variables , any , and any ,
The Hanson-Wright inequality states that if a random vector is subgaussian in a certain sense, then any quadratic form of this vector, , is also subgaussian/subexponential. Further, the upper bound on the tail of , is uniform.
A weak version of the following theorem was proved in (Hanson, Wright, 1971). [11] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.
Theorem. [12] [13] There exists a constant , such that:
Let be a positive integer. Let be independent random variables, such that each satisfies . Combine them into a random vector . For any matrix , we havewhere , and is the Frobenius norm of the matrix, and is the operator norm of the matrix.
In words, the quadratic form has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.
In the statement of the theorem, the constant is an "absolute constant", meaning that it has no dependence on . It is a mathematical constant much like pi and e.
Theorem (subgaussian concentration). [12] There exists a constant , such that:
Let be positive integers. Let be independent random variables, such that each satisfies . Combine them into a random vector . For any matrix , we haveIn words, the random vector is concentrated on a spherical shell of radius , such that is subgaussian, with subgaussian norm .
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to .
In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.
In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.
In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:
In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.
In probability theory, the Azuma–Hoeffding inequality gives a concentration result for the values of martingales that have bounded differences.
In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in with a density.
In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.
In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.
In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.
In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.
In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.
In quantum mechanics, notably in quantum information theory, fidelity quantifies the "closeness" between two density matrices. It expresses the probability that one state will pass a test to identify as the other. It is not a metric on the space of density matrices, but it can be used to define the Bures metric on this space.
In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,
In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.
In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value. The deviation or other function of the random variable can be thought of as a secondary random variable. The simplest example of the concentration of such a secondary random variable is the CDF of the first random variable which concentrates the probability to unity. If an analytic form of the CDF is available this provides a concentration equality that provides the exact probability of concentration. It is precisely when the CDF is difficult to calculate or even the exact form of the first random variable is unknown that the applicable concentration inequalities provide useful insight.
The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in design of experiments.
In probability theory and statistics, the modified half-normal distribution (MHN) is a three-parameter family of continuous probability distributions supported on the positive part of the real line. It can be viewed as a generalization of multiple families, including the half-normal distribution, truncated normal distribution, gamma distribution, and square root of the gamma distribution, all of which are special cases of the MHN distribution. Therefore, it is a flexible probability model for analyzing real-valued positive data. The name of the distribution is motivated by the similarities of its density function with that of the half-normal distribution.