Sub-Gaussian distribution

Last updated February 06, 2025

In probability theory, a subgaussian distribution, the distribution of a subgaussian random variable, is a probability distribution with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. This property gives subgaussian distributions their name.

Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study.

Formally, the probability distribution of a random variable $X$ is called subgaussian if there is a positive constant C such that for every $t\geq 0$ ,

{\textstyle \operatorname {P} (|X|\geq t)\leq 2\exp {(-t^{2}/C^{2})}}

.

There are many equivalent definitions. For example, a random variable $X$ is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian:

P(|X|\geq t)\leq cP(|Z|\geq t)\quad \forall t>0

where $c\geq 0$ is constant and $Z$ is a mean zero Gaussian random variable.^[1]^{: Theorem 2.6}

Definitions

Subgaussian norm

The subgaussian norm of $X$ , denoted as $\Vert X\Vert _{\psi _{2}}$ , is $\Vert X\Vert _{\psi _{2}}=\inf \left\{c>0:\operatorname {E} \left[\exp {\left({\frac {X^{2}}{c^{2}}}\right)}\right]\leq 2\right\}.$ In other words, it is the Orlicz norm of $X$ generated by the Orlicz function $\Phi (u)=e^{u^{2}}-1.$ By condition $(2)$ below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.

Variance proxy

If there exists some $s^{2}$ such that $\operatorname {E} [e^{(X-\operatorname {E} [X])t}]\leq e^{\frac {s^{2}t^{2}}{2}}$ for all $t$ , then $s^{2}$ is called a variance proxy, and the smallest such $s^{2}$ is called the optimal variance proxy and denoted by $\Vert X\Vert _{\mathrm {vp} }^{2}$ .

Since $\operatorname {E} [e^{(X-\operatorname {E} [X])t}]=e^{\frac {\sigma ^{2}t^{2}}{2}}$ when $X\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ is Gaussian, we then have $\|X\|_{vp}^{2}=\sigma ^{2}$ , as it should.

Equivalent definitions

Let $X$ be a random variable. Let $K_{1},K_{2},K_{3},\dots$ be positive constants. The following conditions are equivalent: (Proposition 2.5.2 ^[2])

Tail probability bound: $\operatorname {P} (|X|\geq t)\leq 2\exp {(-t^{2}/K_{1}^{2})}$ for all $t\geq 0$ ;
Finite subgaussian norm: $\Vert X\Vert _{\psi _{2}}=K_{2}<\infty$ ;
Moment: $\operatorname {E} |X|^{p}\leq 2K_{3}^{p}\Gamma \left({\frac {p}{2}}+1\right)$ for all $p\geq 1$ , where $\Gamma$ is the Gamma function;
Moment: $\operatorname {E} |X|^{p}\leq K^{p}p^{p/2}$ for all $p\geq 1$ ;
Moment-generating function (of $X$ ), or variance proxy^[3]^[4] : $\operatorname {E} [e^{(X-\operatorname {E} [X])t}]\leq e^{\frac {K^{2}t^{2}}{2}}$ for all $t$ ;
Moment-generating function (of $X^{2}$ ): $\operatorname {E} [e^{X^{2}t^{2}}]\leq e^{K^{2}t^{2}}$ for all $t\in [-1/K,+1/K]$ ;
Union bound: for some c > 0, $\ \operatorname {E} [\max\{|X_{1}-\operatorname {E} [X]|,\ldots ,|X_{n}-\operatorname {E} [X]|\}]\leq c{\sqrt {\log n}}$ for all n > c, where $X_{1},\ldots ,X_{n}$ are i.i.d copies of X;
Subexponential: $X^{2}$ has a subexponential distribution.

Furthermore, the constant $K$ is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants $K_{1},K_{2}$ in the two definitions satisfy $K_{1}\leq cK_{2},K_{2}\leq c'K_{1}$ , where $c,c'$ are constants independent of the random variable.

Proof of equivalence

As an example, the first four definitions are equivalent by the proof below.

Proof. $(1)\implies (3)$ By the layer cake representation, ${\begin{aligned}\operatorname {E} |X|^{p}&=\int _{0}^{\infty }\operatorname {P} (|X|^{p}\geq t)dt\\&=\int _{0}^{\infty }pt^{p-1}\operatorname {P} (|X|\geq t)dt\\&\leq 2\int _{0}^{\infty }pt^{p-1}\exp \left(-{\frac {t^{2}}{K_{1}^{2}}}\right)dt\\\end{aligned}}$

After a change of variables $u=t^{2}/K_{1}^{2}$ , we find that ${\begin{aligned}\operatorname {E} |X|^{p}&\leq 2K_{1}^{p}{\frac {p}{2}}\int _{0}^{\infty }u^{{\frac {p}{2}}-1}e^{-u}du\\&=2K_{1}^{p}{\frac {p}{2}}\Gamma \left({\frac {p}{2}}\right)\\&=2K_{1}^{p}\Gamma \left({\frac {p}{2}}+1\right).\end{aligned}}$ $(3)\implies (2)$ By the Taylor series ${\textstyle e^{x}=1+\sum _{p=1}^{\infty }{\frac {x^{p}}{p!}},}$ ${\begin{aligned}\operatorname {E} [\exp {(\lambda X^{2})}]&=1+\sum _{p=1}^{\infty }{\frac {\lambda ^{p}\operatorname {E} {[X^{2p}]}}{p!}}\\&\leq 1+\sum _{p=1}^{\infty }{\frac {2\lambda ^{p}K_{3}^{2p}\Gamma \left(p+1\right)}{p!}}\\&=1+2\sum _{p=1}^{\infty }\lambda ^{p}K_{3}^{2p}\\&=2\sum _{p=0}^{\infty }\lambda ^{p}K_{3}^{2p}-1\\&={\frac {2}{1-\lambda K_{3}^{2}}}-1\quad {\text{for }}\lambda K_{3}^{2}<1,\end{aligned}}$ which is less than or equal to $2$ for $\lambda \leq {\frac {1}{3K_{3}^{2}}}$ . Let $K_{2}\geq 3^{\frac {1}{2}}K_{3}$ , then ${\textstyle \operatorname {E} [\exp {(X^{2}/K_{2}^{2})}]\leq 2.}$

$(2)\implies (1)$ By Markov's inequality, $\operatorname {P} (|X|\geq t)=\operatorname {P} \left(\exp \left({\frac {X^{2}}{K_{2}^{2}}}\right)\geq \exp \left({\frac {t^{2}}{K_{2}^{2}}}\right)\right)\leq {\frac {\operatorname {E} [\exp {(X^{2}/K_{2}^{2})}]}{\exp \left({\frac {t^{2}}{K_{2}^{2}}}\right)}}\leq 2\exp \left(-{\frac {t^{2}}{K_{2}^{2}}}\right).$ $(3)\iff (4)$ by asymptotic formula for gamma function: $\Gamma (p/2+1)\sim {\sqrt {\pi p}}\left({\frac {p}{2e}}\right)^{p/2}$ .

From the proof, we can extract a cycle of three inequalities:

If $\operatorname {P} (|X|\geq t)\leq 2\exp {(-t^{2}/K^{2})}$ , then $\operatorname {E} |X|^{p}\leq 2K^{p}\Gamma \left({\frac {p}{2}}+1\right)$ for all $p\geq 1$ .
If $\operatorname {E} |X|^{p}\leq 2K^{p}\Gamma \left({\frac {p}{2}}+1\right)$ for all $p\geq 1$ , then $\|X\|_{\psi _{2}}\leq 3^{\frac {1}{2}}K$ .
If $\|X\|_{\psi _{2}}\leq K$ , then $\operatorname {P} (|X|\geq t)\leq 2\exp {(-t^{2}/K^{2})}$ .

In particular, the constant $K$ provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of $X$ .

Similarly, because up to a positive multiplicative constant, $\Gamma (p/2+1)=p^{p/2}\times ((2e)^{-1/2}p^{1/2p})^{p}$ for all $p\geq 1$ , the definitions (3) and (4) are also equivalent up to a constant.

Basic properties

Basic properties — * If ${\textstyle X}$ is subgaussian, and ${\textstyle k>0}$ , then ${\textstyle \|kX\|_{\psi _{2}}=k\|X\|_{\psi _{2}}}$ and ${\textstyle \|kX\|_{vp}=k\|X\|_{vp}}$ .

(Triangle inequality) If ${\textstyle X,Y}$ are subgaussian, then $\|X+Y\|_{vp}^{2}\leq (\|X\|_{vp}+\|Y\|_{vp})^{2}$

(Chernoff bound) If ${\textstyle X}$ is subgaussian, then $Pr(X\geq t)\leq e^{-{\frac {t^{2}}{2\|X\|_{vp}^{2}}}}$ for all ${\textstyle t\geq 0}$

${\textstyle X\lesssim X'}$ means that ${\textstyle X\leq CX'}$ , where the positive constant ${\textstyle C}$ is independent of ${\textstyle X}$ and ${\textstyle X'}$ .

Subgaussian deviation bound — If ${\textstyle X}$ is subgaussian, then $\|X-E[X]\|_{\psi _{2}}\lesssim \|X\|_{\psi _{2}}$

Proof

By triangle inequality, $\|X-E[X]\|_{\psi _{2}}\leq \|X\|_{\psi _{2}}+\|E[X]\|_{\psi _{2}}$ . Now we have $\|E[X]\|_{\psi _{2}}={\sqrt {\ln 2}}|E[X]|\leq {\sqrt {\ln 2}}E[|X|]\sim E[|X|]$ . By the equivalence of definitions (2) and (4) of subgaussianity, we have $E[|X|]\lesssim \|X\|_{\psi _{2}}$ .

Independent subgaussian sum bound — If ${\textstyle X,Y}$ are subgaussian and independent, then $\|X+Y\|_{vp}^{2}\leq \|X\|_{vp}^{2}+\|Y\|_{vp}^{2}$

Proof

If independent, then use that the cumulant of independent random variables is additive. That is, $\ln \operatorname {E} [e^{t(X+Y)}]=\ln \operatorname {E} [e^{tX}]+\ln \operatorname {E} [e^{tY}]$ .

If not independent, then by Hölder's inequality, for any $1/p+1/q=1$ we have $E[e^{t(X+Y)}]=\|e^{t(X+Y)}\|_{1}\leq e^{{\frac {1}{2}}t^{2}(p\|X\|_{vp}^{2}+q\|Y\|_{vp}^{2})}$ Solving the optimization problem ${\begin{cases}\min p\|X\|_{vp}^{2}+q\|Y\|_{vp}^{2}\\1/p+1/q=1\end{cases}}$ , we obtain the result.

Corollary — Linear sums of subgaussian random variables are subgaussian.

Partial converse (Matoušek 2008, Lemma 2.4) — If ${\textstyle E[X]=0,E[X^{2}]=1}$ , and ${\textstyle -\ln Pr(X\geq t)\geq {\frac {1}{2}}at^{2}}$ for all ${\textstyle t>0}$ , then $\ln E[e^{tX}]\leq C_{a}t^{2}$ where ${\textstyle C_{a}>0}$ depends on ${\textstyle a}$ only.

Proof

Proof

Let ${\textstyle F(x)}$ be the CDF of ${\textstyle X}$ . The proof splits the integral of MGF to two halves, one with ${\textstyle tX>1}$ and one with ${\textstyle tX\leq 1}$ , and bound each one respectively.

${\begin{aligned}E[e^{tX}]&=\int _{\mathbb {R} }e^{tx}dF(x)\\&=\int _{-\infty }^{1/t}e^{tx}dF(x)+\int _{1/t}^{+\infty }e^{tx}dF(x)\\\end{aligned}}$ Since ${\textstyle e^{x}\leq 1+x+x^{2}}$ for ${\textstyle x\leq 1}$ , ${\begin{aligned}\int _{-\infty }^{1/t}e^{tx}dF(x)&\leq \int _{-\infty }^{1/t}(1+tx+t^{2}x^{2})dF(x)\\&\leq \int _{\mathbb {R} }(1+tx+t^{2}x^{2})dF(x)\\&=1+tE[X]+t^{2}E[X^{2}]\\&=1+t^{2}\\&\leq e^{t^{2}}\end{aligned}}$ For the second term, upper bound it by a summation: ${\begin{aligned}\int _{1/t}^{+\infty }e^{tx}dF(x)&\leq e^{2}Pr(X\in [1/t,2/t])+e^{3}Pr(X\in [1/t,2/t])+\dots \\&\leq \sum _{k=1}^{\infty }e^{k+1}Pr(X\geq k/t)\\&\leq \sum _{k=1}^{\infty }e^{k(2-{\frac {1}{2}}ak/t^{2})}\end{aligned}}$ When ${\textstyle t\leq {\sqrt {a/8}}}$ , for any ${\textstyle k\geq 1}$ , ${\textstyle 2k-{\frac {ak^{2}}{2t^{2}}}\leq -{\frac {ak}{4t^{2}}}}$ , so

$\leq {\frac {1}{e^{\frac {a}{4t^{2}}}-1}}\leq {\frac {4}{a}}t^{2}$ When ${\textstyle t>{\sqrt {a/8}}}$ , by drawing out the curve of ${\textstyle f(x)=e^{-{\frac {a}{2t^{2}}}x^{2}+2x}}$ , and plotting out the summation, we find that $\sum _{k=1}^{\infty }e^{k(2-{\frac {1}{2}}ak/t^{2})}\leq \int _{\mathbb {R} }f(x)dx+2\max _{x}f(x)=e^{\frac {2t^{2}}{a}}\left({\sqrt {\frac {2\pi t^{2}}{a}}}+2\right)<10{\sqrt {t^{2}/a}}e^{\frac {2t^{2}}{a}}$ Now verify that ${\textstyle \ln 10+{\frac {1}{2}}\ln(t^{2}/a)+{\frac {2}{a}}t^{2}<C_{a}t^{2}}$ , where ${\textstyle C_{a}}$ depends on ${\textstyle a}$ only.

Corollary (Matoušek 2008, Lemma 2.2) — ${\textstyle X_{1},\dots ,X_{n}}$ are independent random variables with the same upper subgaussian tail: $-\ln Pr(X_{i}\geq t)\geq {\frac {1}{2}}at^{2}$ for all ${\textstyle t>0}$ . Also, ${\textstyle E[X_{i}]=0,E[X_{i}^{2}]=1}$ , then for any unit vector ${\textstyle v\in \mathbb {R} ^{n}}$ , the linear sum ${\textstyle \sum _{i}v_{i}X_{i}}$ has a subgaussian tail:
$-\ln Pr\left(\sum _{i}v_{i}X_{i}\geq t\right)\geq C_{a}t^{2}$ where ${\textstyle C_{a}>0}$ depends only on ${\textstyle a}$ .

Concentration

Gaussian concentration inequality for Lipschitz functions (Tao 2012, Theorem 2.1.12.) — If ${\textstyle f:\mathbb {R} ^{n}\to \mathbb {R} }$ is ${\textstyle L}$ -Lipschitz, and ${\textstyle X\sim N(0,I)}$ is a standard gaussian vector, then ${\textstyle f(X)}$ concentrates around its expectation at a rate $Pr(f(X)-E[f(X)]\geq t)\leq e^{-{\frac {2}{\pi ^{2}}}{\frac {t^{2}}{L^{2}}}}$ and similarly for the other tail.

Proof

Proof

By shifting and scaling, it suffices to prove the case where ${\textstyle L=1}$ , and ${\textstyle E[f(X)]=0}$ .

Since every 1-Lipschitz function is uniformly approximable by 1-Lipschitz smooth functions (by convolving with a mollifier), it suffices to prove it for 1-Lipschitz smooth functions.

Now it remains to bound the cumulant generating function.

To exploit the Lipschitzness, we introduce ${\textstyle Y}$ , an independent copy of ${\textstyle X}$ , then by Jensen, $E[e^{t(X-Y)}]=E[e^{tX}]E[e^{-tY}]\geq E[e^{tX}]e^{-tE[Y]}=E[e^{tX}]$

By the circular symmetry of gaussian variables, we introduce ${\textstyle X_{\theta }:=Y\cos \theta +X\sin \theta }$ . This has the benefit that its derivative ${\textstyle X'=-Y\sin \theta +X\cos \theta }$ is independent of it.

${\begin{aligned}e^{t(f(X)-f(Y))}&=e^{t(f(X_{\pi /2})-f(X_{0}))}\\&=e^{t\int _{0}^{\pi /2}\nabla f(X_{\theta })\cdot X_{\theta }'d\theta }\\&=e^{\pi t/2\int _{0}^{\pi /2}\nabla f(X_{\theta })\cdot X_{\theta }'{\frac {d\theta }{\pi /2}}}\\&\leq \int _{0}^{\pi /2}e^{\pi t/2\nabla f(X_{\theta })\cdot X_{\theta }'}{\frac {d\theta }{\pi /2}}\\\end{aligned}}$

Now take its expectation, $E[e^{t(f(X)-f(Y))}]\leq \int _{0}^{\pi /2}E[e^{\pi t/2\nabla f(X_{\theta })\cdot X_{\theta }'}]{\frac {d\theta }{\pi /2}}$ The expectation within the integral is over the joint distribution of ${\textstyle X,Y}$ , but since the joint distribution of ${\textstyle X_{\theta },X_{\theta }'}$ is exactly the same, we have

$=E_{X}[E_{Y}[e^{\pi t/2\nabla f(X)\cdot Y}]]$

Conditional on ${\textstyle X}$ , the quantity ${\textstyle \nabla f(X)\cdot Y}$ is normally distributed, with variance ${\textstyle \leq 1}$ , so $\leq e^{{\frac {1}{2}}(\pi t/2)^{2}}=e^{{\frac {\pi ^{2}}{8}}t^{2}}$

Thus, we have $\ln E[e^{tf(X)}]\leq {\frac {\pi ^{2}}{8}}t^{2}$

Strictly subgaussian

Expanding the cumulant generating function: ${\frac {1}{2}}s^{2}t^{2}\geq \ln \operatorname {E} [e^{tX}]={\frac {1}{2}}\mathrm {Var} [X]t^{2}+\kappa _{3}t^{3}+\cdots$ we find that $\mathrm {Var} [X]\leq \|X\|_{\mathrm {vp} }^{2}$ . At the edge of possibility, we define that a random variable $X$ satisfying $\mathrm {Var} [X]=\|X\|_{\mathrm {vp} }^{2}$ is called strictly subgaussian.

Properties

Theorem.^[5] Let $X$ be a subgaussian random variable with mean zero. If all zeros of its characteristic function are real, then $X$ is strictly subgaussian.

Corollary. If $X_{1},\dots ,X_{n}$ are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.

Examples

By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution.

Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric triangular distribution is strictly subgaussian.

Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric Binomial distribution is strictly subgaussian.

Examples


	$\\|X\\|_{\psi _{2}}$	$\\|X\\|_{vp}^{2}$	strictly subgaussian?
gaussian distribution ${\mathcal {N}}(0,1)$	${\sqrt {8/3}}$	$1$	Yes
mean-zero Bernoulli distribution $p\delta _{q}+q\delta _{-p}$	solution to $pe^{(q/t)^{2}}+qe^{(p/t)^{2}}=2$	${\frac {p-q}{2(\log p-\log q)}}$	Iff $p=0,1/2,1$
symmetric Bernoulli distribution ${\frac {1}{2}}\delta _{1/2}+{\frac {1}{2}}\delta _{-1/2}$	${\frac {1}{2{\sqrt {\ln 2}}}}$	$1/4$	Yes
uniform distribution $U(0,1)$	solution to $\int _{0}^{1}e^{x^{2}/t^{2}}dx=2$ , approximately 0.7727	$1/3$	Yes
arbitrary distribution on interval $[a,b]$		$\leq \left({\frac {b-a}{2}}\right)^{2}$

The optimal variance proxy $\Vert X\Vert _{\mathrm {vp} }^{2}$ is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet^[6], Kumaraswamy, triangular^[7], truncated Gaussian, and truncated exponential.^[8]

Bernoulli distribution

Let $p+q=1$ be two positive numbers. Let $X$ be a centered Bernoulli distribution $p\delta _{q}+q\delta _{-p}$ , so that it has mean zero, then $\Vert X\Vert _{\mathrm {vp} }^{2}={\frac {p-q}{2(\log p-\log q)}}$ .^[5] Its subgaussian norm is $t$ where $t$ is the unique positive solution to $pe^{(q/t)^{2}}+qe^{(p/t)^{2}}=2$ .

Let $X$ be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, $X$ takes values $-1$ and $1$ with probabilities $1/2$ each. Since $X^{2}=1$ , it follows that $\Vert X\Vert _{\psi _{2}}=\inf \left\{c>0:\operatorname {E} \left[\exp {\left({\frac {X^{2}}{c^{2}}}\right)}\right]\leq 2\right\}=\inf \left\{c>0:\exp {\left({\frac {1}{c^{2}}}\right)}\leq 2\right\}={\frac {1}{\sqrt {\ln 2}}},$ and hence $X$ is a subgaussian random variable.

Bounded distributions

Bounded distributions have no tail at all, so clearly they are subgaussian.

If $X$ is bounded within the interval $[a,b]$ , Hoeffding's lemma states that $\Vert X\Vert _{\mathrm {vp} }^{2}\leq \left({\frac {b-a}{2}}\right)^{2}$ . Hoeffding's inequality is the Chernoff bound obtained using this fact.

Convolutions

Density of a mixture of three normal distributions (m = 5, 10, 15, s = 2) with equal weights. Each component is shown as a weighted density (each integrating to 1/3) Gaussian-mixture-example.svg — Density of a mixture of three normal distributions (μ = 5, 10, 15, σ = 2) with equal weights. Each component is shown as a weighted density (each integrating to 1/3)

Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.

Mixtures

Given subgaussian distributions $X_{1},X_{2},\dots ,X_{n}$ , we can construct an additive mixture $X$ as follows: first randomly pick a number $i\in \{1,2,\dots ,n\}$ , then pick $X_{i}$ .

Since $\operatorname {E} \left[\exp {\left({\frac {X^{2}}{c^{2}}}\right)}\right]=\sum _{i}p_{i}\operatorname {E} \left[\exp {\left({\frac {X_{i}^{2}}{c^{2}}}\right)}\right]$ we have $\|X\|_{\psi _{2}}\leq \max _{i}\|X_{i}\|_{\psi _{2}}$ , and so the mixture is subgaussian.

In particular, any gaussian mixture is subgaussian.

More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum: $\|X\|_{\psi _{2}}\leq \sup _{i}\|X_{i}\|_{\psi _{2}}$ .

Subgaussian random vectors

So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast.

Let $X$ be a random vector taking values in $\mathbb {R} ^{n}$ .

Define.

$\|X\|_{\psi _{2}}:=\sup _{v\in S^{n-1}}\|v^{T}X\|_{\psi _{2}}$ , where $S^{n-1}$ is the unit sphere in $\mathbb {R} ^{n}$ .
$X$ is subgaussian iff $\|X\|_{\psi _{2}}<\infty$ .

Theorem. (Theorem 3.4.6 ^[2]) For any positive integer $n$ , the uniformly distributed random vector $X\sim U({\sqrt {n}}S^{n-1})$ is subgaussian, with $\|X\|_{\psi _{2}}\lesssim {}1$ .

This is not so surprising, because as $n\to \infty$ , the projection of $U({\sqrt {n}}S^{n-1})$ to the first coordinate converges in distribution to the standard normal distribution.

Maximum inequalities

Proposition. If $X_{1},\dots ,X_{n}$ are mean-zero subgaussians, with $\|X_{i}\|_{vp}^{2}\leq \sigma ^{2}$ , then for any $\delta >0$ , we have $\max(X_{1},\dots ,X_{n})\leq \sigma {\sqrt {2\ln {\frac {n}{\delta }}}}$ with probability $\geq 1-\delta$ .

Proof. By the Chernoff bound, $Pr(X_{i}\geq \sigma {\sqrt {2\ln(n/\delta )}})\leq \delta /n$ . Now apply the union bound.

Proposition. (Exercise 2.5.10 ^[2]) If $X_{1},X_{2},\dots$ are subgaussians, with $\|X_{i}\|_{\psi _{2}}\leq K$ , then $E\left[\sup _{n}{\frac {|X_{n}|}{\sqrt {1+\ln n}}}\right]\lesssim K,\quad E\left[\max _{1\leq n\leq N}|X_{n}|\right]\lesssim K{\sqrt {\ln N}}$ Further, the bound is sharp, since when $X_{1},X_{2},\dots$ are IID samples of ${\mathcal {N}}(0,1)$ we have $E\left[\max _{1\leq n\leq N}|X_{n}|\right]\gtrsim {\sqrt {\ln N}}$ .^[9]

^[10]

Theorem. (over a finite set) If $X_{1},\dots ,X_{n}$ are subgaussian, with $\|X_{i}\|_{vp}^{2}\leq \sigma ^{2}$ , then ${\begin{aligned}E[\max _{i}(X_{i}-E[X_{i}])]\leq \sigma {\sqrt {2\ln n}},&\quad P(\max _{i}X_{i}>t)\leq ne^{-{\frac {t^{2}}{2\sigma ^{2}}}},\\E[\max _{i}|X_{i}-E[X_{i}]|]\leq \sigma {\sqrt {2\ln(2n)}},&\quad P(\max _{i}|X_{i}|>t)\leq 2ne^{-{\frac {t^{2}}{2\sigma ^{2}}}}\end{aligned}}$ Theorem. (over a convex polytope) Fix a finite set of vectors $v_{1},\dots ,v_{n}$ . If $X$ is a random vector, such that each $\|v_{i}^{T}X\|_{vp}^{2}\leq \sigma ^{2}$ , then the above 4 inequalities hold, with $\max _{v\in \mathrm {conv} (v_{1},\dots ,v_{n})}v^{T}X$ replacing $\max _{i}X_{i}$ .

Here, $\mathrm {conv} (v_{1},\dots ,v_{n})$ is the convex polytope spanned by the vectors $v_{1},\dots ,v_{n}$ .

Theorem. (over a ball) If $X$ is a random vector in $\mathbb {R} ^{d}$ , such that $\|v^{T}X\|_{vp}^{2}\leq \sigma ^{2}$ for all $v$ on the unit sphere $S$ , then $E[\max _{v\in S}v^{T}X]=E[\max _{v\in S}|v^{T}X|]\leq 4\sigma {\sqrt {d}}$ For any $\delta >0$ , with probability at least $1-\delta$ , $\max _{v\in S}v^{T}X=\max _{v\in S}|v^{T}X|\leq 4\sigma {\sqrt {d}}+2\sigma {\sqrt {2\log(1/\delta )}}$

Inequalities

Theorem. (Theorem 2.6.1 ^[2]) There exists a positive constant $C$ such that given any number of independent mean-zero subgaussian random variables $X_{1},\dots ,X_{n}$ , $\left\|\sum _{i=1}^{n}X_{i}\right\|_{\psi _{2}}^{2}\leq C\sum _{i=1}^{n}\left\|X_{i}\right\|_{\psi _{2}}^{2}$ Theorem. (Hoeffding's inequality) (Theorem 2.6.3 ^[2]) There exists a positive constant $c$ such that given any number of independent mean-zero subgaussian random variables $X_{1},\dots ,X_{N}$ , $\mathbb {P} \left(\left|\sum _{i=1}^{N}X_{i}\right|\geq t\right)\leq 2\exp \left(-{\frac {ct^{2}}{\sum _{i=1}^{N}\left\|X_{i}\right\|_{\psi _{2}}^{2}}}\right)\quad \forall t>0$ Theorem. (Bernstein's inequality) (Theorem 2.8.1 ^[2]) There exists a positive constant $c$ such that given any number of independent mean-zero subexponential random variables $X_{1},\dots ,X_{N}$ , $\mathbb {P} \left(\left|\sum _{i=1}^{N}X_{i}\right|\geq t\right)\leq 2\exp \left(-c\min \left({\frac {t^{2}}{\sum _{i=1}^{N}\left\|X_{i}\right\|_{\psi _{1}}^{2}}},{\frac {t}{\max _{i}\left\|X_{i}\right\|_{\psi _{1}}}}\right)\right)$ Theorem. (Khinchine inequality) (Exercise 2.6.5 ^[2]) There exists a positive constant $C$ such that given any number of independent mean-zero variance-one subgaussian random variables $X_{1},\dots ,X_{N}$ , any $p\geq 2$ , and any $a_{1},\dots ,a_{N}\in \mathbb {R}$ , $\left(\sum _{i=1}^{N}a_{i}^{2}\right)^{1/2}\leq \left\|\sum _{i=1}^{N}a_{i}X_{i}\right\|_{L^{p}}\leq CK{\sqrt {p}}\left(\sum _{i=1}^{N}a_{i}^{2}\right)^{1/2}$

Hanson-Wright inequality

The Hanson-Wright inequality states that if a random vector $X$ is subgaussian in a certain sense, then any quadratic form $A$ of this vector, $X^{T}AX$ , is also subgaussian/subexponential. Further, the upper bound on the tail of $X^{T}AX$ , is uniform.

A weak version of the following theorem was proved in (Hanson, Wright, 1971).^[11] There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms.

Theorem.^[12]^[13] There exists a constant $c$ , such that:

Let $n$ be a positive integer. Let $X_{1},...,X_{n}$ be independent random variables, such that each satisfies $E[X_{i}]=0$ . Combine them into a random vector $X=(X_{1},\dots ,X_{n})$ . For any $n\times n$ matrix $A$ , we have $P(|X^{T}AX-E[X^{T}AX]|>t)\leq \max \left(2e^{-{\frac {ct^{2}}{K^{4}\|A\|_{F}^{2}}}},2e^{-{\frac {ct}{K^{2}\|A\|}}}\right)=2\exp \left[-c\min \left({\frac {t^{2}}{K^{4}\|A\|_{F}^{2}}},{\frac {t}{K^{2}\|A\|}}\right)\right]$ where $K=\max _{i}\|X_{i}\|_{\psi _{2}}$ , and $\|A\|_{F}={\sqrt {\sum _{ij}A_{ij}^{2}}}$ is the Frobenius norm of the matrix, and $\|A\|=\max _{\|x\|_{2}=1}\|Ax\|_{2}$ is the operator norm of the matrix.

In words, the quadratic form $X^{T}AX$ has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger.

In the statement of the theorem, the constant $c$ is an "absolute constant", meaning that it has no dependence on $n,X_{1},\dots ,X_{n},A$ . It is a mathematical constant much like pi and e.

Consequences

Theorem (subgaussian concentration).^[12] There exists a constant $c$ , such that:

Let $n,m$ be positive integers. Let $X_{1},...,X_{n}$ be independent random variables, such that each satisfies $E[X_{i}]=0,E[X_{i}^{2}]=1$ . Combine them into a random vector $X=(X_{1},\dots ,X_{n})$ . For any $m\times n$ matrix $A$ , we have $P(|\|AX\|_{2}-\|A\|_{F}|>t)\leq 2e^{-{\frac {ct^{2}}{K^{4}\|A\|^{2}}}}$ In words, the random vector $AX$ is concentrated on a spherical shell of radius $\|A\|_{F}$ , such that $\|AX\|_{2}-\|A\|_{F}$ is subgaussian, with subgaussian norm $\leq {\sqrt {3/c}}\|A\|K^{2}$ .

Notes

↑ Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. doi:10.1017/9781108627771, ISBN 9781108627771.
1 2 3 4 5 6 7 Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press.
↑ Kahane, J. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Studia Mathematica. 19: 1–25. doi:10.4064/sm-19-1-1-25.
↑ Buldygin, V. V.; Kozachenko, Yu. V. (1980). "Sub-Gaussian random variables". Ukrainian Mathematical Journal. 32 (6): 483–489. doi:10.1007/BF01087176.
1 2 Bobkov, S. G.; Chistyakov, G. P.; Götze, F. (2023-08-03). "Strictly subgaussian probability distributions". arXiv: 2308.01749 [math.PR].
↑ Marchal, Olivier; Arbel, Julyan (2017). "On the sub-Gaussianity of the Beta and Dirichlet distributions". Electronic Communications in Probability. 22. arXiv: 1705.00048 . doi:10.1214/17-ECP92.
↑ Arbel, Julyan; Marchal, Olivier; Nguyen, Hien D. (2020). "On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables". Esaim: Probability and Statistics. 24: 39–55. arXiv: 1901.09188 . doi:10.1051/ps/2019018.
↑ Barreto, Mathias; Marchal, Olivier; Arbel, Julyan (2024). "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables". arXiv: 2403.08628 [math.ST].
↑ Kamath, Gautam. "Bounds on the expectation of the maximum of samples from a gaussian." (2015)
↑ "MIT 18.S997 | Spring 2015 | High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables" (PDF). MIT OpenCourseWare. Retrieved 2024-04-03.
↑ Hanson, D. L.; Wright, F. T. (1971). "A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables". The Annals of Mathematical Statistics. 42 (3): 1079–1083. doi: 10.1214/aoms/1177693335 . ISSN 0003-4851. JSTOR 2240253.
1 2 Rudelson, Mark; Vershynin, Roman (January 2013). "Hanson-Wright inequality and sub-gaussian concentration". Electronic Communications in Probability. 18 (none): 1–9. arXiv: 1306.2872 . doi:10.1214/ECP.v18-2865. ISSN 1083-589X.
↑ Vershynin, Roman (2018). "6. Quadratic Forms, Symmetrization, and Contraction". High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press. pp. 127–146. doi:10.1017/9781108231596.009. ISBN 978-1-108-41519-4.

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would expect to get in reality.

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In probability theory and statistics, the chi-squared distribution with $degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.$

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $α$ and a scale parameter $θ$
With a shape parameter $and a rate parameter ⁠ ⁠$

In mathematics, the Gudermannian function relates a hyperbolic angle measure $to a circular angle measure called the gudermannian of and denoted . The Gudermannian function reveals a close relationship between the circular functions and hyperbolic functions. It was introduced in the 1760s by Johann Heinrich Lambert, and later named for Christoph Gudermann who also described the relationship between circular and hyperbolic functions in 1830. The gudermannian is sometimes called the hyperbolic amplitude as a limiting case of the Jacobi elliptic amplitude when parameter$

In probability theory, the Azuma–Hoeffding inequality gives a concentration result for the values of martingales that have bounded differences.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in $with a density.$

In linear algebra and functional analysis, the min-max theorem, or variational theorem, or Courant–Fischer–Weyl min-max principle, is a result that gives a variational characterization of eigenvalues of compact Hermitian operators on Hilbert spaces. It can be viewed as the starting point of many results of similar nature.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.

<span class="mw-page-title-main">Generalized Pareto distribution</span> Family of probability distributions often used to model tails or extreme values

In statistics, the generalized Pareto distribution (GPD) is a family of continuous probability distributions. It is often used to model the tails of another distribution. It is specified by three parameters: location $, scale, and shape . Sometimes it is specified by only scale and shape and sometimes only by its shape parameter. Some references give the shape parameter as .$

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value. The deviation or other function of the random variable can be thought of as a secondary random variable. The simplest example of the concentration of such a secondary random variable is the CDF of the first random variable which concentrates the probability to unity. If an analytic form of the CDF is available this provides a concentration equality that provides the exact probability of concentration. It is precisely when the CDF is difficult to calculate or even the exact form of the first random variable is unknown that the applicable concentration inequalities provide useful insight.

References

Kahane, J.P. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Studia Mathematica . 19: 1–25. doi: 10.4064/sm-19-1-1-25 .
Tao, Terence (2012). Topics in random matrix theory. Graduate studies in mathematics. Providence, R.I: American Mathematical Society. ISBN 978-0-8218-7430-1.
Matoušek, Jiří (September 2008). "On variants of the Johnson–Lindenstrauss lemma". Random Structures & Algorithms. 33 (2): 142–156. doi:10.1002/rsa.20218. ISSN 1042-9832.
Buldygin, V.V.; Kozachenko, Yu.V. (1980). "Sub-Gaussian random variables". Ukrainian Mathematical Journal. 32 (6): 483–489. doi:10.1007/BF01087176.
Ledoux, Michel; Talagrand, Michel (1991). Probability in Banach Spaces. Springer-Verlag.
Stromberg, K.R. (1994). Probability for Analysts. Chapman & Hall/CRC.
Litvak, A.E.; Pajor, A.; Rudelson, M.; Tomczak-Jaegermann, N. (2005). "Smallest singular value of random matrices and geometry of random polytopes" (PDF). Advances in Mathematics . 195 (2): 491–523. doi: 10.1016/j.aim.2004.08.004 .
Rudelson, Mark; Vershynin, Roman (2010). "Non-asymptotic theory of random matrices: extreme singular values". Proceedings of the International Congress of Mathematicians 2010. pp. 1576–1602. arXiv: 1003.2990 . doi:10.1142/9789814324359_0111.
Rivasplata, O. (2012). "Subgaussian random variables: An expository note" (PDF). Unpublished.
Vershynin, R. (2018). "High-dimensional probability: An introduction with applications in data science" (PDF). Volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity.24(5): 1231--1240. arXiv : 1709.02970. doi : 10.1007/s11117-019-00729-6.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Wainwright2019-1] Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. doi:10.1017/9781108627771, ISBN 9781108627771.

[:0-2] 1 2 3 4 5 6 7 Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge: Cambridge University Press.

[kahane-3] Kahane, J. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Studia Mathematica. 19: 1–25. doi:10.4064/sm-19-1-1-25.

[buldygin-4] Buldygin, V. V.; Kozachenko, Yu. V. (1980). "Sub-Gaussian random variables". Ukrainian Mathematical Journal. 32 (6): 483–489. doi:10.1007/BF01087176.

[:2-5] 1 2 Bobkov, S. G.; Chistyakov, G. P.; Götze, F. (2023-08-03). "Strictly subgaussian probability distributions". arXiv: 2308.01749 [math.PR].

[marchal2017-6] Marchal, Olivier; Arbel, Julyan (2017). "On the sub-Gaussianity of the Beta and Dirichlet distributions". Electronic Communications in Probability. 22. arXiv: 1705.00048 . doi:10.1214/17-ECP92.

[arbel2020-7] Arbel, Julyan; Marchal, Olivier; Nguyen, Hien D. (2020). "On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables". Esaim: Probability and Statistics. 24: 39–55. arXiv: 1901.09188 . doi:10.1051/ps/2019018.

[barreto2024-8] Barreto, Mathias; Marchal, Olivier; Arbel, Julyan (2024). "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables". arXiv: 2403.08628 [math.ST].

[9] Kamath, Gautam. "Bounds on the expectation of the maximum of samples from a gaussian." (2015)

[10] "MIT 18.S997 | Spring 2015 | High-Dimensional Statistics, Chapter 1. Sub-Gaussian Random Variables" (PDF). MIT OpenCourseWare. Retrieved 2024-04-03.

[11] Hanson, D. L.; Wright, F. T. (1971). "A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables". The Annals of Mathematical Statistics. 42 (3): 1079–1083. doi: 10.1214/aoms/1177693335 . ISSN 0003-4851. JSTOR 2240253.

[:1-12] 1 2 Rudelson, Mark; Vershynin, Roman (January 2013). "Hanson-Wright inequality and sub-gaussian concentration". Electronic Communications in Probability. 18 (none): 1–9. arXiv: 1306.2872 . doi:10.1214/ECP.v18-2865. ISSN 1083-589X.

[13] Vershynin, Roman (2018). "6. Quadratic Forms, Symmetrization, and Contraction". High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press. pp. 127–146. doi:10.1017/9781108231596.009. ISBN 978-1-108-41519-4.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Sub-Gaussian distribution

Contents

Definitions

Subgaussian norm

Variance proxy

Equivalent definitions

Proof of equivalence

Basic properties

Concentration

Strictly subgaussian

Properties

Examples

Examples

Bernoulli distribution

Bounded distributions

Convolutions

Mixtures

Subgaussian random vectors

Maximum inequalities

Inequalities

Hanson-Wright inequality

Consequences

See also

Notes

Related Research Articles

References