Matrix Chernoff bound

Last updated

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

Contents

The following theorems answer this general question under various assumptions; these assumptions are named below by analogy to their classical, scalar counterparts. All of these theorems can be found in ( Tropp 2010 ), as the specific application of a general result which is derived below. A summary of related works is given.

Matrix Gaussian and Rademacher series

Self-adjoint matrices case

Consider a finite sequence of fixed, self-adjoint matrices with dimension , and let be a finite sequence of independent standard normal or independent Rademacher random variables.

Then, for all ,

where

Rectangular case

Consider a finite sequence of fixed, self-adjoint matrices with dimension , and let be a finite sequence of independent standard normal or independent Rademacher random variables. Define the variance parameter

Then, for all ,

Matrix Chernoff inequalities

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly bounded random variables. In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random matrices subjected to a uniform eigenvalue bound.

Matrix Chernoff I

Consider a finite sequence of independent, random, self-adjoint matrices with dimension . Assume that each random matrix satisfies

almost surely.

Define

Then

Matrix Chernoff II

Consider a sequence of independent, random, self-adjoint matrices that satisfy

almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

Then

The binary information divergence is defined as

for .

Matrix Bennett and Bernstein inequalities

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of independent, zero-mean random variables that are either bounded or subexponential. In the matrix case, the analogous results concern a sum of zero-mean random matrices.

Bounded case

Consider a finite sequence of independent, random, self-adjoint matrices with dimension . Assume that each random matrix satisfies

almost surely.

Compute the norm of the total variance,

Then, the following chain of inequalities holds for all :

The function is defined as for .

Subexponential case

Consider a finite sequence of independent, random, self-adjoint matrices with dimension . Assume that

for .

Compute the variance parameter,

Then, the following chain of inequalities holds for all :

Rectangular case

Consider a finite sequence of independent, random, matrices with dimension . Assume that each random matrix satisfies

almost surely. Define the variance parameter

Then, for all

holds. [1]

Matrix Azuma, Hoeffding, and McDiarmid inequalities

Matrix Azuma

The scalar version of Azuma's inequality states that a scalar martingale exhibits normal concentration about its mean value, and the scale for deviations is controlled by the total maximum squared range of the difference sequence. The following is the extension in matrix setting.

Consider a finite adapted sequence of self-adjoint matrices with dimension , and a fixed sequence of self-adjoint matrices that satisfy

almost surely.

Compute the variance parameter

Then, for all

The constant 1/8 can be improved to 1/2 when there is additional information available. One case occurs when each summand is conditionally symmetric. Another example requires the assumption that commutes almost surely with .

Matrix Hoeffding

Placing addition assumption that the summands in Matrix Azuma are independent gives a matrix extension of Hoeffding's inequalities.

Consider a finite sequence of independent, random, self-adjoint matrices with dimension , and let be a sequence of fixed self-adjoint matrices. Assume that each random matrix satisfies

almost surely.

Then, for all

where

An improvement of this result was established in ( Mackey et al. 2012 ): for all

where

Matrix bounded difference (McDiarmid)

In scalar setting, McDiarmid's inequality provides one common way of bounding the differences by applying Azuma's inequality to a Doob martingale. A version of the bounded differences inequality holds in the matrix setting.

Let be an independent, family of random variables, and let be a function that maps variables to a self-adjoint matrix of dimension . Consider a sequence of fixed self-adjoint matrices that satisfy

where and range over all possible values of for each index . Compute the variance parameter

Then, for all

where .

An improvement of this result was established in ( Paulin, Mackey & Tropp 2013 ) (see also ( Paulin, Mackey & Tropp 2016 )): for all

where and

The first bounds of this type were derived by ( Ahlswede & Winter 2003 ). Recall the theorem above for self-adjoint matrix Gaussian and Rademacher bounds: For a finite sequence of fixed, self-adjoint matrices with dimension and for a finite sequence of independent standard normal or independent Rademacher random variables, then

where

Ahlswede and Winter would give the same result, except with

.

By comparison, the in the theorem above commutes and ; that is, it is the largest eigenvalue of the sum rather than the sum of the largest eigenvalues. It is never larger than the Ahlswede–Winter value (by the norm triangle inequality), but can be much smaller. Therefore, the theorem above gives a tighter bound than the Ahlswede–Winter result.

The chief contribution of ( Ahlswede & Winter 2003 ) was the extension of the Laplace-transform method used to prove the scalar Chernoff bound (see Chernoff bound#Additive form (absolute error)) to the case of self-adjoint matrices. The procedure given in the derivation below. All of the recent works on this topic follow this same procedure, and the chief differences follow from subsequent steps. Ahlswede & Winter use the Golden–Thompson inequality to proceed, whereas Tropp ( Tropp 2010 ) uses Lieb's Theorem.

Suppose one wished to vary the length of the series (n) and the dimensions of the matrices (d) while keeping the right-hand side approximately constant. Then n must vary approximately as the log of d. Several papers have attempted to establish a bound without a dependence on dimensions. Rudelson and Vershynin ( Rudelson & Vershynin 2007 ) give a result for matrices which are the outer product of two vectors. ( Magen & Zouzias 2010 ) provide a result without the dimensional dependence for low rank matrices. The original result was derived independently from the Ahlswede–Winter approach, but ( Oliveira 2010b ) proves a similar result using the Ahlswede–Winter approach.

Finally, Oliveira ( Oliviera 2010a ) proves a result for matrix martingales independently from the Ahlswede–Winter framework. Tropp ( Tropp 2011 ) slightly improves on the result using the Ahlswede–Winter framework. Neither result is presented in this article.

Derivation and proof

Ahlswede and Winter

The Laplace transform argument found in ( Ahlswede & Winter 2003 ) is a significant result in its own right: Let be a random self-adjoint matrix. Then

To prove this, fix . Then

The second-to-last inequality is Markov's inequality. The last inequality holds since . Since the left-most quantity is independent of , the infimum over remains an upper bound for it.

Thus, our task is to understand Nevertheless, since trace and expectation are both linear, we can commute them, so it is sufficient to consider , which we call the matrix generating function. This is where the methods of ( Ahlswede & Winter 2003 ) and ( Tropp 2010 ) diverge. The immediately following presentation follows ( Ahlswede & Winter 2003 ).

The Golden–Thompson inequality implies that

, where we used the linearity of expectation several times.

Suppose . We can find an upper bound for by iterating this result. Noting that , then

Iterating this, we get

So far we have found a bound with an infimum over . In turn, this can be bounded. At any rate, one can see how the Ahlswede–Winter bound arises as the sum of largest eigenvalues.

Tropp

The major contribution of ( Tropp 2010 ) is the application of Lieb's theorem where ( Ahlswede & Winter 2003 ) had applied the Golden–Thompson inequality. Tropp's corollary is the following: If is a fixed self-adjoint matrix and is a random self-adjoint matrix, then

Proof: Let . Then Lieb's theorem tells us that

is concave. The final step is to use Jensen's inequality to move the expectation inside the function:

This gives us the major result of the paper: the subadditivity of the log of the matrix generating function.

Subadditivity of log mgf

Let be a finite sequence of independent, random self-adjoint matrices. Then for all ,

Proof: It is sufficient to let . Expanding the definitions, we need to show that

To complete the proof, we use the law of total expectation. Let be the expectation conditioned on . Since we assume all the are independent,

Define .

Finally, we have

where at every step m we use Tropp's corollary with

Master tail bound

The following is immediate from the previous result:

All of the theorems given above are derived from this bound; the theorems consist in various ways to bound the infimum. These steps are significantly simpler than the proofs given.

Related Research Articles

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

In probability theory, Chebyshev's inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean. Specifically, no more than 1/k2 of the distribution's values can be k or more standard deviations away from the mean. The rule is often called Chebyshev's theorem, about the range of standard deviations around the mean, in statistics. The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined. For example, it can be used to prove the weak law of large numbers.

<span class="mw-page-title-main">Unitary group</span> Group of unitary matrices

In mathematics, the unitary group of degree n, denoted U(n), is the group of n × n unitary matrices, with the group operation of matrix multiplication. The unitary group is a subgroup of the general linear group GL(n, C). Hyperorthogonal group is an archaic name for the unitary group, especially over finite fields. For the group of unitary matrices with determinant 1, see Special unitary group.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.

<span class="mw-page-title-main">Stellar dynamics</span>

Stellar dynamics is the branch of astrophysics which describes in a statistical way the collective motions of stars subject to their mutual gravity. The essential difference from celestial mechanics is that the number of body

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In quantum mechanics, notably in quantum information theory, fidelity is a measure of the "closeness" of two quantum states. It expresses the probability that one state will pass a test to identify as the other. The fidelity is not a metric on the space of density matrices, but it can be used to define the Bures metric on this space.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability and statistics, the class of exponential dispersion models (EDM) is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume. It plays an important role for discrete-stable distributions.

In probability theory, concentration inequalities provide bounds on how a random variable deviates from some value. The law of large numbers of classical probability theory states that sums of independent random variables are, under very mild conditions, close to their expectation with a large probability. Such sums are the most basic examples of random variables concentrated around their mean. Recent results show that such behavior is shared by other functions of independent random variables.

In optics, the Fraunhofer diffraction equation is used to model the diffraction of waves when the diffraction pattern is viewed at a long distance from the diffracting object, and also when it is viewed at the focal plane of an imaging lens.

In mathematics, there are many kinds of inequalities involving matrices and linear operators on Hilbert spaces. This article covers some important operator inequalities connected with traces of matrices.

The quantum Fisher information is a central quantity in quantum metrology and is the quantum analogue of the classical Fisher information. The quantum Fisher information of a state with respect to the observable is defined as

The quaternion estimator algorithm (QUEST) is an algorithm designed to solve Wahba's problem, that consists of finding a rotation matrix between two coordinate systems from two sets of observations sampled in each system respectively. The key idea behind the algorithm is to find an expression of the loss function for the Wahba's problem as a quadratic form, using the Cayley–Hamilton theorem and the Newton–Raphson method to efficiently solve the eigenvalue problem and construct a numerically stable representation of the solution.

References