Total variation distance of probability measures

Last updated
Total variation distance is half the absolute area between the two curves: Half the shaded area above. Total variation distance.svg
Total variation distance is half the absolute area between the two curves: Half the shaded area above.

In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational distance.

Contents

Definition

Consider a measurable space and probability measures and defined on . The total variation distance between and is defined as: [1]

This is the largest absolute difference between the probabilities that the two probability distributions assign to the same event.

Properties

The total variation distance is an f-divergence and an integral probability metric.

Relation to other distances

The total variation distance is related to the Kullback–Leibler divergence by Pinsker’s inequality:

One also has the following inequality, due to Bretagnolle and Huber [2] (see, also, Tsybakov [3] ), which has the advantage of providing a non-vacuous bound even when :

The total variation distance is half of the L1 distance between the probability functions: on discrete domains this is the distance between probability mass functions [4] , The relationship holds more generally as well: [5] when the distributions have standard probability density functions p and q, or the analogous distance between Radon-Nikodym derivatives with any common dominating measure. This result can be shown by noticing that the supremum in the definition is achieved exactly at the set where one distribution dominates the other. [6]

The total variation distance is related to the Hellinger distance as follows: [7]

These inequalities follow immediately from the inequalities between the 1-norm and the 2-norm.

Connection to transportation theory

The total variation distance (or half the norm) arises as the optimal transportation cost, when the cost function is , that is,

where the expectation is taken with respect to the probability measure on the space where lives, and the infimum is taken over all such with marginals and , respectively. [8]

See also

Related Research Articles

<span class="mw-page-title-main">Expected value</span> Average value of a random variable

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In probability theory and mathematical physics, a random matrix is a matrix-valued random variable—that is, a matrix in which some or all elements are random variables. Many important properties of physical systems can be represented mathematically as matrix problems. For example, the thermal conductivity of a lattice can be computed from the dynamical matrix of the particle-particle interactions within the lattice.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space . It is named after Leonid Vaseršteĭn.

In mathematics, the Lévy–Prokhorov metric is a metric on the collection of probability measures on a given metric space. It is named after the French mathematician Paul Lévy and the Soviet mathematician Yuri Vasilyevich Prokhorov; Prokhorov introduced it in 1956 as a generalization of the earlier Lévy metric.

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.

In probability theory, a real valued stochastic process X is called a semimartingale if it can be decomposed as the sum of a local martingale and a càdlàg adapted finite-variation process. Semimartingales are "good integrators", forming the largest class of processes with respect to which the Itô integral and the Stratonovich integral can be defined.

In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.

In mathematics, Maass forms or Maass wave forms are studied in the theory of automorphic forms. Maass forms are complex-valued smooth functions of the upper half plane, which transform in a similar way under the operation of a discrete subgroup of as modular forms. They are eigenforms of the hyperbolic Laplace operator defined on and satisfy certain growth conditions at the cusps of a fundamental domain of . In contrast to modular forms, Maass forms need not be holomorphic. They were studied first by Hans Maass in 1949.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In quantum mechanics, and especially quantum information and the study of open quantum systems, the trace distanceT is a metric on the space of density matrices and gives a measure of the distinguishability between two states. It is the quantum generalization of the Kolmogorov distance for classical probability distributions.

In mathematics and information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. In the language of large deviations theory, Sanov's theorem identifies the rate function for large deviations of the empirical measure of a sequence of i.i.d. random variables.

In quantum information theory, the classical capacity of a quantum channel is the maximum rate at which classical data can be sent over it error-free in the limit of many uses of the channel. Holevo, Schumacher, and Westmoreland proved the following least upper bound on the classical capacity of any quantum channel :

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions and by a concave and bounded function of the Kullback–Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing

References

  1. Chatterjee, Sourav. "Distances between probability measures" (PDF). UC Berkeley. Archived from the original (PDF) on July 8, 2008. Retrieved 21 June 2013.
  2. Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).
  3. Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. ISBN   978-0-387-79051-0, Equation 2.25.
  4. David A. Levin, Yuval Peres, Elizabeth L. Wilmer, Markov Chains and Mixing Times , 2nd. rev. ed. (AMS, 2017), Proposition 4.2, p. 48.
  5. Tsybakov, Aleksandr B. (2009). Introduction to nonparametric estimation (rev. and extended version of the French Book ed.). New York, NY: Springer. Lemma 2.1. ISBN   978-0-387-79051-0.
  6. Devroye, Luc; Györfi, Laszlo; Lugosi, Gabor (1996-04-04). A Probabilistic Theory of Pattern Recognition (Corrected ed.). New York: Springer. ISBN   978-0-387-94618-4.
  7. Harsha, Prahladh (September 23, 2011). "Lecture notes on communication complexity" (PDF).
  8. Villani, Cédric (2009). Optimal Transport, Old and New. Grundlehren der mathematischen Wissenschaften. Vol. 338. Springer-Verlag Berlin Heidelberg. p. 10. doi:10.1007/978-3-540-71050-9. ISBN   978-3-540-71049-3.