Total variation distance of probability measures

Last updated May 15, 2024

In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational distance.

Definition

Consider a measurable space $(\Omega ,{\mathcal {F}})$ and probability measures $P$ and $Q$ defined on $(\Omega ,{\mathcal {F}})$ . The total variation distance between $P$ and $Q$ is defined as^[1]

\delta (P,Q)=\sup _{A\in {\mathcal {F}}}\left|P(A)-Q(A)\right|.

This is the largest absolute difference between the probabilities that the two probability distributions assign to the same event.

Properties

The total variation distance is an f-divergence and an integral probability metric.

Relation to other distances

The total variation distance is related to the Kullback–Leibler divergence by Pinsker’s inequality:

\delta (P,Q)\leq {\sqrt {{\frac {1}{2}}D_{\mathrm {KL} }(P\parallel Q)}}.

One also has the following inequality, due to Bretagnolle and Huber ^[2] (see also ^[3]), which has the advantage of providing a non-vacuous bound even when $\textstyle D_{\mathrm {KL} }(P\parallel Q)>2\colon$

\delta (P,Q)\leq {\sqrt {1-e^{-D_{\mathrm {KL} }(P\parallel Q)}}}.

The total variation distance is half of the L¹ distance between the probability functions: on discrete domains, this is the distance between the probability mass functions ^[4]

\delta (P,Q)={\frac {1}{2}}\sum _{x}|P(x)-Q(x)|,

and when the distributions have standard probability density functions $p$ and $q$ ,^[5]

\delta (P,Q)={\frac {1}{2}}\int |p(x)-q(x)|\,\mathrm {d} x

(or the analogous distance between Radon-Nikodym derivatives with any common dominating measure). This result can be shown by noticing that the supremum in the definition is achieved exactly at the set where one distribution dominates the other.^[6]

The total variation distance is related to the Hellinger distance $H(P,Q)$ as follows:^[7]

H^{2}(P,Q)\leq \delta (P,Q)\leq {\sqrt {2}}H(P,Q).

These inequalities follow immediately from the inequalities between the 1-norm and the 2-norm.

Connection to transportation theory

The total variation distance (or half the norm) arises as the optimal transportation cost, when the cost function is $c(x,y)={\mathbf {1} }_{x\neq y}$ , that is,

{\frac {1}{2}}\|P-Q\|_{1}=\delta (P,Q)=\inf {\big \{}\mathbb {P} (X\neq Y):{\text{Law}}(X)=P,{\text{Law}}(Y)=Q{\big \}}=\inf _{\pi }\operatorname {E} _{\pi }[{\mathbf {1} }_{x\neq y}],

where the expectation is taken with respect to the probability measure $\pi$ on the space where $(x,y)$ lives, and the infimum is taken over all such $\pi$ with marginals $P$ and $Q$ , respectively.^[8]

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the alphabet and is distributed according to, the entropy is$

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P . While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

In probability theory and mathematical physics, a random matrix is a matrix-valued random variable—that is, a matrix in which some or all of its entries are sampled randomly from a probability distribution. Random matrix theory (RMT) is the study of properties of random matrices, often as they become large. RMT provides techniques like mean-field theory, diagrammatic methods, the cavity method, or the replica method to compute quantities like traces, spectral densities, or scalar products between eigenvectors. Many physical phenomena, such as the spectrum of nuclei of heavy atoms, the thermal conductivity of a lattice, or the emergence of quantum chaos, can be modeled mathematically as problems concerning large, random matrices.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space $. It is named after Leonid Vaseršteĭn.$

In mathematics, the Lévy–Prokhorov metric is a metric on the collection of probability measures on a given metric space. It is named after the French mathematician Paul Lévy and the Soviet mathematician Yuri Vasilyevich Prokhorov; Prokhorov introduced it in 1956 as a generalization of the earlier Lévy metric.

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst $of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.$

In probability theory, a real valued stochastic process X is called a semimartingale if it can be decomposed as the sum of a local martingale and a càdlàg adapted finite-variation process. Semimartingales are "good integrators", forming the largest class of processes with respect to which the Itô integral and the Stratonovich integral can be defined.

In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In quantum mechanics, and especially quantum information and the study of open quantum systems, the trace distanceT is a metric on the space of density matrices and gives a measure of the distinguishability between two states. It is the quantum generalization of the Kolmogorov distance for classical probability distributions.

In quantum information theory, the classical capacity of a quantum channel is the maximum rate at which classical data can be sent over it error-free in the limit of many uses of the channel. Holevo, Schumacher, and Westmoreland proved the following least upper bound on the classical capacity of any quantum channel $:$

In mathematics and theoretical computer science, analysis of Boolean functions is the study of real-valued functions on $or from a spectral perspective. The functions studied are often, but not always, Boolean-valued, making them Boolean functions. The area has found many applications in combinatorics, social choice theory, random graphs, and theoretical computer science, especially in hardness of approximation, property testing, and PAC learning.$

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions $and by a concave and bounded function of the Kullback-Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle-Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing (Bretagnolle-Huber-Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)$

References

↑ Chatterjee, Sourav. "Distances between probability measures" (PDF). UC Berkeley. Archived from the original (PDF) on July 8, 2008. Retrieved 21 June 2013.
↑ Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).
↑ Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. ISBN 978-0-387-79051-0, Equation 2.25.
↑ David A. Levin, Yuval Peres, Elizabeth L. Wilmer, Markov Chains and Mixing Times , 2nd. rev. ed. (AMS, 2017), Proposition 4.2, p. 48.
↑ Tsybakov, Aleksandr B. (2009). Introduction to nonparametric estimation (rev. and extended version of the French Book ed.). New York, NY: Springer. Lemma 2.1. ISBN 978-0-387-79051-0.
↑ Devroye, Luc; Györfi, Laszlo; Lugosi, Gabor (1996-04-04). A Probabilistic Theory of Pattern Recognition (Corrected ed.). New York: Springer. ISBN 978-0-387-94618-4.
↑ Harsha, Prahladh (September 23, 2011). "Lecture notes on communication complexity" (PDF).
↑ Villani, Cédric (2009). Optimal Transport, Old and New. Grundlehren der mathematischen Wissenschaften. Vol. 338. Springer-Verlag Berlin Heidelberg. p. 10. doi:10.1007/978-3-540-71050-9. ISBN 978-3-540-71049-3.

This probability-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Chatterjee2007-1] Chatterjee, Sourav. "Distances between probability measures" (PDF). UC Berkeley. Archived from the original (PDF) on July 8, 2008. Retrieved 21 June 2013.

[2] Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).

[3] Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. ISBN 978-0-387-79051-0, Equation 2.25.

[4] David A. Levin, Yuval Peres, Elizabeth L. Wilmer, Markov Chains and Mixing Times , 2nd. rev. ed. (AMS, 2017), Proposition 4.2, p. 48.

[5] Tsybakov, Aleksandr B. (2009). Introduction to nonparametric estimation (rev. and extended version of the French Book ed.). New York, NY: Springer. Lemma 2.1. ISBN 978-0-387-79051-0.

[6] Devroye, Luc; Györfi, Laszlo; Lugosi, Gabor (1996-04-04). A Probabilistic Theory of Pattern Recognition (Corrected ed.). New York: Springer. ISBN 978-0-387-94618-4.

[7] Harsha, Prahladh (September 23, 2011). "Lecture notes on communication complexity" (PDF).

[8] Villani, Cédric (2009). Optimal Transport, Old and New. Grundlehren der mathematischen Wissenschaften. Vol. 338. Springer-Verlag Berlin Heidelberg. p. 10. doi:10.1007/978-3-540-71050-9. ISBN 978-3-540-71049-3.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]