Variation of information

Last updated January 08, 2025

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.^[1]^[2]^[3]

Definition

Suppose we have two partitions $X$ and $Y$ of a set $A$ into disjoint subsets, namely $X=\{X_{1},X_{2},\ldots ,X_{k}\}$ and $Y=\{Y_{1},Y_{2},\ldots ,Y_{l}\}$ .

Let:

n=\sum _{i}|X_{i}|=\sum _{j}|Y_{j}|=|A|

p_{i}=|X_{i}|/n

and

q_{j}=|Y_{j}|/n

r_{ij}=|X_{i}\cap Y_{j}|/n

Then the variation of information between the two partitions is:

\mathrm {VI} (X;Y)=-\sum _{i,j}r_{ij}\left[\log(r_{ij}/p_{i})+\log(r_{ij}/q_{j})\right]

.

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on $A$ defined by $\mu (B):=|B|/n$ for $B\subseteq A$ .

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact lattice where the partial order induces two operations, the meet $\wedge$ and the join $\vee$ , where the maximum ${\overline {\mathrm {1} }}$ is the partition with only one block, i.e., all elements grouped together, and the minimum is ${\overline {\mathrm {0} }}$ , the partition consisting of all elements as singletons. The meet of two partitions $X$ and $Y$ is easy to understand as that partition formed by all pair intersections of one block of, $X_{i}$ , of $X$ and one, $Y_{i}$ , of $Y$ . It then follows that $X\wedge Y\subseteq X$ and $X\wedge Y\subseteq Y$ .

Let's define the entropy of a partition $X$ as

H\left(X\right)\,=\,-\sum _{i}\,p_{i}\log p_{i}

,

where $p_{i}=|X_{i}|/n$ . Clearly, $H({\overline {\mathrm {1} }})=0$ and $H({\overline {\mathrm {0} }})=\log \,n$ . The entropy of a partition is a monotonous function on the lattice of partitions in the sense that $X\subseteq Y\Rightarrow H(X)\geq H(Y)$ .

Then the VI distance between $X$ and $Y$ is given by

\mathrm {VI} (X,Y)\,=\,2H(X\wedge Y)\,-\,H(X)\,-\,H(Y)

.

The difference $d(X,Y)\equiv |H\left(X\right)-H\left(Y\right)|$ is a pseudo-metric as $d(X,Y)=0$ doesn't necessarily imply that $X=Y$ . From the definition of ${\overline {\mathrm {1} }}$ , it is $\mathrm {VI} (X,\mathrm {1} )\,=\,H\left(X\right)$ .

If in the Hasse diagram we draw an edge from every partition to the maximum ${\overline {\mathrm {1} }}$ and assign it a weight equal to the VI distance between the given partition and ${\overline {\mathrm {1} }}$ , we can interpret the VI distance as basically an average of differences of edge weights to the maximum

\mathrm {VI} (X,Y)\,=\,|\mathrm {VI} (X,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,+\,|\mathrm {VI} (Y,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y)

.

For $H(X)$ as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H(X,Y)\,=\,H(X\wedge Y)

and we also have that $d(X,X\wedge Y)\,=\,H(X\wedge Y|X)$ coincides with the conditional entropy of the meet (intersection) $X\wedge Y$ relative to $X$ .

Identities

The variation of information satisfies

\mathrm {VI} (X;Y)=H(X)+H(Y)-2I(X,Y)

,

where $H(X)$ is the entropy of $X$ , and $I(X,Y)$ is mutual information between $X$ and $Y$ with respect to the uniform probability measure on $A$ . This can be rewritten as

\mathrm {VI} (X;Y)=H(X,Y)-I(X,Y)

,

where $H(X,Y)$ is the joint entropy of $X$ and $Y$ , or

\mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)

,

where $H(X|Y)$ and $H(Y|X)$ are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

\mathrm {VI} (X;Y)\leq \log(n)

,

Or with respect to a maximum number of clusters, $K^{*}$ :

\mathrm {VI} (X;Y)\leq 2\log(K^{*})

Triangle inequality

To verify the triangle inequality $\mathrm {VI} (X;Z)\leq \mathrm {VI} (X;Y)+\mathrm {VI} (Y;Z)$ , expand using the identity $\mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)$ . It suffices to prove $H(X|Z)\leq H(X|Y)+H(Y|Z)$ The right side has a lower bound $H(X|Y)+H(Y|Z)\geq H(X|Y,Z)+H(Y|Z)=H(X,Y|Z)$ which is no less than the left side.

Related Research Articles

In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable $, which takes values in the set and is distributed according to, the entropy is where denotes the sum over the variable's possible values. The choice of base for, the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.$

In mathematics, the conjugate transpose, also known as the Hermitian transpose, of an $complex matrix is an matrix obtained by transposing and applying complex conjugation to each entry. There are several notations, such as or,, or .$

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In mathematics, the Hausdorff distance, or Hausdorff metric, also called Pompeiu–Hausdorff distance, measures how far two subsets of a metric space are from each other. It turns the set of non-empty compact subsets of a metric space into a metric space in its own right. It is named after Felix Hausdorff and Dimitrie Pompeiu.

<span class="mw-page-title-main">Conditional entropy</span> Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable $given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .$

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how much a model probability distribution Q is different from a true probability distribution P . Mathematically, it is defined as$

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In mathematics, the Bernoulli scheme or Bernoulli shift is a generalization of the Bernoulli process to more than two possible outcomes. Bernoulli schemes appear naturally in symbolic dynamics, and are thus important in the study of dynamical systems. Many important dynamical systems exhibit a repellor that is the product of the Cantor set and a smooth manifold, and the dynamics on the Cantor set are isomorphic to that of the Bernoulli shift. This is essentially the Markov partition. The term shift is in reference to the shift operator, which may be used to study Bernoulli schemes. The Ornstein isomorphism theorem shows that Bernoulli shifts are isomorphic when their entropy is equal.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In thermodynamics, the fundamental thermodynamic relation are four fundamental equations which demonstrate how four important thermodynamic quantities depend on variables that can be controlled and measured experimentally. Thus, they are essentially equations of state, and using the fundamental equations, experimental data can be used to determine sought-after quantities like G or H (enthalpy). The relation is generally expressed as a microscopic change in internal energy in terms of microscopic changes in entropy, and volume for a closed system in thermal equilibrium in the following way.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

The study of facility location problems (FLP), also known as location analysis, is a branch of operations research and computational geometry concerned with the optimal placement of facilities to minimize transportation costs while considering factors like avoiding placing hazardous materials near housing, and competitors' facilities. The techniques also apply to cluster analysis.

In theoretical computer science, graph theory, and mathematics, the conductance is a parameter of a Markov chain that is closely tied to its mixing time, that is, how rapidly the chain converges to its stationary distribution, should it exist. Equivalently, the conductance can be viewed as a parameter of a directed graph, in which case it can be used to analyze how quickly random walks in the graph converge.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

The entropic vector or entropic function is a concept arising in information theory. It represents the possible values of Shannon's information entropy that subsets of one set of random variables may take. Understanding which vectors are entropic is a way to represent all possible inequalities between entropies of various subsets. For example, for any two random variables $, their joint entropy is at most the sum of the entropies of and of :$

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In statistical mechanics, the cluster expansion is a power series expansion of the partition function of a statistical field theory around a model that is a union of non-interacting 0-dimensional field theories. Unlike the usual perturbation expansion which usually leads to a divergent asymptotic series, the cluster expansion may converge within a non-trivial region, in particular when the interaction is small and short-ranged.

In probability theory and information theory, adjusted mutual information, a variation of mutual information may be used for comparing clusterings. It corrects the effect of agreement solely due to chance between clusterings, similar to the way the adjusted rand index corrects the Rand index. It is closely related to variation of information: when a similar adjustment is made to the VI index, it becomes equivalent to the AMI. The adjusted measure however is no longer metrical.

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

References

↑ Arabie, P.; Boorman, S. A. (1973). "Multidimensional scaling of measures of distance between partitions". Journal of Mathematical Psychology. 10 (2): 148–203. doi:10.1016/0022-2496(73)90012-6.
↑ W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)
↑ Meilă, Marina (2003). "Comparing Clusterings by the Variation of Information". In Bernhard Schölkopf; Manfred K. Warmuth (eds.). Learning Theory and Kernel Machines. 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop. Lecture Notes in Computer Science, vol. 2777. Springer. pp. 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. S2CID 4341039.

External links

Partanalyzer includes a C++ implementation of VI and other metrics and indices for analyzing partitions and clusterings
C++ implementation with MATLAB mex files

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Arabie, P.; Boorman, S. A. (1973). "Multidimensional scaling of measures of distance between partitions". Journal of Mathematical Psychology. 10 (2): 148–203. doi:10.1016/0022-2496(73)90012-6.

[2] W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)

[3] Meilă, Marina (2003). "Comparing Clusterings by the Variation of Information". In Bernhard Schölkopf; Manfred K. Warmuth (eds.). Learning Theory and Kernel Machines. 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop. Lecture Notes in Computer Science, vol. 2777. Springer. pp. 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. S2CID 4341039.

[1]

[2]

[3]