Pointwise mutual information

Last updated October 18, 2024

In statistics, probability theory and information theory, pointwise mutual information (PMI),^[1] or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.^[2]

PMI (especially in its positive pointwise mutualinformation variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have expected them to appear by chance."^[2]

The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables:^[2] The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.

Definition

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:^[2]

\operatorname {pmi} (x;y)\equiv \log _{2}{\frac {p(x,y)}{p(x)p(y)}}=\log _{2}{\frac {p(x|y)}{p(x)}}=\log _{2}{\frac {p(y|x)}{p(y)}}

(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).

The measure is symmetric ( $\operatorname {pmi} (x;y)=\operatorname {pmi} (y;x)$ ). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e. $p(x|y)$ or $p(y|x)=1$ ), yielding the following bounds:

-\infty \leq \operatorname {pmi} (x;y)\leq \min \left[-\log p(x),-\log p(y)\right].

Finally, $\operatorname {pmi} (x;y)$ will increase if $p(x|y)$ is fixed but $p(x)$ decreases.

Here is an example to illustrate:

x	y	p(x, y)
0	0	0.1
0	1	0.7
1	0	0.15
1	1	0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

	p(x)	p(y)
0	0.8	0.25
1	0.2	0.75

With this example, we can compute four values for $\operatorname {pmi} (x;y)$ . Using base-2 logarithms:

pmi(x=0;y=0)	=	−1
pmi(x=0;y=1)	=	0.222392
pmi(x=1;y=0)	=	1.584963
pmi(x=1;y=1)	=	-1.584963

(For reference, the mutual information $\operatorname {I} (X;Y)$ would then be 0.2141709.)

Similarities to mutual information

Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,

${\begin{aligned}\operatorname {pmi} (x;y)&=&h(x)+h(y)-h(x,y)\\&=&h(x)-h(x\mid y)\\&=&h(y)-h(y\mid x)\end{aligned}}$

Where $h(x)$ is the self-information, or $-\log _{2}p(x)$ .

Variants

Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":^[3]

PMI can take both positive and negative values and has no fixed bounds, which makes it harder to interpret.^[3]
PMI has "a well-known tendency to give higher scores to low-frequency events", but in applications such as measuring word similarity, it is preferable to have "a higher score for pairs of words whose relatedness is supported by more evidence."^[3]

Positive PMI

The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:^[2]

$\operatorname {ppmi} (x;y)\equiv \max \left(\log _{2}{\frac {p(x,y)}{p(x)p(y)}},0\right)$

This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment".^[2] It also avoid having to deal with $-\infty$ values for events that never occur together ( $p(x,y)=0$ ), by setting PPMI for these to 0.^[2]

Normalized pointwise mutual information (npmi)

Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.^[4]

$\operatorname {npmi} (x;y)={\frac {\operatorname {pmi} (x;y)}{h(x,y)}}$

Where $h(x,y)$ is the joint self-information $-\log _{2}p(x,y)$ .

PMI^k family

The PMI^k measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as^[5]^[3]

$\operatorname {pmi} ^{k}(x;y)\equiv \log _{2}{\frac {p(x,y)^{k}}{p(x)p(y)}}=\operatorname {pmi} (x;y)-(-(k-1))\log _{2}p(x,y))$

In particular, $pmi^{1}(x;y)=pmi(x;y)$ . The additional factors of $p(x,y)$ inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs.^[3] A 2011 case study demonstrated the success of PMI³ in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing $pmi(x;y)$ ) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI³ were much more general ("league", "clubs", "england").^[3]

Specific Correlation

Total correlation is an extension of mutual information to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation."^[6] The SI of the results of random variables ${\boldsymbol {x}}=(x_{1},x_{2},\ldots {},x_{n})$ is expressed as the following:

\mathrm {SI} (x_{1},x_{2},\ldots ,x_{n})\equiv \log {\frac {p(x_{1},x_{2},\ldots ,x_{n})}{\prod _{i=1}^{n}p(x_{i})}}=\log p({\boldsymbol {x}})-\log \prod _{i=1}^{n}p\left(x_{i}\right)

Chain-rule

Like mutual information,^[7] point mutual information follows the chain rule, that is,

\operatorname {pmi} (x;yz)=\operatorname {pmi} (x;y)+\operatorname {pmi} (x;z|y)

This is proven through application of Bayes' theorem:

{\begin{aligned}\operatorname {pmi} (x;y)+\operatorname {pmi} (x;z|y)&{}=\log {\frac {p(x,y)}{p(x)p(y)}}+\log {\frac {p(x,z|y)}{p(x|y)p(z|y)}}\\&{}=\log \left[{\frac {p(x,y)}{p(x)p(y)}}{\frac {p(x,z|y)}{p(x|y)p(z|y)}}\right]\\&{}=\log {\frac {p(x|y)p(y)p(x,z|y)}{p(x)p(y)p(x|y)p(z|y)}}\\&{}=\log {\frac {p(x,yz)}{p(x)p(yz)}}\\&{}=\operatorname {pmi} (x;yz)\end{aligned}}

Applications

PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds).^[8] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities $p(x)$ and $p(x,y)$ respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015)^{[ citation needed ]} filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)

word 1	word 2	count word 1	count word 2	count of co-occurrences	PMI
puerto	rico	1938	1311	1159	10.0349081703
hong	kong	2438	2694	2205	9.72831972408
los	angeles	3501	2808	2791	9.56067615065
carbon	dioxide	4265	1353	1032	9.09852946116
prize	laureate	5131	1676	1210	8.85870710982
san	francisco	5237	2477	1779	8.83305176711
nobel	prize	4098	5131	2498	8.68948811416
ice	hockey	5607	3002	1933	8.6555759741
star	trek	8264	1594	1489	8.63974676575
car	driver	5578	2749	1384	8.41470768304
it	the	283891	3293296	3347	-1.72037278119
are	of	234458	1761436	1019	-2.09254205335
this	the	199882	3293296	1211	-2.38612756961
is	of	565679	1761436	1562	-2.54614706831
and	of	1375396	1761436	2949	-2.79911817902
a	and	984442	1375396	1457	-2.92239510038
in	and	1187652	1375396	1537	-3.05660070757
to	and	1025659	1375396	1286	-3.08825363041
to	in	1025659	1187652	1066	-3.12911348956
of	and	1761436	1375396	1190	-3.70663100173

Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Conditional entropy</span> Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable $given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .$

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q . Mathematically, it is defined as$

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

<span class="mw-page-title-main">Z-channel (information theory)</span>

In coding theory and information theory, a Z-channel or binary asymmetric channel is a communications channel used to model the behaviour of some data storage systems.

<span class="mw-page-title-main">Conditional mutual information</span> Information theory

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This particular method relies on event A occurring with some sort of relationship with another event B. In this situation, the event A can be analyzed by a conditional probability with respect to B. If the event of interest is $A$ and the event $B$ is known or assumed to have occurred, "the conditional probability of $A$ given $B$ ", or "the probability of $A$ under the condition $B$ ", is usually written as $P(A | B)$ or occasionally $P B (A)$ . This can also be understood as the fraction of probability B that intersects with A, or the ratio of the probabilities of both events happening to the "given" one happening (how many times A occurs rather than not assuming B has occurred): $.$

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product $is a product distribution .$

References

↑ Kenneth Ward Church and Patrick Hanks (March 1990). "Word association norms, mutual information, and lexicography". Comput. Linguist. 16 (1): 22–29.
1 2 3 4 5 6 7 Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft), December 29, 2021, chapter 6
1 2 3 4 5 6 Francois Role, Moahmed Nadif. Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information. Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26–29, 2011
↑ Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction" (PDF). Proceedings of the Biennial GSCL Conference.
↑ B. Daille. Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. Thèse de Doctorat en Informatique Fondamentale. Université Paris 7. 1994. p.139
↑ Tim Van de Cruys. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 16–20, Portland, Oregon, USA. Association for Computational Linguistics.
↑ Paul L. Williams. INFORMATION DYNAMICS: ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS.
↑ Čmelo, I.; Voršilák, M.; Svozil, D. (2021-01-10). "Profiling and analysis of chemical compounds using pointwise mutual information". Journal of Cheminformatics. 13 (1): 3. doi: 10.1186/s13321-020-00483-y . ISSN 1758-2946. PMC 7798221 . PMID 33423694.

Fano, R M (1961). "chapter 2". Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, MA. ISBN 978-0262561693.

External links

Demo at Rensselaer MSR Server (PMI values normalized to be between 0 and 1)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Church1990-1] Kenneth Ward Church and Patrick Hanks (March 1990). "Word association norms, mutual information, and lexicography". Comput. Linguist. 16 (1): 22–29.

[:0-2] 1 2 3 4 5 6 7 Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft), December 29, 2021, chapter 6

[:1-3] 1 2 3 4 5 6 Francois Role, Moahmed Nadif. Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information. Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26–29, 2011

[4] Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction" (PDF). Proceedings of the Biennial GSCL Conference.

[5] B. Daille. Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. Thèse de Doctorat en Informatique Fondamentale. Université Paris 7. 1994. p.139

[6] Tim Van de Cruys. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 16–20, Portland, Oregon, USA. Association for Computational Linguistics.

[7] Paul L. Williams. INFORMATION DYNAMICS: ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS.

[8] Čmelo, I.; Voršilák, M.; Svozil, D. (2021-01-10). "Profiling and analysis of chemical compounds using pointwise mutual information". Journal of Cheminformatics. 13 (1): 3. doi: 10.1186/s13321-020-00483-y . ISSN 1758-2946. PMC 7798221 . PMID 33423694.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]