In statistics, probability theory and information theory, pointwise mutual information (PMI), [1] or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent. [2]
PMI (especially in its positive pointwise mutualinformation variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have expected them to appear by chance." [2]
The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables: [2] The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.
The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically: [2]
(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).
The measure is symmetric (). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e. or ), yielding the following bounds:
Finally, will increase if is fixed but decreases.
Here is an example to illustrate:
x | y | p(x, y) |
---|---|---|
0 | 0 | 0.1 |
0 | 1 | 0.7 |
1 | 0 | 0.15 |
1 | 1 | 0.05 |
Using this table we can marginalize to get the following additional table for the individual distributions:
p(x) | p(y) | |
---|---|---|
0 | 0.8 | 0.25 |
1 | 0.2 | 0.75 |
With this example, we can compute four values for . Using base-2 logarithms:
pmi(x=0;y=0) | = | −1 |
pmi(x=0;y=1) | = | 0.222392 |
pmi(x=1;y=0) | = | 1.584963 |
pmi(x=1;y=1) | = | -1.584963 |
(For reference, the mutual information would then be 0.2141709.)
Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,
Where is the self-information, or .
Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations": [3]
The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero: [2]
This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment". [2] It also avoid having to deal with values for events that never occur together (), by setting PPMI for these to 0. [2]
Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence. [4]
Where is the joint self-information .
The PMIk measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as [5] [3]
In particular, . The additional factors of inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs. [3] A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing ) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI3 were much more general ("league", "clubs", "england"). [3]
Total correlation is an extension of mutual information to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation." [6] The SI of the results of random variables is expressed as the following:
Like mutual information, [7] point mutual information follows the chain rule, that is,
This is proven through application of Bayes' theorem:
PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds). [8] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities and respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015)[ citation needed ] filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)
word 1 | word 2 | count word 1 | count word 2 | count of co-occurrences | PMI |
---|---|---|---|---|---|
puerto | rico | 1938 | 1311 | 1159 | 10.0349081703 |
hong | kong | 2438 | 2694 | 2205 | 9.72831972408 |
los | angeles | 3501 | 2808 | 2791 | 9.56067615065 |
carbon | dioxide | 4265 | 1353 | 1032 | 9.09852946116 |
prize | laureate | 5131 | 1676 | 1210 | 8.85870710982 |
san | francisco | 5237 | 2477 | 1779 | 8.83305176711 |
nobel | prize | 4098 | 5131 | 2498 | 8.68948811416 |
ice | hockey | 5607 | 3002 | 1933 | 8.6555759741 |
star | trek | 8264 | 1594 | 1489 | 8.63974676575 |
car | driver | 5578 | 2749 | 1384 | 8.41470768304 |
it | the | 283891 | 3293296 | 3347 | -1.72037278119 |
are | of | 234458 | 1761436 | 1019 | -2.09254205335 |
this | the | 199882 | 3293296 | 1211 | -2.38612756961 |
is | of | 565679 | 1761436 | 1562 | -2.54614706831 |
and | of | 1375396 | 1761436 | 2949 | -2.79911817902 |
a | and | 984442 | 1375396 | 1457 | -2.92239510038 |
in | and | 1187652 | 1375396 | 1537 | -3.05660070757 |
to | and | 1025659 | 1375396 | 1286 | -3.08825363041 |
to | in | 1025659 | 1187652 | 1066 | -3.12911348956 |
of | and | 1761436 | 1375396 | 1190 | -3.70663100173 |
Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.
In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.
A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.
In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.
In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.
In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.
In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .
In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q. Mathematically, it is defined as
In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.
In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .
This article discusses how information theory is related to measure theory.
The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.
In coding theory and information theory, a Z-channel or binary asymmetric channel is a communications channel used to model the behaviour of some data storage systems.
In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This particular method relies on event A occurring with some sort of relationship with another event B. In this situation, the event A can be analyzed by a conditional probability with respect to B. If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(A|B) or occasionally PB(A). This can also be understood as the fraction of probability B that intersects with A, or the ratio of the probabilities of both events happening to the "given" one happening (how many times A occurs rather than not assuming B has occurred): .
A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.