Uncertainty coefficient

Last updated July 22, 2022

In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil ^{[ citation needed ]} and is based on the concept of information entropy.

Definition

Suppose we have samples of two discrete random variables, X and Y. By constructing the joint distribution, $P X,Y (x, y)$ , from which we can calculate the conditional distributions, $P X | Y (x | y) = P X,Y (x, y)/ P Y (y)$ and $P Y |X (y | x) = P X,Y (x, y)/ P X (x)$ , and calculating the various entropies, we can determine the degree of association between the two variables.

The entropy of a single distribution is given as: ^[1]

H(X)=-\sum _{x}P_{X}(x)\log P_{X}(x),

while the conditional entropy is given as:^[1]

H(X|Y)=-\sum _{x,~y}P_{X,Y}(x,~y)\log P_{X|Y}(x|y).

The uncertainty coefficient^[2] or proficiency^[3] is defined as:

U(X|Y)={\frac {H(X)-H(X|Y)}{H(X)}}={\frac {I(X;Y)}{H(X)}},

and tells us: given Y, what fraction of the bits of X can we predict? In this case we can think of X as containing the total information, and of Y as allowing one to predict part of such information.

The above expression makes clear that the uncertainty coefficient is a normalised mutual information I(X;Y). In particular, the uncertainty coefficient ranges in [0, 1] as I(X;Y) < H(X) and both I(X,Y) and H(X) are positive or null.

Note that the value of U (but not H!) is independent of the base of the log since all logarithms are proportional.

The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as precision and recall in that it is not affected by the relative fractions of the different classes, i.e., P(x). ^[4] It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating clustering algorithms since cluster labels typically have no particular ordering.^[3]

Variations

The uncertainty coefficient is not symmetric with respect to the roles of X and Y. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two:^[2]

{\begin{aligned}U(X,~Y)&={\frac {H(X)U(X|Y)+H(Y)U(Y|X)}{H(X)+H(Y)}}\\[8pt]&=2\left[{\frac {H(X)+H(Y)-H(X,~Y)}{H(X)+H(Y)}}\right].\end{aligned}}

Although normally applied to discrete variables, the uncertainty coefficient can be extended to continuous variables^[1] using density estimation.^{[ citation needed ]}

Related Research Articles

Information theory is the scientific study of the quantification, storage, and communication of digital information. The field was fundamentally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the alphabet and is distributed according to :$

Rate–distortion theory is a major branch of information theory which provides the theoretical foundations for lossy data compression; it addresses the problem of determining the minimal number of bits per symbol, as measured by the rate R, that should be communicated over a channel, so that the source can be approximately reconstructed at the receiver without exceeding an expected distortion D.

In statistics, the (binary) logistic model is a statistical model that models the probability of one event taking place by having the log-odds for the event be a linear combination of one or more independent variables ("predictors"). In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by a indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Mutual information Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

Decision Tree Learning is supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

In mathematical statistics, the Kullback–Leibler divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P . While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for several reasons:

In information theory, the Rényi entropy generalizes the Hartley entropy, the Shannon entropy, the collision entropy and the min-entropy. Entropies quantify the diversity, uncertainty, or randomness of a system. The entropy is named after Alfréd Rényi, who looked for the most general definition of information measures that preserve additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions $and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In information theory, redundancy measures the fractional difference between the entropy $H(X)$ of an ensemble $X$ , and its maximum possible value $. Informally, it is the amount of wasted "space" used to transmit certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while forward error correction is a way of adding desired redundancy for purposes of error detection and correction when communicating over a noisy channel of limited capacity.$

In information theory and machine learning, information gain is a synonym for Kullback–Leibler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

The Theil index is a statistic primarily used to measure economic inequality and other economic phenomena, though it has also been used to measure racial segregation.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In information theory, the binary entropy function, denoted $or, is defined as the entropy of a Bernoulli process with probability of one of two values. It is a special case of, the entropy function. Mathematically, the Bernoulli trial is modelled as a random variable that can take on only two values: 0 and 1, which are mutually exclusive and exhaustive.$

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, based on the binary logarithm. Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. Examples of these kinds of variable include gender, ethnicity, sexual orientation, disability and more. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.

References

1 2 3 Claude E. Shannon; Warren Weaver (1963). The Mathematical Theory of Communication . University of Illinois Press.
1 2 William H. Press; Brian P. Flannery; Saul A. Teukolsky; William T. Vetterling (1992). "14.7.4". Numerical Recipes: the Art of Scientific Computing (3rd ed.). Cambridge University Press. p. 761.
1 2 White, Jim; Steingold, Sam; Fournelle, Connie. "Performance Metrics for Group-Detection Algorithms" (PDF). Interface 2004.{{cite journal}}: Cite journal requires |journal= (help)
↑ Peter, Mills (2011). "Efficient statistical classification of satellite measurements" (PDF). International Journal of Remote Sensing. 32 (21): 6109–6132. arXiv: 1202.2194 . Bibcode:2011IJRS...32.6109M. doi:10.1080/01431161.2010.507795. S2CID 88518570. Archived from the original (PDF) on 2012-04-26.

External links

libagf Includes software for calculating uncertainty coefficients.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Shannon_Weaver1963-1] 1 2 3 Claude E. Shannon; Warren Weaver (1963). The Mathematical Theory of Communication . University of Illinois Press.

[Press_etal1992-2] 1 2 William H. Press; Brian P. Flannery; Saul A. Teukolsky; William T. Vetterling (1992). "14.7.4". Numerical Recipes: the Art of Scientific Computing (3rd ed.). Cambridge University Press. p. 761.

[JimWhite-3] 1 2 White, Jim; Steingold, Sam; Fournelle, Connie. "Performance Metrics for Group-Detection Algorithms" (PDF). Interface 2004.{{cite journal}}: Cite journal requires |journal= (help)

[Mills2010-4] Peter, Mills (2011). "Efficient statistical classification of satellite measurements" (PDF). International Journal of Remote Sensing. 32 (21): 6109–6132. arXiv: 1202.2194 . Bibcode:2011IJRS...32.6109M. doi:10.1080/01431161.2010.507795. S2CID 88518570. Archived from the original (PDF) on 2012-04-26.

[1]

[2]

[3]

[4]