Entropic vector

Last updated

The entropic vector or entropic function is a concept arising in information theory. It represents the possible values of Shannon's information entropy that subsets of one set of random variables may take. Understanding which vectors are entropic is a way to represent all possible inequalities between entropies of various subsets. For example, for any two random variables , their joint entropy (the entropy of the random variable representing the pair ) is at most the sum of the entropies of and of :

Contents

Other information-theoretic measures such as conditional information, mutual information, or total correlation can be expressed in terms of joint entropy and are thus related by the corresponding inequalities. Many inequalities satisfied by entropic vectors can be derived as linear combinations of a few basic ones, called Shannon-type inequalities. However, it has been proven that already for variables, no finite set of linear inequalities is sufficient to characterize all entropic vectors.

Definition

Shannon's information entropy of a random variable is denoted . For a tuple of random variables , we denote the joint entropy of a subset as , or more concisely as , where . Here can be understood as the random variable representing the tuple . For the empty subset , denotes a deterministic variable with entropy 0.

A vector h in indexed by subsets of is called an entropic vector of order if there exists a tuple of random variables such that for each subset .

The set of all entropic vectors of order is denoted by . Zhang and Yeung [1] proved that it is not closed (for ), but its closure, , is a convex cone and hence characterized by the (infinitely many) linear inequalities it satisfies. Describing the region is thus equivalent to characterizing all possible inequalities on joint entropies.

Example

Let X,Y be two independent random variables with discrete uniform distribution over the set . Then

(since each is uniformly distributed over a two-element set), and

(since the two variables are independent, which means the pair is uniformly distributed over .) The corresponding entropic vector is thus:

On the other hand, the vector is not entropic (that is, ), because any pair of random variables (independent or not) should satisfy .

Characterizing entropic vectors: the region Γn*

Shannon-type inequalities and Γn

For a tuple of random variables , their entropies satisfy:

,    for any

In particular, , for any .

The Shannon inequality says that an entropic vector is submodular:

,    for any

It is equivalent to the inequality stating that the conditional mutual information is non-negative:

(For one direction, observe this the last form expresses Shannon's inequality for subsets and of the tuple ; for the other direction, substitute , , ).

Many inequalities can be derived as linear combinations of Shannon inequalities; they are called Shannon-type inequalities or basic information inequalities of Shannon's information measures. [2] The set of vectors that satisfies them is called ; it contains .

Software has been developed to automate the task of proving Shannon-type inequalities. [3] [4] Given an inequality, such software is able to determine whether the given inequality is a valid Shannon-type inequality (i.e., whether it contains the cone ).

Non-Shannon-type inequalities

The question of whether Shannon-type inequalities are the only ones, that is, whether they completely characterize the region , was first asked by Te Su Han in 1981 [2] and more precisely by Nicholas Pippenger in 1986. [5] It is not hard to show that this is true for two variables, that is, . For three variables, Zhang and Yeung [1] proved that ; however, it is still asymptotically true, meaning that the closure is equal: . In 1998, Zhang and Yeung [2] [6] showed that for all , by proving that the following inequality on four random variables (in terms of conditional mutual information) is true for any entropic vector, but is not Shannon-type:

Further inequalities and infinite families of inequalities have been found. [7] [8] [9] [10] These inequalities provide outer bounds for better than the Shannon-type bound . In 2007, Matus proved that no finite set of linear inequalities is sufficient (to deduce all as linear combinations), for variables. In other words, the region is not polyhedral. [11] Whether they can be characterized in some other way (allowing to effectively decide whether a vector is entropic or not) remains an open problem.

Analogous questions for von Neumann entropy in quantum information theory have been considered. [12]

Inner bounds

Some inner bounds of are also known. One example is that contains all vectors in which additionally satisfy the following inequality (and those obtained by permuting variables), known as Ingleton's inequality for entropy: [13]

[2]

Entropy and groups

Group-characterizable vectors and quasi-uniform distributions

Consider a group and subgroups of . Let denote for ; this is also a subgroup of . It is possible to construct a probability distribution for random variables such that

. [14]

(The construction essentially takes an element of uniformly at random and lets be the corresponding coset ). Thus any information-theoretic inequality implies a group-theoretic one. For example, the basic inequality implies that

It turns out the converse is essentially true. More precisely, a vector is said to be group-characterizable if it can be obtained from a tuple of subgroups as above. The set of group-characterizable vectors is denoted . As said above, . On the other hand, (and thus ) is contained in the topological closure of the convex closure of . [15] In other words, a linear inequality holds for all entropic vectors if and only if it holds for all vectors of the form , where goes over subsets of some tuple of subgroups in a group .

Group-characterizable vectors that come from an abelian group satisfy Ingleton's inequality.

Kolmogorov complexity

Kolmogorov complexity satisfies essentially the same inequalities as entropy. Namely, denote the Kolmogorov complexity of a finite string as (that is, the length of the shortest program that outputs ). The joint complexity of two strings , defined as the complexity of an encoding of the pair , can be denoted . Similarly, the conditional complexity can be denoted (the length of the shortest program that outputs given ). Andrey Kolmogorov noticed these notions behave similarly to Shannon entropy, for example:

In 2000, Hammer et al. [16] proved that indeed an inequality holds for entropic vectors if and only if the corresponding inequality in terms of Kolmogorov complexity holds up to logarithmic terms for all tuples of strings.

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

In probability theory, Chebyshev's inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean. Specifically, no more than 1/k2 of the distribution's values can be k or more standard deviations away from the mean. The rule is often called Chebyshev's theorem, about the range of standard deviations around the mean, in statistics. The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined. For example, it can be used to prove the weak law of large numbers.

In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the moment-generating functions of distributions defined by the weighted sums of random variables. However, not all random variables have moment-generating functions.

In vector calculus, Green's theorem relates a line integral around a simple closed curve C to a double integral over the plane region D bounded by C. It is the two-dimensional special case of Stokes' theorem.

<span class="mw-page-title-main">Rayleigh distribution</span> Probability distribution

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. Up to rescaling, it coincides with the chi distribution with two degrees of freedom. The distribution is named after Lord Rayleigh.

<span class="mw-page-title-main">Joint entropy</span> Measure of information in probability and information theory

In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

Fourier–Motzkin elimination, also known as the FME method, is a mathematical algorithm for eliminating variables from a system of linear inequalities. It can output real solutions.

In information theory, the entropy power inequality (EPI) is a result that relates to so-called "entropy power" of random variables. It shows that the entropy power of suitably well-behaved random variables is a superadditive function. The entropy power inequality was proved in 1948 by Claude Shannon in his seminal paper "A Mathematical Theory of Communication". Shannon also provided a sufficient condition for equality to hold; Stam (1959) showed that the condition is in fact necessary.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .

In probability theory, concentration inequalities provide bounds on how a random variable deviates from some value. The law of large numbers of classical probability theory states that sums of independent random variables are, under very mild conditions, close to their expectation with a large probability. Such sums are the most basic examples of random variables concentrated around their mean. Recent results show that such behavior is shared by other functions of independent random variables.

In mathematics, a submodular set function is a set function that, informally, describes the relationship between a set of inputs and an output, where adding more of one input has a decreasing additional benefit. The natural diminishing returns property which makes them suitable for many applications, including approximation algorithms, game theory and electrical networks. Recently, submodular functions have also found immense utility in several real world problems in machine learning and artificial intelligence, including automatic summarization, multi-document summarization, feature selection, active learning, sensor placement, image collection summarization and many other domains.

In mathematics and physics, Lieb–Thirring inequalities provide an upper bound on the sums of powers of the negative eigenvalues of a Schrödinger operator in terms of integrals of the potential. They are named after E. H. Lieb and W. E. Thirring.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

  1. 1 2 Zhang, Z.; Yeung, R.W. (1997). "A Non-Shannon-Type Conditional Inequality of Information Quantities". IEEE Trans. Inf. Theory. 43 (6): 1982–1986. doi:10.1109/18.641561.
  2. 1 2 3 4 Zhang, Z.; Yeung, R.W. (1998). "On Characterization of Entropy Function via Information Inequalities". IEEE Trans. Inf. Theory. 44 (4): 1440–1452. doi:10.1109/18.681320.
  3. Yeung, R.W.; Yan, Y.O. (1996). "ITIP - Information Theoretic Inequality Prover".{{cite journal}}: Cite journal requires |journal= (help)
  4. Pulikkoonattu, R.; E.Perron, E.; S.Diggavi, S. (2007). "Xitip - Information Theoretic Inequalities Prover".{{cite journal}}: Cite journal requires |journal= (help)
  5. Kaced, Tarik (2013). Equivalence of Two Proof Techniques for Non-Shannon-type Inequalities. 2013 IEEE International Symposium on Information Theory. arXiv: 1302.2994 .
  6. Yeung. A First Course in Information Theory, Theorem 14.7
  7. Dougherty, R.; Freiling, C.; Zeger, K. (2006). Six New Non-Shannon Information Inequalities. 2006 IEEE International Symposium on Information Theory.
  8. Matus, F. (1999). "Conditional independences among four random variables III: Final conclusion". Combinatorics, Probability and Computing . 8 (3): 269–276. doi:10.1017/s0963548399003740. S2CID   121634597.
  9. Makarychev, K.; et al. (2002). "A new class of non-Shannon-type inequalities for entropies". Communications in Information and Systems. 2 (2): 147–166. doi: 10.4310/cis.2002.v2.n2.a3 .
  10. Zhang, Z. (2003). "On a new non-Shannon-type information inequality". Communications in Information and Systems. 3 (1): 47–60. doi: 10.4310/cis.2003.v3.n1.a4 .
  11. Matus, F. (2007). Infinitely many information inequalities. 2007 IEEE International Symposium on Information Theory.
  12. Linden; Winter (2005). "A New Inequality for the von Neumann Entropy". Commun. Math. Phys. 259 (1): 129–138. arXiv: quant-ph/0406162 . Bibcode:2005CMaPh.259..129L. doi:10.1007/s00220-005-1361-2. S2CID   13279358.
  13. Yeung. A First Course in Information Theory, p. 386
  14. Yeung. A First Course in Information Theory, Theorem 16.16
  15. Yeung. A First Course in Information Theory, Theorem 16.22
  16. Hammer; Romashchenko; Shen; Vereshchagin (2000). "Inequalities for Shannon Entropy and Kolmogorov Complexity". Journal of Computer and System Sciences. 60 (2): 442–464. doi: 10.1006/jcss.1999.1677 .