Fano's inequality

Last updated

In information theory, Fano's inequality (also known as the Fano converse and the Fano lemma) relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

Contents

It is used to find a lower bound on the error probability of any decoder as well as the lower bounds for minimax risks in density estimation.

Let the random variables and represent input and output messages with a joint probability . Let represent an occurrence of error; i.e., that , with being an approximate version of . Fano's inequality is

where denotes the support of ,

is the conditional entropy,

is the probability of the communication error, and

is the corresponding binary entropy.

Proof

Define an indicator random variable , that indicates the event that our estimate is in error,

Consider . We can use the chain rule for entropies to expand this in two different ways

Equating the two

Expanding the right most term,

Since means ; being given the value of allows us to know the value of with certainty. This makes the term . On the other hand, means that , hence given the value of , we can narrow down to one of different values, allowing us to upper bound the conditional entropy . Hence

The other term, , because conditioning reduces entropy. Because of the way is defined, , meaning that . Putting it all together,

Because is a Markov chain, we have by the data processing inequality, and hence , giving us

Intuition

Fano's inequality can be interpreted as a way of dividing the uncertainty of a conditional distribution into two questions given an arbitrary predictor. The first question, corresponding to the term , relates to the uncertainty of the predictor. If the prediction is correct, there is no more uncertainty remaining. If the prediction is incorrect, the uncertainty of any discrete distribution has an upper bound of the entropy of the uniform distribution over all choices besides the incorrect prediction. This has entropy . Looking at extreme cases, if the predictor is always correct the first and second terms of the inequality are 0, and the existence of a perfect predictor implies is totally determined by , and so . If the predictor is always wrong, then the first term is 0, and can only be upper bounded with a uniform distribution over the remaining choices.

Alternative formulation

Let be a random variable with density equal to one of possible densities . Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,

for all

Let be an estimate of the index. Then

where is the probability induced by

Generalization

The following generalization is due to Ibragimov and Khasminskii (1979), Assouad and Birge (1983).

Let F be a class of densities with a subclass of r + 1 densities ƒθ such that for any θ  θ

Then in the worst case the expected value of error of estimation is bound from below,

where ƒn is any density estimator based on a sample of size n.

Related Research Articles

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

<span class="mw-page-title-main">Conditional entropy</span> Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In information theory, the noisy-channel coding theorem, establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data nearly error-free up to a computable maximum rate through the channel. This result was presented by Claude Shannon in 1948 and was based in part on earlier work and ideas of Harry Nyquist and Ralph Hartley.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

This article discusses how information theory is related to measure theory.

In mathematics, the theory of optimal stopping or early stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost. Optimal stopping problems can be found in areas of statistics, economics, and mathematical finance. A key example of an optimal stopping problem is the secretary problem. Optimal stopping problems can often be written in the form of a Bellman equation, and are therefore often solved using dynamic programming.

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In quantum information theory, the Wehrl entropy, named after Alfred Wehrl, is a classical entropy of a quantum-mechanical density matrix. It is a type of quasi-entropy defined for the Husimi Q representation of the phase-space quasiprobability distribution. See for a comprehensive review of basic properties of classical, quantum and Wehrl entropies, and their implications in statistical mechanics.

The min-entropy, in information theory, is the smallest of the Rényi family of entropies, corresponding to the most conservative way of measuring the unpredictability of a set of outcomes, as the negative logarithm of the probability of the most likely outcome. The various Rényi entropies are all equal for a uniform distribution, but measure the unpredictability of a nonuniform distribution in different ways. The min-entropy is never greater than the ordinary or Shannon entropy and that in turn is never greater than the Hartley or max-entropy, defined as the logarithm of the number of outcomes with nonzero probability.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References