Fisher information metric

Last updated

In information geometry, the Fisher information metric [1] is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.[ clarification needed ]

Contents

The metric is interesting in several aspects. By Chentsov’s theorem, the Fisher information metric on statistical models is the only Riemannian metric (up to rescaling) that is invariant under sufficient statistics. [2] [3]

It can also be understood to be the infinitesimal form of the relative entropy (i.e., the Kullback–Leibler divergence); specifically, it is the Hessian of the divergence. Alternately, it can be understood as the metric induced by the flat space Euclidean metric, after appropriate changes of variable. When extended to complex projective Hilbert space, it becomes the Fubini–Study metric; when written in terms of mixed states, it is the quantum Bures metric.

Considered purely as a matrix, it is known as the Fisher information matrix. Considered as a measurement technique, where it is used to estimate hidden parameters in terms of observed random variables, it is known as the observed information.

Definition

Given a statistical manifold with coordinates , one writes for the probability density as a function of . Here is drawn from the value space R for a (discrete or continuous) random variable X. The probability is normalized by where is the distribution of .

The Fisher information metric then takes the form:[ clarification needed ]

The integral is performed over all values x in R. The variable is now a coordinate on a Riemann manifold. The labels j and k index the local coordinate axes on the manifold.

When the probability is derived from the Gibbs measure, as it would be for any Markovian process, then can also be understood to be a Lagrange multiplier; Lagrange multipliers are used to enforce constraints, such as holding the expectation value of some quantity constant. If there are n constraints holding n different expectation values constant, then the dimension of the manifold is n dimensions smaller than the original space. In this case, the metric can be explicitly derived from the partition function; a derivation and discussion is presented there.

Substituting from information theory, an equivalent form of the above definition is:

To show that the equivalent form equals the above definition note that

and apply on both sides.

Relation to the Kullback–Leibler divergence

Alternatively, the metric can be obtained as the second derivative of the relative entropy or Kullback–Leibler divergence. [4] To obtain this, one considers two probability distributions and , which are infinitesimally close to one another, so that

with an infinitesimally small change of in the j direction. Then, since the Kullback–Leibler divergence has an absolute minimum of 0 when , one has an expansion up to second order in of the form

.

The symmetric matrix is positive (semi) definite and is the Hessian matrix of the function at the extremum point . This can be thought of intuitively as: "The distance between two infinitesimally close points on a statistical differential manifold is the informational difference between them."

Relation to Ruppeiner geometry

The Ruppeiner metric and Weinhold metric are the Fisher information metric calculated for Gibbs distributions as the ones found in equilibrium statistical mechanics. [5] [6]

Change in free entropy

The action of a curve on a Riemannian manifold is given by

The path parameter here is time t; this action can be understood to give the change in free entropy of a system as it is moved from time a to time b. [6] Specifically, one has

as the change in free entropy. This observation has resulted in practical applications in chemical and processing industry [ citation needed ]: in order to minimize the change in free entropy of a system, one should follow the minimum geodesic path between the desired endpoints of the process. The geodesic minimizes the entropy, due to the Cauchy–Schwarz inequality, which states that the action is bounded below by the length of the curve, squared.

Relation to the Jensen–Shannon divergence

The Fisher metric also allows the action and the curve length to be related to the Jensen–Shannon divergence. [6] Specifically, one has

where the integrand dJSD is understood to be the infinitesimal change in the Jensen–Shannon divergence along the path taken. Similarly, for the curve length, one has

That is, the square root of the Jensen–Shannon divergence is just the Fisher metric (divided by the square root of 8).

As Euclidean metric

For a discrete probability space, that is, a probability space on a finite set of objects, the Fisher metric can be understood to simply be the Euclidean metric restricted to a positive orthant (e.g. "quadrant" in ) of a unit sphere, after appropriate changes of variable. [7]

Consider a flat, Euclidean space, of dimension N+1, parametrized by points . The metric for Euclidean space is given by

where the are 1-forms; they are the basis vectors for the cotangent space. Writing as the basis vectors for the tangent space, so that

,

the Euclidean metric may be written as

The superscript 'flat' is there to remind that, when written in coordinate form, this metric is with respect to the flat-space coordinate .

An N-dimensional unit sphere embedded in (N + 1)-dimensional Euclidean space may be defined as

This embedding induces a metric on the sphere, it is inherited directly from the Euclidean metric on the ambient space. It takes exactly the same form as the above, taking care to ensure that the coordinates are constrained to lie on the surface of the sphere. This can be done, e.g. with the technique of Lagrange multipliers.

Consider now the change of variable . The sphere condition now becomes the probability normalization condition

while the metric becomes

The last can be recognized as one-fourth of the Fisher information metric. To complete the process, recall that the probabilities are parametric functions of the manifold variables , that is, one has . Thus, the above induces a metric on the parameter manifold:

or, in coordinate form, the Fisher information metric is:

where, as before,

The superscript 'fisher' is present to remind that this expression is applicable for the coordinates ; whereas the non-coordinate form is the same as the Euclidean (flat-space) metric. That is, the Fisher information metric on a statistical manifold is simply (four times) the Euclidean metric restricted to the positive orthant of the sphere, after appropriate changes of variable.

When the random variable is not discrete, but continuous, the argument still holds. This can be seen in one of two different ways. One way is to carefully recast all of the above steps in an infinite-dimensional space, being careful to define limits appropriately, etc., in order to make sure that all manipulations are well-defined, convergent, etc. The other way, as noted by Gromov, [7] is to use a category-theoretic approach; that is, to note that the above manipulations remain valid in the category of probabilities. Here, one should note that such a category would have the Radon–Nikodym property, that is, the Radon–Nikodym theorem holds in this category. This includes the Hilbert spaces; these are square-integrable, and in the manipulations above, this is sufficient to safely replace the sum over squares by an integral over squares.

As Fubini–Study metric

The above manipulations deriving the Fisher metric from the Euclidean metric can be extended to complex projective Hilbert spaces. In this case, one obtains the Fubini–Study metric. [8] This should perhaps be no surprise, as the Fubini–Study metric provides the means of measuring information in quantum mechanics. The Bures metric, also known as the Helstrom metric, is identical to the Fubini–Study metric, [8] although the latter is usually written in terms of pure states, as below, whereas the Bures metric is written for mixed states. By setting the phase of the complex coordinate to zero, one obtains exactly one-fourth of the Fisher information metric, exactly as above.

One begins with the same trick, of constructing a probability amplitude, written in polar coordinates, so:

Here, is a complex-valued probability amplitude; and are strictly real. The previous calculations are obtained by setting . The usual condition that probabilities lie within a simplex, namely that

is equivalently expressed by the idea the square amplitude be normalized:

When is real, this is the surface of a sphere.

The Fubini–Study metric, written in infinitesimal form, using quantum-mechanical bra–ket notation, is

In this notation, one has that and integration over the entire measure space X is written as

The expression can be understood to be an infinitesimal variation; equivalently, it can be understood to be a 1-form in the cotangent space. Using the infinitesimal notation, the polar form of the probability above is simply

Inserting the above into the Fubini–Study metric gives:

Setting in the above makes it clear that the first term is (one-fourth of) the Fisher information metric. The full form of the above can be made slightly clearer by changing notation to that of standard Riemannian geometry, so that the metric becomes a symmetric 2-form acting on the tangent space. The change of notation is done simply replacing and and noting that the integrals are just expectation values; so:

The imaginary term is a symplectic form, it is the Berry phase or geometric phase. In index notation, the metric is:

Again, the first term can be clearly seen to be (one fourth of) the Fisher information metric, by setting . Equivalently, the Fubini–Study metric can be understood as the metric on complex projective Hilbert space that is induced by the complex extension of the flat Euclidean metric. The difference between this, and the Bures metric, is that the Bures metric is written in terms of mixed states.

Continuously-valued probabilities

A slightly more formal, abstract definition can be given, as follows. [9]

Let X be an orientable manifold, and let be a measure on X. Equivalently, let be a probability space on , with sigma algebra and probability .

The statistical manifold S(X) of X is defined as the space of all measures on X (with the sigma-algebra held fixed). Note that this space is infinite-dimensional, and is commonly taken to be a Fréchet space. The points of S(X) are measures.

Pick a point and consider the tangent space . The Fisher information metric is then an inner product on the tangent space. With some abuse of notation, one may write this as

Here, and are vectors in the tangent space; that is, . The abuse of notation is to write the tangent vectors as if they are derivatives, and to insert the extraneous d in writing the integral: the integration is meant to be carried out using the measure over the whole space X. This abuse of notation is, in fact, taken to be perfectly normal in measure theory; it is the standard notation for the Radon–Nikodym derivative.

In order for the integral to be well-defined, the space S(X) must have the Radon–Nikodym property, and more specifically, the tangent space is restricted to those vectors that are square-integrable. Square integrability is equivalent to saying that a Cauchy sequence converges to a finite value under the weak topology: the space contains its limit points. Note that Hilbert spaces possess this property.

This definition of the metric can be seen to be equivalent to the previous, in several steps. First, one selects a submanifold of S(X) by considering only those measures that are parameterized by some smoothly varying parameter . Then, if is finite-dimensional, then so is the submanifold; likewise, the tangent space has the same dimension as .

With some additional abuse of language, one notes that the exponential map provides a map from vectors in a tangent space to points in an underlying manifold. Thus, if is a vector in the tangent space, then is the corresponding probability associated with point (after the parallel transport of the exponential map to .) Conversely, given a point , the logarithm gives a point in the tangent space (roughly speaking, as again, one must transport from the origin to point ; for details, refer to original sources). Thus, one has the appearance of logarithms in the simpler definition, previously given.

See also

Notes

  1. Nielsen, Frank (2023). "A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions". Entropy. 25 (4): 654. arXiv: 2302.08175 . Bibcode:2023Entrp..25..654N. doi: 10.3390/e25040654 . PMC   10137715 . PMID   37190442.
  2. Amari, Shun-ichi; Nagaoka, Horishi (2000). "Chentsov's theorem and some historical remarks". Methods of Information Geometry. New York: Oxford University Press. pp. 37–40. ISBN   0-8218-0531-2.
  3. Dowty, James G. (2018). "Chentsov's theorem for exponential families". Information Geometry. 1 (1): 117–135. arXiv: 1701.08895 . doi:10.1007/s41884-018-0006-4. S2CID   5954036.
  4. Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory (2nd ed.). Hoboken: John Wiley & Sons. ISBN   0-471-24195-4.
  5. Brody, Dorje; Hook, Daniel (2008). "Information geometry in vapour-liquid equilibrium". Journal of Physics A. 42 (2): 023001. arXiv: 0809.1166 . doi:10.1088/1751-8113/42/2/023001. S2CID   118311636.
  6. 1 2 3 Crooks, Gavin E. (2009). "Measuring thermodynamic length". Physical Review Letters. 99 (10): 100602. arXiv: 0706.0559 . doi:10.1103/PhysRevLett.99.100602. PMID   17930381. S2CID   7527491.
  7. 1 2 Gromov, Misha (2012). "In a Search for a Structure, Part 1: On Entropy" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  8. 1 2 Facchi, Paolo; et al. (2010). "Classical and Quantum Fisher Information in the Geometrical Formulation of Quantum Mechanics". Physics Letters A. 374 (48): 4801–4803. arXiv: 1009.5219 . Bibcode:2010PhLA..374.4801F. doi:10.1016/j.physleta.2010.10.005. S2CID   55558124.
  9. Itoh, Mitsuhiro; Shishido, Yuichi (2008). "Fisher information metric and Poisson kernels" (PDF). Differential Geometry and Its Applications. 26 (4): 347–356. doi:10.1016/j.difgeo.2007.11.027. hdl: 2241/100265 .

Related Research Articles

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function is the probability of observing data assuming is the actual parameter.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

The Klein–Gordon equation is a relativistic wave equation, related to the Schrödinger equation. It is second-order in space and time and manifestly Lorentz-covariant. It is a differential equation version of the relativistic energy–momentum relation .

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

<span class="mw-page-title-main">Cramér–Rao bound</span> Lower bound on variance of an estimator

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In differential topology, the jet bundle is a certain construction that makes a new smooth fiber bundle out of a given smooth fiber bundle. It makes it possible to write differential equations on sections of a fiber bundle in an invariant form. Jets may also be seen as the coordinate free versions of Taylor expansions.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In mathematics, the Fubini–Study metric is a Kähler metric on a complex projective space CPn endowed with a Hermitian form. This metric was originally described in 1904 and 1905 by Guido Fubini and Eduard Study.

In physics, a sigma model is a field theory that describes the field as a point particle confined to move on a fixed manifold. This manifold can be taken to be any Riemannian manifold, although it is most commonly taken to be either a Lie group or a symmetric space. The model may or may not be quantized. An example of the non-quantized version is the Skyrme model; it cannot be quantized due to non-linearities of power greater than 4. In general, sigma models admit (classical) topological soliton solutions, for example, the Skyrmion for the Skyrme model. When the sigma field is coupled to a gauge field, the resulting model is described by Ginzburg–Landau theory. This article is primarily devoted to the classical field theory of the sigma model; the corresponding quantized theory is presented in the article titled "non-linear sigma model".

In theoretical physics, the Wess–Zumino model has become the first known example of an interacting four-dimensional quantum field theory with linearly realised supersymmetry. In 1974, Julius Wess and Bruno Zumino studied, using modern terminology, dynamics of a single chiral superfield whose cubic superpotential leads to a renormalizable theory.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

In mathematics — specifically, in stochastic analysis — the infinitesimal generator of a Feller process is a Fourier multiplier operator that encodes a great deal of information about the process.

<span class="mw-page-title-main">Gravitational lensing formalism</span>

In general relativity, a point mass deflects a light ray with impact parameter by an angle approximately equal to

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable is known as the natural exponential family of .

In fluid dynamics, Taylor scraping flow is a type of two-dimensional corner flow occurring when one of the wall is sliding over the other with constant velocity, named after G. I. Taylor.

Projection filters are a set of algorithms based on stochastic analysis and information geometry, or the differential geometric approach to statistics, used to find approximate solutions for filtering problems for nonlinear state-space systems. The filtering problem consists of estimating the unobserved signal of a random dynamical system from partial noisy observations of the signal. The objective is computing the probability distribution of the signal conditional on the history of the noise-perturbed observations. This distribution allows for calculations of all statistics of the signal given the history of observations. If this distribution has a density, the density satisfies specific stochastic partial differential equations (SPDEs) called Kushner-Stratonovich equation, or Zakai equation. It is known that the nonlinear filter density evolves in an infinite dimensional function space.

References