Bhattacharyya distance

Last updated

In statistics, the Bhattacharyya distance is a quantity which represents a notion of similarity between two probability distributions. [1] It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two statistical samples or populations.

Contents

It is not a metric, despite being named a "distance", since it does not obey the triangle inequality.

History

Both the Bhattacharyya distance and the Bhattacharyya coefficient are named after Anil Kumar Bhattacharyya, a statistician who worked in the 1930s at the Indian Statistical Institute. [2] He has developed this through a series of papers. [3] [4] [5] He developed the method to measure the distance between two non-normal distributions and illustrated this with the classical multinomial populations, [3] this work despite being submitted for publication in 1941, appeared almost five years later in Sankhya. [3] [2] Consequently, Professor Bhattacharyya started working toward developing a distance metric for probability distributions that are absolutely continuous with respect to the Lebesgue measure and published his progress in 1942, at Proceedings of the Indian Science Congress [4] and the final work has appeared in 1943 in the Bulletin of the Calcutta Mathematical Society. [5]

Definition

For probability distributions and on the same domain , the Bhattacharyya distance is defined as

where

is the Bhattacharyya coefficient for discrete probability distributions.

For continuous probability distributions, with and where and are the probability density functions, the Bhattacharyya coefficient is defined as

.

More generally, given two probability measures on a measurable space , let be a (sigma finite) measure such that and are absolutely continuous with respect to i.e. such that , and for probability density functions with respect to defined -almost everywhere. Such a measure, even such a probability measure, always exists, e.g. . Then define the Bhattacharyya measure on by

It does not depend on the measure , for if we choose a measure such that and an other measure choice are absolutely continuous i.e. and , then

,

and similarly for . We then have

.

We finally define the Bhattacharyya coefficient

.

By the above, the quantity does not depend on , and by the Cauchy inequality . Using , and ,

Gaussian case

Let , , where is the normal distribution with mean and variance ; then

.

And in general, given two multivariate normal distributions ,

,

where [6] Note that the first term is a squared Mahalanobis distance.

Properties

and .

does not obey the triangle inequality, though the Hellinger distance does.

Bounds on Bayes error

The Bhattacharyya distance can be used to upper and lower bound the Bayes error rate:

where and is the posterior probability. [7]

Applications

The Bhattacharyya coefficient quantifies the "closeness" of two random statistical samples.

Given two sequences from distributions , bin them into buckets, and let the frequency of samples from in bucket be , and similarly for , then the sample Bhattacharyya coefficient is

which is an estimator of . The quality of estimation depends on the choice of buckets; too few buckets would overestimate , while too many would underestimate.

A common task in classification is estimating the separability of classes. Up to a multiplicative factor, the squared Mahalanobis distance is a special case of the Bhattacharyya distance when the two classes are normally distributed with the same variances. When two classes have similar means but significantly different variances, the Mahalanobis distance would be close to zero, while the Bhattacharyya distance would not be.

The Bhattacharyya coefficient is used in the construction of polar codes. [8]

The Bhattacharyya distance is used in feature extraction and selection, [9] image processing, [10] speaker recognition, [11] phone clustering, [12] and in genetics. [13]

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In physics, a Langevin equation is a stochastic differential equation describing how a system evolves when subjected to a combination of deterministic and fluctuating ("random") forces. The dependent variables in a Langevin equation typically are collective (macroscopic) variables changing only slowly in comparison to the other (microscopic) variables of the system. The fast (microscopic) variables are responsible for the stochastic nature of the Langevin equation. One application is to Brownian motion, which models the fluctuating motion of a small particle in a fluid.

The Einstein–Hilbert action in general relativity is the action that yields the Einstein field equations through the stationary-action principle. With the (− + + +) metric signature, the gravitational part of the action is given as

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q. Mathematically, it is defined as

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matrix:

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

<span class="mw-page-title-main">Maxwell's equations in curved spacetime</span> Electromagnetism in general relativity

In physics, Maxwell's equations in curved spacetime govern the dynamics of the electromagnetic field in curved spacetime or where one uses an arbitrary coordinate system. These equations can be viewed as a generalization of the vacuum Maxwell's equations which are normally formulated in the local coordinates of flat spacetime. But because general relativity dictates that the presence of electromagnetic fields induce curvature in spacetime, Maxwell's equations in flat spacetime should be viewed as a convenient approximation.

In quantum mechanics, notably in quantum information theory, fidelity quantifies the "closeness" between two density matrices. It expresses the probability that one state will pass a test to identify as the other. It is not a metric on the space of density matrices, but it can be used to define the Bures metric on this space.

The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.

In mathematics, Gaussian measure is a Borel measure on finite-dimensional Euclidean space , closely related to the normal distribution in statistics. There is also a generalization to infinite-dimensional spaces. Gaussian measures are named after the German mathematician Carl Friedrich Gauss. One reason why Gaussian measures are so ubiquitous in probability theory is the central limit theorem. Loosely speaking, it states that if a random variable is obtained by summing a large number of independent random variables with variance 1, then has variance and its law is approximately Gaussian.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.

In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

<span class="mw-page-title-main">Marchenko–Pastur distribution</span> Distribution of singular values of large rectangular random matrices

In the mathematical theory of random matrices, the Marchenko–Pastur distribution, or Marchenko–Pastur law, describes the asymptotic behavior of singular values of large rectangular random matrices. The theorem is named after soviet mathematicians Volodymyr Marchenko and Leonid Pastur who proved this result in 1967.

In mathematical physics, the Belinfante–Rosenfeld tensor is a modification of the stress–energy tensor that is constructed from the canonical stress–energy tensor and the spin current so as to be symmetric yet still conserved.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

References

  1. Dodge, Yadolah (2003). The Oxford Dictionary of Statistical Terms. Oxford University Press. ISBN   978-0-19-920613-1.
  2. 1 2 Sen, Pranab Kumar (1996). "Anil Kumar Bhattacharyya (1915-1996): A Reverent Remembrance". Calcutta Statistical Association Bulletin. 46 (3–4): 151–158. doi:10.1177/0008068319960301. S2CID   164326977.
  3. 1 2 3 Bhattacharyya, A. (1946). "On a Measure of Divergence between Two Multinomial Populations". Sankhyā. 7 (4): 401–406. JSTOR   25047882.
  4. 1 2 Bhattacharyya, A (1942). "On discrimination and divergence". Proceedings of the Indian Science Congress. Asiatic Society of Bengal.
  5. 1 2 Bhattacharyya, A. (March 1943). "On a measure of divergence between two statistical populations defined by their probability distributions". Bulletin of the Calcutta Mathematical Society . 35: 99–109. MR   0010358.
  6. Kashyap, Ravi (2019). "The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance". Physica A: Statistical Mechanics and its Applications. 536: 120938. arXiv: 1603.09060 . doi:10.1016/j.physa.2019.04.174.
  7. Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997).
  8. Arıkan, Erdal (July 2009). "Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels". IEEE Transactions on Information Theory. 55 (7): 3051–3073. arXiv: 0807.3917 . doi:10.1109/TIT.2009.2021379. S2CID   889822.
  9. Euisun Choi, Chulhee Lee, "Feature extraction based on the Bhattacharyya distance", Pattern Recognition, Volume 36, Issue 8, August 2003, Pages 1703–1709
  10. François Goudail, Philippe Réfrégier, Guillaume Delyon, "Bhattacharyya distance as a contrast parameter for statistical processing of noisy optical images", JOSA A, Vol. 21, Issue 7, pp. 1231−1240 (2004)
  11. Chang Huai You, "An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition", Signal Processing Letters, IEEE, Vol 16, Is 1, pp. 49-52
  12. Mak, B., "Phone clustering using the Bhattacharyya distance", Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, Vol 4, pp. 2005–2008 vol.4, 3−6 Oct 1996
  13. Chattopadhyay, Aparna; Chattopadhyay, Asis Kumar; B-Rao, Chandrika (2004-06-01). "Bhattacharyya's distance measure as a precursor of genetic distance measures". Journal of Biosciences. 29 (2): 135–138. doi:10.1007/BF02703410. ISSN   0973-7138.
  1. Nielsen, Frank; Boltz, Sylvain (2011). "The Burbea-Rao and Bhattacharyya Centroids". IEEE Transactions on Information Theory. 57 (8): 5455–5466. arXiv: 1004.5049 . doi:10.1109/TIT.2011.2159046. ISSN   0018-9448. S2CID   14238708.
  2. Kailath, T. (1967). "The Divergence and Bhattacharyya Distance Measures in Signal Selection". IEEE Transactions on Communications. 15 (1): 52–60. doi:10.1109/TCOM.1967.1089532. ISSN   0096-2244.
  3. Djouadi, A.; Snorrason, O.; Garber, F.D. (1990). "The quality of training sample estimates of the Bhattacharyya coefficient". IEEE Transactions on Pattern Analysis and Machine Intelligence. 12 (1): 92–97. doi:10.1109/34.41388.