Bhattacharyya distance

Last updated

In statistics, the Bhattacharyya distance is a quantity which represents a notion of similarity between two probability distributions. [1] It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two statistical samples or populations.

Contents

It is not a metric, despite being named a "distance", since it does not obey the triangle inequality.

History

Both the Bhattacharyya distance and the Bhattacharyya coefficient are named after Anil Kumar Bhattacharyya, a statistician who worked in the 1930s at the Indian Statistical Institute. [2] He has developed this through a series of papers. [3] [4] [5] He developed the method to measure the distance between two non-normal distributions and illustrated this with the classical multinomial populations, [3] this work despite being submitted for publication in 1941, appeared almost five years later in Sankhya. [3] [2] Consequently, Professor Bhattacharyya started working toward developing a distance metric for probability distributions that are absolutely continuous with respect to the Lebesgue measure and published his progress in 1942, at Proceedings of the Indian Science Congress [4] and the final work has appeared in 1943 in the Bulletin of the Calcutta Mathematical Society. [5]

Definition

For probability distributions and on the same domain , the Bhattacharyya distance is defined as

where

is the Bhattacharyya coefficient for discrete probability distributions.

For continuous probability distributions, with and where and are the probability density functions, the Bhattacharyya coefficient is defined as

.

More generally, given two probability measures on a measurable space , let be a (sigma finite) measure such that and are absolutely continuous with respect to i.e. such that , and for probability density functions with respect to defined -almost everywhere. Such a measure, even such a probability measure, always exists, e.g. . Then define the Bhattacharyya measure on by

It does not depend on the measure , for if we choose a measure such that and an other measure choice are absolutely continuous i.e. and , then

,

and similarly for . We then have

.

We finally define the Bhattacharyya coefficient

.

By the above, the quantity does not depend on , and by the Cauchy inequality . In particular if is absolutely continuous wrt to with Radon Nikodym derivative , then

Gaussian case

Let , , where is the normal distribution with mean and variance ; then

.

And in general, given two multivariate normal distributions ,

,

where [6] Note that the first term is a squared Mahalanobis distance.

Properties

and .

does not obey the triangle inequality, though the Hellinger distance does.

Bounds on Bayes Error

The Bhattacharyya distance can be used to upper and lower bound the Bayes error rate:

where and is the posterior probability. [7]

Applications

The Bhattacharyya coefficient quantifies the "closeness" of two random statistical samples.

Given two sequences from distributions , bin them into buckets, and let the frequency of samples from in bucket be , and similarly for , then the sample Bhattacharyya coefficient is

which is an estimator of . The quality of estimation depends on the choice of buckets; too few buckets would overestimate , while too many would underestimate.

A common task in classification is estimating the separability of classes. Up to a multiplicative factor, the squared Mahalanobis distance is a special case of the Bhattacharyya distance when the two classes are normally distributed with the same variances. When two classes have similar means but significantly different variances, the Mahalanobis distance would be close to zero, while the Bhattacharyya distance would not be.

The Bhattacharyya coefficient is used in the construction of polar codes. [8]

The Bhattacharyya distance is used in feature extraction and selection, [9] image processing, [10] speaker recognition, [11] phone clustering, [12] and in genetics. [13]

See also

Related Research Articles

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In quantum mechanics, notably in quantum information theory, fidelity is a measure of the "closeness" of two quantum states. It expresses the probability that one state will pass a test to identify as the other. The fidelity is not a metric on the space of density matrices, but it can be used to define the Bures metric on this space.

The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.

In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

<span class="mw-page-title-main">Marchenko–Pastur distribution</span> Distribution of singular values of large rectangular random matrices

In the mathematical theory of random matrices, the Marchenko–Pastur distribution, or Marchenko–Pastur law, describes the asymptotic behavior of singular values of large rectangular random matrices. The theorem is named after Ukrainian mathematicians Volodymyr Marchenko and Leonid Pastur who proved this result in 1967.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In plasma physics and magnetic confinement fusion, neoclassical transport or neoclassical diffusion is a theoretical description of collisional transport in toroidal plasmas, usually found in tokamaks or stellerators. It is a modification of classical diffusion adding in effects of non-uniform magnetic fields due to the toroidal geometry, which give rise to new diffusion effects.

References

  1. Dodge, Yadolah (2003). The Oxford Dictionary of Statistical Terms. Oxford University Press. ISBN   978-0-19-920613-1.
  2. 1 2 Sen, Pranab Kumar (1996). "Anil Kumar Bhattacharyya (1915-1996): A Reverent Remembrance". Calcutta Statistical Association Bulletin. 46 (3–4): 151–158. doi:10.1177/0008068319960301. S2CID   164326977.
  3. 1 2 3 Bhattacharyya, A. (1946). "On a Measure of Divergence between Two Multinomial Populations". Sankhyā. 7 (4): 401–406. JSTOR   25047882.
  4. 1 2 Bhattacharyya, A (1942). "On discrimination and divergence". Proceedings of the Indian Science Congress. Asiatic Society of Bengal.
  5. 1 2 Bhattacharyya, A. (March 1943). "On a measure of divergence between two statistical populations defined by their probability distributions". Bulletin of the Calcutta Mathematical Society . 35: 99–109. MR   0010358.
  6. Kashyap, Ravi (2019). "The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance". Physica A: Statistical Mechanics and its Applications. 536: 120938. arXiv: 1603.09060 . doi:10.1016/j.physa.2019.04.174.
  7. Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997).
  8. Arıkan, Erdal (July 2009). "Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels". IEEE Transactions on Information Theory. 55 (7): 3051–3073. arXiv: 0807.3917 . doi:10.1109/TIT.2009.2021379. S2CID   889822.
  9. Euisun Choi, Chulhee Lee, "Feature extraction based on the Bhattacharyya distance", Pattern Recognition, Volume 36, Issue 8, August 2003, Pages 1703–1709
  10. François Goudail, Philippe Réfrégier, Guillaume Delyon, "Bhattacharyya distance as a contrast parameter for statistical processing of noisy optical images", JOSA A, Vol. 21, Issue 7, pp. 1231−1240 (2004)
  11. Chang Huai You, "An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition", Signal Processing Letters, IEEE, Vol 16, Is 1, pp. 49-52
  12. Mak, B., "Phone clustering using the Bhattacharyya distance", Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, Vol 4, pp. 2005–2008 vol.4, 3−6 Oct 1996
  13. Chattopadhyay, Aparna; Chattopadhyay, Asis Kumar; B-Rao, Chandrika (2004-06-01). "Bhattacharyya's distance measure as a precursor of genetic distance measures". Journal of Biosciences. 29 (2): 135–138. doi:10.1007/BF02703410. ISSN   0973-7138.
  1. Nielsen, Frank; Boltz, Sylvain (2011). "The Burbea-Rao and Bhattacharyya Centroids". IEEE Transactions on Information Theory. 57 (8): 5455–5466. arXiv: 1004.5049 . doi:10.1109/TIT.2011.2159046. ISSN   0018-9448. S2CID   14238708.
  2. Kailath, T. (1967). "The Divergence and Bhattacharyya Distance Measures in Signal Selection". IEEE Transactions on Communications. 15 (1): 52–60. doi:10.1109/TCOM.1967.1089532. ISSN   0096-2244.
  3. Djouadi, A.; Snorrason, O.; Garber, F.D. (1990). "The quality of training sample estimates of the Bhattacharyya coefficient". IEEE Transactions on Pattern Analysis and Machine Intelligence. 12 (1): 92–97. doi:10.1109/34.41388.