Energy distance

Last updated

Energy distance is a statistical distance between probability distributions. If X and Y are independent random vectors in Rd with cumulative distribution functions (cdf) F and G respectively, then the energy distance between the distributions F and G is defined to be the square root of

Contents

where (X, X', Y, Y') are independent, the cdf of X and X' is F, the cdf of Y and Y' is G, is the expected value, and || . || denotes the length of a vector. Energy distance satisfies all axioms of a metric thus energy distance characterizes the equality of distributions: D(F,G) = 0 if and only if F = G. Energy distance for statistical applications was introduced in 1985 by Gábor J. Székely, who proved that for real-valued random variables is exactly twice Harald Cramér's distance: [1]

For a simple proof of this equivalence, see Székely (2002). [2]

In higher dimensions, however, the two distances are different because the energy distance is rotation invariant while Cramér's distance is not. (Notice that Cramér's distance is not the same as the distribution-free Cramér–von Mises criterion.)

Generalization to metric spaces

One can generalize the notion of energy distance to probability distributions on metric spaces. Let be a metric space with its Borel sigma algebra . Let denote the collection of all probability measures on the measurable space . If μ and ν are probability measures in , then the energy-distance of μ and ν can be defined as the square root of

This is not necessarily non-negative, however. If is a strongly negative definite kernel, then is a metric, and conversely. [3] This condition is expressed by saying that has negative type. Negative type is not sufficient for to be a metric; the latter condition is expressed by saying that has strong negative type. In this situation, the energy distance is zero if and only if X and Y are identically distributed. An example of a metric of negative type but not of strong negative type is the plane with the taxicab metric. All Euclidean spaces and even separable Hilbert spaces have strong negative type. [4]

In the literature on kernel methods for machine learning, these generalized notions of energy distance are studied under the name of maximum mean discrepancy. Equivalence of distance based and kernel methods for hypothesis testing is covered by several authors. [5] [6]

Energy statistics

A related statistical concept, the notion of E-statistic or energy-statistic [7] was introduced by Gábor J. Székely in the 1980s when he was giving colloquium lectures in Budapest, Hungary and at MIT, Yale, and Columbia. This concept is based on the notion of Newton’s potential energy. [8] The idea is to consider statistical observations as heavenly bodies governed by a statistical potential energy which is zero only when an underlying statistical null hypothesis is true. Energy statistics are functions of distances between statistical observations.

Energy distance and E-statistic were considered as N-distances and N-statistic in Zinger A.A., Kakosyan A.V., Klebanov L.B. Characterization of distributions by means of mean values of some statistics in connection with some probability metrics, Stability Problems for Stochastic Models. Moscow, VNIISI, 1989,47-55. (in Russian), English Translation: A characterization of distributions by mean values of statistics and certain probabilistic metrics A. A. Zinger, A. V. Kakosyan, L. B. Klebanov in Journal of Soviet Mathematics (1992). In the same paper there was given a definition of strongly negative definite kernel, and provided a generalization on metric spaces, discussed above. The book [3] gives these results and their applications to statistical testing as well. The book contains also some applications to recovering the measure from its potential.

Testing for equal distributions

Consider the null hypothesis that two random variables, X and Y, have the same probability distributions: . For statistical samples from X and Y:

and ,

the following arithmetic averages of distances are computed between the X and the Y samples:

.

The E-statistic of the underlying null hypothesis is defined as follows:

One can prove [8] [9] that and that the corresponding population value is zero if and only if X and Y have the same distribution (). Under this null hypothesis the test statistic

converges in distribution to a quadratic form of independent standard normal random variables. Under the alternative hypothesis T tends to infinity. This makes it possible to construct a consistent statistical test, the energy test for equal distributions. [10]

The E-coefficient of inhomogeneity can also be introduced. This is always between 0 and 1 and is defined as

where denotes the expected value. H = 0 exactly when X and Y have the same distribution.

Goodness-of-fit

A multivariate goodness-of-fit measure is defined for distributions in arbitrary dimension (not restricted by sample size). The energy goodness-of-fit statistic is

where X and X' are independent and identically distributed according to the hypothesized distribution, and . The only required condition is that X has finite moment under the null hypothesis. Under the null hypothesis , and the asymptotic distribution of Qn is a quadratic form of centered Gaussian random variables. Under an alternative hypothesis, Qn tends to infinity stochastically, and thus determines a statistically consistent test. For most applications the exponent 1 (Euclidean distance) can be applied. The important special case of testing multivariate normality [9] is implemented in the energy package for R. Tests are also developed for heavy tailed distributions such as Pareto (power law), or stable distributions by application of exponents in (0,1).

Applications

Applications include:

Gneiting and Raftery [19] apply energy distance to develop a new and very general type of proper scoring rule for probabilistic predictions, the energy score.

Applications of energy statistics are implemented in the open source energy package [26] for R.

Related Research Articles

<span class="mw-page-title-main">Cumulative distribution function</span> Probability that random variable X is less than or equal to x

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to .

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

The median of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as the “middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory. The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. Mathematically, it is defined as

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

<span class="mw-page-title-main">Gábor J. Székely</span>

Gábor J. Székely is a Hungarian-American statistician/mathematician best known for introducing energy statistics (E-statistics). Examples include: the distance correlation, which is a bona fide dependence measure, equals zero exactly when the variables are independent; the distance skewness, which equals zero exactly when the probability distribution is diagonally symmetric; the E-statistic for normality test; and the E-statistic for clustering.

<span class="mw-page-title-main">Distance correlation</span> Statistical measure

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

Let be independent, identically distributed real-valued random variables with common characteristic function . The empirical characteristic function (ECF) defined as

Ball divergence is a non-parametric two-sample statistical test method in metric spaces. It measures the difference between two population probability distributions by integrating the difference over all balls in the space. Therefore, its value is zero if and only if the two probability measures are the same. Similar to common non-parametric test methods, ball divergence calculates the p-value through permutation tests.

References

  1. Cramér, H. (1928) On the composition of elementary errors, Skandinavisk Aktuarietidskrift, 11, 141–180.
  2. E-Statistics: The energy of statistical samples (2002) PDF Archived 2016-04-20 at the Wayback Machine
  3. 1 2 Klebanov, L. B. (2005) N-distances and their Applications, Karolinum Press, Charles University, Prague.
  4. Lyons, R. (2013). "Distance Covariance in Metric Spaces". The Annals of Probability. 41 (5): 3284–3305. arXiv: 1106.5758 . doi:10.1214/12-aop803. S2CID   73677891.
  5. Sejdinovic, D.; Sriperumbudur, B.; Gretton, A. & Fukumizu, K. (2013). "Equivalence of distance-based and RKHS-based statistics in hypothesis testing". The Annals of Statistics. 41 (5): 2263–2291. arXiv: 1207.6076 . doi:10.1214/13-aos1140. S2CID   8308769.
  6. Shen, Cencheng; Vogelstein, Joshua T. (2021). "The exact equivalence of distance and kernel methods in hypothesis testing". AStA Advances in Statistical Analysis. 105 (3): 385–403. arXiv: 1806.05514 . doi:10.1007/s10182-020-00378-1. S2CID   49210956.
  7. G. J. Szekely and M. L. Rizzo (2013). Energy statistics: statistics based on distances. Journal of Statistical Planning and Inference Volume 143, Issue 8, August 2013, pp. 1249-1272.
  8. 1 2 Székely, G.J. (2002) E-statistics: The Energy of Statistical Samples, Technical Report BGSU No 02-16.
  9. 1 2 3 Székely, G. J.; Rizzo, M. L. (2005). "A new test for multivariate normality". Journal of Multivariate Analysis. 93 (1): 58–80. doi: 10.1016/j.jmva.2003.12.002 . Reprint Archived 2011-08-05 at the Wayback Machine
  10. G. J. Szekely and M. L. Rizzo (2004). Testing for Equal Distributions in High Dimension, InterStat, Nov. (5). Reprint Archived 2011-08-05 at the Wayback Machine .
  11. Székely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method, Journal of Classification, 22(2) 151–183
  12. Varin, T., Bureau, R., Mueller, C. and Willett, P. (2009). "Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward's method" (PDF). Journal of Molecular Graphics and Modelling. 28 (2): 187–195. doi:10.1016/j.jmgm.2009.06.006. PMID   19640752.{{cite journal}}: CS1 maint: multiple names: authors list (link) "eprint".
  13. M. L. Rizzo and G. J. Székely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics Vol. 4, No. 2, 1034–1055. arXiv : 1011.2288
  14. Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, Nov. (5). Reprint Archived 2011-08-05 at the Wayback Machine .
  15. Ledlie, Jonathan and Pietzuch, Peter and Seltzer, Margo (2006). "Stable and Accurate Network Coordinates". 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06). ICDCS '06. Washington, DC, USA: IEEE Computer Society. pp. 74–83. CiteSeerX   10.1.1.68.4006 . doi:10.1109/ICDCS.2006.79. ISBN   978-0-7695-2540-2. PMID   1154085. S2CID   6770731.{{cite book}}: |journal= ignored (help)CS1 maint: multiple names: authors list (link) PDF Archived 2011-07-08 at the Wayback Machine
  16. Albert Y. Kim; Caren Marzban; Donald B. Percival; Werner Stuetzle (2009). "Using labeled data to evaluate change detectors in a multivariate streaming environment". Signal Processing. 89 (12): 2529–2536. Bibcode:2009SigPr..89.2529K. CiteSeerX   10.1.1.143.6576 . doi:10.1016/j.sigpro.2009.04.011. ISSN   0165-1684. Preprint:TR534.
  17. Székely, G. J., Rizzo M. L. and Bakirov, N. K. (2007). "Measuring and testing independence by correlation of distances", The Annals of Statistics, 35, 27692794. arXiv : 0803.4101
  18. Székely, G. J. and Rizzo, M. L. (2009). "Brownian distance covariance", The Annals of Applied Statistics, 3/4, 12331308. arXiv : 1010.0297
  19. T. Gneiting; A. E. Raftery (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation". Journal of the American Statistical Association. 102 (477): 359–378. doi:10.1198/016214506000001437. S2CID   1878582. Reprint
  20. Klebanov L.B. A class of Probability Metrics and its Statistical Applications, Statistics in Industry and Technology: Statistical Data Analysis, Yadolah Dodge, Ed. Birkhauser, Basel, Boston, Berlin, 2002,241-252.
  21. F. Ziel (2021). "The energy distance for ensemble and scenario reduction". Philosophical Transactions of the Royal Society A. 379 (2202): 20190431. arXiv: 2005.14670 . Bibcode:2021RSPTA.37990431Z. doi:10.1098/rsta.2019.0431. ISSN   1364-503X. PMID   34092100. S2CID   219124032.
  22. Statistics and Data Analysis, 2006, 50, 12, 3619-3628Rui Hu, Xing Qiu, Galina Glazko, Lev Klebanov, Andrei Yakovlev Detecting intergene correlation changes in microarray analysis: a new approach to gene selection, BMCBioinformatics, Vol.10, 20 (2009), 1-15.
  23. Yuanhui Xiao, Robert Frisina, Alexander Gordon, Lev Klebanov, Andrei Yakovlev Multivariate Search for Differentially Expressed Gene Combinations BMC Bioinformatics, 2004, 5:164; Antoni Almudevar, Lev Klebanov, Xing Qiu, Andrei Yakovlev Utility of correlation measures in analysis of gene expression, In: NeuroRX, 2006, 3, 3, 384-395; Klebanov Lev, Gordon Alexander, Land Hartmut, Yakovlev Andrei A permutation test motivated by microarray data analysis
  24. Viktor Benes, Radka Lechnerova, Lev Klebanov, Margarita Slamova, Peter Slama Statistical comparison of the geometry of second-phase particles, Materials Characterization, Vol. 60 (2009 ), 1076 - 1081.
  25. E. Vaiciukynas, A. Verikas, A. Gelzinis, M. Bacauskiene, and I. Olenina (2015) Exploiting statistical energy test for comparison of multiple groups in morphometric and chemometric data, Chemometrics and Intelligent Laboratory Systems, 146, 10-23.
  26. "energy: R package version 1.6.2" . Retrieved 30 January 2015.