The name "integral probability metric" was given by German statistician Alfred Müller;[1] the distances had also previously been called "metrics with a ζ-structure."[2]
Definition
Integral probability metrics (IPMs) are distances on the space of distributions over a set , defined by a class of real-valued functions on as here the notation Pf refers to the expectation of f under the distribution P. The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every its negation is also in .
The functions f being optimized over are sometimes called "critic" functions;[3] if a particular achieves the supremum, it is often termed a "witness function"[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from P and small (likely negative) values for samples from Q; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.[5]:sec. 4
The choice of determines the particular distance; more than one can generate the same distance.[1]
For any choice of , satisfies all the definitions of a metric except that we may have we may have for some P ≠ Q; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class which only contains the zero function, is identically zero. is a metric if and only if separates points on the space of probability distributions, i.e. for any P ≠ Q there is some such that ;[1] most, but not all, common particular cases satisfy this property.
Examples
All of these examples are metrics except when noted otherwise.
The energy distance, as a special case of the maximum mean discrepancy,[7] is generated by the unit ball in a particular reproducing kernel Hilbert space.
The f-divergences are probably the best-known way to measure dissimilarity of probability distributions. It has been shown[5]:sec. 2 that the only functions which are both IPMs and f-divergences are of the form , where and is the total variation distance between distributions.
One major difference between f-divergences and most IPMs is that when P and Q have disjoint support, all f-divergences take on a constant value;[17] by contrast, IPMs where functions in are "smooth" can give "partial credit." For instance, consider the sequence of Dirac measures at 1/n; this sequence converges in distribution to , and many IPMs satisfy , but no nonzero f-divergence can satisfy this. That is, many IPMs are continuous in weaker topologies than f-divergences. This property is sometimes of substantial importance,[18] although other options also exist, such as considering f-divergences between distributions convolved with continuous noise[18][19] or informal convolutions between f-divergences and integral probability metrics.[20][21]
Estimation from samples
Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate using a simple "plug-in" estimator: where and are empirical measures of sample sets. These empirical distances can be computed exactly for some classes ;[5] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.[14][22][23]
When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures and ) and "test" sets ( and ), find approximately maximizing , then use as an estimate.[24][12][25][26] This estimator can possibly be consistent, but has a negative bias[24]:thm. 2. In fact, no unbiased estimator can exist for any IPM[24]:thm. 3, although there is for instance an unbiased estimator of the squared maximum mean discrepancy.[4]
References
1 2 3 Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR1428011. S2CID124648603.
↑ Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.
1 2 Gretton, Arthur; Borgwardt, Karsten M.; Rasche, Malte J.; Schölkopf, Bernhard; Smola, Alexander (2012). "A Kernel Two-Sample Test"(PDF). Journal of Machine Learning Research. 13: 723–773.
1 2 3 Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].
↑ Stanczuk, Jan; Etmann, Christian; Lisa Maria Kreusser; Schönlieb, Carola-Bibiane (2021). "Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance)". arXiv:2103.01678 [stat.ML].
↑ Mallasto, Anton; Montúfar, Guido; Gerolin, Augusto (2019). "How Well do WGANs Estimate the Wasserstein Metric?". arXiv:1910.03875 [cs.LG].
This page is based on this Wikipedia article Text is available under the CC BY-SA 4.0 license; additional terms may apply. Images, videos and audio are available under their respective licenses.