Integral probability metric

Last updated May 04, 2024

In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions. Many important statistical distances are integral probability metrics, including the Wasserstein-1 distance and the total variation distance. In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.

Definition

Integral probability metrics (IPMs) are distances on the space of distributions over a set ${\mathcal {X}}$ , defined by a class ${\mathcal {F}}$ of real-valued functions on ${\mathcal {X}}$ as

D_{\mathcal {F}}(P,Q)=\sup _{f\in {\mathcal {F}}}{\big |}\mathbb {E} _{X\sim P}f(X)-\mathbb {E} _{Y\sim Q}f(Y){\big |}=\sup _{f\in {\mathcal {F}}}{\big |}Pf-Qf{\big |};

here the notation $P f$ refers to the expectation of $f$ under the distribution $P$ . The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every $f\in {\mathcal {F}}$ its negation $-f$ is also in ${\mathcal {F}}$ .

The functions $f$ being optimized over are sometimes called "critic" functions;^[3] if a particular $f^{*}\in {\mathcal {F}}$ achieves the supremum, it is often termed a "witness function"^[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from $P$ and small (likely negative) values for samples from $Q$ ; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.^[5]^{: sec. 4}

The choice of ${\mathcal {F}}$ determines the particular distance; more than one ${\mathcal {F}}$ can generate the same distance.^[1]

For any choice of ${\mathcal {F}}$ , $D_{\mathcal {F}}$ satisfies all the definitions of a metric except that we may have we may have $D_{\mathcal {F}}(P,Q)=0$ for some $P \neq Q$ ; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class ${\mathcal {F}}=\{x\mapsto 0\}$ which only contains the zero function, $D_{\mathcal {F}}(P,Q)$ is identically zero. $D_{\mathcal {F}}$ is a metric if and only if ${\mathcal {F}}$ separates points on the space of probability distributions, i.e. for any $P \neq Q$ there is some $f\in {\mathcal {F}}$ such that $Pf\neq Qf$ ;^[1] most, but not all, common particular cases satisfy this property.

Examples

All of these examples are metrics except when noted otherwise.

The Wasserstein-1 distance (also called earth mover's distance), via its dual representation, has ${\mathcal {F}}$ the set of 1-Lipschitz functions.
The related Dudley metric is generated by the set of bounded 1-Lipschitz functions.
The total variation distance can be generated by ${\mathcal {F}}=\{f:{\mathcal {X}}\to \{0,1\}\}$ , so that ${\mathcal {F}}$ is a set of indicator functions for any event, or by the larger class ${\mathcal {F}}=\{f:{\mathcal {X}}\to [0,1]\}$ .
The closely related Radon metric is generated by continuous functions bounded in $[-1, 1]$ .
The Kolmogorov metric used in the Kolmogorov-Smirnov test has a function class of indicator functions, ${\mathcal {F}}=\{1_{(-\infty ,t]}:t\in \mathbb {R} \}$ .
The kernel maximum mean discrepancy (MMD) has ${\mathcal {F}}$ the unit ball in a reproducing kernel Hilbert space. This distance is particularly easy to estimate from samples, requiring no optimization; it is a proper metric exactly when the underlying kernel is characteristic.^[6]
The energy distance, as a special case of the maximum mean discrepancy,^[7] is generated by the unit ball in a particular reproducing kernel Hilbert space.
Defining ${\mathcal {F}}$ by functions with a bounded Sobolev norm gives a useful distance for generative modeling, among other applications.^[8]
Functions with bounded Besov norm generalize many other forms of IPM and are amenable to theoretical analysis.^[9]^[10]
Many variants of generative adversarial networks and classifer-based two-sample tests ^[11]^[12] use a "neural net distance"^[13]^[14] where ${\mathcal {F}}$ is a class of neural networks; these are not metrics for typical fixed-size networks, but could be for other classifiers. For Wasserstein GANs in particular, it has been argued that analysis in terms of this distance and not the Wasserstein they approximate is very important to the behavior of these models.^[13]^[15]^[16]

Relationship to $f$ -divergences

The $f$ -divergences are probably the best-known way to measure dissimilarity of probability distributions. It has been shown^[5]^{: sec. 2} that the only functions which are both IPMs and $f$ -divergences are of the form $c\,\operatorname {TV} (P,Q)$ , where $c\in [0,\infty ]$ and $\operatorname {TV}$ is the total variation distance between distributions.

One major difference between $f$ -divergences and most IPMs is that when $P$ and $Q$ have disjoint support, all $f$ -divergences take on a constant value;^[17] by contrast, IPMs where functions in ${\mathcal {F}}$ are "smooth" can give "partial credit." For instance, consider the sequence $\delta _{1/n}$ of Dirac measures at $1/ n$ ; this sequence converges in distribution to $\delta _{0}$ , and many IPMs satisfy $D_{\mathcal {F}}(\delta _{1/n},\delta _{0})\to 0$ , but no nonzero $f$ -divergence can satisfy this. That is, many IPMs are continuous in weaker topologies than $f$ -divergences. This property is sometimes of substantial importance,^[18] although other options also exist, such as considering $f$ -divergences between distributions convolved with continuous noise.^[18]^[19]

Estimation from samples

Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate $D_{\mathcal {F}}(P,Q)$ using a simple "plug-in" estimator: $D_{\mathcal {F}}({\hat {P}},{\hat {Q}})$ where ${\hat {P}}$ and ${\hat {Q}}$ are empirical measures of sample sets. These empirical distances can be computed exactly for some classes ${\mathcal {F}}$ ;^[5] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.^[14]^[20]^[21]

When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures ${\hat {P}}_{\mathit {train}}$ and ${\hat {Q}}_{\mathit {train}}$ ) and "test" sets ( ${\hat {P}}_{\mathit {test}}$ and ${\hat {Q}}_{\mathit {test}}$ ), find ${\hat {f}}$ approximately maximizing ${\big |}{\hat {P}}_{\mathit {train}}f-{\hat {Q}}_{\mathit {train}}f{\big |}$ , then use ${\big |}{\hat {P}}_{\mathit {test}}{\hat {f}}-{\hat {Q}}_{\mathit {test}}{\hat {f}}{\big |}$ as an estimate.^[22]^[12]^[23]^[24] This estimator can possibly be consistent, but has a negative bias^[22]^{: thm. 2}. In fact, no unbiased estimator can exist for any IPM^[22]^{: thm. 3}, although there is for instance an unbiased estimator of the squared maximum mean discrepancy.^[4]

Related Research Articles

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational distance.

In statistics, the Bhattacharyya distance is a quantity which represents a notion of similarity between two probability distributions. It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two statistical samples or populations.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P . While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

Bernhard Schölkopf is a German computer scientist known for his work in machine learning, especially on kernel methods and causality. He is a director at the Max Planck Institute for Intelligent Systems in Tübingen, Germany, where he heads the Department of Empirical Inference. He is also an affiliated professor at ETH Zürich, honorary professor at the University of Tübingen and the Technical University Berlin, and chairman of the European Laboratory for Learning and Intelligent Systems (ELLIS).

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space $. It is named after Leonid Vaseršteĭn.$

Energy distance is a statistical distance between probability distributions. If X and Y are independent random vectors in R^d with cumulative distribution functions (cdf) F and G respectively, then the energy distance between the distributions F and G is defined to be the square root of

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".

Distributional data analysis is a branch of nonparametric statistics that is related to functional data analysis. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that the space of probability distributions is, while a convex space, is not a vector space.

References

1 2 3 Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR 1428011. S2CID 124648603.
↑ Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.
↑ Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017-07-17). "Wasserstein Generative Adversarial Networks". International Conference on Machine Learning. PMLR: 214–223.
1 2 Gretton, Arthur; Borgwardt, Karsten M.; Rasche, Malte J.; Schölkopf, Bernhard; Smola, Alexander (2012). "A Kernel Two-Sample Test" (PDF). Journal of Machine Learning Research. 13: 723–773.
1 2 3 Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv: 0901.2698 [cs.IT].
↑ Fukumizu, Kenji; Gretton, Arthur; Sun, Xiaohui; Schölkopf, Bernhard (2007). "Kernel Measures of Conditional Dependence". Advances in Neural Information Processing Systems .
↑ Sejdinovic, Dino; Sriperumbudur, Bharath; Gretton, Arthur; Fukumizu, Kenji (2013). "Equivalence of distance-based and RKHS-based statistics in hypothesis testing". The Annals of Statistics. 41 (5): 2263–2291. arXiv: 1207.6076 . doi:10.1214/13-aos1140. S2CID 8308769.
↑ Mroueh, Youssef; Li, Chun-Liang; Sercu, Tom; Raj, Anant; Cheng, Yu (2018). "Sobolev GAN". International Conference on Learning Representations . arXiv: 1711.04894 .
↑ Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2019). "Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses". Advances in Neural Information Processing Systems . arXiv: 1902.03511 .
↑ Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2020). "Robust Density Estimation under Besov IPM Losses". Advances in Neural Information Processing Systems . arXiv: 2004.08597 .
↑ Kim, Ilmun; Ramdas, Aaditya; Singh, Aarti; Wasserman, Larry (February 2021). "Classification accuracy as a proxy for two-sample testing". The Annals of Statistics . 49 (1). arXiv: 1703.00573 . doi:10.1214/20-AOS1962. S2CID 17668083.
1 2 Lopez-Paz, David; Oquab, Maxime (2017). "Revisiting Classifier Two-Sample Tests". International Conference on Learning Representations . arXiv: 1610.06545 .
1 2 Arora, Sanjeev; Ge, Rong; Liang, Yingyu; Ma, Tengyu; Zhang, Yi (2017). "Generalization and Equilibrium in Generative Adversarial Nets (GANs)". International Conference on Machine Learning . arXiv: 1703.00573 .
1 2 Ji, Kaiyi; Liang, Yingbin (2018). "Minimax Estimation of Neural Net Distance". Advances in Neural Information Processing Systems . arXiv: 1811.01054 .
↑ Stanczuk, Jan; Etmann, Christian; Lisa Maria Kreusser; Schönlieb, Carola-Bibiane (2021). "Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance)". arXiv: 2103.01678 [stat.ML].
↑ Mallasto, Anton; Montúfar, Guido; Gerolin, Augusto (2019). "How Well do WGANs Estimate the Wasserstein Metric?". arXiv: 1910.03875 [cs.LG].
↑ Sutherland, Danica J. "Computing Jensen-Shannon Divergence between discrete and continuous distribution". Stack Exchange Network. Retrieved 18 July 2023.
1 2 Arjovsky, Martin; Bettou, Léon (2017). "Towards Principled Methods for Training Generative Adversarial Networks". International Conference on Learning Representations . arXiv: 1701.04862 .
↑ Sønderby, Casper Kaae; Caballero, Jose; Theis, Lucas; Shi, Wenzhe; Huszár, Ferenc (2017). "Amortised MAP Inference for Image Super-resolution". International Conference on Learning Representations . Appendix C. arXiv: 1610.04490 .
↑ Tolstikhin, Ilya O.; Sriperumbudur, Bharath K.; Schölkopf, Bernhard (2016). "Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels". Advances in Neural Information Processing Systems .
↑ Singh, Shashank; Póczos, Barnabás (2018). "Minimax Distribution Estimation in Wasserstein Distance". arXiv: 1802.08855 [math.ST].
1 2 3 Bińkowski, Mikołaj; Sutherland, Danica J.; Arbel, Michael; Gretton, Arthur (2018). "Demystifying MMD GANs". International Conference on Learning Representations . arXiv: 1801.01401 .
↑ Liu, Feng; Xu, Wenkai; Lu, Jie; Zhang, Guangquan; Gretton, Arthur; Sutherland, Danica J. (2020). "Learning Deep Kernels for Non-Parametric Two-Sample Tests". International Conference on Machine Learning . arXiv: 2002.09116 .
↑ Kübler, Jonas M.; Jitkrittum, Wittawat; Schölkopf, Bernhard; Muandent, Krikamol (2021). "A Witness Two-Sample Test". International Conference on Artificial Intelligence and Statistics. arXiv: 2102.05573 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[mueller-1] 1 2 3 Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR 1428011. S2CID 124648603.

[2] Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.

[wgan-3] Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017-07-17). "Wasserstein Generative Adversarial Networks". International Conference on Machine Learning. PMLR: 214–223.

[mmd-4] 1 2 Gretton, Arthur; Borgwardt, Karsten M.; Rasche, Malte J.; Schölkopf, Bernhard; Smola, Alexander (2012). "A Kernel Two-Sample Test" (PDF). Journal of Machine Learning Research. 13: 723–773.

[on-ipms-5] 1 2 3 Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv: 0901.2698 [cs.IT].

[6] Fukumizu, Kenji; Gretton, Arthur; Sun, Xiaohui; Schölkopf, Bernhard (2007). "Kernel Measures of Conditional Dependence". Advances in Neural Information Processing Systems .

[7] Sejdinovic, Dino; Sriperumbudur, Bharath; Gretton, Arthur; Fukumizu, Kenji (2013). "Equivalence of distance-based and RKHS-based statistics in hypothesis testing". The Annals of Statistics. 41 (5): 2263–2291. arXiv: 1207.6076 . doi:10.1214/13-aos1140. S2CID 8308769.

[8] Mroueh, Youssef; Li, Chun-Liang; Sercu, Tom; Raj, Anant; Cheng, Yu (2018). "Sobolev GAN". International Conference on Learning Representations . arXiv: 1711.04894 .

[9] Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2019). "Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses". Advances in Neural Information Processing Systems . arXiv: 1902.03511 .

[10] Uppal, Ananya; Singh, Shashank; Póczos, Barnabás (2020). "Robust Density Estimation under Besov IPM Losses". Advances in Neural Information Processing Systems . arXiv: 2004.08597 .

[11] Kim, Ilmun; Ramdas, Aaditya; Singh, Aarti; Wasserman, Larry (February 2021). "Classification accuracy as a proxy for two-sample testing". The Annals of Statistics . 49 (1). arXiv: 1703.00573 . doi:10.1214/20-AOS1962. S2CID 17668083.

[c2st-12] 1 2 Lopez-Paz, David; Oquab, Maxime (2017). "Revisiting Classifier Two-Sample Tests". International Conference on Learning Representations . arXiv: 1610.06545 .

[gen-and-eq-13] 1 2 Arora, Sanjeev; Ge, Rong; Liang, Yingyu; Ma, Tengyu; Zhang, Yi (2017). "Generalization and Equilibrium in Generative Adversarial Nets (GANs)". International Conference on Machine Learning . arXiv: 1703.00573 .

[minimax-nns-14] 1 2 Ji, Kaiyi; Liang, Yingbin (2018). "Minimax Estimation of Neural Net Distance". Advances in Neural Information Processing Systems . arXiv: 1811.01054 .

[15] Stanczuk, Jan; Etmann, Christian; Lisa Maria Kreusser; Schönlieb, Carola-Bibiane (2021). "Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance)". arXiv: 2103.01678 [stat.ML].

[16] Mallasto, Anton; Montúfar, Guido; Gerolin, Augusto (2019). "How Well do WGANs Estimate the Wasserstein Metric?". arXiv: 1910.03875 [cs.LG].

[17] Sutherland, Danica J. "Computing Jensen-Shannon Divergence between discrete and continuous distribution". Stack Exchange Network. Retrieved 18 July 2023.

[towards-principled-gans-18] 1 2 Arjovsky, Martin; Bettou, Léon (2017). "Towards Principled Methods for Training Generative Adversarial Networks". International Conference on Learning Representations . arXiv: 1701.04862 .

[19] Sønderby, Casper Kaae; Caballero, Jose; Theis, Lucas; Shi, Wenzhe; Huszár, Ferenc (2017). "Amortised MAP Inference for Image Super-resolution". International Conference on Learning Representations . Appendix C. arXiv: 1610.04490 .

[20] Tolstikhin, Ilya O.; Sriperumbudur, Bharath K.; Schölkopf, Bernhard (2016). "Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels". Advances in Neural Information Processing Systems .

[21] Singh, Shashank; Póczos, Barnabás (2018). "Minimax Distribution Estimation in Wasserstein Distance". arXiv: 1802.08855 [math.ST].

[demystifying-22] 1 2 3 Bińkowski, Mikołaj; Sutherland, Danica J.; Arbel, Michael; Gretton, Arthur (2018). "Demystifying MMD GANs". International Conference on Learning Representations . arXiv: 1801.01401 .

[23] Liu, Feng; Xu, Wenkai; Lu, Jie; Zhang, Guangquan; Gretton, Arthur; Sutherland, Danica J. (2020). "Learning Deep Kernels for Non-Parametric Two-Sample Tests". International Conference on Machine Learning . arXiv: 2002.09116 .

[24] Kübler, Jonas M.; Jitkrittum, Wittawat; Schölkopf, Bernhard; Muandent, Krikamol (2021). "A Witness Two-Sample Test". International Conference on Artificial Intelligence and Statistics. arXiv: 2102.05573 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]