Wasserstein GAN

Last updated March 08, 2024

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".^[1]^[2]

Motivation

The GAN game

The original GAN method is based on the GAN game, a zero-sum game with 2 players: generator and discriminator. The game is defined over a probability space $(\Omega ,{\mathcal {B}},\mu _{ref})$ , The generator's strategy set is the set of all probability measures $\mu _{G}$ on $(\Omega ,{\mathcal {B}})$ , and the discriminator's strategy set is the set of measurable functions $D:\Omega \to [0,1]$ .

The objective of the game is

L(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{ref}}[\ln D(x)]+\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))].

The generator aims to minimize it, and the discriminator aims to maximize it.

A basic theorem of the GAN game states that

Theorem (the optimal discriminator computes the Jensen–Shannon divergence) — For any fixed generator strategy $\mu _{G}$ , let the optimal reply be $D^{*}=\arg \max _{D}L(\mu _{G},D)$ , then

{\begin{aligned}D^{*}(x)&={\frac {d\mu _{ref}}{d(\mu _{ref}+\mu _{G})}}\\L(\mu _{G},D^{*})&=2D_{JS}(\mu _{ref};\mu _{G})-2\ln 2,\end{aligned}}

where the derivative is the Radon–Nikodym derivative, and $D_{JS}$ is the Jensen–Shannon divergence.

Repeat the GAN game many times, each time with the generator moving first, and the discriminator moving second. Each time the generator $\mu _{G}$ changes, the discriminator must adapt by approaching the ideal

D^{*}(x)={\frac {d\mu _{ref}}{d(\mu _{ref}+\mu _{G})}}.

Since we are really interested in $\mu _{ref}$ , the discriminator function $D$ is by itself rather uninteresting. It merely keeps track of the likelihood ratio between the generator distribution and the reference distribution. At equilibrium, the discriminator is just outputting ${\frac {1}{2}}$ constantly, having given up trying to perceive any difference.^{[note 1]}

Concretely, in the GAN game, let us fix a generator $\mu _{G}$ , and improve the discriminator step-by-step, with $\mu _{D,t}$ being the discriminator at step $t$ . Then we (ideally) have

L(\mu _{G},\mu _{D,1})\leq L(\mu _{G},\mu _{D,2})\leq \cdots \leq \max _{\mu _{D}}L(\mu _{G},\mu _{D})=2D_{JS}(\mu _{ref}\|\mu _{G})-2\ln 2,

so we see that the discriminator is actually lower-bounding $D_{JS}(\mu _{ref}\|\mu _{G})$ .

Wasserstein distance

Thus, we see that the point of the discriminator is mainly as a critic to provide feedback for the generator, about "how far it is from perfection", where "far" is defined as Jensen–Shannon divergence.

Naturally, this brings the possibility of using a different criteria of farness. There are many possible divergences to choose from, such as the f-divergence family, which would give the f-GAN.^[3]

The Wasserstein GAN is obtained by using the Wasserstein metric, which satisfies a "dual representation theorem" that renders it highly efficient to compute:

Theorem (Kantorovich-Rubenstein duality) — When the probability space $\Omega$ is a metric space, then

for any fixed

K>0

,

W_{1}(\mu ,\nu )={\frac {1}{K}}\sup _{\|f\|_{L}\leq K}\mathbb {E} _{x\sim \mu }[f(x)]-\mathbb {E} _{y\sim \nu }[f(y)]

where $\|\cdot \|_{L}$ is the Lipschitz norm.

A proof can be found in the main page on Wasserstein metric.

Definition

By the Kantorovich-Rubenstein duality, the definition of Wasserstein GAN is clear:

A Wasserstein GAN game is defined by a probability space $(\Omega ,{\mathcal {B}},\mu _{ref})$ , where $\Omega$ is a metric space, and a constant $K>0$ .
There are 2 players: generator and discriminator (also called "critic").
The generator's strategy set is the set of all probability measures $\mu _{G}$ on $(\Omega ,{\mathcal {B}})$ .
The discriminator's strategy set is the set of measurable functions of type $D:\Omega \to \mathbb {R}$ with bounded Lipschitz-norm: $\|D\|_{L}\leq K$ .
The Wasserstein GAN game is a zero-sum game, with objective function
$L_{WGAN}(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{G}}[D(x)]-\mathbb {E} _{x\sim \mu _{ref}}[D(x)].$
The generator goes first, and the discriminator goes second. The generator aims to minimize the objective, and the discriminator aims to maximize the objective:
$\min _{\mu _{G}}\max _{D}L_{WGAN}(\mu _{G},D).$

By the Kantorovich-Rubenstein duality, for any generator strategy $\mu _{G}$ , the optimal reply by the discriminator is $D^{*}$ , such that

L_{WGAN}(\mu _{G},D^{*})=K\cdot W_{1}(\mu _{G},\mu _{ref}).

Consequently, if the discriminator is good, the generator would be constantly pushed to minimize $W_{1}(\mu _{G},\mu _{ref})$ , and the optimal strategy for the generator is just $\mu _{G}=\mu _{ref}$ , as it should.

Comparison with GAN

In the Wasserstein GAN game, the discriminator provides a better gradient than in the GAN game.

Consider for example a game on the real line where both $\mu _{G}$ and $\mu _{ref}$ are Gaussian. Then the optimal Wasserstein critic $D_{WGAN}$ and the optimal GAN discriminator $D$ are plotted as below:

For fixed discriminator, the generator needs to minimize the following objectives:

For GAN, $\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]$ .
For Wasserstein GAN, $\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]$ .

Let $\mu _{G}$ be parametrized by $\theta$ , then we can perform stochastic gradient descent by using two unbiased estimators of the gradient:

\nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]=\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]

\nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]=\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]

where we used the reparametrization trick.^{[note 2]}

As shown, the generator in GAN is motivated to let its $\mu _{G}$ "slide down the peak" of $\ln(1-D(x))$ . Similarly for the generator in Wasserstein GAN.

For Wasserstein GAN, $D_{WGAN}$ has gradient 1 almost everywhere, while for GAN, $\ln(1-D)$ has flat gradient in the middle, and steep gradient elsewhere. As a result, the variance for the estimator in GAN is usually much larger than that in Wasserstein GAN. See also Figure 3 of.^[1]

The problem with $D_{JS}$ is much more severe in actual machine learning situations. Consider training a GAN to generate ImageNet, a collection of photos of size 256-by-256. The space of all such photos is $\mathbb {R} ^{256^{2}}$ , and the distribution of ImageNet pictures, $\mu _{ref}$ , concentrates on a manifold of much lower dimension in it. Consequently, any generator strategy $\mu _{G}$ would almost surely be entirely disjoint from $\mu _{ref}$ , making $D_{JS}(\mu _{G}\|\mu _{ref})=+\infty$ . Thus, a good discriminator can almost perfectly distinguish $\mu _{ref}$ from $\mu _{G}$ , as well as any $\mu _{G}'$ close to $\mu _{G}$ . Thus, the gradient $\nabla _{\mu _{G}}L(\mu _{G},D)\approx 0$ , creating no learning signal for the generator.

Detailed theorems can be found in.^[4]

Training Wasserstein GANs

Training the generator in Wasserstein GAN is just gradient descent, the same as in GAN (or most deep learning methods), but training the discriminator is different, as the discriminator is now restricted to have bounded Lipschitz norm. There are several methods for this.

Upper-bounding the Lipschitz norm

Let the discriminator function $D$ to be implemented by a multilayer perceptron:

D=D_{n}\circ D_{n-1}\circ \cdots \circ D_{1}

where $D_{i}(x)=h(W_{i}x)$ , and $h:\mathbb {R} \to \mathbb {R}$ is a fixed activation function with $\sup _{x}|h'(x)|\leq 1$ . For example, the hyperbolic tangent function $h=\tanh$ satisfies the requirement. Then, for any $x$ , let $x_{i}=(D_{i}\circ D_{i-1}\circ \cdots \circ D_{1})(x)$ , we have by the chain rule:

dD(x)=diag(h'(W_{n}x_{n-1}))\cdot W_{n}\cdot diag(h'(W_{n-1}x_{n-2}))\cdot W_{n-1}\cdots diag(h'(W_{1}x))\cdot W_{1}\cdot dx

Thus, the Lipschitz norm of $D$ is upper-bounded by

\|D\|_{L}\leq \sup _{x}\|diag(h'(W_{n}x_{n-1}))\cdot W_{n}\cdot diag(h'(W_{n-1}x_{n-2}))\cdot W_{n-1}\cdots diag(h'(W_{1}x))\cdot W_{1}\|_{F}

where $\|\cdot \|_{s}$ is the operator norm of the matrix, that is, the largest singular value of the matrix, that is, the spectral radius of the matrix (these concepts are the same for matrices, but different for general linear operators). Since $\sup _{x}|h'(x)|\leq 1$ , we have $\|diag(h'(W_{i}x_{i-1}))\|_{s}=\max _{j}|h'(W_{i}x_{i-1,j})|\leq 1$ , and consequently the upper bound:

\|D\|_{L}\leq \prod _{i=1}^{n}\|W_{i}\|_{s}

Thus, if we can upper-bound operator norms $\|W_{i}\|_{s}$ of each matrix, we can upper-bound the Lipschitz norm of $D$ .

Weight clipping

Since for any $m\times l$ matrix $W$ , let $c=\max _{i,j}|W_{i,j}|$ , we have

\|W\|_{s}^{2}=\sup _{\|x\|_{2}=1}\|Wx\|_{2}^{2}=\sup _{\|x\|_{2}=1}\sum _{i}\left(\sum _{j}W_{i,j}x_{j}\right)^{2}=\sup _{\|x\|_{2}=1}\sum _{i,j,k}W_{ij}W_{ik}x_{j}x_{k}\leq c^{2}ml^{2}

by clipping all entries of $W$ to within some interval $[-c,c]$ , we have can bound $\|W\|_{s}$ .

This is the weight clipping method, proposed by the original paper.^[1]

Spectral normalization

The spectral radius can be efficiently computed by the following algorithm:

INPUT matrix $W$ and initial guess $x$
Iterate $x\mapsto {\frac {1}{\|Wx\|_{2}}}Wx$ to convergence $x^{*}$ . This is the eigenvector of $W$ with eigenvalue $\|W\|_{s}$ .
RETURN $x^{*},\|Wx^{*}\|_{2}$

By reassigning $W_{i}\leftarrow {\frac {W_{i}}{\|W_{i}\|_{s}}}$ after each update of the discriminator, we can upper bound $\|W_{i}\|_{s}\leq 1$ , and thus upper bound $\|D\|_{L}$ .

The algorithm can be further accelerated by memoization: At step $t$ , store $x_{i}^{*}(t)$ . Then at step $t+1$ , use $x_{i}^{*}(t)$ as the initial guess for the algorithm. Since $W_{i}(t+1)$ is very close to $W_{i}(t)$ , so is $x_{i}^{*}(t)$ close to $x_{i}^{*}(t+1)$ , so this allows rapid convergence.

This is the spectral normalization method.^[5]

Gradient penalty

Instead of strictly bounding $\|D\|_{L}$ , we can simply add a "gradient penalty" term for the discriminator, of form

\mathbb {E} _{x\sim {\hat {\mu }}}[(\|\nabla D(x)\|_{2}-a)^{2}]

where ${\hat {\mu }}$ is a fixed distribution used to estimate how much the discriminator has violated the Lipschitz norm requirement.

The discriminator, in attempting to minimize the new loss function, would naturally bring $\nabla D(x)$ close to $a$ everywhere, thus making $\|D\|_{L}\approx a$ .

This is the gradient penalty method.^[6]

Related Research Articles

In mathematics, the $L p$ spaces are function spaces defined using a natural generalization of the $p$ -norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

In the mathematical field of real analysis, the monotone convergence theorem is any of a number of related theorems proving the convergence of monotonic sequences that are also bounded. Informally, the theorems state that if a sequence is increasing and bounded above by a supremum, then the sequence will converge to the supremum; in the same way, if a sequence is decreasing and is bounded below by an infimum, it will converge to the infimum.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of $L p$ spaces.

In mathematical analysis, the Minkowski inequality establishes that the L^p spaces are normed vector spaces. Let $be a measure space, let and let and be elements of Then is in and we have the triangle inequality$

In mathematics, Fatou's lemma establishes an inequality relating the Lebesgue integral of the limit inferior of a sequence of functions to the limit inferior of integrals of these functions. The lemma is named after Pierre Fatou.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

In mathematics, mixing is an abstract concept originating from physics: the attempt to describe the irreversible thermodynamic process of mixing in the everyday world: e.g. mixing paint, mixing drinks, industrial mixing.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space $. It is named after Leonid Vaseršteĭn.$

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In mathematics — specifically, in stochastic analysis — the infinitesimal generator of a Feller process is a Fourier multiplier operator that encodes a great deal of information about the process.

In functional analysis, the dual norm is a measure of size for a continuous linear function defined on a normed vector space.

In mathematics, the logarithmic norm is a real-valued functional on operators, and is derived from either an inner product, a vector norm, or its induced operator norm. The logarithmic norm was independently introduced by Germund Dahlquist and Sergei Lozinskiĭ in 1958, for square matrices. It has since been extended to nonlinear operators and unbounded operators as well. The logarithmic norm has a wide range of applications, in particular in matrix theory, differential equations and numerical analysis. In the finite-dimensional setting, it is also referred to as the matrix measure or the Lozinskiĭ measure.

In mathematics, especially measure theory, a set function is a function whose domain is a family of subsets of some given set and that (usually) takes its values in the extended real number line $which consists of the real numbers and$

In mathematical analysis, Lorentz spaces, introduced by George G. Lorentz in the 1950s, are generalisations of the more familiar $spaces$ .

In probability theory, a McKean–Vlasov process is a stochastic process described by a stochastic differential equation where the coefficients of the diffusion depend on the distribution of the solution itself. The equations are a model for Vlasov equation and were first studied by Henry McKean in 1966. It is an example of propagation of chaos, in that it can be obtained as a limit of a mean-field system of interacting particles: as the number of particles tends to infinity, the interactions between any single particle and the rest of the pool will only depend on the particle itself.

Stochastic portfolio theory (SPT) is a mathematical theory for analyzing stock market structure and portfolio behavior introduced by E. Robert Fernholz in 2002. It is descriptive as opposed to normative, and is consistent with the observed behavior of actual markets. Normative assumptions, which serve as a basis for earlier theories like modern portfolio theory (MPT) and the capital asset pricing model (CAPM), are absent from SPT.

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

Distributional data analysis is a branch of nonparametric statistics that is related to functional data analysis. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that the space of probability distributions is, while a convex space, is not a vector space.

References

1 2 3 Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017-07-17). "Wasserstein Generative Adversarial Networks". International Conference on Machine Learning. PMLR: 214–223.
↑ Weng, Lilian (2019-04-18). "From GAN to WGAN". arXiv: 1904.08994 [cs.LG].
↑ Nowozin, Sebastian; Cseke, Botond; Tomioka, Ryota (2016). "f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29. arXiv: 1606.00709 .
↑ Arjovsky, Martin; Bottou, Léon (2017-01-01). "Towards Principled Methods for Training Generative Adversarial Networks". arXiv: 1701.04862 .{{cite journal}}: Cite journal requires |journal= (help)
↑ Miyato, Takeru; Kataoka, Toshiki; Koyama, Masanori; Yoshida, Yuichi (2018-02-16). "Spectral Normalization for Generative Adversarial Networks". arXiv: 1802.05957 [cs.LG].
↑ Gulrajani, Ishaan; Ahmed, Faruk; Arjovsky, Martin; Dumoulin, Vincent; Courville, Aaron C (2017). "Improved Training of Wasserstein GANs". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30.

Notes

↑ In practice, the generator would never be able to reach perfect imitation, and so the discriminator would have motivation for perceiving the difference, which allows it to be used for other tasks, such as performing ImageNet classification without supervision.
↑ This is not how it is really done in practice, since $\nabla _{\theta }\ln \rho _{\mu _{G}}(x)$ is in general intractable, but it is theoretically illuminating.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017-07-17). "Wasserstein Generative Adversarial Networks". International Conference on Machine Learning. PMLR: 214–223.

[2] Weng, Lilian (2019-04-18). "From GAN to WGAN". arXiv: 1904.08994 [cs.LG].

[4] Nowozin, Sebastian; Cseke, Botond; Tomioka, Ryota (2016). "f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29. arXiv: 1606.00709 .

[6] Arjovsky, Martin; Bottou, Léon (2017-01-01). "Towards Principled Methods for Training Generative Adversarial Networks". arXiv: 1701.04862 .{{cite journal}}: Cite journal requires |journal= (help)

[7] Miyato, Takeru; Kataoka, Toshiki; Koyama, Masanori; Yoshida, Yuichi (2018-02-16). "Spectral Normalization for Generative Adversarial Networks". arXiv: 1802.05957 [cs.LG].

[8] Gulrajani, Ishaan; Ahmed, Faruk; Arjovsky, Martin; Dumoulin, Vincent; Courville, Aaron C (2017). "Improved Training of Wasserstein GANs". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30.

[in_practice-3] In practice, the generator would never be able to reach perfect imitation, and so the discriminator would have motivation for perceiving the difference, which allows it to be used for other tasks, such as performing ImageNet classification without supervision.

[not_really_done_in_practice-5] This is not how it is really done in practice, since $\nabla _{\theta }\ln \rho _{\mu _{G}}(x)$ is in general intractable, but it is theoretically illuminating.

[1]

[2]

[note 1]

[3]

[note 2]

[4]

[5]

[6]