Flow-based generative model

Last updated

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, [1] [2] [3] which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

Contents

The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.

In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.

Method

Scheme for normalizing flows Normalizing-flow.svg
Scheme for normalizing flows

Let be a (possibly multivariate) random variable with distribution .

For , let be a sequence of random variables transformed from . The functions should be invertible, i.e. the inverse function exists. The final output models the target distribution.

The log likelihood of is (see derivation):

To efficiently compute the log likelihood, the functions should be 1. easy to invert, and 2. easy to compute the determinant of its Jacobian. In practice, the functions are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE, [4] RealNVP, [5] and Glow. [6]

Derivation of log likelihood

Consider and . Note that .

By the change of variable formula, the distribution of is:

Where is the determinant of the Jacobian matrix of .

By the inverse function theorem:

By the identity (where is an invertible matrix), we have:

The log likelihood is thus:

In general, the above applies to any and . Since is equal to subtracted by a non-recursive term, we can infer by induction that:

Training method

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denoting the model's likelihood and the target distribution to learn, the (forward) KL-divergence is:

The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset of samples each independently drawn from the target distribution , then this term can be estimated as:

Therefore, the learning objective

is replaced by

In other words, minimizing the Kullback–Leibler divergence between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution. [7]

A pseudocode for training normalizing flows is as follows: [8]

Variants

Planar Flow

The earliest example. [9] Fix some activation function , and let with the appropriate dimensions, then

The inverse has no closed-form solution in general.

The Jacobian is .

For it to be invertible everywhere, it must be nonzero everywhere. For example, and satisfies the requirement.

Nonlinear Independent Components Estimation (NICE)

Let be even-dimensional, and split them in the middle. [4] Then the normalizing flow functions are

where is any neural network with weights .

is just , and the Jacobian is just 1, that is, the flow is volume-preserving.

When , this is seen as a curvy shearing along the direction.

Real Non-Volume Preserving (Real NVP)

The Real Non-Volume Preserving model generalizes NICE model by: [5]

Its inverse is , and its Jacobian is . The NICE model is recovered by setting . Since the Real NVP map keeps the first and second halves of the vector separate, it's usually required to add a permutation after every Real NVP layer.

Generative Flow (Glow)

In generative flow model, [6] each layer has 3 parts:

The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

Masked autoregressive flow (MAF)

An autoregressive model of a distribution on is defined as the following stochastic process: [10]

where and are fixed functions that define the autoregressive model. By the reparametrization trick, the autoregressive model is generalized to a normalizing flow:

The autoregressive model is recovered by setting .

The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).

The Jacobian matrix is lower-diagonal, so the Jacobian is .

Reversing the two maps and of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping. [11]

Continuous Normalizing Flow (CNF)

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic. [12] [13] Let be the latent variable with distribution . Map this latent variable to data space with the following flow function:

where is an arbitrary function and can be modeled with e.g. neural networks.

The inverse function is then naturally: [12]

And the log-likelihood of can be found as: [12]

Since the trace depends only on the diagonal of the Jacobian , this allows "free-form" Jacobian. [14] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.

The trace can be estimated by "Hutchinson's trick": [15] [16]

Given any matrix , and any random with , we have . (Proof: expand the expectation directly.)

Usually, the random vector is sampled from (normal distribution) or (Radamacher distribution).

When is implemented as a neural network, neural ODE methods [17] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.

There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible that all solve the same problem).

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE". [18]

Any homeomorphism of can be approximated by a neural ODE operating on , proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks. [19]

To regularize the flow , one can impose regularization losses. The paper [15] proposed the following regularization loss based on optimal transport theory:

where are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.

Downsides

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them. [20]

Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set). [21] Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis, [22] estimation issues when training models, [23] or fundamental issues due to the entropy of the data distributions. [24]

One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision. [25]

Applications

Flow-based generative models have been applied on a variety of modeling tasks, including:

Related Research Articles

<span class="mw-page-title-main">Divergence</span> Vector operator in vector calculus

In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the outward flux of a vector field from an infinitesimal volume around a given point.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Ray transfer matrix analysis is a mathematical form for performing ray tracing calculations in sufficiently simple problems which can be solved considering only paraxial rays. Each optical element is described by a 2×2 ray transfer matrix which operates on a vector describing an incoming light ray to calculate the outgoing ray. Multiplication of the successive matrices thus yields a concise ray transfer matrix describing the entire optical system. The same mathematics is also used in accelerator physics to track particles through the magnet installations of a particle accelerator, see electron optics.

In vector calculus, the Jacobian matrix of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables as input as the number of vector components of its output, its determinant is referred to as the Jacobian determinant. Both the matrix and the determinant are often referred to simply as the Jacobian in literature.

In geometry and complex analysis, a Möbius transformation of the complex plane is a rational function of the form

In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants". The Hessian is sometimes denoted by H or, ambiguously, by ∇2.

<span class="mw-page-title-main">Kriging</span> Method of interpolation

In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.

In mathematics and statistics, the quasi-arithmetic mean or generalised f-mean or Kolmogorov-Nagumo-de Finetti mean is one generalisation of the more familiar means such as the arithmetic mean and the geometric mean, using a function . It is also called Kolmogorov mean after Soviet mathematician Andrey Kolmogorov. It is a broader generalization than the regular generalized mean.

<span class="mw-page-title-main">Hopf fibration</span> Fiber bundle of the 3-sphere over the 2-sphere, with 1-spheres as fibers

In the mathematical field of differential topology, the Hopf fibration describes a 3-sphere in terms of circles and an ordinary sphere. Discovered by Heinz Hopf in 1931, it is an influential early example of a fiber bundle. Technically, Hopf found a many-to-one continuous function from the 3-sphere onto the 2-sphere such that each distinct point of the 2-sphere is mapped from a distinct great circle of the 3-sphere. Thus the 3-sphere is composed of fibers, where each fiber is a circle — one for each point of the 2-sphere.

In multivariable calculus, the implicit function theorem is a tool that allows relations to be converted to functions of several real variables. It does so by representing the relation as the graph of a function. There may not be a single function whose graph can represent the entire relation, but there may be such a function on a restriction of the domain of the relation. The implicit function theorem gives a sufficient condition to ensure that there is such a function.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, econometrics, and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, behavior, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation which should not be confused with a differential equation. Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models of time series, which have a more complicated stochastic structure; it is also a special case of the vector autoregressive model (VAR), which consists of a system of more than one interlocking stochastic difference equation in more than one evolving random variable.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In machine learning, the vanishing gradient problem is encountered when training recurrent neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

  1. Tabak, Esteban G.; Vanden-Eijnden, Eric (2010). "Density estimation by dual ascent of the log-likelihood". Communications in Mathematical Sciences. 8 (1): 217–233. doi:10.4310/CMS.2010.v8.n1.a11.
  2. Tabak, Esteban G.; Turner, Cristina V. (2012). "A family of nonparametric density estimation algorithms". Communications on Pure and Applied Mathematics. 66 (2): 145–164. doi:10.1002/cpa.21423. hdl:11336/8930. S2CID   17820269.
  3. Papamakarios, George; Nalisnick, Eric; Jimenez Rezende, Danilo; Mohamed, Shakir; Bakshminarayanan, Balaji (2021). "Normalizing flows for probabilistic modeling and inference". Journal of Machine Learning Research. 22 (1): 2617–2680. arXiv: 1912.02762 .
  4. 1 2 Dinh, Laurent; Krueger, David; Bengio, Yoshua (2014). "NICE: Non-linear Independent Components Estimation". arXiv: 1410.8516 [cs.LG].
  5. 1 2 Dinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). "Density estimation using Real NVP". arXiv: 1605.08803 [cs.LG].
  6. 1 2 3 Kingma, Diederik P.; Dhariwal, Prafulla (2018). "Glow: Generative Flow with Invertible 1x1 Convolutions". arXiv: 1807.03039 [stat.ML].
  7. Papamakarios, George; Nalisnick, Eric; Rezende, Danilo Jimenez; Shakir, Mohamed; Balaji, Lakshminarayanan (March 2021). "Normalizing Flows for Probabilistic Modeling and Inference". Journal of Machine Learning Research. 22 (57): 1–64. arXiv: 1912.02762 .
  8. Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). "Normalizing Flows: An Introduction and Review of Current Methods". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (11): 3964–3979. arXiv: 1908.09257 . doi:10.1109/TPAMI.2020.2992934. ISSN   1939-3539. PMID   32396070. S2CID   208910764.
  9. Danilo Jimenez Rezende; Mohamed, Shakir (2015). "Variational Inference with Normalizing Flows". arXiv: 1505.05770 [stat.ML].
  10. Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017). "Masked Autoregressive Flow for Density Estimation". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv: 1705.07057 .
  11. Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016). "Improved Variational Inference with Inverse Autoregressive Flow". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv: 1606.04934 .
  12. 1 2 3 Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv: 1810.01367 [cs.LG].
  13. Lipman, Yaron; Chen, Ricky T. Q.; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matt (2022-10-01). "Flow Matching for Generative Modeling". arXiv: 2210.02747 [cs.LG].
  14. Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv: 1810.01367 [cs.LG].
  15. 1 2 Finlay, Chris; Jacobsen, Joern-Henrik; Nurbekyan, Levon; Oberman, Adam (2020-11-21). "How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization". International Conference on Machine Learning. PMLR: 3154–3164. arXiv: 2002.02798 .
  16. Hutchinson, M.F. (January 1989). "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines". Communications in Statistics - Simulation and Computation. 18 (3): 1059–1076. doi:10.1080/03610918908812806. ISSN   0361-0918.
  17. Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David (2018). "Neural Ordinary Differential Equations". arXiv: 1806.07366 [cs.LG].
  18. Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019). "Augmented Neural ODEs". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
  19. Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". arXiv: 1907.12998 [cs.LG].
  20. 1 2 Helminger, Leonhard; Djelouah, Abdelaziz; Gross, Markus; Schroers, Christopher (2020). "Lossy Image Compression with Normalizing Flows". arXiv: 2008.10486 [cs.CV].
  21. Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2018). "Do Deep Generative Models Know What They Don't Know?". arXiv: 1810.09136v3 [stat.ML].
  22. Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2019). "Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality". arXiv: 1906.02994 [stat.ML].
  23. Zhang, Lily; Goldstein, Mark; Ranganath, Rajesh (2021). "Understanding Failures in Out-of-Distribution Detection with Deep Generative Models". Proceedings of Machine Learning Research. 139: 12427–12436. PMC   9295254 . PMID   35860036.
  24. Caterini, Anthony L.; Loaiza-Ganem, Gabriel (2022). "Entropic Issues in Likelihood-Based OOD Detection". pp. 21–26. arXiv: 2109.10794 [stat.ML].
  25. Behrmann, Jens; Vicol, Paul; Wang, Kuan-Chieh; Grosse, Roger; Jacobsen, Jörn-Henrik (2020). "Understanding and Mitigating Exploding Inverses in Invertible Neural Networks". arXiv: 2006.09347 [cs.LG].
  26. Ping, Wei; Peng, Kainan; Gorur, Dilan; Lakshminarayanan, Balaji (2019). "WaveFlow: A Compact Flow-based Model for Raw Audio". arXiv: 1912.01219 [cs.SD].
  27. Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). "GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation". arXiv: 2001.09382 [cs.LG].
  28. Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). "PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows". arXiv: 1906.12320 [cs.CV].
  29. Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). "VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation". arXiv: 1903.01434 [cs.CV].
  30. Rudolph, Marco; Wandt, Bastian; Rosenhahn, Bodo (2021). "Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows". arXiv: 2008.12577 [cs.CV].