Variational autoencoder

Last updated

The basic scheme of a variational autoencoder. The model receives
x
{\displaystyle x}
as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces
x
'
{\displaystyle {x'}}
as similar as possible to
x
{\displaystyle x}
. VAE Basic.png
The basic scheme of a variational autoencoder. The model receives as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces as similar as possible to .

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods. [1]

Contents

In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution.

Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. [2] Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately.

Although this type of model was initially designed for unsupervised learning, [3] [4] its effectiveness has been proven for semi-supervised learning [5] [6] and supervised learning. [7]

Overview of architecture and operation

A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. This neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder.

The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent.

To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. [8]

Formulation

From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data by their chosen parameterized probability distribution . This distribution is usually chosen to be a Gaussian which is parameterized by and respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents results in intractable integrals. Let us find via marginalizing over .

where represents the joint distribution under of the observable data and its latent representation or encoding . According to the chain rule, the equation can be rewritten as

In the vanilla variational autoencoder, is usually taken to be a finite-dimensional vector of real numbers, and to be a Gaussian distribution. Then is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

Unfortunately, the computation of is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

with defined as the set of real values that parametrize . This is sometimes called amortized inference, since by "investing" in finding a good , one can later infer from quickly without doing any integrals.

In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution is computed by the probabilistic decoder, and the approximated posterior distribution is computed by the probabilistic encoder.

Parametrize the encoder as , and the decoder as .

Evidence lower bound (ELBO)

As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders, the idea is to jointly optimize the generative model parameters to reduce the reconstruction error between the input and the output, and to make as close as possible to . As reconstruction loss, mean squared error and cross entropy are often used.

As distance loss between the two distributions the Kullback–Leibler divergence is a good choice to squeeze under . [8] [9]

The distance loss just defined is expanded as

Now define the evidence lower bound (ELBO):

Maximizing the ELBO

is equivalent to simultaneously maximizing and minimizing . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior from the exact posterior . The form given is not very convenient for maximization, but the following, equivalent form, is:

where is implemented as , since that is, up to an additive constant, what yields. That is, we model the distribution of conditional on to be a Gaussian distribution centered on . The distribution of and are often also chosen to be Gaussians as and , with which we obtain by the formula for KL divergence of Gaussians:

Here is the dimension of . For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.

Reparameterization

The scheme of the reparameterization trick. The randomness variable
e
{\displaystyle {\varepsilon }}
is injected into the latent space
z
{\displaystyle z}
as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update. Reparameterization Trick.png
The scheme of the reparameterization trick. The randomness variable is injected into the latent space as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To efficiently search for

the typical method is gradient descent. It is straightforward to find

However,

does not allow one to put the inside the expectation, since appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation [10] ) bypasses this difficulty. [8] [11] [12]

The most important example is when is normally distributed, as .

The scheme of a variational autoencoder after the reparameterization trick Reparameterized Variational Autoencoder.png
The scheme of a variational autoencoder after the reparameterization trick

This can be reparametrized by letting be a "standard random number generator", and construct as . Here, is obtained by the Cholesky decomposition:

Then we have

and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent. Since we reparametrized , we need to find . Let be the probability density function for , then [ clarification needed ]

where is the Jacobian matrix of with respect to . Since , this is

Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.

-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for values greater than one. This architecture can discover disentangled latent factors without supervision. [13] [14]

The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data. [15]

Some structures directly deal with the quality of the generated samples [16] [17] or implement more than one latent space to further improve the representation learning.

Some architectures mix VAE and generative adversarial networks to obtain hybrid models. [18] [19] [20]

See also

Related Research Articles

In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

Superspace is the coordinate space of a theory exhibiting supersymmetry. In such a formulation, along with ordinary space dimensions x, y, z, ..., there are also "anticommuting" dimensions whose coordinates are labeled in Grassmann numbers rather than real numbers. The ordinary space dimensions correspond to bosonic degrees of freedom, the anticommuting dimensions to fermionic degrees of freedom.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

The method of image charges is a basic problem-solving tool in electrostatics. The name originates from the replacement of certain elements in the original layout with imaginary charges, which replicates the boundary conditions of the problem.

Cylindrical multipole moments are the coefficients in a series expansion of a potential that varies logarithmically with the distance to a source, i.e., as . Such potentials arise in the electric potential of long line charges, and the analogous sources for the magnetic potential and gravitational potential.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

<span class="mw-page-title-main">Gravitational lensing formalism</span>

In general relativity, a point mass deflects a light ray with impact parameter by an angle approximately equal to

Financial models with long-tailed distributions and volatility clustering have been introduced to overcome problems with the realism of classical financial models. These classical models of financial time series typically assume homoskedasticity and normality cannot explain stylized phenomena such as skewness, heavy tails, and volatility clustering of the empirical asset returns in finance. In 1963, Benoit Mandelbrot first used the stable distribution to model the empirical distributions which have the skewness and heavy-tail property. Since -stable distributions have infinite -th moments for all , the tempered stable processes have been proposed for overcoming this limitation of the stable distribution.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

  1. Pinheiro Cinelli, Lucas; et al. (2021). "Variational Autoencoder". Variational Methods for Machine Learning with Applications to Deep Networks. Springer. pp. 111–149. doi:10.1007/978-3-030-70679-1_5. ISBN   978-3-030-70681-4. S2CID   240802776.
  2. Rocca, Joseph (2021-03-21). "Understanding Variational Autoencoders (VAEs)". Medium.
  3. Dilokthanakul, Nat; Mediano, Pedro A. M.; Garnelo, Marta; Lee, Matthew C. H.; Salimbeni, Hugh; Arulkumaran, Kai; Shanahan, Murray (2017-01-13). "Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders". arXiv: 1611.02648 [cs.LG].
  4. Hsu, Wei-Ning; Zhang, Yu; Glass, James (December 2017). "Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation". 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 16–23. arXiv: 1707.06265 . doi:10.1109/ASRU.2017.8268911. ISBN   978-1-5090-4788-8. S2CID   22681625.
  5. Ehsan Abbasnejad, M.; Dick, Anthony; van den Hengel, Anton (2017). Infinite Variational Autoencoder for Semi-Supervised Learning. pp. 5888–5897.
  6. Xu, Weidi; Sun, Haoze; Deng, Chao; Tan, Ying (2017-02-12). "Variational Autoencoder for Semi-Supervised Text Classification". Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1). doi: 10.1609/aaai.v31i1.10966 . S2CID   2060721.
  7. Kameoka, Hirokazu; Li, Li; Inoue, Shota; Makino, Shoji (2019-09-01). "Supervised Determined Source Separation with Multichannel Variational Autoencoder". Neural Computation. 31 (9): 1891–1914. doi:10.1162/neco_a_01217. PMID   31335290. S2CID   198168155.
  8. 1 2 3 Kingma, Diederik P.; Welling, Max (2013-12-20). "Auto-Encoding Variational Bayes". arXiv: 1312.6114 [stat.ML].
  9. "From Autoencoder to Beta-VAE". Lil'Log. 2018-08-12.
  10. Rezende, Danilo Jimenez; Mohamed, Shakir; Wierstra, Daan (2014-06-18). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models". International Conference on Machine Learning. PMLR: 1278–1286. arXiv: 1401.4082 .
  11. Bengio, Yoshua; Courville, Aaron; Vincent, Pascal (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv: 1206.5538 . doi:10.1109/TPAMI.2013.50. ISSN   1939-3539. PMID   23787338. S2CID   393948.
  12. Kingma, Diederik P.; Rezende, Danilo J.; Mohamed, Shakir; Welling, Max (2014-10-31). "Semi-Supervised Learning with Deep Generative Models". arXiv: 1406.5298 [cs.LG].
  13. Higgins, Irina; Matthey, Loic; Pal, Arka; Burgess, Christopher; Glorot, Xavier; Botvinick, Matthew; Mohamed, Shakir; Lerchner, Alexander (2016-11-04). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework".{{cite journal}}: Cite journal requires |journal= (help)
  14. Burgess, Christopher P.; Higgins, Irina; Pal, Arka; Matthey, Loic; Watters, Nick; Desjardins, Guillaume; Lerchner, Alexander (2018-04-10). "Understanding disentangling in β-VAE". arXiv: 1804.03599 [stat.ML].
  15. Sohn, Kihyuk; Lee, Honglak; Yan, Xinchen (2015-01-01). "Learning Structured Output Representation using Deep Conditional Generative Models" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  16. Dai, Bin; Wipf, David (2019-10-30). "Diagnosing and Enhancing VAE Models". arXiv: 1903.05789 [cs.LG].
  17. Dorta, Garoe; Vicente, Sara; Agapito, Lourdes; Campbell, Neill D. F.; Simpson, Ivor (2018-07-31). "Training VAEs Under Structured Residuals". arXiv: 1804.01050 [stat.ML].
  18. Larsen, Anders Boesen Lindbo; Sønderby, Søren Kaae; Larochelle, Hugo; Winther, Ole (2016-06-11). "Autoencoding beyond pixels using a learned similarity metric". International Conference on Machine Learning. PMLR: 1558–1566. arXiv: 1512.09300 .
  19. Bao, Jianmin; Chen, Dong; Wen, Fang; Li, Houqiang; Hua, Gang (2017). "CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training". pp. 2745–2754. arXiv: 1703.10155 [cs.CV].
  20. Gao, Rui; Hou, Xingsong; Qin, Jie; Chen, Jiaxin; Liu, Li; Zhu, Fan; Zhang, Zhao; Shao, Ling (2020). "Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning". IEEE Transactions on Image Processing. 29: 3665–3680. Bibcode:2020ITIP...29.3665G. doi:10.1109/TIP.2020.2964429. ISSN   1941-0042. PMID   31940538. S2CID   210334032.