Reparameterization trick

Last updated November 07, 2024

The reparameterization trick (aka "reparameterization gradient estimator") is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators.

Mathematics

Let $z$ be a random variable with distribution $q_{\phi }(z)$ , where $\phi$ is a vector containing the parameters of the distribution.

REINFORCE estimator

Consider an objective function of the form: $L(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[f(z)]$ Without the reparameterization trick, estimating the gradient $\nabla _{\phi }L(\phi )$ can be challenging, because the parameter appears in the random variable itself. In more detail, we have to statistically estimate: $\nabla _{\phi }L(\phi )=\nabla _{\phi }\int dz\;q_{\phi }(z)f(z)$ The REINFORCE estimator, widely used in reinforcement learning and especially policy gradient,^[4] uses the following equality: $\nabla _{\phi }L(\phi )=\int dz\;q_{\phi }(z)\nabla _{\phi }(\ln q_{\phi }(z))f(z)=\mathbb {E} _{z\sim q_{\phi }(z)}[\nabla _{\phi }(\ln q_{\phi }(z))f(z)]$ This allows the gradient to be estimated: $\nabla _{\phi }L(\phi )\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }(\ln q_{\phi }(z_{i}))f(z_{i})$ The REINFORCE estimator has high variance, and many methods were developed to reduce its variance.^[5]

Reparameterization estimator

The reparameterization trick expresses $z$ as: $z=g_{\phi }(\epsilon ),\quad \epsilon \sim p(\epsilon )$ Here, $g_{\phi }$ is a deterministic function parameterized by $\phi$ , and $\epsilon$ is a noise variable drawn from a fixed distribution $p(\epsilon )$ . This gives: $L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[f(g_{\phi }(\epsilon ))]$ Now, the gradient can be estimated as: $\nabla _{\phi }L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[\nabla _{\phi }f(g_{\phi }(\epsilon ))]\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }f(g_{\phi }(\epsilon _{i}))$

Examples

For some common distributions, the reparameterization trick takes specific forms:

Normal distribution: For $z\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ , we can use: $z=\mu +\sigma \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,1)$

Exponential distribution: For $z\sim {\text{Exp}}(\lambda )$ , we can use: $z=-{\frac {1}{\lambda }}\log(\epsilon ),\quad \epsilon \sim {\text{Uniform}}(0,1)$ Discrete distribution can be reparameterized by the Gumbel distribution (Gumbel-softmax trick or "concrete distribution").^[6]

In general, any distribution that is differentiable with respect to its parameters can be reparameterized by inverting the multivariable CDF function, then apply the implicit method. See ^[1] for an exposition and application to the Gamma Beta, Dirichlet, and von Mises distributions.

Applications

Variational autoencoder

In Variational Autoencoders (VAEs), the VAE objective function, known as the Evidence Lower Bound (ELBO), is given by:

${\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{z\sim q_{\phi }(z|x)}[\log p_{\theta }(x|z)]-D_{\text{KL}}(q_{\phi }(z|x)||p(z))$

where $q_{\phi }(z|x)$ is the encoder (recognition model), $p_{\theta }(x|z)$ is the decoder (generative model), and $p(z)$ is the prior distribution over latent variables. The gradient of ELBO with respect to $\theta$ is simply $\mathbb {E} _{z\sim q_{\phi }(z|x)}[\nabla _{\theta }\log p_{\theta }(x|z)]\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\theta }\log p_{\theta }(x|z_{l})$ but the gradient with respect to $\phi$ requires the trick. Express the sampling operation $z\sim q_{\phi }(z|x)$ as: $z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,I)$ where $\mu _{\phi }(x)$ and $\sigma _{\phi }(x)$ are the outputs of the encoder network, and $\odot$ denotes element-wise multiplication. Then we have $\nabla _{\phi }{\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{\epsilon \sim {\mathcal {N}}(0,I)}[\nabla _{\phi }\log p_{\theta }(x|z)+\nabla _{\phi }\log q_{\phi }(z|x)-\nabla _{\phi }\log p(z)]$ where $z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon$ . This allows us to estimate the gradient using Monte Carlo sampling: $\nabla _{\phi }{\text{ELBO}}(\phi ,\theta )\approx {\frac {1}{L}}\sum _{l=1}^{L}[\nabla _{\phi }\log p_{\theta }(x|z_{l})+\nabla _{\phi }\log q_{\phi }(z_{l}|x)-\nabla _{\phi }\log p(z_{l})]$ where $z_{l}=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon _{l}$ and $\epsilon _{l}\sim {\mathcal {N}}(0,I)$ for $l=1,\ldots ,L$ .

This formulation enables backpropagation through the sampling process, allowing for end-to-end training of the VAE model using stochastic gradient descent or its variants.

Variational inference

More generally, the trick allows using stochastic gradient descent for variational inference. Let the variational objective (ELBO) be of the form: ${\text{ELBO}}(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[\log p(x,z)-\log q_{\phi }(z)]$ Using the reparameterization trick, we can estimate the gradient of this objective with respect to $\phi$ : $\nabla _{\phi }{\text{ELBO}}(\phi )\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\phi }[\log p(x,g_{\phi }(\epsilon _{l}))-\log q_{\phi }(g_{\phi }(\epsilon _{l}))],\quad \epsilon _{l}\sim p(\epsilon )$

Dropout

The reparameterization trick has been applied to reduce the variance in dropout, a regularization technique in neural networks. The original dropout can be reparameterized with Bernoulli distributions: $y=(W\odot \epsilon )x,\quad \epsilon _{ij}\sim {\text{Bernoulli}}(\alpha _{ij})$ where $W$ is the weight matrix, $x$ is the input, and $\alpha _{ij}$ are the (fixed) dropout rates.

More generally, other distributions can be used than the Bernoulli distribution, such as the gaussian noise: $y_{i}=\mu _{i}+\sigma _{i}\odot \epsilon _{i},\quad \epsilon _{i}\sim {\mathcal {N}}(0,I)$ where $\mu _{i}=\mathbf {m} _{i}^{\top }x$ and $\sigma _{i}^{2}=\mathbf {v} _{i}^{\top }x^{2}$ , with $\mathbf {m} _{i}$ and $\mathbf {v} _{i}$ being the mean and variance of the $i$ -th output neuron. The reparameterization trick can be applied to all such cases, resulting in the variational dropout method.^[7]

Related Research Articles

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols $, (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δ f (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p) .$

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

Linear elasticity is a mathematical model as to how solid objects deform and become internally stressed by prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.

Stellar dynamics is the branch of astrophysics which describes in a statistical way the collective motions of stars subject to their mutual gravity. The essential difference from celestial mechanics is that the number of body

In econometrics, the autoregressive conditional heteroskedasticity (ARCH) model is a statistical model for time series data that describes the variance of the current error term or innovation as a function of the actual sizes of the previous time periods' error terms; often the variance is related to the squares of the previous innovations. The ARCH model is appropriate when the error variance in a time series follows an autoregressive (AR) model; if an autoregressive moving average (ARMA) model is assumed for the error variance, the model is a generalized autoregressive conditional heteroskedasticity (GARCH) model.

In mathematics, a Killing vector field, named after Wilhelm Killing, is a vector field on a Riemannian manifold that preserves the metric. Killing fields are the infinitesimal generators of isometries; that is, flows generated by Killing fields are continuous isometries of the manifold. More simply, the flow generates a symmetry, in the sense that moving each point of an object the same distance in the direction of the Killing vector will not distort distances on the object.

In quantum field theory, the Wightman distributions can be analytically continued to analytic functions in Euclidean space with the domain restricted to the ordered set of points in Euclidean space with no coinciding points. These functions are called the Schwinger functions and they are real-analytic, symmetric under the permutation of arguments, Euclidean covariant and satisfy a property known as reflection positivity. Properties of Schwinger functions are known as Osterwalder–Schrader axioms. Schwinger functions are also referred to as Euclidean correlation functions.

In mathematical physics and differential geometry, a gravitational instanton is a four-dimensional complete Riemannian manifold satisfying the vacuum Einstein equations. They are so named because they are analogues in quantum theories of gravity of instantons in Yang–Mills theory. In accordance with this analogy with self-dual Yang–Mills instantons, gravitational instantons are usually assumed to look like four dimensional Euclidean space at large distances, and to have a self-dual Riemann tensor. Mathematically, this means that they are asymptotically locally Euclidean hyperkähler 4-manifolds, and in this sense, they are special examples of Einstein manifolds. From a physical point of view, a gravitational instanton is a non-singular solution of the vacuum Einstein equations with positive-definite, as opposed to Lorentzian, metric.

In theoretical physics, the Wess–Zumino model has become the first known example of an interacting four-dimensional quantum field theory with linearly realised supersymmetry. In 1974, Julius Wess and Bruno Zumino studied, using modern terminology, dynamics of a single chiral superfield whose cubic superpotential leads to a renormalizable theory. It is a special case of 4D N = 1 global supersymmetry.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In linear elasticity, the equations describing the deformation of an elastic body subject only to surface forces on the boundary are the equilibrium equation:

In general relativity, a point mass deflects a light ray with impact parameter $by an angle approximately equal to$

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

References

1 2 Figurnov, Mikhail; Mohamed, Shakir; Mnih, Andriy (2018). "Implicit Reparameterization Gradients". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
↑ Fu, Michael C. "Gradient estimation." Handbooks in operations research and management science 13 (2006): 575-616.
↑ Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes". arXiv: 1312.6114 [stat.ML].
↑ Williams, Ronald J. (1992-05-01). "Simple statistical gradient-following algorithms for connectionist reinforcement learning". Machine Learning. 8 (3): 229–256. doi:10.1007/BF00992696. ISSN 1573-0565.
↑ Greensmith, Evan; Bartlett, Peter L.; Baxter, Jonathan (2004). "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning". Journal of Machine Learning Research. 5 (Nov): 1471–1530. ISSN 1533-7928.
↑ Maddison, Chris J.; Mnih, Andriy; Teh, Yee Whye (2017-03-05). "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables". arXiv: 1611.00712 [cs.LG].
↑ Kingma, Durk P; Salimans, Tim; Welling, Max (2015). "Variational Dropout and the Local Reparameterization Trick". Advances in Neural Information Processing Systems. 28. arXiv: 1506.02557 .