Evidence lower bound

Last updated March 16, 2024

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound^[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. $p(X)$ ) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model $p(X)$ or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is $q_{\phi }(\cdot |x)$ , defined in detail later in this article.)

Definition

Let $X$ and $Z$ be random variables, jointly distributed with distribution $p_{\theta }$ . For example, $p_{\theta }(X)$ is the marginal distribution of $X$ , and $p_{\theta }(Z\mid X)$ is the conditional distribution of $Z$ given $X$ . Then, for a sample $x\sim p_{\theta }$ , and any distribution $q_{\phi }$ , the ELBO is defined as

L(\phi ,\theta ;x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right].

The ELBO can equivalently be written as^[2]

{\begin{aligned}L(\phi ,\theta ;x)=&\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {}p_{\theta }(x,z)\right]+H[q_{\phi }(z|x)]\\=&\mathbb {\ln } {}\,p_{\theta }(x)-D_{KL}(q_{\phi }(z|x)||p_{\theta }(z|x)).\\\end{aligned}}

In the first line, $H[q_{\phi }(z|x)]$ is the entropy of $q_{\phi }$ , which relates the ELBO to the Helmholtz free energy.^[3] In the second line, $\ln p_{\theta }(x)$ is called the evidence for $x$ , and $D_{KL}(q_{\phi }(z|x)||p_{\theta }(z|x))$ is the Kullback-Leibler divergence between $q_{\phi }$ and $p_{\theta }$ . Since the Kullback-Leibler divergence is non-negative, $L(\phi ,\theta ;x)$ forms a lower bound on the evidence (ELBO inequality)

\ln p_{\theta }(x)\geq \mathbb {\mathbb {E} } _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z\vert x)}}\right].

Motivation

Variational Bayesian inference

Suppose we have an observable random variable $X$ , and we want to find its true distribution $p^{*}$ . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find $p^{*}$ exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family $\{p_{\theta }\}_{\theta \in \Theta }$ of distributions, then solve for $\min _{\theta }L(p_{\theta },p^{*})$ for some loss function $L$ . One possible way to solve this is by considering small variation from $p_{\theta }$ to $p_{\theta +\delta \theta }$ , and solve for $L(p_{\theta },p^{*})-L(p_{\theta +\delta \theta },p^{*})=0$ . This is a problem in the calculus of variations, thus it is called the variational method.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:

First, define a simple distribution $p(z)$ over a latent random variable $Z$ . Usually a normal distribution or a uniform distribution suffices.
Next, define a family of complicated functions $f_{\theta }$ (such as a deep neural network) parametrized by $\theta$ .
Finally, define a way to convert any $f_{\theta }(z)$ into a simple distribution over the observable random variable $X$ . For example, let $f_{\theta }(z)=(f_{1}(z),f_{2}(z))$ have two outputs, then we can define the corresponding distribution over $X$ to be the normal distribution ${\mathcal {N}}(f_{1}(z),e^{f_{2}(z)})$ .

This defines a family of joint distributions $p_{\theta }$ over $(X,Z)$ . It is very easy to sample $(x,z)\sim p_{\theta }$ : simply sample $z\sim p$ , then compute $f_{\theta }(z)$ , and finally sample $x\sim p_{\theta }(\cdot |z)$ using $f_{\theta }(z)$ .

In other words, we have a generative model for both the observable and the latent.

Now, we consider a distribution $p_{\theta }$ good, if it is a close approximation of $p^{*}$ :

p_{\theta }(X)\approx p^{*}(X)

since the distribution on the right side is over $X$ only, the distribution on the left side must marginalize the latent variable $Z$ away.

In general, it's impossible to perform the integral $p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz$ , forcing us to perform another approximation.

Since $p_{\theta }(x)={\frac {p_{\theta }(x|z)p(z)}{p_{\theta }(z|x)}}$ (Bayes' Rule), it suffices to find a good approximation of $p_{\theta }(z|x)$ . So define another distribution family $q_{\phi }(z|x)$ and use it to approximate $p_{\theta }(z|x)$ . This is a discriminative model for the latent.

The entire situation is summarized in the following table:


$X$ : observable	$X,Z$	$Z$ : latent
$p^{*}(x)\approx p_{\theta }(x)\approx {\frac {p_{\theta }(x\|z)p(z)}{q_{\phi }(z\|x)}}$ approximable		$p(z)$ , easy
	$p_{\theta }(x\|z)p(z)$ , easy
$p_{\theta }(z\|x)\approx q_{\phi }(z\|x)$ approximable		$p_{\theta }(x\|z)$ , easy

In Bayesian language, $X$ is the observed evidence, and $Z$ is the latent/unobserved. The distribution $p$ over $Z$ is the prior distribution over $Z$ , $p_{\theta }(x|z)$ is the likelihood function, and $p_{\theta }(z|x)$ is the posteriordistribution over $Z$ .

Given an observation $x$ , we can infer what $z$ likely gave rise to $x$ by computing $p_{\theta }(z|x)$ . The usual Bayesian method is to estimate the integral $p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz$ , then compute by Bayes' rule $p_{\theta }(z|x)={\frac {p_{\theta }(x|z)p(z)}{p_{\theta }(x)}}$ . This is expensive to perform in general, but if we can simply find a good approximation $q_{\phi }(z|x)\approx p_{\theta }(z|x)$ for most $x,z$ , then we can infer $z$ from $x$ cheaply. Thus, the search for a good $q_{\phi }$ is also called amortized inference.

All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:

\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))

where $H(p^{*})=-\mathbb {\mathbb {E} } _{x\sim p^{*}}[\ln p^{*}(x)]$ is the entropy of the true distribution. So if we can maximize $\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]$ , we can minimize $D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))$ , and consequently find an accurate approximation $p_{\theta }\approx p^{*}$ . To maximize $\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]$ , we simply sample many $x_{i}\sim p^{*}(x)$ , i.e. use Importance sampling

N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]\approx \max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})

where $N$ is the number of samples drawn from the true distribution. This approximation can be seen as overfitting.^{[note 1]} In order to maximize $\sum _{i}\ln p_{\theta }(x_{i})$ , it's necessary to find $\ln p_{\theta }(x)$ :

\ln p_{\theta }(x)=\ln \int p_{\theta }(x|z)p(z)dz

This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:

\int p_{\theta }(x|z)p(z)dz=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]

where $q_{\phi }(z|x)$ is a sampling distribution over $z$ that we use to perform the Monte Carlo integration. So we see that if we sample $z\sim q_{\phi }(\cdot |x)$ , then ${\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}$ is an unbiased estimator of $p_{\theta }(x)$ . Unfortunately, this does not give us an unbiased estimator of $\ln p_{\theta }(x)$ , because $\ln$ is nonlinear. Indeed, we have by Jensen's inequality,

\ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]

In fact, all the obvious estimators of $\ln p_{\theta }(x)$ are biased downwards, because no matter how many samples of $z_{i}\sim q_{\phi }(\cdot |x)$ we take, we have by Jensen's inequality:

\mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right)\right]\leq \ln \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[{\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right]=\ln p_{\theta }(x)

Subtracting the right side, we see that the problem comes down to a biased estimator of zero:

\mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\leq 0

At this point, we could branch off towards the development of an importance-weighted autoencoder^{[note 2]}, but we will instead continue with the simplest case with $N=1$ :

\ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]

The tightness of the inequality has a closed form:

\ln p_{\theta }(x)-\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))\geq 0

We have thus obtained the ELBO function:

L(\phi ,\theta ;x):=\ln p_{\theta }(x)-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))

Maximizing the ELBO

For fixed $x$ , the optimization $\max _{\theta ,\phi }L(\phi ,\theta ;x)$ simultaneously attempts to maximize $\ln p_{\theta }(x)$ and minimize $D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))$ . If the parametrization for $p_{\theta }$ and $q_{\phi }$ are flexible enough, we would obtain some ${\hat {\phi }},{\hat {\theta }}$ , such that we have simultaneously

\ln p_{\hat {\theta }}(x)\approx \max _{\theta }\ln p_{\theta }(x);\quad q_{\hat {\phi }}(\cdot |x)\approx p_{\hat {\theta }}(\cdot |x)

Since

\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))

we have

\ln p_{\hat {\theta }}(x)\approx \max _{\theta }-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))

and so

{\hat {\theta }}\approx \arg \min D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))

In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model $p_{\hat {\theta }}\approx p^{*}$ and an accurate discriminative model $q_{\hat {\phi }}(\cdot |x)\approx p_{\hat {\theta }}(\cdot |x)$ .^[5]

Main forms

The ELBO has many possible expressions, each with some different emphasis.

\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=\int q_{\phi }(z|x)\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}dz

This form shows that if we sample $z\sim q_{\phi }(\cdot |x)$ , then $\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}$ is an unbiased estimator of the ELBO.

\ln p_{\theta }(x)-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\;\|\;p_{\theta }(\cdot |x))

This form shows that the ELBO is a lower bound on the evidence $\ln p_{\theta }(x)$ , and that maximizing the ELBO with respect to $\phi$ is equivalent to minimizing the KL-divergence from $p_{\theta }(\cdot |x)$ to $q_{\phi }(\cdot |x)$ .

\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}[\ln p_{\theta }(x|z)]-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\;\|\;p)

This form shows that maximizing the ELBO simultaneously attempts to keep $q_{\phi }(\cdot |x)$ close to $p$ and concentrate $q_{\phi }(\cdot |x)$ on those $z$ that maximizes $\ln p_{\theta }(x|z)$ . That is, the approximate posterior $q_{\phi }(\cdot |x)$ balances between staying close to the prior $p$ and moving towards the maximum likelihood $\arg \max _{z}\ln p_{\theta }(x|z)$ .

H(q_{\phi }(\cdot |x))+\mathbb {E} _{z\sim q(\cdot |x)}[\ln p_{\theta }(z|x)]+\ln p_{\theta }(x)

This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of $q_{\phi }(\cdot |x)$ high, and concentrate $q_{\phi }(\cdot |x)$ on those $z$ that maximizes $\ln p_{\theta }(z|x)$ . That is, the approximate posterior $q_{\phi }(\cdot |x)$ balances between being a uniform distribution and moving towards the maximum a posteriori $\arg \max _{z}\ln p_{\theta }(z|x)$ .

Data-processing inequality

Suppose we take $N$ independent samples from $p^{*}$ , and collect them in the dataset $D=\{x_{1},...,x_{N}\}$ , then we have empirical distribution $q_{D}(x)={\frac {1}{N}}\sum _{i}\delta _{x_{i}}$ .

Fitting $p_{\theta }(x)$ to $q_{D}(x)$ can be done, as usual, by maximizing the loglikelihood $\ln p_{\theta }(D)$ :

D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))=-{\frac {1}{N}}\sum _{i}\ln p_{\theta }(x_{i})-H(q_{D})=-{\frac {1}{N}}\ln p_{\theta }(D)-H(q_{D})

Now, by the ELBO inequality, we can bound $\ln p_{\theta }(D)$ , and thus

D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))\leq -{\frac {1}{N}}L(\phi ,\theta ;D)-H(q_{D})

The right-hand-side simplifies to a KL-divergence, and so we get:

D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))\leq -{\frac {1}{N}}\sum _{i}L(\phi ,\theta ;x_{i})-H(q_{D})=D_{\mathit {KL}}(q_{D,\phi }(x,z);p_{\theta }(x,z))

This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing $L(\phi ,\theta ;D)=\sum _{i}L(\phi ,\theta ;x_{i})$ is minimizing $D_{\mathit {KL}}(q_{D,\phi }(x,z);p_{\theta }(x,z))$ , which upper-bounds the real quantity of interest $D_{\mathit {KL}}(q_{D}(x);p_{\theta }(x))$ via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.^[6]

Related Research Articles

Euler's formula, named after Leonhard Euler, is a mathematical formula in complex analysis that establishes the fundamental relationship between the trigonometric functions and the complex exponential function. Euler's formula states that, for any real number $x$ , one has

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. A list of the spherical harmonics is available in Table of spherical harmonics.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $k$ and a scale parameter $θ$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.

In mathematics, theta functions are special functions of several complex variables. They show up in many topics, including Abelian varieties, moduli spaces, quadratic forms, and solitons. As Grassmann algebras, they appear in quantum field theory.

Stellar dynamics is the branch of astrophysics which describes in a statistical way the collective motions of stars subject to their mutual gravity. The essential difference from celestial mechanics is that the number of body

In mathematics, the Mahler measure $of a polynomial with complex coefficients is defined as$

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution.

In information theory, the cross-entropy between two probability distributions $and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

The derivatives of scalars, vectors, and second-order tensors with respect to second-order tensors are of considerable use in continuum mechanics. These derivatives are used in the theories of nonlinear elasticity and plasticity, particularly in the design of algorithms for numerical simulations.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

In probability theory and directional statistics, a wrapped probability distribution is a continuous probability distribution that describes data points that lie on a unit n-sphere. In one dimension, a wrapped distribution consists of points on the unit circle. If $is a random variate in the interval with probability density function (PDF), then is a circular variable distributed according to the wrapped distribution and is an angular variable in the interval distributed according to the wrapped distribution .$

In quantum information theory, the Wehrl entropy, named after Alfred Wehrl, is a classical entropy of a quantum-mechanical density matrix. It is a type of quasi-entropy defined for the Husimi Q representation of the phase-space quasiprobability distribution. See for a comprehensive review of basic properties of classical, quantum and Wehrl entropies, and their implications in statistical mechanics.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

↑ Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv: 1312.6114 [stat.ML].
↑ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
↑ Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. Morgan-Kaufmann. 6.
↑ Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv: 1509.00519 [stat.ML].
↑ Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, S2CID 17947141
↑ Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv: 1906.02691 . doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.

Notes

↑ In fact, by Jensen's inequality, $\mathbb {E} _{x\sim p^{*}(x)}\left[\max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})\right]\geq \max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}\left[\sum _{i}\ln p_{\theta }(x_{i})\right]=N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]$ The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data $x_{i}$ , there is usually some $\theta$ that fits them better than the entire $p^{*}$ distribution.
↑ By the delta method, we have $\mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\approx -{\frac {1}{2N}}\mathbb {V} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(z|x)}{q_{\phi }(z|x)}}\right]=O(N^{-1})$ If we continue with this, we would obtain the importance-weighted autoencoder.^[4]

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv: 1312.6114 [stat.ML].

[2] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.

[3] Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. Morgan-Kaufmann. 6.

[5] Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv: 1509.00519 [stat.ML].

[7] Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, S2CID 17947141

[8] Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv: 1906.02691 . doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.

[in_fact-4] In fact, by Jensen's inequality, $\mathbb {E} _{x\sim p^{*}(x)}\left[\max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})\right]\geq \max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}\left[\sum _{i}\ln p_{\theta }(x_{i})\right]=N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]$ The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data $x_{i}$ , there is usually some $\theta$ that fits them better than the entire $p^{*}$ distribution.

[importance-weighted-6] By the delta method, we have $\mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\approx -{\frac {1}{2N}}\mathbb {V} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(z|x)}{q_{\phi }(z|x)}}\right]=O(N^{-1})$ If we continue with this, we would obtain the importance-weighted autoencoder.^[4]

[1]

[2]

[3]

[note 1]

[note 2]

[5]

[6]

[4]