Inception score

Last updated May 10, 2023

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN).^[1] The score is calculated based on the output of a separate, pretrained Inceptionv3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:

The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".^[2]

It has been somewhat superseded by the related Fréchet inception distance.^[3] While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").

Definition

Let there be two spaces, the space of images $\Omega _{X}$ and the space of labels $\Omega _{Y}$ . The space of labels is finite.

Let $p_{gen}$ be a probability distribution over $\Omega _{X}$ that we wish to judge.

Let a discriminator be a function of type

p_{dis}:\Omega _{X}\to M(\Omega _{Y})

where $M(\Omega _{Y})$ is the set of all probability distributions on $\Omega _{Y}$ . For any image $x$ , and any label $y$ , let $p_{dis}(y|x)$ be the probability that image $x$ has label $y$ , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet. The Inception Score of $p_{gen}$ relative to $p_{dis}$ is

IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)

Equivalent rewrites include

\ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]

\ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]

$\ln IS$ is nonnegative by Jensen's inequality. Pseudocode:

INPUT discriminator $p_{dis}$ .
INPUT generator $g$ .
Sample images $x_{i}$ from generator.
Compute $p_{dis}(\cdot |x_{i})$ , the probability distribution over labels conditional on image $x_{i}$ .
Sum up the results to obtain ${\hat {p}}$ , an empirical estimate of $\int p_{dis}(\cdot |x)p_{gen}(x)dx$ .
Sample more images $x_{i}$ from generator, and for each, compute $D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)$ .
Average the results, and take its exponential.
RETURN the result.

Interpretation

A higher inception score is interpreted as "better", as it means that $p_{gen}$ is a "sharp and distinct" collection of pictures.

$\ln IS(p_{gen},p_{dis})\in [0,\ln N]$ , where $N$ is the total number of possible labels.

$\ln IS(p_{gen},p_{dis})=0$ iff for almost all $x\sim p_{gen}$

p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx

That means $p_{gen}$ is completely "indistinct". That is, for any image $x$ sampled from $p_{gen}$ , discriminator returns exactly the same label predictions $p_{dis}(\cdot |x)$ .

The highest inception score $N$ is achieved if and only if the two conditions are both true:

For almost all $x\sim p_{gen}$ , the distribution $p_{dis}(y|x)$ is concentrated on one label. That is, $H_{y}[p_{dis}(y|x)]=0$ . That is, every image sampled from $p_{gen}$ is exactly classified by the discriminator.
For every label $y$ , the proportion of generated images labelled as $y$ is exactly $\mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}$ . That is, the generated images are equally distributed over all labels.

Related Research Articles

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as it is not actually random or a variable, but rather it is a mapping or a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

In mathematics, mathematical physics and the theory of stochastic processes, a harmonic function is a twice continuously differentiable function $where U is an open subset of that satisfies Laplace's equation, that is,$

In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivative. It is frequently used to transform the antiderivative of a product of functions into an antiderivative for which a solution can be more easily found. The rule can be thought of as an integral version of the product rule of differentiation.

In vector calculus, the divergence theorem, also known as Gauss's theorem or Ostrogradsky's theorem, is a theorem which relates the flux of a vector field through a closed surface to the divergence of the field in the volume enclosed.

In probability theory and statistics, given two jointly distributed random variables $and, the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.$

In mathematics and signal processing, the Hilbert transform is a specific singular integral that takes a function, $u (t)$ of a real variable and produces another function of a real variable $H(u)(t)$ . The Hilbert transform is given by the Cauchy principal value of the convolution with the function $(see § Definition). The Hilbert transform has a particularly simple representation in the frequency domain: It imparts a phase shift of \pm90° (π ⁄ 2 radians) to every frequency component of a function, the sign of the shift depending on the sign of the frequency (see § Relationship with the Fourier transform). The Hilbert transform is important in signal processing, where it is a component of the analytic representation of a real-valued signal u (t) . The Hilbert transform was first introduced by David Hilbert in this setting, to solve a special case of the Riemann-Hilbert problem for analytic functions.$

In mathematics, Kähler differentials provide an adaptation of differential forms to arbitrary commutative rings or schemes. The notion was introduced by Erich Kähler in the 1930s. It was adopted as standard in commutative algebra and algebraic geometry somewhat later, once the need was felt to adapt methods from calculus and geometry over the complex numbers to contexts where such methods are not available.

In statistics and probability theory, a point process or point field is a collection of mathematical points randomly located on a mathematical space such as the real line or Euclidean space. Point processes can be used for spatial data analysis, which is of interest in such diverse disciplines as forestry, plant ecology, epidemiology, geography, seismology, materials science, astronomy, telecommunications, computational neuroscience, economics and others.

A Dynkin system, named after Eugene Dynkin is a collection of subsets of another universal set $satisfying a set of axioms weaker than those of 𝜎-algebra. Dynkin systems are sometimes referred to as 𝜆-systems or d-system . These set families have applications in measure theory and probability.$

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

The multivariate stable distribution is a multivariate probability distribution that is a multivariate generalisation of the univariate stable distribution. The multivariate stable distribution defines linear relations between stable distribution marginals. In the same way as for the univariate case, the distribution is defined in terms of its characteristic function.

In financial mathematics and stochastic optimization, the concept of risk measure is used to quantify the risk involved in a random outcome or risk position. Many risk measures have hitherto been proposed, each having certain characteristics. The entropic value at risk (EVaR) is a coherent risk measure introduced by Ahmadi-Javid, which is an upper bound for the value at risk (VaR) and the conditional value at risk (CVaR), obtained from the Chernoff inequality. The EVaR can also be represented by using the concept of relative entropy. Because of its connection with the VaR and the relative entropy, this risk measure is called "entropic value at risk". The EVaR was developed to tackle some computational inefficiencies of the CVaR. Getting inspiration from the dual representation of the EVaR, Ahmadi-Javid developed a wide class of coherent risk measures, called g-entropic risk measures. Both the CVaR and the EVaR are members of this class.

Generalized filtering is a generic Bayesian filtering scheme for nonlinear state-space models. It is based on a variational principle of least action, formulated in generalized coordinates of motion. Note that "generalized coordinates of motion" are related to—but distinct from—generalized coordinates as used in (multibody) dynamical systems analysis. Generalized filtering furnishes posterior densities over hidden states generating observed data using a generalized gradient descent on variational free energy, under the Laplace assumption. Unlike classical filtering, generalized filtering eschews Markovian assumptions about random fluctuations. Furthermore, it operates online, assimilating data to approximate the posterior density over unknown quantities, without the need for a backward pass. Special cases include variational filtering, dynamic expectation maximization and generalized predictive coding.

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

Poisson-type random measures are a family of three random counting measures which are closed under restriction to a subspace, i.e. closed under thinning. They are the only distributions in the canonical non-negative power series family of distributions to possess this property and include the Poisson distribution, negative binomial distribution, and binomial distribution. The PT family of distributions is also known as the Katz family of distributions, the Panjer or (a,b,0) class of distributions and may be retrieved through the Conway–Maxwell–Poisson distribution.

In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images.

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".

References

↑ Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29. arXiv: 1606.03498 .
↑ Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. doi: 10.1016/j.neunet.2021.07.019 . PMID 34500257. S2CID 231698782.
↑ Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. arXiv: 2103.09396 . doi:10.1016/j.cviu.2021.103329. S2CID 232257836.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Salimans-1] Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29. arXiv: 1606.03498 .

[Frolov-2] Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. doi: 10.1016/j.neunet.2021.07.019 . PMID 34500257. S2CID 231698782.

[Borji-3] Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. arXiv: 2103.09396 . doi:10.1016/j.cviu.2021.103329. S2CID 232257836.

[1]

[2]

[3]

v t e Machine learning evaluation metrics
Regression	MSE · MAE · sMAPE · MAPE · MASE · MSPE · RMS · RMSE/RMSD · R2 · MDA · MAD
Classification	F-score · P4 · Accuracy · Precision · Recall · Kappa · MCC · AUC · ROC · Sensitivity and specificity · Logarithmic Loss
Clustering	Silhouette · Calinski-Harabasz · Davies-Bouldin · Dunn index · Hopkins statistic · Jaccard index · Rand index · Similarity measure · SMC · SimHash
Ranking	MRR · DCG · NDCG · AP
Computer Vision	PSNR · SSIM · IoU
NLP	Perplexity · BLEU
Deep Learning Related Metrics	Inception score · FID
Recommender system	Coverage · Intra-list Similarity
Similarity	Cosine similarity · Euclidean distance · Pearson correlation coefficient
Confusion matrix

Inception score

Contents

Definition

Interpretation

Related Research Articles

References