Importance sampling

Last updated

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, [1] but its precursors can be found in statistical physics as early as 1949. [2] [3] Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

Contents

Basic theory

Let be a random variable in some probability space . We wish to estimate the expected value of X under P, denoted E[X;P]. If we have statistically independent random samples , generated according to P, then an empirical estimate of E[X;P] is

and the precision of this estimate depends on the variance of X:

The basic idea of importance sampling is to sample the states from a different distribution to lower the variance of the estimation of E[X;P], or when sampling from P is difficult. This is accomplished by first choosing a random variable such that E[L;P] = 1 and that P-almost everywhere . With the variable L we define a probability that satisfies

The variable X/L will thus be sampled under P(L) to estimate E[X;P] as above and this estimation is improved when .

When X is of constant sign over Ω, the best variable L would clearly be , so that X/L* is the searched constant E[X;P] and a single sample under P(L*) suffices to give its value. Unfortunately we cannot take that choice, because E[X;P] is precisely the value we are looking for! However this theoretical best case L* gives us an insight into what importance sampling does:

to the right, is one of the infinitesimal elements that sum up to E[X;P]:

therefore, a good probability change P(L) in importance sampling will redistribute the law of X so that its samples' frequencies are sorted directly according to their weights in E[X;P]. Hence the name "importance sampling."

Importance sampling is often used as a Monte Carlo integrator. When is the uniform distribution and , E[X;P] corresponds to the integral of the real function .

Application to probabilistic inference

Such methods are frequently used to estimate posterior densities or expectations in state and/or parameter estimation problems in probabilistic models that are too hard to treat analytically. Examples include Bayesian networks and importance weighted variational autoencoders. [4]

Application to simulation

Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. The idea behind importance sampling is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these "important" values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in importance sampling is to choose a distribution which "encourages" the important values. This use of "biased" distributions will result in a biased estimator if it is applied directly in the simulation. However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new importance sampling estimator is unbiased. The weight is given by the likelihood ratio, that is, the Radon–Nikodym derivative of the true underlying distribution with respect to the biased simulation distribution.

The fundamental issue in implementing importance sampling simulation is the choice of the biased distribution which encourages the important regions of the input variables. Choosing or designing a good biased distribution is the "art" of importance sampling. The rewards for a good distribution can be huge run-time savings; the penalty for a bad distribution can be longer run times than for a general Monte Carlo simulation without importance sampling.

Consider to be the sample and to be the likelihood ratio, where is the probability density (mass) function of the desired distribution and is the probability density (mass) function of the biased/proposal/sample distribution. Then the problem can be characterized by choosing the sample distribution that minimizes the variance of the scaled sample:

It can be shown that the following distribution minimizes the above variance: [5]

Notice that when , this variance becomes 0.

Mathematical approach

Consider estimating by simulation the probability of an event , where is a random variable with cumulative distribution function and probability density function , where prime denotes derivative. A -length independent and identically distributed (i.i.d.) sequence is generated from the distribution , and the number of random variables that lie above the threshold are counted. The random variable is characterized by the Binomial distribution

One can show that , and , so in the limit we are able to obtain . Note that the variance is low if . Importance sampling is concerned with the determination and use of an alternate density function (for ), usually referred to as a biasing density, for the simulation experiment. This density allows the event to occur more frequently, so the sequence lengths gets smaller for a given estimator variance. Alternatively, for a given , use of the biasing density results in a variance smaller than that of the conventional Monte Carlo estimate. From the definition of , we can introduce as below.

where

is a likelihood ratio and is referred to as the weighting function. The last equality in the above equation motivates the estimator

This is the importance sampling estimator of and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from and for each sample which exceeds , the estimate is incremented by the weight evaluated at the sample value. The results are averaged over trials. The variance of the importance sampling estimator is easily shown to be

Now, the importance sampling problem then focuses on finding a biasing density such that the variance of the importance sampling estimator is less than the variance of the general Monte Carlo estimate. For some biasing density function, which minimizes the variance, and under certain conditions reduces it to zero, it is called an optimal biasing density function.

Conventional biasing methods

Although there are many kinds of biasing methods, the following two methods are most widely used in the applications of importance sampling.

Scaling

Shifting probability mass into the event region by positive scaling of the random variable with a number greater than unity has the effect of increasing the variance (mean also) of the density function. This results in a heavier tail of the density, leading to an increase in the event probability. Scaling is probably one of the earliest biasing methods known and has been extensively used in practice. It is simple to implement and usually provides conservative simulation gains as compared to other methods.

In importance sampling by scaling, the simulation density is chosen as the density function of the scaled random variable , where usually for tail probability estimation. By transformation,

and the weighting function is

While scaling shifts probability mass into the desired event region, it also pushes mass into the complementary region which is undesirable. If is a sum of random variables, the spreading of mass takes place in an dimensional space. The consequence of this is a decreasing importance sampling gain for increasing , and is called the dimensionality effect. A modern version of importance sampling by scaling is e.g. so-called sigma-scaled sampling (SSS) which is running multiple Monte Carlo (MC) analysis with different scaling factors. In opposite to many other high yield estimation methods (like worst-case distances WCD) SSS does not suffer much from the dimensionality problem. Also addressing multiple MC outputs causes no degradation in efficiency. On the other hand, as WCD, SSS is only designed for Gaussian statistical variables, and in opposite to WCD, the SSS method is not designed to provide accurate statistical corners. Another SSS disadvantage is that the MC runs with large scale factors may become difficult, e. g. due to model and simulator convergence problems. In addition, in SSS we face a strong bias-variance trade-off: Using large scale factors, we obtain quite stable yield results, but the larger the scale factors, the larger the bias error. If the advantages of SSS does not matter much in the application of interest, then often other methods are more efficient.

Translation

Another simple and effective biasing technique employs translation of the density function (and hence random variable) to place much of its probability mass in the rare event region. Translation does not suffer from a dimensionality effect and has been successfully used in several applications relating to simulation of digital communication systems. It often provides better simulation gains than scaling. In biasing by translation, the simulation density is given by

where is the amount of shift and is to be chosen to minimize the variance of the importance sampling estimator.

Effects of system complexity

The fundamental problem with importance sampling is that designing good biased distributions becomes more complicated as the system complexity increases. Complex systems are the systems with long memory since complex processing of a few inputs is much easier to handle. This dimensionality or memory can cause problems in three ways:

In principle, the importance sampling ideas remain the same in these situations, but the design becomes much harder. A successful approach to combat this problem is essentially breaking down a simulation into several smaller, more sharply defined subproblems. Then importance sampling strategies are used to target each of the simpler subproblems. Examples of techniques to break the simulation down are conditioning and error-event simulation (EES) and regenerative simulation.

Evaluation of importance sampling

In order to identify successful importance sampling techniques, it is useful to be able to quantify the run-time savings due to the use of the importance sampling approach. The performance measure commonly used is , and this can be interpreted as the speed-up factor by which the importance sampling estimator achieves the same precision as the MC estimator. This has to be computed empirically since the estimator variances are not likely to be analytically possible when their mean is intractable. Other useful concepts in quantifying an importance sampling estimator are the variance bounds and the notion of asymptotic efficiency. One related measure is the so-called Effective Sample Size(ESS). [6]

Variance cost function

Variance is not the only possible cost function for a simulation, and other cost functions, such as the mean absolute deviation, are used in various statistical applications. Nevertheless, the variance is the primary cost function addressed in the literature, probably due to the use of variances in confidence intervals and in the performance measure .

An associated issue is the fact that the ratio overestimates the run-time savings due to importance sampling since it does not include the extra computing time required to compute the weight function. Hence, some people evaluate the net run-time improvement by various means. Perhaps a more serious overhead to importance sampling is the time taken to devise and program the technique and analytically derive the desired weight function.

Multiple and adaptive importance sampling

When different proposal distributions, , are jointly used for drawing the samples different proper weighting functions can be employed (e.g., see [7] [8] [9] [10] ). In an adaptive setting, the proposal distributions, , and are updated each iteration of the adaptive importance sampling algorithm. Hence, since a population of proposal densities is used, several suitable combinations of sampling and weighting schemes can be employed. [11] [12] [13] [14] [15] [16] [17]

See also

Notes

  1. Kloek, T.; van Dijk, H. K. (1978). "Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo" (PDF). Econometrica . 46 (1): 1–19. doi:10.2307/1913641. JSTOR   1913641.
  2. Goertzle, G. (1949). "Quota Sampling and Importance Functions in Stochastic Solution of Particle Problems". Technical Report ORNL-434, Oak Ridge National Laboratory. Aecd; 2793. hdl:2027/mdp.39015086443671.
  3. Kahn, H.; Harris, T. E. (1949). "Estimation of Particle Transmission by Random Sampling". Monte Carlo Method. Applied Mathematics Series. 12. National Bureau of Standards.: 27–30.
  4. Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2016). "Importance Weighted Autoencoders". Proceedings of the 4th International Conference on Learning Representations (ICLR). arXiv: 1509.00519 .
  5. Rubinstein, R. Y., & Kroese, D. P. (2011). Simulation and the Monte Carlo method (Vol. 707). John Wiley & Sons.
  6. Martino, Luca; Elvira, Víctor; Louzada, Francisco (2017). "Effective sample size for importance sampling based on discrepancy measures". Signal Processing. 131: 386–401. arXiv: 1602.03572 . doi:10.1016/j.sigpro.2016.08.025. S2CID   26317735.
  7. Veach, Eric; Guibas, Leonidas J. (1995-01-01). "Optimally combining sampling techniques for Monte Carlo rendering". Proceedings of the 22nd annual conference on Computer graphics and interactive techniques - SIGGRAPH '95. New York, NY, USA: ACM. pp.  419–428. CiteSeerX   10.1.1.127.8105 . doi:10.1145/218380.218498. ISBN   978-0-89791-701-8. S2CID   207194026.
  8. Owen, Art; Associate, Yi Zhou (2000-03-01). "Safe and Effective Importance Sampling". Journal of the American Statistical Association. 95 (449): 135–143. CiteSeerX   10.1.1.36.4536 . doi:10.1080/01621459.2000.10473909. ISSN   0162-1459. S2CID   119761472.
  9. Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. (2015-10-01). "Efficient Multiple Importance Sampling Estimators". IEEE Signal Processing Letters. 22 (10): 1757–1761. arXiv: 1505.05391 . Bibcode:2015ISPL...22.1757E. doi:10.1109/LSP.2015.2432078. ISSN   1070-9908. S2CID   14504598.
  10. Elvira, Víctor; Martino, Luca; Luengo, David; Bugallo, Mónica F. (2017). "Improving population Monte Carlo: Alternative weighting and resampling schemes". Signal Processing. 131: 77–91. arXiv: 1607.02758 . doi:10.1016/j.sigpro.2016.07.012. S2CID   205171823.
  11. Cappé, O.; Guillin, A.; Marin, J. M.; Robert, C. P. (2004-12-01). "Population Monte Carlo". Journal of Computational and Graphical Statistics. 13 (4): 907–929. doi:10.1198/106186004X12803. ISSN   1061-8600. S2CID   119690181.
  12. Martino, L.; Elvira, V.; Luengo, D.; Corander, J. (2017-05-01). "Layered adaptive importance sampling". Statistics and Computing. 27 (3): 599–623. arXiv: 1505.04732 . doi:10.1007/s11222-016-9642-5. ISSN   0960-3174. S2CID   2508031.
  13. Cappé, Olivier; Douc, Randal; Guillin, Arnaud; Marin, Jean-Michel; Robert, Christian P. (2008-04-25). "Adaptive importance sampling in general mixture classes". Statistics and Computing. 18 (4): 447–459. arXiv: 0710.4242 . doi:10.1007/s11222-008-9059-x. ISSN   0960-3174. S2CID   483916.
  14. Cornuet, Jean-Marie; Marin, Jean-Michel; Mira, Antonietta; Robert, Christian P. (2012-12-01). "Adaptive Multiple Importance Sampling". Scandinavian Journal of Statistics. 39 (4): 798–812. arXiv: 0907.1254 . doi:10.1111/j.1467-9469.2011.00756.x. ISSN   1467-9469. S2CID   17191248.
  15. Martino, L.; Elvira, V.; Luengo, D.; Corander, J. (2015-08-01). "An Adaptive Population Importance Sampler: Learning From Uncertainty". IEEE Transactions on Signal Processing. 63 (16): 4422–4437. Bibcode:2015ITSP...63.4422M. CiteSeerX   10.1.1.464.9395 . doi:10.1109/TSP.2015.2440215. ISSN   1053-587X. S2CID   17017431.
  16. Bugallo, Mónica F.; Martino, Luca; Corander, Jukka (2015-12-01). "Adaptive importance sampling in signal processing". Digital Signal Processing. Special Issue in Honour of William J. (Bill) Fitzgerald. 47: 36–49. doi: 10.1016/j.dsp.2015.05.014 .
  17. Bugallo, M. F.; Elvira, V.; Martino, L.; Luengo, D.; Miguez, J.; Djuric, P. M. (July 2017). "Adaptive Importance Sampling: The past, the present, and the future". IEEE Signal Processing Magazine. 34 (4): 60–79. Bibcode:2017ISPM...34...60B. doi:10.1109/msp.2017.2699226. ISSN   1053-5888. S2CID   5619054.

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

The median of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as the “middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In mathematics, the moments of a function are certain quantitative measures related to the shape of the function's graph. If the function represents mass density, then the zeroth moment is the total mass, the first moment is the center of mass, and the second moment is the moment of inertia. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in with a density.

<span class="mw-page-title-main">Cramér–Rao bound</span> Lower bound on variance of an estimator

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

<span class="mw-page-title-main">Monte Carlo integration</span> Numerical technique

In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand at a regular grid, Monte Carlo randomly chooses points at which the integrand is evaluated. This method is particularly useful for higher-dimensional integrals.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

The control variates method is a variance reduction technique used in Monte Carlo methods. It exploits information about the errors in estimates of known quantities to reduce the error of an estimate of an unknown quantity.

<span class="mw-page-title-main">Variance reduction</span>

In mathematics, more specifically in the theory of Monte Carlo methods, variance reduction is a procedure used to increase the precision of the estimates obtained for a given simulation or computational effort. Every output random variable from the simulation is associated with a variance which limits the precision of the simulation results. In order to make a simulation statistically efficient, i.e., to obtain a greater precision and smaller confidence intervals for the output random variable of interest, variance reduction techniques can be used. The main variance reduction methods are

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation of a population of values, in such a way that the expected value of the calculation equals the true value. Except in some important situations, outlined later, the task has little relevance to applications of statistics since its need is avoided by standard procedures, such as the use of significance tests and confidence intervals, or by using Bayesian analysis.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

Variance-based sensitivity analysis is a form of global sensitivity analysis. Working within a probabilistic framework, it decomposes the variance of the output of the model or system into fractions which can be attributed to inputs or sets of inputs. For example, given a model with two inputs and one output, one might find that 70% of the output variance is caused by the variance in the first input, 20% by the variance in the second, and 10% due to interactions between the two. These percentages are directly interpreted as measures of sensitivity. Variance-based measures of sensitivity are attractive because they measure sensitivity across the whole input space, they can deal with nonlinear responses, and they can measure the effect of interactions in non-additive systems.

Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable is known as the natural exponential family of .

References