Monte Carlo integration

Last updated
An illustration of Monte Carlo integration. In this example, the domain D is the inner circle and the domain E is the square. Because the square's area (4) can be easily calculated, the area of the circle (p*1.0 ) can be estimated by the ratio (0.8) of the points inside the circle (40) to the total number of points (50), yielding an approximation for the circle's area of 4*0.8 = 3.2 [?] p. MonteCarloIntegrationCircle.svg
An illustration of Monte Carlo integration. In this example, the domain D is the inner circle and the domain E is the square. Because the square's area (4) can be easily calculated, the area of the circle (π*1.0 ) can be estimated by the ratio (0.8) of the points inside the circle (40) to the total number of points (50), yielding an approximation for the circle's area of 4*0.8 = 3.2 ≈ π.

In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand at a regular grid, [1] Monte Carlo randomly chooses points at which the integrand is evaluated. [2] This method is particularly useful for higher-dimensional integrals. [3]

Contents

There are different methods to perform a Monte Carlo integration, such as uniform sampling, stratified sampling, importance sampling, sequential Monte Carlo (also known as a particle filter), and mean-field particle methods.

Overview

In numerical integration, methods such as the trapezoidal rule use a deterministic approach. Monte Carlo integration, on the other hand, employs a non-deterministic approach: each realization provides a different outcome. In Monte Carlo, the final outcome is an approximation of the correct value with respective error bars, and the correct value is likely to be within those error bars.

The problem Monte Carlo integration addresses is the computation of a multidimensional definite integral

where Ω, a subset of , has volume

The naive Monte Carlo approach is to sample points uniformly on Ω: [4] given N uniform samples,

I can be approximated by

This is because the law of large numbers ensures that

Given the estimation of I from QN, the error bars of QN can be estimated by the sample variance using the unbiased estimate of the variance.

which leads to

As long as the sequence

is bounded, this variance decreases asymptotically to zero as 1/N. The estimation of the error of QN is thus

which decreases as . This is standard error of the mean multiplied with . This result does not depend on the number of dimensions of the integral, which is the promised advantage of Monte Carlo integration against most deterministic methods that depend exponentially on the dimension. [5] It is important to notice that, unlike in deterministic methods, the estimate of the error is not a strict error bound; random sampling may not uncover all the important features of the integrand that can result in an underestimate of the error.

While the naive Monte Carlo works for simple examples, an improvement over deterministic algorithms can only be accomplished with algorithms that use problem-specific sampling distributions. With an appropriate sample distribution it is possible to exploit the fact that almost all higher-dimensional integrands are very localized and only small subspace notably contributes to the integral. [6] A large part of the Monte Carlo literature is dedicated in developing strategies to improve the error estimates. In particular, stratified sampling—dividing the region in sub-domains—and importance sampling—sampling from non-uniform distributions—are two examples of such techniques.

Example

Relative error as a function of the number of samples, showing the scaling
1
N
{\displaystyle {\tfrac {1}{\sqrt {N}}}} Relative error of a Monte Carlo integration to calculate pi.svg
Relative error as a function of the number of samples, showing the scaling

A paradigmatic example of a Monte Carlo integration is the estimation of π. Consider the function

and the set Ω = [−1,1] × [−1,1] with V = 4. Notice that

Thus, a crude way of calculating the value of π with Monte Carlo integration is to pick N random numbers on Ω and compute

In the figure on the right, the relative error is measured as a function of N, confirming the .

C example

Keep in mind that a true random number generator should be used.

inti,throws=99999,insideCircle=0;doublerandX,randY,pi;srand(time(NULL));for(i=0;i<throws;++i){randX=rand()/(double)RAND_MAX;randY=rand()/(double)RAND_MAX;if(randX*randX+randY*randY<1)++insideCircle;}pi=4.0*insideCircle/throws;

Python example

Made in Python.

fromnumpyimportrandomimportnumpyasnpthrows=2000inside_circle=0i=0radius=1whilei<throws:# Choose random X and Y centered around 0,0x=random.uniform(-radius,radius)y=random.uniform(-radius,radius)# If the point is inside circle, increase variableifx**2+y**2<=radius**2:inside_circle+=1i+=1# Calculate area and print; should be closer to Pi with increasing number of throwsarea=(((2*radius)**2)*inside_circle)/throwsprint(area)

Wolfram Mathematica example

The code below describes a process of integrating the function

from using the Monte-Carlo method in Mathematica:

func[x_]:=1/(1+Sinh[2*x]*(Log[x])^2);(*Sample from truncated normal distribution to speed up convergence*)Distrib[x_,average_,var_]:=PDF[NormalDistribution[average,var],1.1*x-0.1];n=10;RV=RandomVariate[TruncatedDistribution[{0.8,3},NormalDistribution[1,0.399]],n];Int=1/nTotal[func[RV]/Distrib[RV,1,0.399]]*Integrate[Distrib[x,1,0.399],{x,0.8,3}]NIntegrate[func[x],{x,0.8,3}](*Compare with real answer*)

Recursive stratified sampling

An illustration of Recursive Stratified Sampling. In this example, the function:
f
(
x
,
y
)
=
{
1
x
2
+
y
2
<
1
0
x
2
+
y
2
>=
1
{\displaystyle f(x,y)={\begin{cases}1&x^{2}+y^{2}<1\\0&x^{2}+y^{2}\geq 1\end{cases}}}

from the above illustration was integrated within a unit square using the suggested algorithm. The sampled points were recorded and plotted. Clearly stratified sampling algorithm concentrates the points in the regions where the variation of the function is largest. Strata.png
An illustration of Recursive Stratified Sampling. In this example, the function:
from the above illustration was integrated within a unit square using the suggested algorithm. The sampled points were recorded and plotted. Clearly stratified sampling algorithm concentrates the points in the regions where the variation of the function is largest.

Recursive stratified sampling is a generalization of one-dimensional adaptive quadratures to multi-dimensional integrals. On each recursion step the integral and the error are estimated using a plain Monte Carlo algorithm. If the error estimate is larger than the required accuracy the integration volume is divided into sub-volumes and the procedure is recursively applied to sub-volumes.

The ordinary 'dividing by two' strategy does not work for multi-dimensions as the number of sub-volumes grows far too quickly to keep track. Instead one estimates along which dimension a subdivision should bring the most dividends and only subdivides the volume along this dimension.

The stratified sampling algorithm concentrates the sampling points in the regions where the variance of the function is largest thus reducing the grand variance and making the sampling more effective, as shown on the illustration.

The popular MISER routine implements a similar algorithm.

MISER Monte Carlo

The MISER algorithm is based on recursive stratified sampling. This technique aims to reduce the overall integration error by concentrating integration points in the regions of highest variance. [7]

The idea of stratified sampling begins with the observation that for two disjoint regions a and b with Monte Carlo estimates of the integral and and variances and , the variance Var(f) of the combined estimate

is given by,

It can be shown that this variance is minimized by distributing the points such that,

Hence the smallest error estimate is obtained by allocating sample points in proportion to the standard deviation of the function in each sub-region.

The MISER algorithm proceeds by bisecting the integration region along one coordinate axis to give two sub-regions at each step. The direction is chosen by examining all d possible bisections and selecting the one which will minimize the combined variance of the two sub-regions. The variance in the sub-regions is estimated by sampling with a fraction of the total number of points available to the current step. The same procedure is then repeated recursively for each of the two half-spaces from the best bisection. The remaining sample points are allocated to the sub-regions using the formula for Na and Nb. This recursive allocation of integration points continues down to a user-specified depth where each sub-region is integrated using a plain Monte Carlo estimate. These individual values and their error estimates are then combined upwards to give an overall result and an estimate of its error.

Importance sampling

There are a variety of importance sampling algorithms, such as

Importance sampling algorithm

Importance sampling provides a very important tool to perform Monte-Carlo integration. [3] [8] The main result of importance sampling to this method is that the uniform sampling of is a particular case of a more generic choice, on which the samples are drawn from any distribution . The idea is that can be chosen to decrease the variance of the measurement QN.

Consider the following example where one would like to numerically integrate a gaussian function, centered at 0, with σ = 1, from −1000 to 1000. Naturally, if the samples are drawn uniformly on the interval [−1000, 1000], only a very small part of them would be significant to the integral. This can be improved by choosing a different distribution from where the samples are chosen, for instance by sampling according to a gaussian distribution centered at 0, with σ = 1. Of course the "right" choice strongly depends on the integrand.

Formally, given a set of samples chosen from a distribution

the estimator for I is given by [3]

Intuitively, this says that if we pick a particular sample twice as much as other samples, we weight it half as much as the other samples. This estimator is naturally valid for uniform sampling, the case where is constant.

The Metropolis–Hastings algorithm is one of the most used algorithms to generate from , [3] thus providing an efficient way of computing integrals.

VEGAS Monte Carlo

The VEGAS algorithm approximates the exact distribution by making a number of passes over the integration region which creates the histogram of the function f. Each histogram is used to define a sampling distribution for the next pass. Asymptotically this procedure converges to the desired distribution. [9] In order to avoid the number of histogram bins growing like Kd, the probability distribution is approximated by a separable function:

so that the number of bins required is only Kd. This is equivalent to locating the peaks of the function from the projections of the integrand onto the coordinate axes. The efficiency of VEGAS depends on the validity of this assumption. It is most efficient when the peaks of the integrand are well-localized. If an integrand can be rewritten in a form which is approximately separable this will increase the efficiency of integration with VEGAS. VEGAS incorporates a number of additional features, and combines both stratified sampling and importance sampling. [9]

See also

Notes

  1. Press et al. 2007 , Chap. 4
  2. Press et al. 2007 , Chap. 7
  3. 1 2 3 4 Newman & Barkema 1999 , Chap. 2
  4. Newman & Barkema 1999
  5. Press et al. 2007
  6. MacKay 2003 , pp. 284–292
  7. Press & Farrar 1990 , pp. 190–195
  8. Kroese, Taimre & Botev 2011
  9. 1 2 Lepage 1978

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical theorem that states that the average of the results obtained from a large number of independent and identical random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

<span class="mw-page-title-main">Numerical integration</span> Methods of calculating definite integrals

In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral. The term numerical quadrature is more or less a synonym for "numerical integration", especially as applied to one-dimensional integrals. Some authors refer to numerical integration over more than one dimension as cubature; others take "quadrature" to include higher-dimensional integration.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

The VEGAS algorithm, due to G. Peter Lepage, is a method for reducing error in Monte Carlo simulations by using a known or approximate probability distribution function to concentrate the search in those areas of the integrand that make the greatest contribution to the final integral.

<span class="mw-page-title-main">Quasi-Monte Carlo method</span> Numerical integration process

In numerical analysis, the quasi-Monte Carlo method is a method for numerical integration and solving some other problems using low-discrepancy sequences to achieve variance reduction. This is in contrast to the regular Monte Carlo method or Monte Carlo integration, which are based on sequences of pseudorandom numbers.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, but its precursors can be found in statistical physics as early as 1949. Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

<span class="mw-page-title-main">Variance reduction</span>

In mathematics, more specifically in the theory of Monte Carlo methods, variance reduction is a procedure used to increase the precision of the estimates obtained for a given simulation or computational effort. Every output random variable from the simulation is associated with a variance which limits the precision of the simulation results. In order to make a simulation statistically efficient, i.e., to obtain a greater precision and smaller confidence intervals for the output random variable of interest, variance reduction techniques can be used. The main variance reduction methods are

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

Multiple-try Metropolis (MTM) is a sampling method that is a modified form of the Metropolis–Hastings method, first presented by Liu, Liang, and Wong in 2000. It is designed to help the sampling trajectory converge faster, by increasing both the step size and the acceptance rate.

In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the weighted average. Each random variable is weighted in inverse proportion to its variance, i.e., proportional to its precision.

Mean-field particle methods are a broad class of interacting type Monte Carlo algorithms for simulating from a sequence of probability distributions satisfying a nonlinear evolution equation. These flows of probability measures can always be interpreted as the distributions of the random states of a Markov process whose transition probabilities depends on the distributions of the current random states. A natural way to simulate these sophisticated nonlinear Markov processes is to sample a large number of copies of the process, replacing in the evolution equation the unknown distributions of the random states by the sampled empirical measures. In contrast with traditional Monte Carlo and Markov chain Monte Carlo methods these mean-field particle techniques rely on sequential interacting samples. The terminology mean-field reflects the fact that each of the samples interacts with the empirical measures of the process. When the size of the system tends to infinity, these random empirical measures converge to the deterministic distribution of the random states of the nonlinear Markov chain, so that the statistical interaction between particles vanishes. In other words, starting with a chaotic configuration based on independent copies of initial state of the nonlinear Markov chain model, the chaos propagates at any time horizon as the size the system tends to infinity; that is, finite blocks of particles reduces to independent copies of the nonlinear Markov process. This result is called the propagation of chaos property. The terminology "propagation of chaos" originated with the work of Mark Kac in 1976 on a colliding mean-field kinetic gas model.

References