Algorithms for calculating variance

Last updated

Algorithms for calculating variance play a major role in computational statistics. A key difficulty in the design of good algorithms for this problem is that formulas for the variance may involve sums of squares, which can lead to numerical instability as well as to arithmetic overflow when dealing with large values.

Contents

Naïve algorithm

A formula for calculating the variance of an entire population of size N is:

Using Bessel's correction to calculate an unbiased estimate of the population variance from a finite sample of n observations, the formula is:

Therefore, a naïve algorithm to calculate the estimated variance is given by the following:

  • Let n ← 0, Sum ← 0, SumSq ← 0
  • For each datum x:
    • nn + 1
    • Sum ← Sum + x
    • SumSq ← SumSq + x × x
  • Var = (SumSq − (Sum × Sum) / n) / (n − 1)

This algorithm can easily be adapted to compute the variance of a finite population: simply divide by n instead of n  1 on the last line.

Because SumSq and (Sum×Sum)/n can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice, [1] [2] and several alternate, numerically stable, algorithms have been proposed. [3] This is particularly bad if the standard deviation is small relative to the mean.

Computing shifted data

The variance is invariant with respect to changes in a location parameter, a property which can be used to avoid the catastrophic cancellation in this formula.

with any constant, which leads to the new formula

the closer is to the mean value the more accurate the result will be, but just choosing a value inside the samples range will guarantee the desired stability. If the values are small then there are no problems with the sum of its squares, on the contrary, if they are large it necessarily means that the variance is large as well. In any case the second term in the formula is always smaller than the first one therefore no cancellation may occur. [2]

If just the first sample is taken as the algorithm can be written in Python programming language as

defshifted_data_variance(data):iflen(data)<2:return0.0K=data[0]n=Ex=Ex2=0.0forxindata:n+=1Ex+=x-KEx2+=(x-K)**2variance=(Ex2-Ex**2/n)/(n-1)# use n instead of (n-1) if want to compute the exact variance of the given data# use (n-1) if data are samples of a larger populationreturnvariance

This formula also facilitates the incremental computation that can be expressed as

K=Ex=Ex2=0.0n=0defadd_variable(x):globalK,n,Ex,Ex2ifn==0:K=xn+=1Ex+=x-KEx2+=(x-K)**2defremove_variable(x):globalK,n,Ex,Ex2n-=1Ex-=x-KEx2-=(x-K)**2defget_mean():globalK,n,ExreturnK+Ex/ndefget_variance():globaln,Ex,Ex2return(Ex2-Ex**2/n)/(n-1)

Two-pass algorithm

An alternative approach, using a different formula for the variance, first computes the sample mean,

and then computes the sum of the squares of the differences from the mean,

where s is the standard deviation. This is given by the following code:

deftwo_pass_variance(data):n=len(data)mean=sum(data)/nvariance=sum((x-mean)**2forxindata)/(n-1)returnvariance

This algorithm is numerically stable if n is small. [1] [4] However, the results of both of these simple algorithms ("naïve" and "two-pass") can depend inordinately on the ordering of the data and can give poor results for very large data sets due to repeated roundoff error in the accumulation of the sums. Techniques such as compensated summation can be used to combat this error to a degree.

Welford's online algorithm

It is often useful to be able to compute the variance in a single pass, inspecting each value only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation. For such an online algorithm, a recurrence relation is required between quantities from which the required statistics can be calculated in a numerically stable fashion.

The following formulas can be used to update the mean and (estimated) variance of the sequence, for an additional element xn. Here, denotes the sample mean of the first n samples , their biased sample variance, and their unbiased sample variance.

These formulas suffer from numerical instability [ citation needed ], as they repeatedly subtract a small number from a big number which scales with n. A better quantity for updating is the sum of squares of differences from the current mean, , here denoted :

This algorithm was found by Welford, [5] [6] and it has been thoroughly analyzed. [2] [7] It is also common to denote and . [8]

An example Python implementation for Welford's algorithm is given below.

# For a new value new_value, compute the new count, new mean, the new M2.# mean accumulates the mean of the entire dataset# M2 aggregates the squared distance from the mean# count aggregates the number of samples seen so fardefupdate(existing_aggregate,new_value):(count,mean,M2)=existing_aggregatecount+=1delta=new_value-meanmean+=delta/countdelta2=new_value-meanM2+=delta*delta2return(count,mean,M2)# Retrieve the mean, variance and sample variance from an aggregatedeffinalize(existing_aggregate):(count,mean,M2)=existing_aggregateifcount<2:returnfloat("nan")else:(mean,variance,sample_variance)=(mean,M2/count,M2/(count-1))return(mean,variance,sample_variance)

This algorithm is much less prone to loss of precision due to catastrophic cancellation, but might not be as efficient because of the division operation inside the loop. For a particularly robust two-pass algorithm for computing the variance, one can first compute and subtract an estimate of the mean, and then use this algorithm on the residuals.

The parallel algorithm below illustrates how to merge multiple sets of statistics calculated online.

Weighted incremental algorithm

The algorithm can be extended to handle unequal sample weights, replacing the simple counter n with the sum of weights seen so far. West (1979) [9] suggests this incremental algorithm:

defweighted_incremental_variance(data_weight_pairs):w_sum=w_sum2=mean=S=0forx,windata_weight_pairs:w_sum=w_sum+ww_sum2=w_sum2+w**2mean_old=meanmean=mean_old+(w/w_sum)*(x-mean_old)S=S+w*(x-mean_old)*(x-mean)population_variance=S/w_sum# Bessel's correction for weighted samples# for integer frequency weightssample_frequency_variance=S/(w_sum-1)

Parallel algorithm

Chan et al. [10] note that Welford's online algorithm detailed above is a special case of an algorithm that works for combining arbitrary sets and :

.

This may be useful when, for example, multiple processing units may be assigned to discrete parts of the input.

Chan's method for estimating the mean is numerically unstable when and both are large, because the numerical error in is not scaled down in the way that it is in the case. In such cases, prefer .

defparallel_variance(n_a,avg_a,M2_a,n_b,avg_b,M2_b):n=n_a+n_bdelta=avg_b-avg_aM2=M2_a+M2_b+delta**2*n_a*n_b/nvar_ab=M2/(n-1)returnvar_ab

This can be generalized to allow parallelization with AVX, with GPUs, and computer clusters, and to covariance. [3]

Example

Assume that all floating point operations use standard IEEE 754 double-precision arithmetic. Consider the sample (4, 7, 13, 16) from an infinite population. Based on this sample, the estimated population mean is 10, and the unbiased estimate of population variance is 30. Both the naïve algorithm and two-pass algorithm compute these values correctly.

Next consider the sample (108 + 4, 108 + 7, 108 + 13, 108 + 16), which gives rise to the same estimated variance as the first sample. The two-pass algorithm computes this variance estimate correctly, but the naïve algorithm returns 29.333333333333332 instead of 30.

While this loss of precision may be tolerable and viewed as a minor flaw of the naïve algorithm, further increasing the offset makes the error catastrophic. Consider the sample (109 + 4, 109 + 7, 109 + 13, 109 + 16). Again the estimated population variance of 30 is computed correctly by the two-pass algorithm, but the naïve algorithm now computes it as −170.66666666666666. This is a serious problem with naïve algorithm and is due to catastrophic cancellation in the subtraction of two similar numbers at the final stage of the algorithm.

Higher-order statistics

Terriberry [11] extends Chan's formulae to calculating the third and fourth central moments, needed for example when estimating skewness and kurtosis:

Here the are again the sums of powers of differences from the mean , giving

For the incremental case (i.e., ), this simplifies to:

By preserving the value , only one division operation is needed and the higher-order statistics can thus be calculated for little incremental cost.

An example of the online algorithm for kurtosis implemented as described is:

defonline_kurtosis(data):n=mean=M2=M3=M4=0forxindata:n1=nn=n+1delta=x-meandelta_n=delta/ndelta_n2=delta_n**2term1=delta*delta_n*n1mean=mean+delta_nM4=M4+term1*delta_n2*(n**2-3*n+3)+6*delta_n2*M2-4*delta_n*M3M3=M3+term1*delta_n*(n-2)-3*delta_n*M2M2=M2+term1# Note, you may also calculate variance using M2, and skewness using M3# Caution: If all the inputs are the same, M2 will be 0, resulting in a division by 0.kurtosis=(n*M4)/(M2**2)-3returnkurtosis

Pébaÿ [12] further extends these results to arbitrary-order central moments, for the incremental and the pairwise cases, and subsequently Pébaÿ et al. [13] for weighted and compound moments. One can also find there similar formulas for covariance.

Choi and Sweetman [14] offer two alternative methods to compute the skewness and kurtosis, each of which can save substantial computer memory requirements and CPU time in certain applications. The first approach is to compute the statistical moments by separating the data into bins and then computing the moments from the geometry of the resulting histogram, which effectively becomes a one-pass algorithm for higher moments. One benefit is that the statistical moment calculations can be carried out to arbitrary accuracy such that the computations can be tuned to the precision of, e.g., the data storage format or the original measurement hardware. A relative histogram of a random variable can be constructed in the conventional way: the range of potential values is divided into bins and the number of occurrences within each bin are counted and plotted such that the area of each rectangle equals the portion of the sample values within that bin:

where and represent the frequency and the relative frequency at bin and is the total area of the histogram. After this normalization, the raw moments and central moments of can be calculated from the relative histogram:

where the superscript indicates the moments are calculated from the histogram. For constant bin width these two expressions can be simplified using :

The second approach from Choi and Sweetman [14] is an analytical methodology to combine statistical moments from individual segments of a time-history such that the resulting overall moments are those of the complete time-history. This methodology could be used for parallel computation of statistical moments with subsequent combination of those moments, or for combination of statistical moments computed at sequential times.

If sets of statistical moments are known: for , then each can be expressed in terms of the equivalent raw moments:

where is generally taken to be the duration of the time-history, or the number of points if is constant.

The benefit of expressing the statistical moments in terms of is that the sets can be combined by addition, and there is no upper limit on the value of .

where the subscript represents the concatenated time-history or combined . These combined values of can then be inversely transformed into raw moments representing the complete concatenated time-history

Known relationships between the raw moments () and the central moments () are then used to compute the central moments of the concatenated time-history. Finally, the statistical moments of the concatenated history are computed from the central moments:

Covariance

Very similar algorithms can be used to compute the covariance.

Naïve algorithm

The naïve algorithm is

For the algorithm above, one could use the following Python code:

defnaive_covariance(data1,data2):n=len(data1)sum1=sum(data1)sum2=sum(data2)sum12=sum([i1*i2fori1,i2inzip(data1,data2)])covariance=(sum12-sum1*sum2/n)/nreturncovariance

With estimate of the mean

As for the variance, the covariance of two random variables is also shift-invariant, so given any two constant values and it can be written:

and again choosing a value inside the range of values will stabilize the formula against catastrophic cancellation as well as make it more robust against big sums. Taking the first value of each data set, the algorithm can be written as:

defshifted_data_covariance(data_x,data_y):n=len(data_x)ifn<2:return0kx=data_x[0]ky=data_y[0]Ex=Ey=Exy=0forix,iyinzip(data_x,data_y):Ex+=ix-kxEy+=iy-kyExy+=(ix-kx)*(iy-ky)return(Exy-Ex*Ey/n)/n

Two-pass

The two-pass algorithm first computes the sample means, and then the covariance:

The two-pass algorithm may be written as:

deftwo_pass_covariance(data1,data2):n=len(data1)mean1=sum(data1)/nmean2=sum(data2)/ncovariance=0fori1,i2inzip(data1,data2):a=i1-mean1b=i2-mean2covariance+=a*b/nreturncovariance

A slightly more accurate compensated version performs the full naive algorithm on the residuals. The final sums and should be zero, but the second pass compensates for any small error.

Online

A stable one-pass algorithm exists, similar to the online algorithm for computing the variance, that computes co-moment :

The apparent asymmetry in that last equation is due to the fact that , so both update terms are equal to . Even greater accuracy can be achieved by first computing the means, then using the stable one-pass algorithm on the residuals.

Thus the covariance can be computed as

defonline_covariance(data1,data2):meanx=meany=C=n=0forx,yinzip(data1,data2):n+=1dx=x-meanxmeanx+=dx/nmeany+=(y-meany)/nC+=dx*(y-meany)population_covar=C/n# Bessel's correction for sample variancesample_covar=C/(n-1)

A small modification can also be made to compute the weighted covariance:

defonline_weighted_covariance(data1,data2,data3):meanx=meany=0wsum=wsum2=0C=0forx,y,winzip(data1,data2,data3):wsum+=wwsum2+=w*wdx=x-meanxmeanx+=(w/wsum)*dxmeany+=(w/wsum)*(y-meany)C+=w*dx*(y-meany)population_covar=C/wsum# Bessel's correction for sample variance# Frequency weightssample_frequency_covar=C/(wsum-1)# Reliability weightssample_reliability_covar=C/(wsum-wsum2/wsum)

Likewise, there is a formula for combining the covariances of two sets that can be used to parallelize the computation: [3]

Weighted batched version

A version of the weighted online algorithm that does batched updated also exists: let denote the weights, and write

The covariance can then be computed as

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<span class="mw-page-title-main">Quantitative genetics</span> Study of the inheritance of continuously variable traits

Quantitative genetics is the study of quantitative traits, which are phenotypes that vary continuously—such as height or mass—as opposed to phenotypes and gene-products that are discretely identifiable—such as eye-colour, or the presence of a particular biochemical.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem states that a stochastic process can be represented as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

The sample mean or empirical mean, and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.

In statistics, pooled variance is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance.

Experimental uncertainty analysis is a technique that analyses a derived quantity, based on the uncertainties in the experimentally measured quantities that are used in some form of mathematical relationship ("model") to calculate that derived quantity. The model used to convert the measurements into the derived quantity is usually based on fundamental principles of a science or engineering discipline.

<span class="mw-page-title-main">Distance correlation</span> Statistical measure

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

SAMV is a parameter-free superresolution algorithm for the linear inverse problem in spectral estimation, direction-of-arrival (DOA) estimation and tomographic reconstruction with applications in signal processing, medical imaging and remote sensing. The name was coined in 2013 to emphasize its basis on the asymptotically minimum variance (AMV) criterion. It is a powerful tool for the recovery of both the amplitude and frequency characteristics of multiple highly correlated sources in challenging environments. Applications include synthetic-aperture radar, computed tomography scan, and magnetic resonance imaging (MRI).

References

  1. 1 2 Einarsson, Bo (2005). Accuracy and Reliability in Scientific Computing. SIAM. p. 47. ISBN   978-0-89871-584-2.
  2. 1 2 3 Chan, Tony F.; Golub, Gene H.; LeVeque, Randall J. (1983). "Algorithms for computing the sample variance: Analysis and recommendations" (PDF). The American Statistician. 37 (3): 242–247. doi:10.1080/00031305.1983.10483115. JSTOR   2683386. Archived (PDF) from the original on 9 October 2022.
  3. 1 2 3 Schubert, Erich; Gertz, Michael (9 July 2018). Numerically stable parallel computation of (co-)variance. ACM. p. 10. doi:10.1145/3221269.3223036. ISBN   9781450365055. S2CID   49665540.
  4. Higham, Nicholas J. (2002). "Problem 1.10". Accuracy and Stability of Numerical Algorithms (2nd ed.). Philadelphia, PA: Society for Industrial and Applied Mathematics. doi:10.1137/1.9780898718027. ISBN   978-0-898715-21-7. e ISBN   978-0-89871-802-7, 2002075848. Metadata also listed at ACM Digital Library.
  5. Welford, B. P. (1962). "Note on a method for calculating corrected sums of squares and products". Technometrics . 4 (3): 419–420. doi:10.2307/1266577. JSTOR   1266577.
  6. Donald E. Knuth (1998). The Art of Computer Programming , volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.
  7. Ling, Robert F. (1974). "Comparison of Several Algorithms for Computing Sample Means and Variances". Journal of the American Statistical Association. 69 (348): 859–866. doi:10.2307/2286154. JSTOR   2286154.
  8. Cook, John D. (30 September 2022) [1 November 2014]. "Accurately computing sample variance". John D. Cook Consulting: Expert consulting in applied mathematics & data privacy.
  9. West, D. H. D. (1979). "Updating Mean and Variance Estimates: An Improved Method". Communications of the ACM . 22 (9): 532–535. doi: 10.1145/359146.359153 . S2CID   30671293.
  10. Chan, Tony F.; Golub, Gene H.; LeVeque, Randall J. (November 1979). "Updating Formulae and a Pairwise Algorithm for Computing Sample Variances" (PDF). Department of Computer Science, Stanford University. Technical Report STAN-CS-79-773, supported in part by Army contract No. DAAGEI-‘E-G-013.
  11. Terriberry, Timothy B. (15 October 2008) [9 December 2007]. "Computing Higher-Order Moments Online". Archived from the original on 23 April 2014. Retrieved 5 May 2008.
  12. Pébay, Philippe Pierre (September 2008). "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments". Sponsoring Org.: USDOE. Albuquerque, NM, and Livermore, CA (United States): Sandia National Laboratories (SNL). doi:10.2172/1028931. OSTI   1028931 . Technical Report SAND2008-6212, TRN: US201201%%57, DOE Contract Number: AC04-94AL85000 via UNT Digital Library.
  13. Pébaÿ, Philippe; Terriberry, Timothy; Kolla, Hemanth; Bennett, Janine (2016). "Numerically Stable, Scalable Formulas for Parallel and Online Computation of Higher-Order Multivariate Central Moments with Arbitrary Weights". Computational Statistics. 31 (4). Springer: 1305–1325. doi:10.1007/s00180-015-0637-z. S2CID   124570169.
  14. 1 2 Choi, Myoungkeun; Sweetman, Bert (2010). "Efficient Calculation of Statistical Moments for Structural Health Monitoring". Journal of Structural Health Monitoring. 9 (1): 13–24. doi:10.1177/1475921709341014. S2CID   17534100.