Bessel's correction

Last updated

In statistics, Bessel's correction ()() is the use of n  1 instead of n in the formula for the sample variance and sample standard deviation, [1] where n is the number of observations in a sample. This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations. This technique is named after Friedrich Bessel.

Contents

In estimating the population variance from a sample when the population mean is unknown, the uncorrected sample variance is the mean of the squares of deviations of sample values from the sample mean (i.e. using a multiplicative factor 1/n). In this case, the sample variance is a biased estimator of the population variance.

Multiplying the uncorrected sample variance by the factor

gives an unbiased estimator of the population variance. In some literature, [2] [3] the above factor is called Bessel's correction.

One can understand Bessel's correction as the degrees of freedom in the residuals vector (residuals, not errors, because the population mean is unknown):

where is the sample mean. While there are n independent observations in the sample, there are only n  1 independent residuals, as they sum to 0. For a more intuitive explanation of the need for Bessel's correction, see § Source of bias.

Generally Bessel's correction is an approach to reduce the bias due to finite sample size. Such finite-sample bias correction is also needed for other estimates like skew and kurtosis, but in these the inaccuracies are often significantly larger. To fully remove such bias it is necessary to do a more complex multi-parameter estimation. For instance a correct correction for the standard deviation depends on the kurtosis (normalized central 4th moment), but this again has a finite sample bias and it depends on the standard deviation, i.e. both estimations have to be merged.

Caveats

There are three caveats to consider regarding Bessel's correction:

  1. It does not yield an unbiased estimator of standard deviation.
  2. The corrected estimator often has a higher mean squared error (MSE) than the uncorrected estimator. [4] Furthermore, there is no population distribution for which it has the minimum MSE because a different scale factor can always be chosen to minimize MSE.
  3. It is only necessary when the population mean is unknown (and estimated as the sample mean). In practice, this generally happens.

Firstly, while the sample variance (using Bessel's correction) is an unbiased estimator of the population variance, its square root, the sample standard deviation, is a biased estimate of the population standard deviation; because the square root is a concave function, the bias is downward, by Jensen's inequality. There is no general formula for an unbiased estimator of the population standard deviation, though there are correction factors for particular distributions, such as the normal; see unbiased estimation of standard deviation for details. An approximation for the exact correction factor for the normal distribution is given by using n  1.5 in the formula: the bias decays quadratically (rather than linearly, as in the uncorrected form and Bessel's corrected form).

Secondly, the unbiased estimator does not minimize mean squared error (MSE), and generally has worse MSE than the uncorrected estimator (this varies with excess kurtosis). MSE can be minimized by using a different factor. The optimal value depends on excess kurtosis, as discussed in mean squared error: variance; for the normal distribution this is optimized by dividing by n + 1 (instead of n  1 or n).

Thirdly, Bessel's correction is only necessary when the population mean is unknown, and one is estimating both population mean and population variance from a given sample, using the sample mean to estimate the population mean. In that case there are n degrees of freedom in a sample of n points, and simultaneous estimation of mean and variance means one degree of freedom goes to the sample mean and the remaining n  1 degrees of freedom (the residuals) go to the sample variance. However, if the population mean is known, then the deviations of the observations from the population mean have n degrees of freedom (because the mean is not being estimated – the deviations are not residuals but errors) and Bessel's correction is not applicable.

Source of bias

Most simply, to understand the bias that needs correcting, think of an extreme case. Suppose the population is (0,0,0,1,2,9), which has a population mean of 2 and a population variance of 10 1/3. A sample of n= 1 is drawn, and it turns out to be The best estimate of the population mean is But what if we use the formula to estimate the variance? The estimate of the variance would be zero--- and the estimate would be zero for any population and any sample of n = 1. The problem is that in estimating the sample mean, the process has already made our estimate of the mean close to the value we sampled--identical, for n = 1. In the case of n = 1, the variance just can't be estimated, because there's no variability in the sample.

But consider n = 2. Suppose the sample were (0, 2). Then and , but with Bessel's correction, , which is an unbiased estimate (if all possible samples of n=2 are taken and this method is used, the average estimate will be 10 1/3.)

To see this in more detail, consider the following example. Suppose the mean of the whole population is 2050, but the statistician does not know that, and must estimate it based on this small sample chosen randomly from the population:

One may compute the sample average:

This may serve as an observable estimate of the unobservable population average, which is 2050. Now we face the problem of estimating the population variance. That is the average of the squares of the deviations from 2050. If we knew that the population average is 2050, we could proceed as follows:

But our estimate of the population average is the sample average, 2052. The actual average, 2050, is unknown. So the sample average, 2052, must be used:

The variance is now a lot smaller. As proven below, the variance will almost always be smaller when calculated using the sum of squared distances to the sample mean, compared to using the sum of squared distances to the population mean. The one exception to this is when the sample mean happens to be equal to the population mean, in which case the variance is also equal.

To see why this happens, we use a simple identity in algebra:

With representing the deviation of an individual sample from the sample mean, and representing the deviation of the sample mean from the population mean. Note that we've simply decomposed the actual deviation of an individual sample from the (unknown) population mean into two components: the deviation of the single sample from the sample mean, which we can compute, and the additional deviation of the sample mean from the population mean, which we can not. Now, we apply this identity to the squares of deviations from the population mean:

Now apply this to all five observations and observe certain patterns:

The sum of the entries in the middle column must be zero because the term a will be added across all 5 rows, which itself must equal zero. That is because a contains the 5 individual samples (left side within parentheses) which - when added - naturally have the same sum as adding 5 times the sample mean of those 5 numbers (2052). This means that a subtraction of these two sums must equal zero. The factor 2 and the term b in the middle column are equal for all rows, meaning that the relative difference across all rows in the middle column stays the same and can therefore be disregarded. The following statements explain the meaning of the remaining columns:

Therefore:

That is why the sum of squares of the deviations from the sample mean is too small to give an unbiased estimate of the population variance when the average of those squares is found. The smaller the sample size, the larger is the difference between the sample variance and the population variance.

Terminology

This correction is so common that the term "sample variance" and "sample standard deviation" are frequently used to mean the corrected estimators (unbiased sample variation, less biased sample standard deviation), using n  1. However caution is needed: some calculators and software packages may provide for both or only the more unusual formulation. This article uses the following symbols and definitions:

μ is the population mean
is the sample mean
σ2 is the population variance
sn2 is the biased sample variance (i.e. without Bessel's correction)
s2 is the unbiased sample variance (i.e. with Bessel's correction)

The standard deviations will then be the square roots of the respective variances. Since the square root introduces bias, the terminology "uncorrected" and "corrected" is preferred for the standard deviation estimators:

sn is the uncorrected sample standard deviation (i.e. without Bessel's correction)
s is the corrected sample standard deviation (i.e. with Bessel's correction), which is less biased, but still biased

Formula

The sample mean is given by

The biased sample variance is then written:

and the unbiased sample variance is written:

Proof of correctness

Alternative 1

Alternative 2

Alternative 3

Intuition

In the biased estimator, by using the sample mean instead of the true mean, you are underestimating each xi  µ by x  µ. We know that the variance of a sum is the sum of the variances (for uncorrelated variables). So, to find the discrepancy between the biased estimator and the true variance, we just need to find the expected value of (x  µ)2.

This is just the variance of the sample mean, which is σ2/n. So, we expect that the biased estimator underestimates σ2 by σ2/n, and so the biased estimator = (1  1/n) × the unbiased estimator = (n  1)/n × the unbiased estimator.

See also

Notes

  1. Radziwill, Nicole M (2017). Statistics (the easier way) with R. ISBN   9780996916059. OCLC   1030532622.
  2. W. J. Reichmann, W. J. (1961) Use and abuse of statistics, Methuen. Reprinted 1964–1970 by Pelican. Appendix 8.
  3. Upton, G.; Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN   978-0-19-954145-4 (entry for "Variance (data)")
  4. Rosenthal, Jeffrey S. (2015). "The Kids are Alright: Divide by n when estimating variance". Bulletin of the Institute of Mathematical Statistics. December 2015: 9.

Related Research Articles

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations.

Normal distribution Probability distribution

In probability theory, a normaldistribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

Standard deviation Measure of the amount of variation or dispersion of a set of values

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Skewness measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

Variance Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. In other words, it measures how far a set of numbers is spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

Students <i>t</i>-distribution Probability distribution

In probability and statistics, Student's t-distribution is any member of a family of continuous probability distributions that arise when estimating the mean of a normally-distributed population in situations where the sample size is small and the population's standard deviation is unknown. It was developed by English statistician William Sealy Gosset under the pseudonym "Student".

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "theoretical value". The error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest, and the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic parameter, stating that the variance of any such estimator is at least as high as the inverse of the Fisher information. The result is named in honor of Harald Cramér and C. R. Rao, but has independently also been derived by Maurice Fréchet, Georges Darmois, as well as Alexander Aitken and Harold Silverstone.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has normal distribution, the sample covariance matrix has Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Directional statistics

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds.

Consistent estimator Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

von Mises distribution Probability distribution on the circle

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot quantity need not be a statistic—the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is a distinct concept from consistency. Consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation of a population of values, in such a way that the expected value of the calculation equals the true value. Except in some important situations, outlined later, the task has little relevance to applications of statistics since its need is avoided by standard procedures, such as the use of significance tests and confidence intervals, or by using Bayesian analysis.

In statistics, pooled variance is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.