# Bias of an estimator

Last updated

In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".

Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the same experiment it represents. For example, the expected value in rolling a six-sided die is 3.5, because the average of all the numbers that come up is 3.5 as the number of rolls approaches infinity. In other words, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity. The expected value is also known as the expectation, mathematical expectation, EV, average, mean value, mean, or first moment.

## Contents

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased (hence converge to the correct value as the number of data points grows arbitrarily large), though individual estimators in a consistent sequence may be biased (so long as the bias converges to zero); see bias versus consistency.

The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it.

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

All else being equal, an unbiased estimator is preferable to a biased estimator, but in practice all else is not equal, and biased estimators are frequently used, generally with small bias. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population or is difficult to compute (as in unbiased estimation of standard deviation); because an estimator is median-unbiased but not mean-unbiased (or the reverse); because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful. Further, mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see effect of transformations); for example, the sample variance is an unbiased estimator for the population variance, but its square root, the sample standard deviation, is a biased estimator for the population standard deviation. These are all illustrated below.

In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation of a population of values, in such a way that the expected value of the calculation equals the true value. Except in some important situations, outlined later, the task has little relevance to applications of statistics since its need is avoided by standard procedures, such as the use of significance tests and confidence intervals, or by using Bayesian analysis.

In mathematical optimization, statistics, econometrics, decision theory, machine learning and computational neuroscience, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative, in which case it is to be maximized.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

## Definition

Suppose we have a statistical model, parameterized by a real number θ, giving rise to a probability distribution for observed data, ${\displaystyle P_{\theta }(x)=P(x\mid \theta )}$, and a statistic ${\displaystyle {\hat {\theta }}}$ which serves as an estimator of θ based on any observed data ${\displaystyle x}$. That is, we assume that our data follow some unknown distribution ${\displaystyle P(x\mid \theta )}$ (where θ is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator ${\displaystyle {\hat {\theta }}}$ that maps observed data to values that we hope are close to θ. The bias of ${\displaystyle {\hat {\theta }}}$ relative to ${\displaystyle \theta }$ is defined as

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of some sample data. A statistical model represents, often in considerably idealized form, the data-generating process.

${\displaystyle \operatorname {Bias} _{\theta }[\,{\hat {\theta }}\,]=\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}\,]-\theta =\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}-\theta \,],}$

where ${\displaystyle \operatorname {E} _{x\mid \theta }}$ denotes expected value over the distribution ${\displaystyle P(x\mid \theta )}$, i.e. averaging over all possible observations ${\displaystyle x}$. The second equation follows since θ is measurable with respect to the conditional distribution ${\displaystyle P(x\mid \theta )}$.

An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.

In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.

## Examples

### Sample variance

The sample variance of a random variable demonstrates two aspects of estimator bias: firstly, the naive estimator is biased, which can be corrected by a scale factor; second, the unbiased estimator is not optimal in terms of mean squared error (MSE), which can be minimized by using a different scale factor, resulting in a biased estimator with lower MSE than the unbiased estimator. Concretely, the naive estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n  1 yields an unbiased estimator. Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator. This number is always larger than n  1, so this is known as a shrinkage estimator, as it "shrinks" the unbiased estimator towards zero; for the normal distribution the optimal value is n + 1.

Suppose X1, ..., Xn are independent and identically distributed (i.i.d.) random variables with expectation μ and variance σ2. If the sample mean and uncorrected sample variance are defined as

${\displaystyle {\overline {X}}\,={\frac {1}{n}}\sum _{i=1}^{n}X_{i}\qquad S^{2}={\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}\,{\big )}^{2}\qquad }$

then S2 is a biased estimator of σ2, because

{\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )-({\overline {X}}-\mu ){\bigg )}^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg )}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\sum _{i=1}^{n}1{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\cdot n{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]\end{aligned}}}

To continue, we note that by subtracting ${\displaystyle \mu }$ from both sides of ${\displaystyle {\overline {X}}={\frac {1}{n}}\sum _{i=1}^{n}X_{i}}$, we get

{\displaystyle {\begin{aligned}{\overline {X}}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-{\frac {1}{n}}\sum _{i=1}^{n}\mu \ ={\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu ).\\[8pt]\end{aligned}}}

Meaning, (by cross-multiplication) ${\displaystyle n\cdot ({\overline {X}}-\mu )=\sum _{i=1}^{n}(X_{i}-\mu )}$. Then, the previous becomes:

{\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\cdot n\cdot ({\overline {X}}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )^{2}+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\sigma ^{2}-\operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]=\left(1-{\frac {1}{n}}\right)\sigma ^{2}<\sigma ^{2}.\end{aligned}}}

In other words, the expected value of the uncorrected sample variance does not equal the population variance σ2, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased [1] estimator of the population mean μ.

Note that the usual definition of sample variance is ${\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}}$, and this is an unbiased estimator of the population variance.

This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above: ${\displaystyle \operatorname {E} {\big [}({\overline {X}}-\mu )^{2}{\big ]}={\frac {1}{n}}\sigma ^{2}}$

Algebraically speaking, ${\displaystyle \operatorname {E} [S^{2}]}$ is unbiased because:

{\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n-1}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )-({\overline {X}}-\mu ){\bigg )}^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg )}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n-1}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n-1}}({\overline {X}}-\mu )^{2}\sum _{i=1}^{n}1{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n-1}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n-1}}({\overline {X}}-\mu )^{2}\cdot n{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n-1}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]\end{aligned}}}

To continue, like above, we note that when ${\displaystyle \mu }$ is subtracted from both sides of ${\displaystyle {\overline {X}}={\frac {1}{n}}\sum _{i=1}^{n}X_{i}}$, we get

{\displaystyle {\begin{aligned}{\overline {X}}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-{\frac {1}{n}}\sum _{i=1}^{n}\mu \ ={\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )\\[8pt]\end{aligned}}}

and cross-multiplying yields: ${\displaystyle n\cdot ({\overline {X}}-\mu )=\sum _{i=1}^{n}(X_{i}-\mu )}$. Then we have:

{\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n-1}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n-1}}({\overline {X}}-\mu )\cdot n\cdot ({\overline {X}}-\mu )+{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2n}{n-1}}({\overline {X}}-\mu )^{2}+{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-\operatorname {E} {\bigg [}{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {n}{n-1}}\cdot {\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-\operatorname {E} {\bigg [}{\frac {n}{n-1}}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&={\frac {n}{n-1}}\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-{\frac {n}{n-1}}\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&={\frac {n}{n-1}}\cdot \sigma ^{2}-{\frac {n}{n-1}}\cdot {\frac {1}{n}}\sigma ^{2}\\[8pt]&={\frac {n}{n-1}}\cdot \sigma ^{2}-{\frac {1}{n-1}}\cdot \sigma ^{2}\\[8pt]&={\frac {n-1}{n-1}}\cdot \sigma ^{2}\\[8pt]&=\sigma ^{2}\end{aligned}}}

Thus ${\displaystyle \operatorname {E} [S^{2}]=\sigma ^{2}}$, and therefore ${\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}}$ is an unbiased estimator of the population variance, σ2. The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.

The reason that an uncorrected sample variance, S2, is biased stems from the fact that the sample mean is an ordinary least squares (OLS) estimator for μ: ${\displaystyle {\overline {X}}}$ is the number that makes the sum ${\displaystyle \sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}$ as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice ${\displaystyle \mu \neq {\overline {X}}}$ gives,

${\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}<{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2},}$

and then

{\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}{\bigg ]}<\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}=\sigma ^{2}.\end{aligned}}}

The above discussion can be understood in geometric terms: the vector ${\displaystyle {\vec {C}}=(X_{1}-\mu ,\ldots ,X_{n}-\mu )}$ can be decomposed into the "mean part" and "variance part" by projecting to the direction of ${\displaystyle {\vec {u}}=(1,\ldots ,1)}$ and to that direction's orthogonal complement hyperplane. One gets ${\displaystyle {\vec {A}}=({\overline {X}}-\mu ,\ldots ,{\overline {X}}-\mu )}$ for the part along ${\displaystyle {\vec {u}}}$ and ${\displaystyle {\vec {B}}=(X_{1}-{\overline {X}},\ldots ,X_{n}-{\overline {X}})}$ for the complementary part. Since this is an orthogonal decomposition, Pythagorean theorem says ${\displaystyle |{\vec {C}}|^{2}=|{\vec {A}}|^{2}+|{\vec {B}}|^{2}}$, and taking expectations we get ${\displaystyle n\sigma ^{2}=n\operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]+n\operatorname {E} [S^{2}]}$, as above (but times ${\displaystyle n}$). If the distribution of ${\displaystyle {\vec {C}}}$ is rotationally symmetric, as in the case when ${\displaystyle X_{i}}$ are sampled from a Gaussian, then on average, the dimension along ${\displaystyle {\vec {u}}}$ contributes to ${\displaystyle |{\vec {C}}|^{2}}$ equally as the ${\displaystyle n-1}$ directions perpendicular to ${\displaystyle {\vec {u}}}$, so that ${\displaystyle \operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]={\frac {\sigma ^{2}}{n}}}$ and ${\displaystyle \operatorname {E} [S^{2}]={\frac {(n-1)\sigma ^{2}}{n}}}$. This is in fact true in general, as explained above.

### Estimating a Poisson probability

A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution. [2] [3] Suppose that X has a Poisson distribution with expectation λ. Suppose it is desired to estimate

${\displaystyle \operatorname {P} (X=0)^{2}=e^{-2\lambda }\quad }$

with a sample of size 1. (For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and λ is the average number of calls per minute, then e−2λ is the probability that no calls arrive in the next two minutes.)

Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e.

${\displaystyle \operatorname {E} (\delta (X))=\sum _{x=0}^{\infty }\delta (x){\frac {\lambda ^{x}e^{-\lambda }}{x!}}=e^{-2\lambda },}$

the only function of the data constituting an unbiased estimator is

${\displaystyle \delta (x)=(-1)^{x}.\,}$

To see this, note that when decomposing eλ from the above expression for expectation, the sum that is left is a Taylor series expansion of eλ as well, yielding eλeλ = e−2λ (see Characterizations of the exponential function).

If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if X is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive.

The (biased) maximum likelihood estimator

${\displaystyle e^{-2{X}}\quad }$

is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its mean squared error

${\displaystyle e^{-4\lambda }-2e^{\lambda (1/e^{2}-3)}+e^{\lambda (1/e^{4}-1)}\,}$

is smaller; compare the unbiased estimator's MSE of

${\displaystyle 1-e^{-4\lambda }.\,}$

The MSEs are functions of the true value λ. The bias of the maximum-likelihood estimator is:

${\displaystyle e^{-2\lambda }-e^{\lambda (1/e^{2}-1)}.\,}$

### Maximum of a discrete uniform distribution

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 through to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximum-likelihood estimator of n is X, even though the expectation of X given n is only (n + 1)/2; we can be certain only that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X  1.

## Median-unbiased estimators

The theory of median-unbiased estimators was revived by George W. Brown in 1947: [4]

An estimate of a one-dimensional parameter θ will be said to be median-unbiased, if, for fixed θ, the median of the distribution of the estimate is at the value θ; i.e., the estimate underestimates just as often as it overestimates. This requirement seems for most purposes to accomplish as much as the mean-unbiased requirement and has the additional property that it is invariant under one-to-one transformation.

Further properties of median-unbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl.[ citation needed ] In particular, median-unbiased estimators exist in cases where mean-unbiased and maximum-likelihood estimators do not exist. They are invariant under one-to-one transformations.

There are methods of construction median-unbiased estimators for probability distributions that have monotone likelihood-functions, such as one-parameter exponential families, to ensure that they are optimal (in a sense analogous to minimum-variance property considered for mean-unbiased estimators). [5] [6] One such procedure is an analogue of the Rao—Blackwell procedure for mean-unbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao—Blackwell procedure for mean-unbiased estimation but for a larger class of loss-functions. [7]

## Bias with respect to other loss functions

Any minimum-variance mean-unbiased estimator minimizes the risk (expected loss) with respect to the squared-error loss function (among mean-unbiased estimators), as observed by Gauss. [8] A minimum-average absolute deviation median -unbiased estimator minimizes the risk with respect to the absolute loss function (among median-unbiased estimators), as observed by Laplace. [9] [10] Other loss functions are used in statistics, particularly in robust statistics. [11] [12]

## Effect of transformations

As stated above, for univariate parameters, median-unbiased estimators remain median-unbiased under transformations that preserve order (or reverse order).

Note that, when a transformation is applied to a mean-unbiased estimator, the result need not be a mean-unbiased estimator of its corresponding population statistic. By Jensen's inequality, a convex function as transformation will introduce positive bias, while a concave function will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a non-linear function f and a mean-unbiased estimator U of a parameter p, the composite estimator f(U) need not be a mean-unbiased estimator of f(p). For example, the square root of the unbiased estimator of the population variance is not a mean-unbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see unbiased estimation of standard deviation for a discussion in this case.

## Bias, variance and mean squared error

While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample.

One measure which is used to try to reflect both types of difference is the mean square error,

${\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} {\big [}({\hat {\theta }}-\theta )^{2}{\big ]}.}$

This can be shown to be equal to the square of the bias, plus the variance:

{\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})=&(\operatorname {E} [{\hat {\theta }}]-\theta )^{2}+\operatorname {E} [\,({\hat {\theta }}-\operatorname {E} [\,{\hat {\theta }}\,])^{2}\,]\\=&(\operatorname {Bias} ({\hat {\theta }},\theta ))^{2}+\operatorname {Var} ({\hat {\theta }})\end{aligned}}}

When the parameter is a vector, an analogous decomposition applies: [13]

${\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {trace} (\operatorname {Var} ({\hat {\theta }}))+\left\Vert \operatorname {Bias} ({\hat {\theta }},\theta )\right\Vert ^{2}}$

where

${\displaystyle \operatorname {trace} (\operatorname {Var} ({\hat {\theta }}))}$

is the trace of the covariance matrix of the estimator.

An estimator that minimises the bias will not necessarily minimise the mean square error.

### Example: Estimation of population variance

For example, [14] suppose an estimator of the form

${\displaystyle T^{2}=c\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}\,\right)^{2}=cnS^{2}}$

is sought for the population variance as above, but this time to minimise the MSE:

{\displaystyle {\begin{aligned}\operatorname {MSE} =&\operatorname {E} \left[(T^{2}-\sigma ^{2})^{2}\right]\\=&\left(\operatorname {E} \left[T^{2}-\sigma ^{2}\right]\right)^{2}+\operatorname {Var} (T^{2})\end{aligned}}}

If the variables X1 ... Xn follow a normal distribution, then nS2/σ2 has a chi-squared distribution with n  1 degrees of freedom, giving:

${\displaystyle \operatorname {E} [nS^{2}]=(n-1)\sigma ^{2}{\text{ and }}\operatorname {Var} (nS^{2})=2(n-1)\sigma ^{4}.}$

and so

${\displaystyle \operatorname {MSE} =(c(n-1)-1)^{2}\sigma ^{4}+2c^{2}(n-1)\sigma ^{4}}$

With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n  1) which minimises just the bias term.

More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.

However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.

## Bayesian view

Most bayesians are rather unconcerned about unbiasedness (at least in the formal sampling-theory sense above) of their estimates. For example, Gelman et al (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading." [15]

Fundamentally, the difference between the Bayesian approach and the sampling-theory approach above is that in the sampling-theory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the data which are known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using Bayes' theorem:

${\displaystyle p(\theta \mid D,I)\propto p(\theta \mid I)p(D\mid \theta ,I)}$

Here the second term, the likelihood of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the prior probability for θ, which takes account of everything the analyst may know or suspect about θbefore the data comes in. This information plays no part in the sampling-theory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.

But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior.

For example, consider again the estimation of an unknown population variance σ2 of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function

${\displaystyle \operatorname {ExpectedLoss} =\operatorname {E} \left[\left(cnS^{2}-\sigma ^{2}\right)^{2}\right]=\operatorname {E} \left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]}$

A standard choice of uninformative prior for this problem is the Jeffreys prior, ${\displaystyle \scriptstyle {p(\sigma ^{2})\;\propto \;1/\sigma ^{2}}}$, which is equivalent to adopting a rescaling-invariant flat prior for ln( σ2).

One consequence of adopting this prior is that S2/σ2 remains a pivotal quantity, i.e. the probability distribution of S2/σ2 depends only on S2/σ2, independent of the value of S2 or σ2:

${\displaystyle p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid S^{2}\right)=p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid \sigma ^{2}\right)=g\left({\tfrac {S^{2}}{\sigma ^{2}}}\right)}$

However, while

${\displaystyle \operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]=\sigma ^{4}\operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]}$

in contrast

${\displaystyle \operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]\neq \sigma ^{4}\operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]}$

— when the expectation is taken over the probability distribution of σ2 given S2, as it is in the Bayesian case, rather than S2 given σ2, one can no longer take σ4 as a constant and factor it out. The consequence of this is that, compared to the sampling-theory calculation, the Bayesian calculation puts more weight on larger values of σ2, properly taking into account (as the sampling-theory calculation cannot) that under this squared-loss function the consequence of underestimating large values of σ2 is more costly in squared-loss terms than that of overestimating small values of σ2.

The worked-out Bayesian calculation gives a scaled inverse chi-squared distribution with n  1 degrees of freedom for the posterior probability distribution of σ2. The expected loss is minimised when cnS2 = <σ2>; this occurs when c = 1/(n  3).

Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expected-loss minimising result as the corresponding sampling-theory calculation.

## Notes

1. Richard Arnold Johnson; Dean W. Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall. ISBN   978-0-13-187715-3 . Retrieved 10 August 2012.
2. J. P. Romano and A. F. Siegel (1986) Counterexamples in Probability and Statistics, Wadsworth & Brooks / Cole, Monterey, California, USA, p. 168
3. Hardy, M. (1 March 2003). "An Illuminating Counterexample". American Mathematical Monthly. 110 (3): 234–238. arXiv:. doi:10.2307/3647938. ISSN   0002-9890. JSTOR   3647938.
4. Brown (1947), page 583
5. Pfanzagl, Johann. "On optimal median unbiased estimators in the presence of nuisance parameters." The Annals of Statistics (1979): 187-193.
6. Brown, L. D.; Cohen, Arthur; Strawderman, W. E. A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications. Ann. Statist. 4 (1976), no. 4, 712--722. doi:10.1214/aos/1176343543. http://projecteuclid.org/euclid.aos/1176343543.
7. Page 713: Brown, L. D.; Cohen, Arthur; Strawderman, W. E. A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications. Ann. Statist. 4 (1976), no. 4, 712--722. doi:10.1214/aos/1176343543. http://projecteuclid.org/euclid.aos/1176343543.
8. Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: North-Holland Publishing Co.
9. Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: North-Holland Publishing Co.
10. Jaynes, E.T. (2007). Probability theory : the logic of science (5. print. ed.). Cambridge [u.a.]: Cambridge Univ. Press. p. 172. ISBN   978-0-521-59271-0.
11. Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: North-Holland Publishing Co.
12. Chapter 3: Robust and Non-Robust Models in Statistics by Lev B. Klebanov, Svetlozar T. Rachev and Frank J. Fabozzi, Nova Scientific Publishers, Inc. New York, 2009.
13. Taboga, Marco (2010). "Lectures on probability theory and mathematical statistics".
14. Morris H. DeGroot (1986), Probability and Statistics (2nd edition), Addison-Wesley. ISBN   0-201-11366-X. Pp. 414–5.
But compare it with, for example, the discussion in Casella and Berger (2001), Statistical Inference (2nd edition), Duxbury. ISBN   0534243126. P. 332.
15. A. Gelman et al (1995), Bayesian Data Analysis, Chapman and Hall. ISBN   0-412-03991-5. p. 108.

## Related Research Articles

In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. The method obtains the parameter estimates by finding the parameter values that maximize the likelihood function. The estimates are called maximum likelihood estimates, which is also abbreviated as MLE.

In estimation theory and statistics, the Cramér–Rao bound (CRB), Cramér–Rao lower bound (CRLB), Cramér–Rao inequality, Frechet–Darmois–Cramér–Rao inequality, or information inequality expresses a lower bound on the variance of unbiased estimators of a deterministic parameter. This term is named in honor of Harald Cramér, Calyampudi Radhakrishna Rao, Maurice Fréchet and Georges Darmois all of whom independently derived this limit to statistical precision in the 1940s.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has normal distribution, the sample covariance matrix has Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds.

In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot quantity need not be a statistic—the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, Bessel's correction is the use of n − 1 instead of n in the formula for the sample variance and sample standard deviation, where n is the number of observations in a sample. This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations. This technique is named after Friedrich Bessel.

In statistics, Stein's unbiased risk estimate (SURE) is an unbiased estimator of the mean-squared error of "a nearly arbitrary, nonlinear biased estimator." In other words, it provides an indication of the accuracy of a given estimator. This is important since the true mean-squared error of an estimator is a function of the unknown parameter to be estimated, and thus cannot be determined exactly.

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

## References

• Brown, George W. "On Small-Sample Estimation." The Annals of Mathematical Statistics, vol. 18, no. 4 (Dec., 1947), pp. 582–585. JSTOR   2236236.
• Lehmann, E. L. "A General Concept of Unbiasedness" The Annals of Mathematical Statistics, vol. 22, no. 4 (Dec., 1951), pp. 587–592. JSTOR   2236928.
• Allan Birnbaum, 1961. "A Unified Theory of Estimation, I", The Annals of Mathematical Statistics, vol. 32, no. 1 (Mar., 1961), pp. 112–135.
• Van der Vaart, H. R., 1961. "Some Extensions of the Idea of Bias" The Annals of Mathematical Statistics, vol. 32, no. 2 (June 1961), pp. 436–447.
• Pfanzagl, Johann. 1994. Parametric Statistical Theory. Walter de Gruyter.
• Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (2010). Classical Inference and the Linear Model. Kendall's Advanced Theory of Statistics. 2A. Wiley. ISBN   0-4706-8924-2..
• Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1993). Unbiased estimators and their applications. 1: Univariate case. Dordrect: Kluwer Academic Publishers. ISBN   0-7923-2382-3.
• Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1996). Unbiased estimators and their applications. 2: Multivariate case. Dordrect: Kluwer Academic Publishers. ISBN   0-7923-3939-8.
• Klebanov, Lev [B.]; Rachev, Svetlozar [T.]; Fabozzi, Frank [J.] (2009). Robust and Non-Robust Models in Statistics. New York: Nova Scientific Publishers. ISBN   978-1-60741-768-2.
• Hazewinkel, Michiel, ed. (2001) [1994], "Unbiased estimator", Encyclopedia of Mathematics , Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN   978-1-55608-010-4