Inverse-variance weighting

Last updated May 15, 2024

In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the weighted average. Each random variable is weighted in inverse proportion to its variance (i.e., proportional to its precision).

Context

Suppose an experimenter wishes to measure the value of a quantity, say the acceleration due to gravity of Earth, whose true value happens to be $\mu$ . A careful experimenter makes multiple measurements, which we denote with $n$ random variables $X_{1},X_{2},...,X_{n}$ . If they are all noisy but unbiased, i.e., the measuring device does not systematically overestimate or underestimate the true value and the errors are scattered symmetrically, then the expectation value $E[X_{i}]=\mu$ $\forall i$ . The scatter in the measurement is then characterised by the variance of the random variables $Var(X_{i}):=\sigma _{i}^{2}$ , and if the measurements are performed under identical scenarios, then all the $\sigma _{i}$ are the same, which we shall refer to by $\sigma$ . Given the $n$ measurements, a typical estimator for $\mu$ , denoted as ${\hat {\mu }}$ , is given by the simple average ${\overline {X}}={\frac {1}{n}}\sum _{i}X_{i}$ . Note that this empirical average is also a random variable, whose expectation value $E[{\overline {X}}]$ is $\mu$ but also has a scatter. If the individual measurements are uncorrelated, the square of the error in the estimate is given by $Var({\overline {X}})={\frac {1}{n^{2}}}\sum _{i}\sigma _{i}^{2}=\left({\frac {\sigma }{\sqrt {n}}}\right)^{2}$ . Hence, if all the $\sigma _{i}$ are equal, then the error in the estimate decreases with increase in $n$ as $1/{\sqrt {n}}$ , thus making more observations preferred.

Instead of $n$ repeated measurements with one instrument, if the experimenter makes $n$ of the same quantity with $n$ different instruments with varying quality of measurements, then there is no reason to expect the different $\sigma _{i}$ to be the same. Some instruments could be noisier than others. In the example of measuring the acceleration due to gravity, the different "instruments" could be measuring $g$ from a simple pendulum, from analysing a projectile motion etc. The simple average is no longer an optimal estimator, since the error in ${\overline {X}}$ might actually exceed the error in the least noisy measurement if different measurements have very different errors. Instead of discarding the noisy measurements that increase the final error, the experimenter can combine all the measurements with appropriate weights so as to give more importance to the least noisy measurements and vice versa. Given the knowledge of $\sigma _{1}^{2},\sigma _{2}^{2},...,\sigma _{n}^{2}$ , an optimal estimator to measure $\mu$ would be a weighted mean of the measurements ${\hat {\mu }}={\frac {\sum _{i}w_{i}X_{i}}{\sum _{i}w_{i}}}$ , for the particular choice of the weights $w_{i}=1/\sigma _{i}^{2}$ . The variance of the estimator $Var({\hat {\mu }})={\frac {\sum _{i}w_{i}^{2}\sigma _{i}^{2}}{\left(\sum _{i}w_{i}\right)^{2}}}$ , which for the optimal choice of the weights become $Var({\hat {\mu }}_{\text{opt}})=\left(\sum _{i}\sigma _{i}^{-2}\right)^{-1}.$

Note that since $Var({\hat {\mu }}_{\text{opt}})<\min _{j}\sigma _{j}^{2}$ , the estimator has a scatter smaller than the scatter in any individual measurement. Furthermore, the scatter in ${\hat {\mu }}_{\text{opt}}$ decreases with adding more measurements, however noisier those measurements may be.

Derivation

Consider a generic weighted sum $Y=\sum _{i}w_{i}X_{i}$ , where the weights $w_{i}$ are normalised such that $\sum _{i}w_{i}=1$ . If the $X_{i}$ are all independent, the variance of $Y$ is given by (see Bienaymé's identity)

Var(Y)=\sum _{i}w_{i}^{2}\sigma _{i}^{2}.

For optimality, we wish to minimise $Var(Y)$ which can be done by equating the gradient with respect to the weights of $Var(Y)$ to zero, while maintaining the constraint that $\sum _{i}w_{i}=1$ . Using a Lagrange multiplier $w_{0}$ to enforce the constraint, we express the variance:

Var(Y)=\sum _{i}w_{i}^{2}\sigma _{i}^{2}-w_{0}(\sum _{i}w_{i}-1).

For $k>0$ ,

0={\frac {\partial }{\partial w_{k}}}Var(Y)=2w_{k}\sigma _{k}^{2}-w_{0},

which implies that:

w_{k}={\frac {w_{0}/2}{\sigma _{k}^{2}}}.

The main takeaway here is that $w_{k}\propto 1/\sigma _{k}^{2}$ . Since $\sum _{i}w_{i}=1$ ,

{\frac {2}{w_{0}}}=\sum _{i}{\frac {1}{\sigma _{i}^{2}}}:={\frac {1}{\sigma _{0}^{2}}}.

The individual normalised weights are:

w_{k}={\frac {1}{\sigma _{k}^{2}}}\left(\sum _{i}{\frac {1}{\sigma _{i}^{2}}}\right)^{-1}.

It is easy to see that this extremum solution corresponds to the minimum from the second partial derivative test by noting that the variance is a quadratic function of the weights. Thus, the minimum variance of the estimator is then given by:

Var(Y)=\sum _{i}{\frac {\sigma _{0}^{4}}{\sigma _{i}^{4}}}\sigma _{i}^{2}=\sigma _{0}^{4}\sum _{i}{\frac {1}{\sigma _{i}^{2}}}=\sigma _{0}^{4}{\frac {1}{\sigma _{0}^{2}}}=\sigma _{0}^{2}={\frac {1}{\sum _{i}1/\sigma _{i}^{2}}}.

Normal distributions

For normally distributed random variables inverse-variance weighted averages can also be derived as the maximum likelihood estimate for the true value. Furthermore, from a Bayesian perspective the posterior distribution for the true value given normally distributed observations $y_{i}$ and a flat prior is a normal distribution with the inverse-variance weighted average as a mean and variance $Var(Y)$

Multivariate case

For multivariate distributions an equivalent argument leads to an optimal weighting based on the covariance matrices $\mathbf {C} _{i}$ of the individual vector-valued estimates $\mathbf {x} _{i}$ :

\mathbf {\hat {x}} =\left(\sum _{i}\mathbf {C} _{i}^{-1}\right)^{-1}\sum _{i}\mathbf {C} _{i}^{-1}\mathbf {x} _{i}

\mathbf {\hat {C}} =\left(\sum _{i}\mathbf {C} _{i}^{-1}\right)^{-1}

For multivariate distributions the term "precision-weighted" average is more commonly used.

Related Research Articles

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by $,,,, or .$

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

The sample mean or empirical mean, and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

Experimental uncertainty analysis is a technique that analyses a derived quantity, based on the uncertainties in the experimentally measured quantities that are used in some form of mathematical relationship ("model") to calculate that derived quantity. The model used to convert the measurements into the derived quantity is usually based on fundamental principles of a science or engineering discipline.

Inverse probability weighting is a statistical technique for estimating quantities related to a population other than the one from which the data was collected. Study designs with a disparate sampling population and population of target inference are common in application. There may be prohibitive factors barring researchers from directly sampling from the target population such as cost, time, or ethical concerns. A solution to this problem is to use an alternate design strategy, e.g. stratified sampling. Weighting, when correctly applied, can potentially improve the efficiency and reduce the bias of unweighted estimators.

References

↑ Joachim Hartung; Guido Knapp; Bimal K. Sinha (2008). Statistical meta-analysis with applications . John Wiley & Sons. ISBN 978-0-470-29089-7.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Joachim Hartung; Guido Knapp; Bimal K. Sinha (2008). Statistical meta-analysis with applications . John Wiley & Sons. ISBN 978-0-470-29089-7.

[1]