Inverse probability weighting

Last updated January 15, 2024

Inverse probability weighting is a statistical technique for estimating quantities related to a population other than the one from which the data was collected. Study designs with a disparate sampling population and population of target inference (target population) are common in application.^[1] There may be prohibitive factors barring researchers from directly sampling from the target population such as cost, time, or ethical concerns.^[2] A solution to this problem is to use an alternate design strategy, e.g. stratified sampling. Weighting, when correctly applied, can potentially improve the efficiency and reduce the bias of unweighted estimators.

One very early weighted estimator is the Horvitz–Thompson estimator of the mean.^[3] When the sampling probability is known, from which the sampling population is drawn from the target population, then the inverse of this probability is used to weight the observations. This approach has been generalized to many aspects of statistics under various frameworks. In particular, there are weighted likelihoods, weighted estimating equations, and weighted probability densities from which a majority of statistics are derived. These applications codified the theory of other statistics and estimators such as marginal structural models, the standardized mortality ratio, and the EM algorithm for coarsened or aggregate data.

Inverse probability weighting is also used to account for missing data when subjects with missing data cannot be included in the primary analysis.^[4] With an estimate of the sampling probability, or the probability that the factor would be measured in another measurement, inverse probability weighting can be used to inflate the weight for subjects who are under-represented due to a large degree of missing data.

Inverse Probability Weighted Estimator (IPWE)

The inverse probability weighting estimator can be used to demonstrate causality when the researcher cannot conduct a controlled experiment but has observed data to model. Because it is assumed that the treatment is not randomly assigned, the goal is to estimate the counterfactual or potential outcome if all subjects in population were assigned either treatment.

Suppose observed data are $\{{\bigl (}X_{i},A_{i},Y_{i}{\bigr )}\}_{i=1}^{n}$ drawn i.i.d (independent and identically distributed) from unknown distribution P, where

$X\in \mathbb {R} ^{p}$ covariates
$A\in \{0,1\}$ are the two possible treatments.
$Y\in \mathbb {R}$ response
We do not assume treatment is randomly assigned.

The goal is to estimate the potential outcome, $Y^{*}{\bigl (}a{\bigr )}$ , that would be observed if the subject were assigned treatment $a$ . Then compare the mean outcome if all patients in the population were assigned either treatment: $\mu _{a}=\mathbb {E} Y^{*}(a)$ . We want to estimate $\mu _{a}$ using observed data $\{{\bigl (}X_{i},A_{i},Y_{i}{\bigr )}\}_{i=1}^{n}$ .

Estimator Formula

${\hat {\mu }}_{a,n}^{IPWE}={\frac {1}{n}}\sum _{i=1}^{n}Y_{i}{\frac {\mathbf {1} _{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}$

Constructing the IPWE

$\mu _{a}=\mathbb {E} \left[{\frac {\mathbf {1} _{A=a}Y}{p(A|X)}}\right]$ where $p(a|x)={\frac {P(A=a,X=x)}{P(X=x)}}$
construct ${\hat {p}}_{n}(a|x)$ or $p(a|x)$ using any propensity model (often a logistic regression model)
${\hat {\mu }}_{a,n}^{IPWE}=\sum _{i=1}^{n}{\frac {Y_{i}1_{A_{i}=a}}{n{\hat {p}}_{n}(A_{i}|X_{i})}}$

With the mean of each treatment group computed, a statistical t-test or ANOVA test can be used to judge difference between group means and determine statistical significance of treatment effect.

Assumptions

Recall the joint probability model $(X,A,Y)\sim P$ for the covariate $X$ , action $A$ , and response $Y$ . If $X$ and $A$ are known as $x$ and $a$ , respectively, then the response $Y(X=x,A=a)=Y(x,a)$ has the distribution

${\begin{aligned}Y(x,a)\sim {\frac {P(x,a,\cdot )}{\int P(x,a,y)\,dy}}.\end{aligned}}$

We make the following assumptions.

(A1) Consistency: $Y=Y^{*}(A)$
(A2) No unmeasured confounders: $\{Y^{*}(0),Y^{*}(1)\}\perp A|X$ . More formally, for each bounded and measurable functions $f$ and $g$ , ${\begin{aligned}\qquad \mathbb {E} _{(A,Y)}\left[f(Y(X,a))\,g(A)\,|\,X\right]=\mathbb {E} _{Y}\left[f(Y(X,a))\,|\,X\right]\,\mathbb {E} _{A}\left[g(A)\,|\,X\right].\end{aligned}}$ This means that treatment assignment is based solely on covariate data and independent of potential outcomes.
(A3) Positivity: $P(A=a|X=x)=\mathbb {E} _{A}[\mathbf {1} (A=a)\,|\,X=x]>0$ for all $a$ and $x$ .

Formal derivation

Under the assumptions (A1)-(A3), we will derive the following identities

${\begin{aligned}\mathbb {E} \left[Y^{*}(a)\right]=\mathbb {E} _{(X,Y)}\left[Y(X,a)\right]=\mathbb {E} _{(X,A,Y)}\left[{\frac {Y\mathbf {1} (A=a)}{P(A=a|X)}}\right].\qquad \cdots \cdots (*)\end{aligned}}$

The first equality follows from the definition and (A1). For the second equality, first use the iterated expectation to write

${\begin{aligned}\mathbb {E} _{(X,Y)}\left[Y(X,a)\right]=\mathbb {E} _{X}\left[\mathbb {E} _{Y}\left[Y(X,a)\,|\,X\right]\right].\end{aligned}}$

By (A3), $\mathbb {E} _{A}[\mathbf {1} (A=a)\,|\,X]>0$ almost surely. Then using (A2), note that

${\begin{aligned}\mathbb {E} _{Y}\left[Y(X,a)\,|\,X\right]&={\frac {\mathbb {E} _{Y}\left[Y(X,a)\,|\,X\right]\,\mathbb {E} _{A}[\mathbf {1} (A=a)\,|\,X]}{\mathbb {E} _{A}[\mathbf {1} (A=a)\,|\,X]}}\\&={\frac {\mathbb {E} _{(A,Y)}\left[Y(X,a)\mathbf {1} (A=a)\,|\,X\right]}{\mathbb {E} [\mathbf {1} (A=a)\,|\,X]}}\qquad \cdots \cdots (A2)\\&=\mathbb {E} _{(A,Y)}\left[{\frac {Y(X,a)\mathbf {1} (A=a)}{\mathbb {E} [\mathbf {1} (A=a)\,|\,X]}}\,{\bigg |}\,X\right].\qquad \cdots \cdots ({\text{denominator is a function of X}})\end{aligned}}$

Hence integrating out the last expression with respect to $X$ and noting that $Y(X,a)\mathbf {1} (A=a)=Y(X,A)\mathbf {1} (A=a)=Y\,\mathbf {1} (A=a)$ almost surely, the second equality in $(*)$ follows.

Variance reduction

The Inverse Probability Weighted Estimator (IPWE) is known to be unstable if some estimated propensities are too close to 0 or 1. In such instances, the IPWE is dominated by a small number of subjects with large weights. Recently developed smoothed IPW estimators by employing Rao-Blackwellization, however, reduce the variance of IPWE by up to 7-fold and can also protect the augmented inverse probability weighted estimator from model misspecification. ^[5]

Augmented Inverse Probability Weighted Estimator (AIPWE)

An alternative estimator is the augmented inverse probability weighted estimator (AIPWE) combines both the properties of the regression based estimator and the inverse probability weighted estimator. It is therefore a 'doubly robust' method in that it only requires either the propensity or outcome model to be correctly specified but not both. This method augments the IPWE to reduce variability and improve estimate efficiency. This model holds the same assumptions as the Inverse Probability Weighted Estimator (IPWE).^[6]

Estimator Formula

${\begin{aligned}{\hat {\mu }}_{a,n}^{AIPWE}&={\frac {1}{n}}\sum _{i=1}^{n}{\Biggl (}{\frac {Y_{i}1_{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}-{\frac {1_{A_{i}=a}-{\hat {p}}_{n}(A_{i}|X_{i})}{{\hat {p}}_{n}(A_{i}|X_{i})}}{\hat {Q}}_{n}(X_{i},a){\Biggr )}\\&={\frac {1}{n}}\sum _{i=1}^{n}{\Biggl (}{\frac {1_{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}Y_{i}+(1-{\frac {1_{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}){\hat {Q}}_{n}(X_{i},a){\Biggr )}\\&={\frac {1}{n}}\sum _{i=1}^{n}{\Biggl (}{\hat {Q}}_{n}(X_{i},a){\Biggr )}+{\frac {1}{n}}\sum _{i=1}^{n}{\frac {1_{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}{\Biggl (}Y_{i}-{\hat {Q}}_{n}(X_{i},a){\Biggr )}\end{aligned}}$

With the following notations:

$1_{A_{i}=a}$ is an indicator function if subject i is part of treatment group a (or not).
Construct regression estimator ${\hat {Q}}_{n}(x,a)$ to predict outcome $Y$ based on covariates $X$ and treatment $A$ , for some subject i. For example, using ordinary least squares regression.
Construct propensity (probability) estimate ${\hat {p}}_{n}(A_{i}|X_{i})$ . For example, using logistic regression.
Combine in AIPWE to obtain ${\hat {\mu }}_{a,n}^{AIPWE}$

Interpretation and "double robustness"

The later rearrangement of the formula helps reveal the underlying idea: our estimator is based on the average predicted outcome using the model (i.e.: ${\frac {1}{n}}\sum _{i=1}^{n}{\Biggl (}{\hat {Q}}_{n}(X_{i},a){\Biggr )}$ ). However, if the model is biased, then the residuals of the model will not be (in the full treatment group a) around 0. We can correct this potential bias by adding the extra term of the average residuals of the model (Q) from the true value of the outcome (Y) (i.e.: ${\frac {1}{n}}\sum _{i=1}^{n}{\frac {1_{A_{i}=a}}{{\hat {p}}_{n}(A_{i}|X_{i})}}{\Biggl (}Y_{i}-{\hat {Q}}_{n}(X_{i},a){\Biggr )}$ ). Because we have missing values of Y, we give weights to inflate the relative importance of each residual (these weights are based on the inverse propensity, a.k.a. probability, of seeing each subject observations) (see page 10 in ^[7]).

The "doubly robust" benefit of such an estimator comes from the fact that it's sufficient for one of the two models to be correctly specified, for the estimator to be unbiased (either ${\hat {Q}}_{n}(X_{i},a)$ or ${\hat {p}}_{n}(A_{i}|X_{i})$ , or both). This is because if the outcome model is well specified then its residuals will be around 0 (regardless of the weights each residual will get). While if the model is biased, but the weighting model is well specified, then the bias will be well estimated (And corrected for) by the weighted average residuals.^[7]^[8]^[9]

The bias of the doubly robust estimators is called a second-order bias, and it depends on the product of the difference ${\frac {1}{{\hat {p}}_{n}(A_{i}|X_{i})}}-{\frac {1}{{p}_{n}(A_{i}|X_{i})}}$ and the difference ${\hat {Q}}_{n}(X_{i},a)-Q_{n}(X_{i},a)$ . This property allows us, when having a "large enough" sample size, to lower the overall bias of doubly robust estimators by using machine learning estimators (instead of parametric models).^[10]

Related Research Articles

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of $L p$ spaces.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In information theory, the cross-entropy between two probability distributions $and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

<span class="mw-page-title-main">Empirical distribution function</span> Distribution function associated with the empirical measure of a sample

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics, the Horvitz–Thompson estimator, named after Daniel G. Horvitz and Donovan J. Thompson, is a method for estimating the total and mean of a pseudo-population in a stratified sample by applying inverse probability weighting to account for the difference in the sampling distribution between the collected data and the a target population. The Horvitz–Thompson estimator is frequently applied in survey analyses and can be used to account for missing data, as well as many sources of unequal selection probabilities.

The streamline upwind Petrov–Galerkin pressure-stabilizing Petrov–Galerkin formulation for incompressible Navier–Stokes equations can be used for finite element computations of high Reynolds number incompressible flow using equal order of finite element space by introducing additional stabilization terms in the Navier–Stokes Galerkin formulation.

References

↑ Robins, JM; Rotnitzky, A; Zhao, LP (1994). "Estimation of regression coefficients when some regressors are not always observed". Journal of the American Statistical Association . 89 (427): 846–866. doi:10.1080/01621459.1994.10476818.
↑ Breslow, NE; Lumley, T; et al. (2009). "Using the Whole Cohort in the Analysis of Case-Cohort Data". Am J Epidemiol. 169 (11): 1398–1405. doi:10.1093/aje/kwp055. PMC 2768499 . PMID 19357328.
↑ Horvitz, D. G.; Thompson, D. J. (1952). "A generalization of sampling without replacement from a finite universe". Journal of the American Statistical Association . 47 (260): 663–685. doi:10.1080/01621459.1952.10483446.
↑ Hernan, MA; Robins, JM (2006). "Estimating Causal Effects From Epidemiological Data". J Epidemiol Community Health. 60 (7): 578–596. CiteSeerX 10.1.1.157.9366 . doi:10.1136/jech.2004.029496. PMC 2652882 . PMID 16790829.
↑ Liao, JG; Rohde, C (2022). "Variance reduction in the inverse probability weighted estimators for the average treatment effect using the propensity score". Biometrics. 78 (2): 660–667. doi:10.1111/biom.13454. PMID 33715153. S2CID 232232367.
↑ Cao, Weihua; Tsiatis, Anastasios A.; Davidian, Marie (2009). "Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data". Biometrika. 96 (3): 723–734. doi:10.1093/biomet/asp033. ISSN 0006-3444. PMC 2798744 . PMID 20161511.
1 2 Kang, Joseph DY, and Joseph L. Schafer. "Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data." Statistical science 22.4 (2007): 523-539. link for the paper
↑ Kim, Jae Kwang, and David Haziza. "Doubly robust inference with missing data in survey sampling." Statistica Sinica 24.1 (2014): 375-394. link to the paper
↑ Seaman, Shaun R., and Stijn Vansteelandt. "Introduction to double robust methods for incomplete data." Statistical science: a review journal of the Institute of Mathematical Statistics 33.2 (2018): 184. link to the paper
↑ Hernán, Miguel A., and James M. Robins. "Causal inference." (2010): 2. link to the book - page 170

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[refname2-1] Robins, JM; Rotnitzky, A; Zhao, LP (1994). "Estimation of regression coefficients when some regressors are not always observed". Journal of the American Statistical Association . 89 (427): 846–866. doi:10.1080/01621459.1994.10476818.

[refname3-2] Breslow, NE; Lumley, T; et al. (2009). "Using the Whole Cohort in the Analysis of Case-Cohort Data". Am J Epidemiol. 169 (11): 1398–1405. doi:10.1093/aje/kwp055. PMC 2768499 . PMID 19357328.

[3] Horvitz, D. G.; Thompson, D. J. (1952). "A generalization of sampling without replacement from a finite universe". Journal of the American Statistical Association . 47 (260): 663–685. doi:10.1080/01621459.1952.10483446.

[refname1-4] Hernan, MA; Robins, JM (2006). "Estimating Causal Effects From Epidemiological Data". J Epidemiol Community Health. 60 (7): 578–596. CiteSeerX 10.1.1.157.9366 . doi:10.1136/jech.2004.029496. PMC 2652882 . PMID 16790829.

[refname4-5] Liao, JG; Rohde, C (2022). "Variance reduction in the inverse probability weighted estimators for the average treatment effect using the propensity score". Biometrics. 78 (2): 660–667. doi:10.1111/biom.13454. PMID 33715153. S2CID 232232367.

[6] Cao, Weihua; Tsiatis, Anastasios A.; Davidian, Marie (2009). "Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data". Biometrika. 96 (3): 723–734. doi:10.1093/biomet/asp033. ISSN 0006-3444. PMC 2798744 . PMID 20161511.

[kang2007-7] 1 2 Kang, Joseph DY, and Joseph L. Schafer. "Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data." Statistical science 22.4 (2007): 523-539. link for the paper

[8] Kim, Jae Kwang, and David Haziza. "Doubly robust inference with missing data in survey sampling." Statistica Sinica 24.1 (2014): 375-394. link to the paper

[9] Seaman, Shaun R., and Stijn Vansteelandt. "Introduction to double robust methods for incomplete data." Statistical science: a review journal of the Institute of Mathematical Statistics 33.2 (2018): 184. link to the paper

[10] Hernán, Miguel A., and James M. Robins. "Causal inference." (2010): 2. link to the book - page 170

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Inverse probability weighting

Contents

Inverse Probability Weighted Estimator (IPWE)

Estimator Formula

Constructing the IPWE

Assumptions

Formal derivation

Variance reduction

Augmented Inverse Probability Weighted Estimator (AIPWE)

Estimator Formula

Interpretation and "double robustness"

See also

Related Research Articles

References