Inverse probability weighting

Last updated • 6 min readFrom Wikipedia, The Free Encyclopedia

Inverse probability weighting is a statistical technique for estimating quantities related to a population other than the one from which the data was collected. Study designs with a disparate sampling population and population of target inference (target population) are common in application. [1] There may be prohibitive factors barring researchers from directly sampling from the target population such as cost, time, or ethical concerns. [2] A solution to this problem is to use an alternate design strategy, e.g. stratified sampling. Weighting, when correctly applied, can potentially improve the efficiency and reduce the bias of unweighted estimators.

Contents

One very early weighted estimator is the Horvitz–Thompson estimator of the mean. [3] When the sampling probability is known, from which the sampling population is drawn from the target population, then the inverse of this probability is used to weight the observations. This approach has been generalized to many aspects of statistics under various frameworks. In particular, there are weighted likelihoods, weighted estimating equations, and weighted probability densities from which a majority of statistics are derived. These applications codified the theory of other statistics and estimators such as marginal structural models, the standardized mortality ratio, and the EM algorithm for coarsened or aggregate data.

Inverse probability weighting is also used to account for missing data when subjects with missing data cannot be included in the primary analysis. [4] With an estimate of the sampling probability, or the probability that the factor would be measured in another measurement, inverse probability weighting can be used to inflate the weight for subjects who are under-represented due to a large degree of missing data.

Inverse Probability Weighted Estimator (IPWE)

The inverse probability weighting estimator can be used to demonstrate causality when the researcher cannot conduct a controlled experiment but has observed data to model. Because it is assumed that the treatment is not randomly assigned, the goal is to estimate the counterfactual or potential outcome if all subjects in population were assigned either treatment.

Suppose observed data are drawn i.i.d (independent and identically distributed) from unknown distribution P, where

The goal is to estimate the potential outcome, , that would be observed if the subject were assigned treatment . Then compare the mean outcome if all patients in the population were assigned either treatment: . We want to estimate using observed data .

Estimator Formula

Constructing the IPWE

  1. where
  2. construct or using any propensity model (often a logistic regression model)

With the mean of each treatment group computed, a statistical t-test or ANOVA test can be used to judge difference between group means and determine statistical significance of treatment effect.

Assumptions

Recall the joint probability model for the covariate , action , and response . If and are known as and , respectively, then the response has the distribution

We make the following assumptions.

  • (A1) Consistency:
  • (A2) No unmeasured confounders: . More formally, for each bounded and measurable functions and , This means that treatment assignment is based solely on covariate data and independent of potential outcomes.
  • (A3) Positivity: for all and .

Formal derivation

Under the assumptions (A1)-(A3), we will derive the following identities

The first equality follows from the definition and (A1). For the second equality, first use the iterated expectation to write

By (A3), almost surely. Then using (A2), note that

Hence integrating out the last expression with respect to and noting that almost surely, the second equality in follows.

Variance reduction

The Inverse Probability Weighted Estimator (IPWE) is known to be unstable if some estimated propensities are too close to 0 or 1. In such instances, the IPWE can be dominated by a small number of subjects with large weights. To address this issue, a smoothed IPW estimator using Rao-Blackwellization has been proposed, which reduces the variance of IPWE by up to 7-fold and helps protect the estimator from model misspecification. [5]

Augmented Inverse Probability Weighted Estimator (AIPWE)

An alternative estimator is the augmented inverse probability weighted estimator (AIPWE) combines both the properties of the regression based estimator and the inverse probability weighted estimator. It is therefore a 'doubly robust' method in that it only requires either the propensity or outcome model to be correctly specified but not both. This method augments the IPWE to reduce variability and improve estimate efficiency. This model holds the same assumptions as the Inverse Probability Weighted Estimator (IPWE). [6]

Estimator Formula

With the following notations:

  1. is an indicator function if subject i is part of treatment group a (or not).
  2. Construct regression estimator to predict outcome based on covariates and treatment , for some subject i. For example, using ordinary least squares regression.
  3. Construct propensity (probability) estimate . For example, using logistic regression.
  4. Combine in AIPWE to obtain

Interpretation and "double robustness"

The later rearrangement of the formula helps reveal the underlying idea: our estimator is based on the average predicted outcome using the model (i.e.: ). However, if the model is biased, then the residuals of the model will not be (in the full treatment group a) around 0. We can correct this potential bias by adding the extra term of the average residuals of the model (Q) from the true value of the outcome (Y) (i.e.: ). Because we have missing values of Y, we give weights to inflate the relative importance of each residual (these weights are based on the inverse propensity, a.k.a. probability, of seeing each subject observations) (see page 10 in [7] ).

The "doubly robust" benefit of such an estimator comes from the fact that it's sufficient for one of the two models to be correctly specified, for the estimator to be unbiased (either or , or both). This is because if the outcome model is well specified then its residuals will be around 0 (regardless of the weights each residual will get). While if the model is biased, but the weighting model is well specified, then the bias will be well estimated (And corrected for) by the weighted average residuals. [7] [8] [9]

The bias of the doubly robust estimators is called a second-order bias, and it depends on the product of the difference and the difference . This property allows us, when having a "large enough" sample size, to lower the overall bias of doubly robust estimators by using machine learning estimators (instead of parametric models). [10]

See also

Related Research Articles

<span class="mw-page-title-main">Cauchy distribution</span> Probability distribution

The Cauchy distribution, named after Augustin-Louis Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of Lp spaces.

<span class="mw-page-title-main">Hamiltonian mechanics</span> Formulation of classical mechanics using momenta

In physics, Hamiltonian mechanics is a reformulation of Lagrangian mechanics that emerged in 1833. Introduced by Sir William Rowan Hamilton, Hamiltonian mechanics replaces (generalized) velocities used in Lagrangian mechanics with (generalized) momenta. Both theories provide interpretations of classical mechanics and describe the same physical phenomena.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics, the Horvitz–Thompson estimator, named after Daniel G. Horvitz and Donovan J. Thompson, is a method for estimating the total and mean of a pseudo-population in a stratified sample by applying inverse probability weighting to account for the difference in the sampling distribution between the collected data and the a target population. The Horvitz–Thompson estimator is frequently applied in survey analyses and can be used to account for missing data, as well as many sources of unequal selection probabilities.

In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the weighted average. Each random variable is weighted in inverse proportion to its variance.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

The streamline upwind Petrov–Galerkin pressure-stabilizing Petrov–Galerkin formulation for incompressible Navier–Stokes equations can be used for finite element computations of high Reynolds number incompressible flow using equal order of finite element space by introducing additional stabilization terms in the Navier–Stokes Galerkin formulation.

References

  1. Robins, JM; Rotnitzky, A; Zhao, LP (1994). "Estimation of regression coefficients when some regressors are not always observed". Journal of the American Statistical Association . 89 (427): 846–866. doi:10.1080/01621459.1994.10476818.
  2. Breslow, NE; Lumley, T; et al. (2009). "Using the Whole Cohort in the Analysis of Case-Cohort Data". Am J Epidemiol. 169 (11): 1398–1405. doi:10.1093/aje/kwp055. PMC   2768499 . PMID   19357328.
  3. Horvitz, D. G.; Thompson, D. J. (1952). "A generalization of sampling without replacement from a finite universe". Journal of the American Statistical Association . 47 (260): 663–685. doi:10.1080/01621459.1952.10483446.
  4. Hernan, MA; Robins, JM (2006). "Estimating Causal Effects From Epidemiological Data". J Epidemiol Community Health. 60 (7): 578–596. CiteSeerX   10.1.1.157.9366 . doi:10.1136/jech.2004.029496. PMC   2652882 . PMID   16790829.
  5. Liao, JG; Rohde, C (2022). "Variance reduction in the inverse probability weighted estimators for the average treatment effect using the propensity score". Biometrics. 78 (2): 660–667. doi:10.1111/biom.13454. PMID   33715153. S2CID   232232367.
  6. Cao, Weihua; Tsiatis, Anastasios A.; Davidian, Marie (2009). "Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data". Biometrika. 96 (3): 723–734. doi:10.1093/biomet/asp033. ISSN   0006-3444. PMC   2798744 . PMID   20161511.
  7. 1 2 Kang, Joseph DY, and Joseph L. Schafer. "Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data." Statistical science 22.4 (2007): 523-539. link for the paper
  8. Kim, Jae Kwang, and David Haziza. "Doubly robust inference with missing data in survey sampling." Statistica Sinica 24.1 (2014): 375-394. link to the paper
  9. Seaman, Shaun R., and Stijn Vansteelandt. "Introduction to double robust methods for incomplete data." Statistical science: a review journal of the Institute of Mathematical Statistics 33.2 (2018): 184. link to the paper
  10. Hernán, Miguel A., and James M. Robins. "Causal inference." (2010): 2. link to the book - page 170