Imputation (statistics)

Last updated

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. [1] Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. [2] There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

Contents

Listwise (complete case) deletion

By far, the most common means of dealing with missing data is listwise deletion (also known as complete case), which is when all cases with a missing value are deleted. If the data are missing completely at random, then listwise deletion does not add any bias, but it does decrease the power of the analysis by decreasing the effective sample size. For example, if 1000 cases are collected but 80 have missing values, the effective sample size after listwise deletion is 920. If the cases are not missing completely at random, then listwise deletion will introduce bias because the sub-sample of cases represented by the missing data are not representative of the original sample (and if the original sample was itself a representative sample of a population, the complete cases are not representative of that population either). [3] While listwise deletion is unbiased when the missing data is missing completely at random, this is rarely the case in actuality. [4]

Pairwise deletion (or "available case analysis") involves deleting a case when it is missing a variable required for a particular analysis, but including that case in analyses for which all required variables are present. When pairwise deletion is used, the total N for analysis will not be consistent across parameter estimations. Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion can introduce impossible mathematical situations such as correlations that are over 100%. [5]

The one advantage complete case deletion has over other methods is that it is straightforward and easy to implement. This is a large reason why complete case is the most popular method of handling missing data in spite of the many disadvantages it has.

Single imputation

Hot-deck

A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed. In the common scenario in which the cases are repeated measurements of a variable for a person or other entity, this represents the belief that if a measurement is missing, the best guess is that it hasn't changed from the last time it was measured. This method is known to increase risk of increasing bias and potentially false conclusions. For this reason LOCF is not recommended for use. [6]

Cold-deck

Cold-deck imputation, by contrast, selects donors from another dataset. Due to advances in computer power, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques. It is a method of replacing with response values of similar items in past surveys. It is available in surveys that measure time intervals.

Mean substitution

Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. This is because, in cases with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.

Mean imputation can be carried out within classes (i.e. categories such as gender), and can be expressed as where is the imputed value for record and is the sample mean of respondent data within some class . This is a special case of generalized regression imputation:

Here the values are estimated from regressing on in non-imputed data, is a dummy variable for class membership, and data are split into respondent () and missing (). [7] [8]

Non-negative matrix factorization

Non-negative matrix factorization (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases. [9] This makes it a mathematically proven method for data imputation. NMF can ignore missing data in the cost function, and the impact from missing data can be as small as a second order effect.

Regression

Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where the value of that variable is missing. In other words, available information for complete and incomplete cases is used to predict the value of a specific variable. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance. This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value.

Stochastic regression was a fairly successful attempt to correct the lack of an error term in regression imputation by adding the average regression variance to the regression imputations to introduce error. Stochastic regression shows much less bias than the above-mentioned techniques, but it still missed one thing – if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance. [5]

Multiple imputation

In order to deal with the problem of increased noise due to imputation, Rubin (1987) [10] developed a method for averaging the outcomes across multiple imputed data sets to account for this. All multiple imputation methods follow three steps. [3]

  1. Imputation – Similar to single imputation, missing values are imputed. However, the imputed values are drawn m times from a distribution rather than just once. At the end of this step, there should be m completed datasets.
  2. Analysis – Each of the m datasets is analyzed. At the end of this step there should be m analyses.
  3. Pooling – The m results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern [11] [12] or by combining simulations from each separate model. [13]

Multiple imputation can be used in cases where the data are missing completely at random, missing at random, and missing not at random, though it can be biased in the latter case. [14] One approach is multiple imputation by chained equations (MICE), also known as "fully conditional specification" and "sequential regression multiple imputation." [15] MICE is designed for missing at random data, though there is simulation evidence to suggest that with a sufficient number of auxiliary variables it can also work on data that are missing not at random. However, MICE can suffer from performance problems when the number of observation is large and the data have complex features, such as nonlinearities and high dimensionality.

More recent approaches to multiple imputation use machine learning techniques to improve its performance. MIDAS (Multiple Imputation with Denoising Autoencoders), for instance, uses denoising autoencoders, a type of unsupervised neural network, to learn fine-grained latent representations of the observed data. [16] MIDAS has been shown to provide accuracy and efficiency advantages over traditional multiple imputation strategies.

As alluded in the previous section, single imputation does not take into account the uncertainty in the imputations. After imputation, the data is treated as if they were the actual real values in single imputation. The negligence of uncertainty in the imputation can lead to overly precise results and errors in any conclusions drawn. [17] By imputing multiple times, multiple imputation accounts for the uncertainty and range of values that the true value could have taken. As expected, the combination of both uncertainty estimation and deep learning for imputation is among the best strategies and has been used to model heterogeneous drug discovery data. [18] [19]

Additionally, while single imputation and complete case are easier to implement, multiple imputation is not very difficult to implement. There are a wide range of statistical packages in different statistical software that readily performs multiple imputation. For example, the MICE package allows users in R to perform multiple imputation using the MICE method. [20] MIDAS can be implemented in R with the rMIDAS package and in Python with the MIDASpy package. [16]

See also

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The error of an observation is the deviation of the observed value from the true value of a quantity of interest. The residual is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals. In econometrics, "errors" are also called disturbances.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

<span class="mw-page-title-main">Regression dilution</span> Statistical bias in linear regressions

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.

Robust statistics are statistics which maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

<span class="mw-page-title-main">Stepwise regression</span> Method of statistical factor analysis

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a forward, backward, or combined sequence of F-tests or t-tests.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

In statistics, listwise deletion is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Predictive mean matching (PMM) is a widely used statistical imputation method for missing values, first proposed by Donald B. Rubin in 1986 and R. J. A. Little in 1988.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

  1. Barnard, J.; Meng, X. L. (1999-03-01). "Applications of multiple imputation in medical studies: from AIDS to NHANES". Statistical Methods in Medical Research. 8 (1): 17–36. doi:10.1177/096228029900800103. ISSN   0962-2802. PMID   10347858. S2CID   11453137.
  2. Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006. Ch.25
  3. 1 2 Lall, Ranjit (2016). "How Multiple Imputation Makes a Difference". Political Analysis. 24 (4): 414–433. doi: 10.1093/pan/mpw020 .
  4. Kenward, Michael G (2013-02-26). "The handling of missing data in clinical trials". Clinical Investigation. 3 (3): 241–250. doi:10.4155/cli.13.7. ISSN   2041-6792.
  5. 1 2 Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press. ISBN   978-1-60623-639-0.
  6. Molnar, Frank J.; Hutton, Brian; Fergusson, Dean (2008-10-07). "Does analysis using "last observation carried forward" introduce bias in dementia research?". Canadian Medical Association Journal. 179 (8): 751–753. doi:10.1503/cmaj.080820. ISSN   0820-3946. PMC   2553855 . PMID   18838445.
  7. Kalton, Graham (1986). "The treatment of missing survey data". Survey Methodology. 12: 1–16.
  8. Kalton, Graham; Kasprzyk, Daniel (1982). "Imputing for missing survey responses" (PDF). Proceedings of the Section on Survey Research Methods. American Statistical Association. 22. S2CID   195855359. Archived from the original (PDF) on 2020-02-12.
  9. Ren, Bin; Pueyo, Laurent; Chen, Christine; Choquet, Elodie; Debes, John H; Duchene, Gaspard; Menard, Francois; Perrin, Marshall D. (2020). "Using Data Imputation for Signal Separation in High Contrast Imaging". The Astrophysical Journal. 892 (2): 74. arXiv: 2001.00563 . Bibcode:2020ApJ...892...74R. doi: 10.3847/1538-4357/ab7024 . S2CID   209531731.
  10. Rubin, Donald (9 June 1987). Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics. Wiley. doi:10.1002/9780470316696. ISBN   9780471087052.
  11. Yuan, Yang C. (2010). "Multiple imputation for missing data: Concepts and new development" (PDF). SAS Institute Inc., Rockville, MD. 49: 1–11.
  12. Van Buuren, Stef (2012-03-29). "2. Multiple Imputation". Flexible Imputation of Missing Data. Chapman & Hall/CRC Interdisciplinary Statistics Series. Vol. 20125245. Chapman and Hall/CRC. doi:10.1201/b11826. ISBN   9781439868249. S2CID   60316970.
  13. King, Gary; Honaker, James; Joseph, Anne; Scheve, Kenneth (March 2001). "Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation". American Political Science Review. 95 (1): 49–69. doi:10.1017/S0003055401000235. ISSN   1537-5943. S2CID   15484116.
  14. Pepinsky, Thomas B. (2018-08-03). "A Note on Listwise Deletion versus Multiple Imputation". Political Analysis. Cambridge University Press (CUP). 26 (4): 480–488. doi: 10.1017/pan.2018.18 . ISSN   1047-1987.
  15. Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine; Leaf, Philip J. (2011-03-01). "Multiple imputation by chained equations: what is it and how does it work?". International Journal of Methods in Psychiatric Research. 20 (1): 40–49. doi:10.1002/mpr.329. ISSN   1557-0657. PMC   3074241 . PMID   21499542.
  16. 1 2 Lall, Ranjit; Robinson, Thomas (2021). "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning". Political Analysis. 30 (2): 179–196. doi: 10.1017/pan.2020.49 .
  17. Graham, John W. (2009-01-01). "Missing data analysis: making it work in the real world". Annual Review of Psychology. 60: 549–576. doi:10.1146/annurev.psych.58.110405.085530. ISSN   0066-4308. PMID   18652544.
  18. Irwin, Benedict (2020-06-01). "Practical Applications of Deep Learning to Impute Heterogeneous Drug Discovery Data". Journal of Chemical Information and Modeling. 60 (6): 2848–2857. doi:10.1021/acs.jcim.0c00443. PMID   32478517. S2CID   219171721.
  19. Whitehead, Thomas (2019-02-12). "Imputation of Assay Bioactivity Data Using Deep Learning". Journal of Chemical Information and Modeling. 59 (3): 1197–1204. doi:10.1021/acs.jcim.8b00768. PMID   30753070. S2CID   73429643.
  20. Horton, Nicholas J.; Kleinman, Ken P. (2007-02-01). "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models". The American Statistician. 61 (1): 79–90. doi:10.1198/000313007X172556. ISSN   0003-1305. PMC   1839993 . PMID   17401454.