Missing data

Last updated

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Contents

Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject"). Some items are more likely to generate a nonresponse than others: for example items about private subjects such as income. Attrition is a type of missingness that can occur in longitudinal studies—for instance studying development where a measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.

Data often are missing in research in economics, sociology, and political science because governments or private entities choose not to, or fail to, report critical statistics, [1] or because the information is not available. Sometimes missing values are caused by the researcher—for example, when data collection is done improperly or mistakes are made in data entry. [2]

These forms of missingness take different types, with different impacts on the validity of conclusions from research: Missing completely at random, missing at random, and missing not at random. Missing data can be handled similarly as censored data.

Types

Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. For example, in a study of the relation between IQ and income, if participants with an above-average IQ tend to skip the question ‘What is your salary?’, analyses that do not take into account this missing at random (MAR pattern (see below)) may falsely fail to find a positive association between IQ and salary. Because of these problems, methodologists routinely advise researchers to design studies to minimize the occurrence of missing values. [2] Graphical models can be used to describe the missing data mechanism in detail. [3] [4]

The graph shows the probability distributions of the estimations of the expected intensity of depression in the population. The number of cases is 60. Let the true population be a standardised normal distribution and the non-response probability be a logistic function of the intensity of depression. The conclusion is: The more data is missing (MNAR), the more biased are the estimations. We underestimate the intensity of depression in the population. Missing not at random.png
The graph shows the probability distributions of the estimations of the expected intensity of depression in the population. The number of cases is 60. Let the true population be a standardised normal distribution and the non-response probability be a logistic function of the intensity of depression. The conclusion is: The more data is missing (MNAR), the more biased are the estimations. We underestimate the intensity of depression in the population.

Missing completely at random

Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. [5] When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR.

In the case of MCAR, the missingness of data is unrelated to any study variable: thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice. [6]

Missing at random

Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. [7] Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. [8] An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). However, if the parameter is estimated with Full Information Maximum Likelihood, MAR will provide asymptotically unbiased estimates. [ citation needed ]

Missing not at random

Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing). [5] To extend the previous example, this would occur if men failed to fill in a depression survey because of their level of depression.

Samuelson and Spirer (1992) discussed how missing and/or distorted data about demographics, law enforcement, and health could be indicators of patterns of human rights violations. They gave several fairly well documented examples. [9]

Techniques of dealing with missing data

Missing data reduces the representativeness of the sample and can therefore distort inferences about the population. Generally speaking, there are three main approaches to handle missing data: (1) Imputation—where values are filled in the place of missing data, (2) omission—where samples with invalid data are discarded from further analysis and (3) analysis—by directly applying methods unaffected by the missing values. One systematic review addressing the prevention and handling of missing data for patient-centered outcomes research identified 10 standards as necessary for the prevention and handling of missing data. These include standards for study design, study conduct, analysis, and reporting. [10]

In some practical application, the experimenters can control the level of missingness, and prevent missing values before gathering the data. For example, in computer questionnaires, it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire, though this method may not be permitted by an ethics board overseeing the research. In survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds. [11] :161–187 However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort. [11] :188–198

In situations where missing values are likely to occur, the researcher is often advised on planning to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.

Imputation

Some data analysis techniques are not robust to missingness, and require to "fill in", or impute the missing data. Rubin (1987) argued that repeating imputation even a few times (5 or less) enormously improves the quality of estimation. [2] For many practical purposes, 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, a too-small number of imputations can lead to a substantial loss of statistical power, and some scholars now recommend 20 to 100 or more. [12] Any multiply-imputed data analysis must be repeated for each of the imputed data sets and, in some cases, the relevant statistics must be combined in a relatively complicated way. [2] Multiple imputation is not conducted in specific disciplines, as there is a lack of training or misconceptions about them. [13] Methods such as listwise deletion have been used to impute data but it has been found to introduce additional bias. [14] There is a beginner guide that provides a step-by-step instruction how to impute data. [15]  

The expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.

Interpolation

In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points.

In the comparison of two paired samples with missing data, a test statistic that uses all available data without the need for imputation is the partially overlapping samples t-test. [16] This is valid under normality and assuming MCAR

Partial deletion

Methods which involve reducing the data available to a dataset having no missing values include:

Full analysis

Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:

Partial identification methods may also be used. [19]

Model-based techniques

Model based techniques, often using graphs, offer additional tools for testing missing data types (MCAR, MAR, MNAR) and for estimating parameters under missing data conditions. For example, a test for refuting MAR/MCAR reads as follows:

For any three variables X,Y, and Z where Z is fully observed and X and Y partially observed, the data should satisfy: .

In words, the observed portion of X should be independent on the missingness status of Y, conditional on every value of Z. Failure to satisfy this condition indicates that the problem belongs to the MNAR category. [20]

(Remark: These tests are necessary for variable-based MAR which is a slight variation of event-based MAR. [21] [22] [23] )

When data falls into MNAR category techniques are available for consistently estimating parameters when certain conditions hold in the model. [3] For example, if Y explains the reason for missingness in X and Y itself has missing values, the joint probability distribution of X and Y can still be estimated if the missingness of Y is random. The estimand in this case will be:

where and denote the observed portions of their respective variables.

Different model structures may yield different estimands and different procedures of estimation whenever consistent estimation is possible. The preceding estimand calls for first estimating from complete data and multiplying it by estimated from cases in which Y is observed regardless of the status of X. Moreover, in order to obtain a consistent estimate it is crucial that the first term be as opposed to .

In many cases model based techniques permit the model structure to undergo refutation tests. [23] Any model which implies the independence between a partially observed variable X and the missingness indicator of another variable Y (i.e. ), conditional on can be submitted to the following refutation test: .

Finally, the estimands that emerge from these techniques are derived in closed form and do not require iterative procedures such as Expectation Maximization that are susceptible to local optima. [24]

A special class of problems appears when the probability of the missingness depends on time. For example, in the trauma databases the probability to lose data about the trauma outcome depends on the day after trauma. In these cases various non-stationary Markov chain models are applied. [25]

See also

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run, and a dataset of unknown data against which the model is tested. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.

A t-test is a type of statistical analysis used to compare the averages of two groups and determine whether the differences between them are more likely to arise from random chance. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different.

<span class="mw-page-title-main">Decision tree learning</span> Machine learning algorithm

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

<span class="mw-page-title-main">Random forest</span> Binary search tree based ensemble machine learning method


Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

In statistics, ignorability is a feature of an experiment design whereby the method of data collection does not depend on the missing data. A missing data mechanism such as a treatment assignment or survey sampling strategy is "ignorable" if the missing data matrix, which indicates which variables are observed or missing, is independent of the missing data conditional on the observed data.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.

Clinical trials are medical research studies conducted on human subjects. The human subjects are assigned to one or more interventions, and the investigators evaluate the effects of those interventions. The progress and results of clinical trials are analyzed statistically.

In statistics, listwise deletion is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing.

<span class="mw-page-title-main">Autologistic actor attribute models</span>

Autologistic actor attribute models (ALAAMs) are a family of statistical models used to model the occurrence of node attributes in network data. They are frequently used with social network data to model social influence, the process by which connections in a social network influence the outcomes experienced by nodes. The dependent variable can strictly be binary. However, they may be applied to any type of network data that incorporates binary, ordinal or continuous node attributes as dependent variables.

<span class="mw-page-title-main">Roderick J. A. Little</span> Ph.D. University of London 1974

Roderick Joseph Alexander Little is an academic statistician, whose main research contributions lie in the statistical analysis of data with missing values and the analysis of complex sample survey data. Little is Richard D. Remington Distinguished University Professor of Biostatistics in the Department of Biostatistics at the University of Michigan, where he also holds academic appointments in the Department of Statistics and the Institute for Social Research.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

  1. Messner SF (1992). "Exploring the Consequences of Erratic Data Reporting for Cross-National Research on Homicide". Journal of Quantitative Criminology. 8 (2): 155–173. doi:10.1007/bf01066742. S2CID   133325281.
  2. 1 2 3 4 Hand, David J.; Adèr, Herman J.; Mellenbergh, Gideon J. (2008). Advising on Research Methods: A Consultant's Companion. Huizen, Netherlands: Johannes van Kessel. pp. 305–332. ISBN   978-90-79418-01-5.
  3. 1 2 Mohan, Karthika; Pearl, Judea; Tian, Jin (2013). "Graphical Models for Inference with Missing Data". Advances in Neural Information Processing Systems 26. pp. 1277–1285.
  4. Karvanen, Juha (2015). "Study design in causal models". Scandinavian Journal of Statistics. 42 (2): 361–377. arXiv: 1211.2958 . doi:10.1111/sjos.12110. S2CID   53642701.
  5. 1 2 Polit DF Beck CT (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams & Wilkins.
  6. Deng (2012-10-05). "On Biostatistics and Clinical Trials". Archived from the original on 15 March 2016. Retrieved 13 May 2016.
  7. "Home". Archived from the original on 2015-09-10. Retrieved 2015-08-01.
  8. Little, Roderick J. A.; Rubin, Donald B. (2002), Statistical Analysis with Missing Data (2nd ed.), Wiley .
  9. Samuelson, Douglas A.; Spirer, Herbert F. (1992-12-31), "Chapter 3. Use of Incomplete and Distorted Data in Inference About Human Rights Violations", Human Rights and Statistics, University of Pennsylvania Press, pp. 62–78, doi:10.9783/9781512802863-006, ISBN   9781512802863 , retrieved 2022-08-18
  10. Li, Tianjing; Hutfless, Susan; Scharfstein, Daniel O.; Daniels, Michael J.; Hogan, Joseph W.; Little, Roderick J.A.; Roy, Jason A.; Law, Andrew H.; Dickersin, Kay (2014). "Standards should be applied in the prevention and handling of missing data for patient-centered outcomes research: a systematic review and expert consensus". Journal of Clinical Epidemiology. 67 (1): 15–32. doi:10.1016/j.jclinepi.2013.08.013. PMC   4631258 . PMID   24262770.
  11. 1 2 Stoop, I.; Billiet, J.; Koch, A.; Fitzgerald, R. (2010). Reducing Survey Nonresponse: Lessons Learned from the European Social Survey. Oxford: Wiley-Blackwell. ISBN   978-0-470-51669-0.
  12. Graham J.W.; Olchowski A.E.; Gilreath T.D. (2007). "How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory". Preventative Science. 8 (3): 208–213. CiteSeerX   10.1.1.595.7125 . doi:10.1007/s11121-007-0070-9. PMID   17549635. S2CID   24566076.
  13. van Ginkel, Joost R.; Linting, Marielle; Rippe, Ralph C. A.; van der Voort, Anja (2020-05-03). "Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data". Journal of Personality Assessment. 102 (3): 297–308. doi:10.1080/00223891.2018.1530680. hdl: 1887/138825 . ISSN   0022-3891. PMID   30657714. S2CID   58580667.
  14. van Buuren, S. (2018). Flexible imputation of missing data (2nd ed.). CRC Press.
  15. Woods, Adrienne D.; Gerasimova, Daria; Van Dusen, Ben; Nissen, Jayson; Bainter, Sierra; Uzdavines, Alex; Davis‐Kean, Pamela E.; Halvorson, Max; King, Kevin M.; Logan, Jessica A. R.; Xu, Menglin; Vasilev, Martin R.; Clay, James M.; Moreau, David; Joyal‐Desmarais, Keven (2023-02-23). "Best practices for addressing missing data through multiple imputation". Infant and Child Development. doi: 10.1002/icd.2407 . ISSN   1522-7227.
  16. Derrick, B; Russ, B; Toher, D; White, P (2017). "Test Statistics for the Comparison of Means for Two Samples That Include Both Paired and Independent Observations". Journal of Modern Applied Statistical Methods. 16 (1): 137–157. doi: 10.22237/jmasm/1493597280 .
  17. Chechik, Gal; Heitz, Geremy; Elidan, Gal; Abbeel, Pieter; Koller, Daphne (2008-06-01). "Max-margin Classification of incomplete data" (PDF). Neural Information Processing Systems: 233–240.
  18. Chechik, Gal; Heitz, Geremy; Elidan, Gal; Abbeel, Pieter; Koller, Daphne (2008-06-01). "Max-margin Classification of Data with Absent Features". The Journal of Machine Learning Research. 9: 1–21. ISSN   1532-4435.
  19. Tamer, Elie (2010). "Partial Identification in Econometrics". Annual Review of Economics . 2 (1): 167–195. doi:10.1146/annurev.economics.050708.143401.
  20. Mohan, Karthika; Pearl, Judea (2014). "On the testability of models with missing data". Proceedings of AISTAT-2014, Forthcoming.
  21. Darwiche, Adnan (2009). Modeling and Reasoning with Bayesian Networks. Cambridge University Press.
  22. Potthoff, R.F.; Tudor, G.E.; Pieper, K.S.; Hasselblad, V. (2006). "Can one assess whether missing data are missing at random in medical studies?". Statistical Methods in Medical Research. 15 (3): 213–234. doi:10.1191/0962280206sm448oa. PMID   16768297. S2CID   12882831.
  23. 1 2 Pearl, Judea; Mohan, Karthika (2013). Recoverability and Testability of Missing data: Introduction and Summary of Results (PDF) (Technical report). UCLA Computer Science Department, R-417.
  24. Mohan, K.; Van den Broeck, G.; Choi, A.; Pearl, J. (2014). "An Efficient Method for Bayesian Network Parameter Learning from Incomplete Data". Presented at Causal Modeling and Machine Learning Workshop, ICML-2014.
  25. Mirkes, E.M.; Coats, T.J.; Levesley, J.; Gorban, A.N. (2016). "Handling missing data in large healthcare dataset: A case study of unknown trauma outcomes". Computers in Biology and Medicine. 75: 203–216. arXiv: 1604.00627 . Bibcode:2016arXiv160400627M. doi:10.1016/j.compbiomed.2016.06.004. PMID   27318570. S2CID   5874067. Archived from the original on 2016-08-05.

Further reading

Background

Software