Statistical model validation

Last updated September 21, 2024

In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstanding by researchers of the actual relevance of their model. To combat this, model validation is used to test whether a statistical model can hold up to permutations in the data. This topic is not to be confused with the closely related task of model selection, the process of discriminating between multiple candidate models: model validation does not concern so much the conceptual design of models as it tests only the consistency between a chosen model and its stated outputs.

There are many ways to validate a model. Residual plots plot the difference between the actual data and the model's predictions: correlations in the residual plots may indicate a flaw in the model. Cross validation is a method of model validation that iteratively refits the model, each time leaving out just a small sample and comparing whether the samples left out are predicted by the model: there are many kinds of cross validation. Predictive simulation is used to compare simulated data to actual data. External validation involves fitting the model to new data. Akaike information criterion estimates the quality of a model.

Overview

Model validation comes in many forms and the specific method of model validation a researcher uses is often a constraint of their research design. To emphasize, what this means is that there is no one-size-fits-all method to validating a model. For example, if a researcher is operating with a very limited set of data, but data they have strong prior assumptions about, they may consider validating the fit of their model by using a Bayesian framework and testing the fit of their model using various prior distributions. However, if a researcher has a lot of data and is testing multiple nested models, these conditions may lend themselves toward cross validation and possibly a leave one out test. These are two abstract examples and any actual model validation will have to consider far more intricacies than describes here but these example illustrate that model validation methods are always going to be circumstantial.

In general, models can be validated using existing data or with new data, and both methods are discussed more in the following subsections, and a note of caution is provided, too.

Validation with existing data

Validation based on existing data involves analyzing the goodness of fit of the model or analyzing whether the residuals seem to be random (i.e. residual diagnostics). This method involves using analyses of the models closeness to the data and trying to understand how well the model predicts its own data. One example of this method is in Figure 1, which shows a polynomial function fit to some data. We see that the polynomial function does not conform well to the data, which appears linear, and might invalidate this polynomial model.

Commonly, statistical models on existing data are validated using a validation set, which may also be referred to as a holdout set. A validation set is a set of data points that the user leaves out when fitting a statistical model. After the statistical model is fitted, the validation set is used as a measure of the model's error. If the model fits well on the initial data but has a large error on the validation set, this is a sign of overfitting.

Validation with new data

If new data becomes available, an existing model can be validated by assessing whether the new data is predicted by the old model. If the new data is not predicted by the old model, then the model might not be valid for the researcher's goals.

With this in mind, a modern approach is to validate a neural network is to test its performance on domain-shifted data. This ascertains if the model learned domain-invariant features.^[1]

A note of caution

A model can be validated only relative to some application area.^[2]^[3] A model that is valid for one application might be invalid for some other applications. As an example, consider the curve in Figure 1: if the application only used inputs from the interval [0, 2], then the curve might well be an acceptable model.

Methods for validating

When doing a validation, there are three notable causes of potential difficulty, according to the Encyclopedia of Statistical Sciences .^[4] The three causes are these: lack of data; lack of control of the input variables; uncertainty about the underlying probability distributions and correlations. The usual methods for dealing with difficulties in validation include the following: checking the assumptions made in constructing the model; examining the available data and related model outputs; applying expert judgment.^[2] Note that expert judgment commonly requires expertise in the application area.^[2]

Expert judgment can sometimes be used to assess the validity of a prediction without obtaining real data: e.g. for the curve in Figure 1, an expert might well be able to assess that a substantial extrapolation will be invalid. Additionally, expert judgment can be used in Turing-type tests, where experts are presented with both real data and related model outputs and then asked to distinguish between the two.^[5]

For some classes of statistical models, specialized methods of performing validation are available. As an example, if the statistical model was obtained via a regression, then specialized analyses for regression model validation exist and are generally employed.

Residual diagnostics

Residual diagnostics comprise analyses of the residuals to determine whether the residuals seem to be effectively random. Such analyses typically requires estimates of the probability distributions for the residuals. Estimates of the residuals' distributions can often be obtained by repeatedly running the model, i.e. by using repeated stochastic simulations (employing a pseudorandom number generator for random variables in the model).

If the statistical model was obtained via a regression, then regression-residual diagnostics exist and may be used; such diagnostics have been well studied.

Cross validation

Cross validation is a method of sampling that involves leaving some parts of the data out of the fitting process and then seeing whether those data that are left out are close or far away from where the model predicts they would be. What that means practically is that cross validation techniques fit the model many, many times with a portion of the data and compares each model fit to the portion it did not use. If the models very rarely describe the data that they were not trained on, then the model is probably wrong.

Related Research Articles

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more error-free independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

Permutation tests
Bootstrapping
Cross validation
Jackknife

<span class="mw-page-title-main">Stepwise regression</span> Method of statistical factor analysis

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a forward, backward, or combined sequence of F-tests or t-tests.

In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given personal income $together with years of schooling and on-the-job experience, we might specify a functional relationship as follows:$

In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.

The following outline is provided as an overview of and topical guide to regression analysis:

Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory.

In mathematics, statistics, and computational modelling, a grey box model combines a partial theoretical structure with data to complete the model. The theoretical structure may vary from information on the smoothness of results, to models that need only parameter values from data or existing literature. Thus, almost all models are grey box models as opposed to black box where no model form is assumed or white box models that are purely theoretical. Some models assume a special form such as a linear regression or neural network. These have special analysis methods. In particular linear regression techniques are much more efficient than most non-linear techniques. The model can be deterministic or stochastic depending on its planned use.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

References

↑ Feng, Cheng; Zhong, Chaoliang; Wang, Jie; Zhang, Ying; Sun, Jun; Yokota, Yasuto (July 2022). "Learning Unforgotten Domain-Invariant Representations for Online Unsupervised Domain Adaptation". Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 2958–2965. doi: 10.24963/ijcai.2022/410 . ISBN 978-1-956792-00-3.
1 2 3 National Research Council (2012), "Chapter 5: Model validation and prediction", Assessing the Reliability of Complex Models: Mathematical and statistical foundations of verification, validation, and uncertainty quantification, Washington, DC: National Academies Press, pp. 52–85, doi:10.17226/13395, ISBN 978-0-309-25634-6 {{citation}}: CS1 maint: multiple names: authors list (link).
↑ Batzel, J. J.; Bachar, M.; Karemaker, J. M.; Kappel, F. (2013), "Chapter 1: Merging mathematical and physiological knowledge", in Batzel, J. J.; Bachar, M.; Kappel, F. (eds.), Mathematical Modeling and Validation in Physiology, Springer, pp. 3–19, doi:10.1007/978-3-642-32882-4_1 .
↑ Deaton, M. L. (2006), "Simulation models, validation of", in Kotz, S.; et al. (eds.), Encyclopedia of Statistical Sciences , Wiley .
↑ Mayer, D. G.; Butler, D.G. (1993), "Statistical validation", Ecological Modelling , 68 (1–2): 21–32, doi:10.1016/0304-3800(93)90105-2 .

External links

How can I tell if a model fits my data? —Handbook of Statistical Methods (NIST)
Hicks, Dan (July 14, 2017). "What are core statistical model validation techniques?". Stack Exchange .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Feng, Cheng; Zhong, Chaoliang; Wang, Jie; Zhang, Ying; Sun, Jun; Yokota, Yasuto (July 2022). "Learning Unforgotten Domain-Invariant Representations for Online Unsupervised Domain Adaptation". Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 2958–2965. doi: 10.24963/ijcai.2022/410 . ISBN 978-1-956792-00-3.

[NRC12-2] 1 2 3 National Research Council (2012), "Chapter 5: Model validation and prediction", Assessing the Reliability of Complex Models: Mathematical and statistical foundations of verification, validation, and uncertainty quantification, Washington, DC: National Academies Press, pp. 52–85, doi:10.17226/13395, ISBN 978-0-309-25634-6 {{citation}}: CS1 maint: multiple names: authors list (link).

[BBKK-3] Batzel, J. J.; Bachar, M.; Karemaker, J. M.; Kappel, F. (2013), "Chapter 1: Merging mathematical and physiological knowledge", in Batzel, J. J.; Bachar, M.; Kappel, F. (eds.), Mathematical Modeling and Validation in Physiology, Springer, pp. 3–19, doi:10.1007/978-3-642-32882-4_1 .

[ESS06-4] Deaton, M. L. (2006), "Simulation models, validation of", in Kotz, S.; et al. (eds.), Encyclopedia of Statistical Sciences , Wiley .

[MB93-5] Mayer, D. G.; Butler, D.G. (1993), "Statistical validation", Ecological Modelling , 68 (1–2): 21–32, doi:10.1016/0304-3800(93)90105-2 .

[1]

[2]

[3]

[4]

[5]