Model selection

Last updated

Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice (Occam's razor).


Konishi & Kitagawa (2008 , p. 75) state, "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling". Relatedly, Cox (2006 , p. 197) has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".

Model selection may also refer to the problem of selecting a few representative models from a large set of computational models for the purpose of decision making or optimization under uncertainty. [1]


The scientific observation cycle. ObservationCycle.png
The scientific observation cycle.

In its most basic forms, model selection is one of the fundamental tasks of scientific inquiry. Determining the principle that explains a series of observations is often linked directly to a mathematical model predicting those observations. For example, when Galileo performed his inclined plane experiments, he demonstrated that the motion of the balls fitted the parabola predicted by his model [ citation needed ].

Of the countless number of possible mechanisms and processes that could have produced the data, how can one even begin to choose the best model? The mathematical approach commonly taken decides among a set of candidate models; this set must be chosen by the researcher. Often simple models such as polynomials are used, at least initially [ citation needed ]. Burnham & Anderson (2002) emphasize throughout their book the importance of choosing models based on sound scientific principles, such as understanding of the phenomenological processes or mechanisms (e.g., chemical reactions) underlying the data.

Once the set of candidate models has been chosen, the statistical analysis allows us to select the best of these models. What is meant by best is controversial. A good model selection technique will balance goodness of fit with simplicity [ citation needed ]. More complex models will be better able to adapt their shape to fit the data (for example, a fifth-order polynomial can exactly fit six points), but the additional parameters may not represent anything useful. (Perhaps those six points are really just randomly distributed about a straight line.) Goodness of fit is generally determined using a likelihood ratio approach, or an approximation of this, leading to a chi-squared test. The complexity is generally measured by counting the number of parameters in the model.

Model selection techniques can be considered as estimators of some physical quantity, such as the probability of the model producing the given data. The bias and variance are both important measures of the quality of this estimator; efficiency is also often considered.

A standard example of model selection is that of curve fitting, where, given a set of points and other background knowledge (e.g. points are a result of i.i.d. samples), we must select a curve that describes the function that generated the points.

Two directions of model selection

There are two main objectives in inference and learning from data. One is for scientific discovery, understanding of the underlying data-generating mechanism, and interpretation of the nature of the data. Another objective of learning from data is for predicting future or unseen observations. In the second objective, the data scientist does not necessarily concern an accurate probabilistic description of the data. Of course, one may also be interested in both directions.

In line with the two different objectives, model selection can also have two directions: model selection for inference and model selection for prediction. [2] The first direction is to identify the best model for the data, which will preferably provide a reliable characterization of the sources of uncertainty for scientific interpretation. For this goal, it is significantly important that the selected model is not too sensitive to the sample size. Accordingly, an appropriate notion for evaluating model selection is the selection consistency, meaning that the most robust candidate will be consistently selected given sufficiently many data samples.

The second direction is to choose a model as machinery to offer excellent predictive performance. For the latter, however, the selected model may simply be the lucky winner among a few close competitors, yet the predictive performance can still be the best possible. If so, the model selection is fine for the second goal (prediction), but the use of the selected model for insight and interpretation may be severely unreliable and misleading. [2] Moreover, for very complex models selected this way, even predictions may be unreasonable for data only slightly different from those on which the selection was made. [3]

Methods to assist in choosing the set of candidate models


Below is a list of criteria for model selection. The most commonly used criteria are (i) the Akaike information criterion and (ii) the Bayes factor and/or the Bayesian information criterion (which to some extent approximates the Bayes factor), see Stoica & Selen (2004) for a review.

Among these criteria, cross-validation is typically the most accurate, and computationally the most expensive, for supervised learning problems.[ citation needed ]

Burnham & Anderson (2002 , §6.3) say the following:

There is a variety of model selection methods. However, from the point of view of statistical performance of a method, and intended context of its use, there are only two distinct classes of methods: These have been labeled efficient and consistent . (...) Under the frequentist paradigm for model selection one generally has three main approaches: (I) optimization of some selection criteria, (II) tests of hypotheses, and (III) ad hoc methods.

See also


  1. Shirangi, Mehrdad G.; Durlofsky, Louis J. (2016). "A general method to select representative models for decision making and optimization under uncertainty". Computers & Geosciences. 96: 109–123. Bibcode:2016CG.....96..109S. doi:10.1016/j.cageo.2016.08.002.
  2. 1 2 Ding, Jie; Tarokh, Vahid; Yang, Yuhong (2018). "Model Selection Techniques: An Overview". IEEE Signal Processing Magazine. 35 (6): 16–34. arXiv: 1810.09583 . doi:10.1109/MSP.2018.2867638. ISSN   1053-5888. S2CID   53035396.
  3. Su, J.; Vargas, D.V.; Sakurai, K. (2019). "One Pixel Attack for Fooling Deep Neural Networks". IEEE Transactions on Evolutionary Computation. 23 (5): 828–841. arXiv: 1710.08864 . doi:10.1109/TEVC.2019.2890858.
  4. Ding, J.; Tarokh, V.; Yang, Y. (June 2018). "Bridging AIC and BIC: A New Criterion for Autoregression". IEEE Transactions on Information Theory. 64 (6): 4024–4043. arXiv: 1508.02473 . doi:10.1109/TIT.2017.2717599. ISSN   1557-9654. S2CID   5189440.

Related Research Articles

Statistical inference Process of using data analysis

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Overfitting Analysis that corresponds too closely to a particular set of data and may fail to fit additional data

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

Minimum description length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor. The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In statistics, the use of Bayes factors is a Bayesian alternative to classical hypothesis testing. Bayesian model comparison is a method of model selection based on Bayes factors. The models under consideration are statistical models. The aim of the Bayes factor is to quantify the support for a model over another, regardless of whether these models are correct. The technical definition of "support" in the context of Bayesian inference is described below.

Feature selection Procedure in machine learning and statistics

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for several reasons:

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Hirotugu Akaike Japanese statistician

Hirotugu Akaike was a Japanese statistician. In the early 1970s, he formulated the Akaike information criterion (AIC). AIC is now widely used for model selection, which is commonly the most difficult aspect of statistical inference; additionally, AIC is the basis of a paradigm for the foundations of statistics. Akaike also made major contributions to the study of time series. As well, he had a large role in the general development of statistics in Japan.

Stepwise regression

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a forward, backward, or combined sequence of F-tests or t-tests.

In statistics, Mallows’s Cp, named for Colin Lingwood Mallows, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of Cp means that the model is relatively precise.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

In statistics, the Hannan–Quinn information criterion (HQC) is a criterion for model selection. It is an alternative to Akaike information criterion (AIC) and Bayesian information criterion (BIC). It is given as

In statistics, a generalized linear mixed model (GLMM) is an extension to the generalized linear model (GLM) in which the linear predictor contains random effects in addition to the usual fixed effects. They also inherit from GLMs the idea of extending linear mixed models to non-normal data.

Ensemble learning Statistics and machine learning technique

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

The following outline is provided as an overview of and topical guide to regression analysis:

In statistics, the focused information criterion (FIC) is a method for selecting the most appropriate model among a set of competitors for a given data set. Unlike most other model selection strategies, like the Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the deviance information criterion (DIC), the FIC does not attempt to assess the overall fit of candidate models but focuses attention directly on the parameter of primary interest with the statistical analysis, say , for which competing models lead to different estimates, say for model . The FIC method consists in first developing an exact or approximate expression for the precision or quality of each estimator, say for , and then use data to estimate these precision measures, say . In the end the model with best estimated precision is selected. The FIC methodology was developed by Gerda Claeskens and Nils Lid Hjort, first in two 2003 discussion articles in Journal of the American Statistical Association and later on in other papers and in their 2008 book.

In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.