Deviance information criterion

Last updated January 30, 2023

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.

Definition

Define the deviance as $D(\theta )=-2\log(p(y|\theta ))+C\,$ , where $y$ are the data, $\theta$ are the unknown parameters of the model and $p(y|\theta )$ is the likelihood function. $C$ is a constant that cancels out in all calculations that compare different models, and which therefore does not need to be known.

There are two calculations in common usage for the effective number of parameters of the model. The first, as described in Spiegelhalter et al. (2002 , p. 587), is $p_{D}={\overline {D(\theta )}}-D({\bar {\theta }})$ , where ${\bar {\theta }}$ is the expectation of $\theta$ . The second, as described in Gelman et al. (2004 , p. 182), is $p_{D}=p_{V}={\frac {1}{2}}{\overline {\operatorname {var} \left(D(\theta )\right)}}$ . The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.

The deviance information criterion is calculated as

\mathrm {DIC} =p_{D}+{\overline {D(\theta )}},

or equivalently as

\mathrm {DIC} =D({\bar {\theta }})+2p_{D}.

From this latter form, the connection with AIC is more evident.

Motivation

The idea is that models with smaller DIC should be preferred to models with larger DIC. Models are penalized both by the value of ${\bar {D}}$ , which favors a good fit, but also (similar to AIC) by the effective number of parameters $p_{D}$ . Since ${\bar {D}}$ will decrease as the number of parameters in a model increases, the $p_{D}$ term compensates for this effect by favoring models with a smaller number of parameters.

An advantage of DIC over other criteria in the case of Bayesian model selection is that the DIC is easily calculated from the samples generated by a Markov chain Monte Carlo simulation. AIC requires calculating the likelihood at its maximum over $\theta$ , which is not readily available from the MCMC simulation. But to calculate DIC, simply compute ${\bar {D}}$ as the average of $D(\theta )$ over the samples of $\theta$ , and $D({\bar {\theta }})$ as the value of $D$ evaluated at the average of the samples of $\theta$ . Then the DIC follows directly from these approximations. Claeskens and Hjort (2008, Ch. 3.5) show that the DIC is large-sample equivalent to the natural model-robust version of the AIC.

Assumptions

In the derivation of DIC, it is assumed that the specified parametric family of probability distributions that generate future observations encompasses the true model. This assumption does not always hold, and it is desirable to consider model assessment procedures in that scenario.

Also, the observed data are used both to construct the posterior distribution and to evaluate the estimated models. Therefore, DIC tends to select over-fitted models.

Extensions

A resolution to the issues above was suggested by Ando (2007), with the proposal of the Bayesian predictive information criterion (BPIC). Ando (2010, Ch. 8) provided a discussion of various Bayesian model selection criteria. To avoid the over-fitting problems of DIC, Ando (2011) developed Bayesian model selection criteria from a predictive view point. The criterion is calculated as

{\mathit {IC}}={\bar {D}}+2p_{D}=-2\mathbf {E} ^{\theta }[\log(p(y|\theta ))]+2p_{D}.

The first term is a measure of how well the model fits the data, while the second term is a penalty on the model complexity. Note that the $p$ in this expression is the predictive distribution rather than the likelihood above.

Related Research Articles

The likelihood function represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood function indicates which parameter values are more likely than others, in the sense that they would have made the observed data more probable. Consequently, the likelihood is often written as $instead of, to emphasize that it is to be understood as a function of the parameters instead of the random variable .$

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

In statistics, deviance is a goodness-of-fit statistic for a statistical model; it is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals (SSR) in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. It plays an important role in exponential dispersion models and generalized linear models.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents a convenient approach for setting hyperparameters, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques.

The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a null hypothesis and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its linear approximation. The Bayes factor can be thought of as a Bayesian analog to the likelihood-ratio test, but since it uses the (integrated) marginal likelihood instead of the maximized likelihood, both tests only coincide under simple hypotheses. Also, in contrast with null hypothesis significance testing, Bayes factors support evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected.

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity in the sense of mean, where an initial differencing step can be applied one or more times to eliminate the non-stationarity of the mean function. When the seasonality shows in a time series, the seasonal-differencing could be applied to eliminate the seasonal component. Since the ARMA model, according to the Wold's decomposition theorem, is theoretically sufficient to describe a regular wide-sense stationary time series, we are motivated to make stationary a non-stationary time series, e.g., by using differencing, before we can use the ARMA model. Note that if the time series contains a predictable sub-process, the predictable component is treated as a non-zero-mean but periodic component in the ARIMA framework so that it is eliminated by the seasonal differencing.

Bayesian experimental design provides a general probability-theoretical framework from which other theories on experimental design can be derived. It is based on Bayesian inference to interpret the observations/data acquired during the experiment. This allows accounting for both any prior knowledge on the parameters to be determined as well as uncertainties in observations.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of learning, this may be the selection a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

In statistics, the Hannan–Quinn information criterion (HQC) is a criterion for model selection. It is an alternative to Akaike information criterion (AIC) and Bayesian information criterion (BIC). It is given as

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

In statistics, the focused information criterion (FIC) is a method for selecting the most appropriate model among a set of competitors for a given data set. Unlike most other model selection strategies, like the Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the deviance information criterion (DIC), the FIC does not attempt to assess the overall fit of candidate models but focuses attention directly on the parameter of primary interest with the statistical analysis, say $, for which competing models lead to different estimates, say for model . The FIC method consists in first developing an exact or approximate expression for the precision or quality of each estimator, say for, and then use data to estimate these precision measures, say . In the end the model with best estimated precision is selected. The FIC methodology was developed by Gerda Claeskens and Nils Lid Hjort, first in two 2003 discussion articles in Journal of the American Statistical Association and later on in other papers and in their 2008 book.$

In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

Ando, Tomohiro (2007). "Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayes models". Biometrika . 94 (2): 443–458. doi:10.1093/biomet/asm017.
Ando, T. (2010). Bayesian Model Selection and Statistical Modeling, CRC Press. Chapter 7.
Ando, Tomohiro (2011). "Predictive Bayesian Model Selection". American Journal of Mathematical and Management Sciences . 31 (1–2): 13–38. doi:10.1080/01966324.2011.10737798. S2CID 123680697.
Claeskens, G, and Hjort, N.L. (2008). Model Selection and Model Averaging, Cambridge. Section 3.5.
Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Rubin, Donald B. (2004). Bayesian Data Analysis: Second Edition. Texts in Statistical Science. CRC Press. ISBN 978-1-58488-388-3. LCCN 2003051474. MR 2027492.
van der Linde, A. (2005). "DIC in variable selection", Statistica Neerlandica, 59: 45-56. doi:10.1111/j.1467-9574.2005.00278.x
Spiegelhalter, David J.; Best, Nicola G.; Carlin, Bradley P.; van der Linde, Angelika (2002). "Bayesian measures of model complexity and fit (with discussion)". Journal of the Royal Statistical Society, Series B . 64 (4): 583–639. doi: 10.1111/1467-9868.00353 . JSTOR 3088806. MR 1979380.
Spiegelhalter, David J.; Best, Nicola G.; Carlin, Bradley P.; van der Linde, Angelika (2014). "The deviance information criterion: 12 years on (with discussion)". Journal of the Royal Statistical Society, Series B . 76 (3): 485–493. doi:10.1111/rssb.12062. S2CID 119742633.

External links

McElreath, Richard (January 29, 2015). "Statistical Rethinking Lecture 8 (on DIC and other information criteria)". Archived from the original on 2021-12-21 – via YouTube.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.