# Bayesian information criterion

Last updated

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

## Contents

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, [1] where he gave a Bayesian argument for adopting it.

## Definition

The BIC is formally defined as [2] [lower-alpha 1]

${\displaystyle \mathrm {BIC} =k\ln(n)-2\ln({\widehat {L}}).\ }$

where

• ${\displaystyle {\hat {L}}}$ = the maximized value of the likelihood function of the model ${\displaystyle M}$, i.e. ${\displaystyle {\hat {L}}=p(x\mid {\widehat {\theta }},M)}$, where ${\displaystyle {\widehat {\theta }}}$ are the parameter values that maximize the likelihood function;
• ${\displaystyle x}$ = the observed data;
• ${\displaystyle n}$ = the number of data points in ${\displaystyle x}$, the number of observations, or equivalently, the sample size;
• ${\displaystyle k}$ = the number of parameters estimated by the model. For example, in multiple linear regression, the estimated parameters are the intercept, the ${\displaystyle q}$ slope parameters, and the constant variance of the errors; thus, ${\displaystyle k=q+2}$.

Konishi and Kitagawa [4] :217 derive the BIC to approximate the distribution of the data, integrating out the parameters using Laplace's method, starting with the following model evidence:

${\displaystyle p(x\mid M)=\int p(x\mid \theta ,M)\pi (\theta \mid M)\,d\theta }$

where ${\displaystyle \pi (\theta \mid M)}$ is the prior for ${\displaystyle \theta }$ under model ${\displaystyle M}$.

The log-likelihood, ${\displaystyle \ln(p(x|\theta ,M))}$, is then expanded to a second order Taylor series about the MLE, ${\displaystyle {\widehat {\theta }}}$, assuming it is twice differentiable as follows:

${\displaystyle \ln(p(x\mid \theta ,M))=\ln({\widehat {L}})-0.5(\theta -{\widehat {\theta }})'n{\mathcal {I}}(\theta )(\theta -{\widehat {\theta }})+R(x,\theta ),}$

where ${\displaystyle {\mathcal {I}}(\theta )}$ is the average observed information per observation, and prime (${\displaystyle '}$) denotes transpose of the vector ${\displaystyle (\theta -{\widehat {\theta }})}$. To the extent that ${\displaystyle R(x,\theta )}$ is negligible and ${\displaystyle \pi (\theta \mid M)}$ is relatively linear near ${\displaystyle {\widehat {\theta }}}$, we can integrate out ${\displaystyle \theta }$ to get the following:

${\displaystyle p(x\mid M)\approx {\hat {L}}(2\pi /n)^{k/2}|{\mathcal {I}}({\widehat {\theta }})|^{-1/2}\pi ({\widehat {\theta }})}$

As ${\displaystyle n}$ increases, we can ignore ${\displaystyle |{\mathcal {I}}({\widehat {\theta }})|}$ and ${\displaystyle \pi ({\widehat {\theta }})}$ as they are ${\displaystyle O(1)}$. Thus,

${\displaystyle p(x\mid M)=\exp\{\ln {\widehat {L}}-(k/2)\ln(n)+O(1)\}=\exp(-\mathrm {BIC} /2+O(1)),}$

where BIC is defined as above, and ${\displaystyle {\widehat {L}}}$ either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior ${\displaystyle \pi (\theta \mid M)}$ has nonzero slope at the MLE. Then the posterior

${\displaystyle p(M\mid x)\propto p(x\mid M)p(M)\approx \exp(-\mathrm {BIC} /2)p(M)}$

## Usage

When picking from several models, ones with lower BIC values are generally preferred. The BIC is an increasing function of the error variance ${\displaystyle \sigma _{e}^{2}}$ and an increasing function of k. That is, unexplained variation in the dependent variable and the number of explanatory variables increase the value of BIC. However, a lower BIC does not necessarily indicate one model is better than another. Because it involves approximations, the BIC is merely a heuristic. In particular, differences in BIC should never be treated like transformed Bayes factors.

It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable [lower-alpha 2] are identical for all models being compared. The models being compared need not be nested, unlike the case when models are being compared using an F-test or a likelihood ratio test.[ citation needed ]

## Properties

• The BIC generally penalizes free parameters more strongly than the Akaike information criterion, though it depends on the size of n and relative magnitude of n and k.
• It is independent of the prior.
• It can measure the efficiency of the parameterized model in terms of predicting the data.
• It penalizes the complexity of the model where complexity refers to the number of parameters in the model.
• It is approximately equal to the minimum description length criterion but with negative sign.
• It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset.
• It is closely related to other penalized likelihood criteria such as Deviance information criterion and the Akaike information criterion.

## Limitations

The BIC suffers from two main limitations [5]

1. the above approximation is only valid for sample size ${\displaystyle n}$ much larger than the number ${\displaystyle k}$ of parameters in the model.
2. the BIC cannot handle complex collections of models as in the variable selection (or feature selection) problem in high-dimension. [5]

## Gaussian special case

Under the assumption that the model errors or disturbances are independent and identically distributed according to a normal distribution and the boundary condition that the derivative of the log likelihood with respect to the true variance is zero, this becomes (up to an additive constant, which depends only on n and not on the model): [6]

${\displaystyle \mathrm {BIC} =n\ln({\widehat {\sigma _{e}^{2}}})+k\ln(n)\ }$

where ${\displaystyle {\widehat {\sigma _{e}^{2}}}}$ is the error variance. The error variance in this case is defined as

${\displaystyle {\widehat {\sigma _{e}^{2}}}={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\widehat {x_{i}}})^{2}.}$

In terms of the residual sum of squares (RSS) the BIC is

${\displaystyle \mathrm {BIC} =n\ln(RSS/n)+k\ln(n)\ }$

When testing multiple linear models against a saturated model, the BIC can be rewritten in terms of the deviance ${\displaystyle \chi ^{2}}$ as: [7]

${\displaystyle \mathrm {BIC} =\chi ^{2}+k\ln(n)}$

where ${\displaystyle k}$ is the number of model parameters in the test.

## Notes

1. The AIC, AICc and BIC defined by Claeskens and Hjort [3] are the negatives of those defined in this article and in most other standard references.
2. A dependent variable is also called a response variable or an outcome variable. See Regression analysis.

## Related Research Articles

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two different parameterizations in common use:

1. With a shape parameter k and a scale parameter θ.
2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. Read In estimation theory, two approaches are generally considered.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each of n trials is not fixed but randomly drawn from a beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

Computational anatomy (CA) is a discipline within medical imaging focusing on the study of anatomical shape and form at the visible or gross anatomical scale of morphology. The field is broadly defined and includes foundations in anatomy, applied mathematics and pure mathematics, including medical imaging, neuroscience, physics, probability, and statistics. It focuses on the anatomical structures being imaged, rather than the medical imaging devices. The central focus of the sub-field of computational anatomy within medical imaging is mapping information across anatomical coordinate systems most often dense information measured within a magnetic resonance image (MRI). The introduction of flows into CA, which are akin to the equations of motion used in fluid dynamics, exploit the notion that dense coordinates in image analysis follow the Lagrangian and Eulerian equations of motion. In models based on Lagrangian and Eulerian flows of diffeomorphisms, the constraint is associated to topological properties, such as open sets being preserved, coordinates not crossing implying uniqueness and existence of the inverse mapping, and connected sets remaining connected. The use of diffeomorphic methods grew quickly to dominate the field of mapping methods post Christensen's original paper, with fast and symmetric methods becoming available.

In statistics, suppose that we have been given some data, and we are constructing a statistical model of that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

## References

1. Schwarz, Gideon E. (1978), "Estimating the dimension of a model", Annals of Statistics , 6 (2): 461–464, doi:, MR   0468014 .
2. Wit, Ernst; Edwin van den Heuvel; Jan-Willem Romeyn (2012). "'All models are wrong...': an introduction to model uncertainty" (PDF). Statistica Neerlandica. 66 (3): 217–236. doi:10.1111/j.1467-9574.2012.00530.x.
3. Claeskens, G.; Hjort, N. L. (2008), Model Selection and Model Averaging, Cambridge University Press
4. Konishi, Sadanori; Kitagawa, Genshiro (2008). Information criteria and statistical modeling. Springer. ISBN   978-0-387-71886-6.
5. Giraud, C. (2015). Introduction to high-dimensional statistics. Chapman & Hall/CRC. ISBN   9781482237948.
6. Priestley, M.B. (1981). Spectral Analysis and Time Series. Academic Press. ISBN   978-0-12-564922-3. (p. 375).
7. Kass, Robert E.; Raftery, Adrian E. (1995), "Bayes Factors", Journal of the American Statistical Association , 90 (430): 773–795, doi:10.2307/2291091, ISSN   0162-1459, JSTOR   2291091 .