Akaike information criterion

Last updated

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. [1] [2] [3] Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Contents

AIC is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model.

In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

The Akaike information criterion is named after the Japanese statistician Hirotsugu Akaike, who formulated it. It now forms the basis of a paradigm for the foundations of statistics and is also widely used for statistical inference.

Definition

Suppose that we have a statistical model of some data. Let k be the number of estimated parameters in the model. Let be the maximized value of the likelihood function for the model. Then the AIC value of the model is the following. [4] [5]

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.

AIC is founded in information theory. Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g1 and g2. If we knew f, then we could find the information lost from using g1 to represent f by calculating the Kullback–Leibler divergence, DKL(fg1); similarly, the information lost from using g2 to represent f could be found by calculating DKL(fg2). We would then, generally, choose the candidate model that minimized the information loss.

We cannot choose with certainty, because we do not know f. Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g1 than by g2. The estimate, though, is only valid asymptotically; if the number of data points is small, then some correction is often necessary (see AICc, below).

Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals (to determine whether the residuals seem like random) and tests of the model's predictions. For more on this topic, see statistical model validation .

How to use AIC in practice

To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the "true model," i.e. the process that generated the data. We wish to select, from among the candidate models, the model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss.

Suppose that there are R candidate models. Denote the AIC values of those models by AIC1, AIC2, AIC3, ..., AICR. Let AICmin be the minimum of those values. Then the quantity exp((AICmin − AICi)/2) can be interpreted as being proportional to the probability that the ith model minimizes the (estimated) information loss. [6]

As an example, suppose that there are three candidate models, whose AIC values are 100, 102, and 110. Then the second model is exp((100 − 102)/2) = 0.368 times as probable as the first model to minimize the information loss. Similarly, the third model is exp((100 − 110)/2) = 0.007 times as probable as the first model to minimize the information loss.

In this example, we would omit the third model from further consideration. We then have three options: (1) gather more data, in the hope that this will allow clearly distinguishing between the first two models; (2) simply conclude that the data is insufficient to support selecting one model from among the first two; (3) take a weighted average of the first two models, with weights proportional to 1 and 0.368, respectively, and then do statistical inference based on the weighted multimodel. [7]

The quantity exp((AICmin − AICi)/2) is known as the relative likelihood of model i. It is closely related to the likelihood ratio used in the likelihood-ratio test. Indeed, if all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models, whereas AIC (and AICc) has no such restriction. [8] [9]

Hypothesis testing

Every statistical hypothesis test can be formulated as a comparison of statistical models. Hence, every statistical hypothesis test can be replicated via AIC. Two examples are briefly described in the subsections below. Details for those examples, and many more examples, are given by Sakamoto, Ishiguro & Kitagawa (1986 , Part II) and Konishi & Kitagawa (2008 , ch. 4).

Replicating Student's t-test

As an example of a hypothesis test, consider the t-test to compare the means of two normally-distributed populations. The input to the t-test comprises a random sample from each of the two populations.

To formulate the test as a comparison of models, we construct two different models. The first model models the two populations as having potentially different means and standard deviations. The likelihood function for the first model is thus the product of the likelihoods for two distinct normal distributions; so it has four parameters: μ1, σ1, μ2, σ2. To be explicit, the likelihood function is as follows (denoting the sample sizes by n1 and n2).

The second model models the two populations as having the same means but potentially different standard deviations. The likelihood function for the second model thus sets μ1 = μ2 in the above equation; so it has three parameters.

We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different means.

The t-test assumes that the two populations have identical standard deviations; the test tends to be unreliable if the assumption is false and the sizes of the two samples are very different (Welch's t-test would be better). Comparing the means of the populations via AIC, as in the example above, has an advantage by not making such assumptions.

Comparing categorical data sets

For another example of a hypothesis test, suppose that we have two populations, and each member of each population is in one of two categories—category #1 or category #2. Each population is binomially distributed. We want to know whether the distributions of the two populations are the same. We are given a random sample from each of the two populations.

Let m be the size of the sample from the first population. Let m1 be the number of observations (in the sample) in category #1; so the number of observations in category #2 is mm1. Similarly, let n be the size of the sample from the second population. Let n1 be the number of observations (in the sample) in category #1.

Let p be the probability that a randomly-chosen member of the first population is in category #1. Hence, the probability that a randomly-chosen member of the first population is in category #2 is 1 − p. Note that the distribution of the first population has one parameter. Let q be the probability that a randomly-chosen member of the second population is in category #1. Note that the distribution of the second population also has one parameter.

To compare the distributions of the two populations, we construct two different models. The first model models the two populations as having potentially different distributions. The likelihood function for the first model is thus the product of the likelihoods for two distinct binomial distributions; so it has two parameters: p, q. To be explicit, the likelihood function is as follows.

The second model models the two populations as having the same distribution. The likelihood function for the second model thus sets p = q in the above equation; so the second model has one parameter.

We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different distributions.

Foundations of statistics

Statistical inference is generally regarded as comprising hypothesis testing and estimation. Hypothesis testing can be done via AIC, as discussed above. Regarding estimation, there are two types: point estimation and interval estimation. Point estimation can be done within the AIC paradigm: it is provided by maximum likelihood estimation. Interval estimation can also be done within the AIC paradigm: it is provided by likelihood intervals. Hence, statistical inference generally can be done within the AIC paradigm.

The most commonly used paradigms for statistical inference are frequentist inference and Bayesian inference. AIC, though, can be used to do statistical inference without relying on either the frequentist paradigm or the Bayesian paradigm: because AIC can be interpreted without the aid of significance levels or Bayesian priors. [10] In other words, AIC can be used to form a foundation of statistics that is distinct from both frequentism and Bayesianism. [11] [12]

Modification for small sample size

When the sample size is small, there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit. [13] [14] [15] To address such potential overfitting, AICc was developed: AICc is AIC with a correction for small sample sizes.

The formula for AICc depends upon the statistical model. Assuming that the model is univariate, is linear in its parameters, and has normally-distributed residuals (conditional upon regressors), then the formula for AICc is as follows. [16] [17] [18] [19]

—where n denotes the sample size and k denotes the number of parameters. Thus, AICc is essentially AIC with an extra penalty term for the number of parameters. Note that as n → ∞, the extra penalty term converges to 0, and thus AICc converges to AIC. [20]

If the assumption that the model is univariate and linear with normal residuals does not hold, then the formula for AICc will generally be different from the formula above. For some models, the formula can be difficult to determine. For every model that has AICc available, though, the formula for AICc is given by AIC plus terms that includes both k and k2. In comparison, the formula for AIC includes k but not k2. In other words, AIC is a first-order estimate (of the information loss), whereas AICc is a second-order estimate. [21]

Further discussion of the formula, with examples of other assumptions, is given by Burnham & Anderson (2002 , ch. 7) and by Konishi & Kitagawa (2008 , ch. 7–8). In particular, with other assumptions, bootstrap estimation of the formula is often feasible.

To summarize, AICc has the advantage of tending to be more accurate than AIC (especially for small samples), but AICc also has the disadvantage of sometimes being much more difficult to compute than AIC. Note that if all the candidate models have the same k and the same formula for AICc, then AICc and AIC will give identical (relative) valuations; hence, there will be no disadvantage in using AIC, instead of AICc. Furthermore, if n is many times larger than k2, then the extra penalty term will be negligible; hence, the disadvantage in using AIC, instead of AICc, will be negligible.

History

Hirotugu Akaike Akaike.jpg
Hirotugu Akaike

The Akaike information criterion was formulated by the statistician Hirotsugu Akaike. It was originally named "an information criterion". [22] It was first announced in English by Akaike at a 1971 symposium; the proceedings of the symposium were published in 1973. [22] [23] The 1973 publication, though, was only an informal presentation of the concepts. [24] The first formal publication was a 1974 paper by Akaike, [5] which as of November 2023, has received more than 35,100 citations in the Web of Science database and more than 64,000 according to Google Scholar. AIC has become common enough that it is often used without citing Akaike's 1974 paper; indeed, as of November 2023, there are around 289,000 scholarly articles and books that use AIC (as estimated by Google Scholar). [25]

The initial derivation of AIC relied upon some strong assumptions. Takeuchi (1976) showed that the assumptions could be made much weaker. Takeuchi's work, however, was in Japanese and was not widely known outside Japan for many years.

AICc was originally proposed for linear regression (only) by Sugiura (1978). That instigated the work of Hurvich & Tsai (1989), and several further papers by the same authors, which extended the situations in which AICc could be applied.

The first general exposition of the information-theoretic approach was the volume by Burnham & Anderson (2002). It includes an English presentation of the work of Takeuchi. The volume led to far greater use of AIC, and it now has more than 64,000 citations on Google Scholar.

Akaike called his approach an "entropy maximization principle", because the approach is founded on the concept of entropy in information theory. Indeed, minimizing AIC in a statistical model is effectively equivalent to maximizing entropy in a thermodynamic system; in other words, the information-theoretic approach in statistics is essentially applying the Second Law of Thermodynamics. As such, AIC has roots in the work of Ludwig Boltzmann on entropy. For more on these issues, see Akaike (1985) and Burnham & Anderson (2002 , ch. 2).

Usage tips

Counting parameters

A statistical model must account for random errors. A straight line model might be formally described as yi = b0 + b1xi + εi. Here, the εi are the residuals from the straight line fit. If the εi are assumed to be i.i.d. Gaussian (with zero mean), then the model has three parameters: b0, b1, and the variance of the Gaussian distributions. Thus, when calculating the AIC value of this model, we should use k=3. More generally, for any least squares model with i.i.d. Gaussian residuals, the variance of the residuals' distributions should be counted as one of the parameters. [26]

As another example, consider a first-order autoregressive model, defined by xi = c + φxi−1 + εi, with the εi being i.i.d. Gaussian (with zero mean). For this model, there are three parameters: c, φ, and the variance of the εi. More generally, a pth-order autoregressive model has p + 2 parameters. (If, however, c is not estimated from the data, but instead given in advance, then there are only p + 1 parameters.)

Transforming data

The AIC values of the candidate models must all be computed with the same data set. Sometimes, though, we might want to compare a model of the response variable, y, with a model of the logarithm of the response variable, log(y). More generally, we might want to compare a model of the data with a model of transformed data. Following is an illustration of how to deal with data transforms (adapted from Burnham & Anderson (2002 , §2.11.3): "Investigators should be sure that all hypotheses are modeled using the same response variable").

Suppose that we want to compare two models: one with a normal distribution of y and one with a normal distribution of log(y). We should not directly compare the AIC values of the two models. Instead, we should transform the normal cumulative distribution function to first take the logarithm of y. To do that, we need to perform the relevant integration by substitution: thus, we need to multiply by the derivative of the (natural) logarithm function, which is 1/y. Hence, the transformed distribution has the following probability density function:

—which is the probability density function for the log-normal distribution. We then compare the AIC value of the normal model against the AIC value of the log-normal model.

Comparisons with other model selection methods

The critical difference between AIC and BIC (and their variants) is the asymptotic property under well-specified and misspecified model classes. [27] Their fundamental differences have been well-studied in regression variable selection and autoregression order selection [28] problems. In general, if the goal is prediction, AIC and leave-one-out cross-validations are preferred. If the goal is selection, inference, or interpretation, BIC or leave-many-out cross-validations are preferred. A comprehensive overview of AIC and other popular model selection methods is given by Ding et al. (2018) [29]

Comparison with BIC

The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is 2k, whereas with BIC the penalty is ln(n)k.

A comparison of AIC/AICc and BIC is given by Burnham & Anderson (2002 , §6.3-6.4), with follow-up remarks by Burnham & Anderson (2004). The authors show that AIC/AICc can be derived in the same Bayesian framework as BIC, just by using different prior probabilities. In the Bayesian derivation of BIC, though, each candidate model has a prior probability of 1/R (where R is the number of candidate models). Additionally, the authors present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC.

A point made by several researchers is that AIC and BIC are appropriate for different tasks. In particular, BIC is argued to be appropriate for selecting the "true model" (i.e. the process that generated the data) from the set of candidate models, whereas AIC is not appropriate. To be specific, if the "true model" is in the set of candidates, then BIC will select the "true model" with probability 1, as n → ∞; in contrast, when selection is done via AIC, the probability can be less than 1. [30] [31] [32] Proponents of AIC argue that this issue is negligible, because the "true model" is virtually never in the candidate set. Indeed, it is a common aphorism in statistics that "all models are wrong"; hence the "true model" (i.e. reality) cannot be in the candidate set.

Another comparison of AIC and BIC is given by Vrieze (2012). Vrieze presents a simulation study—which allows the "true model" to be in the candidate set (unlike with virtually all real data). The simulation study demonstrates, in particular, that AIC sometimes selects a much better model than BIC even when the "true model" is in the candidate set. The reason is that, for finite n, BIC can have a substantial risk of selecting a very bad model from the candidate set. This reason can arise even when n is much larger than k2. With AIC, the risk of selecting a very bad model is minimized.

If the "true model" is not in the candidate set, then the most that we can hope to do is select the model that best approximates the "true model". AIC is appropriate for finding the best approximating model, under certain assumptions. [30] [31] [32] (Those assumptions include, in particular, that the approximating is done with regard to information loss.)

Comparison of AIC and BIC in the context of regression is given by Yang (2005). In regression, AIC is asymptotically optimal for selecting the model with the least mean squared error, under the assumption that the "true model" is not in the candidate set. BIC is not asymptotically optimal under the assumption. Yang additionally shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.

Comparison with least squares

Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting.

With least squares fitting, the maximum likelihood estimate for the variance of a model's residuals distributions is

,

where the residual sum of squares is

Then, the maximum value of a model's log-likelihood function is (see Normal distribution#Log-likelihood):

where C is a constant independent of the model, and dependent only on the particular data points, i.e. it does not change if the data does not change.

That gives: [33]

Because only differences in AIC are meaningful, the constant C can be ignored, which allows us to conveniently take the following for model comparisons:

Note that if all the models have the same k, then selecting the model with minimum AIC is equivalent to selecting the model with minimum RSS—which is the usual objective of model selection based on least squares.

Comparison with cross-validation

Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models. [34] Asymptotic equivalence to AIC also holds for mixed-effects models. [35]

Comparison with Mallows's Cp

Mallows's Cp is equivalent to AIC in the case of (Gaussian) linear regression. [36]

See also

Notes

  1. Stoica, P.; Selen, Y. (2004), "Model-order selection: a review of information criterion rules", IEEE Signal Processing Magazine (July): 36–47, doi:10.1109/MSP.2004.1311138, S2CID   17338979
  2. McElreath, Richard (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press. p. 189. ISBN   978-1-4822-5344-3. AIC provides a surprisingly simple estimate of the average out-of-sample deviance.
  3. Taddy, Matt (2019). Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions. New York: McGraw-Hill. p. 90. ISBN   978-1-260-45277-8. The AIC is an estimate for OOS deviance.
  4. Burnham & Anderson 2002 , §2.2
  5. 1 2 Akaike 1974
  6. Burnham & Anderson 2002 , §2.9.1, §6.4.5
  7. Burnham & Anderson 2002
  8. Burnham & Anderson 2002 , §2.12.4
  9. Murtaugh 2014
  10. Burnham & Anderson 2002 , p. 99
  11. Bandyopadhyay & Forster 2011
  12. Sakamoto, Ishiguro & Kitagawa 1986
  13. McQuarrie & Tsai 1998
  14. Claeskens & Hjort 2008 , §8.3
  15. Giraud 2015 , §2.9.1
  16. Sugiura (1978)
  17. Hurvich & Tsai (1989)
  18. Cavanaugh 1997
  19. Burnham & Anderson 2002 , §2.4
  20. Burnham & Anderson 2004
  21. Burnham & Anderson 2002 , §7.4
  22. 1 2 Findley & Parzen 1995
  23. Akaike 1973
  24. deLeeuw 1992
  25. Sources containing both "Akaike" and "AIC"—at Google Scholar.
  26. Burnham & Anderson 2002 , p. 63
  27. Ding, Jie; Tarokh, Vahid; Yang, Yuhong (November 2018). "Model Selection Techniques: An Overview". IEEE Signal Processing Magazine. 35 (6): 16–34. arXiv: 1810.09583 . Bibcode:2018ISPM...35...16D. doi:10.1109/MSP.2018.2867638. ISSN   1053-5888. S2CID   53035396.
  28. Ding, J.; Tarokh, V.; Yang, Y. (June 2018). "Bridging AIC and BIC: A New Criterion for Autoregression". IEEE Transactions on Information Theory. 64 (6): 4024–4043. arXiv: 1508.02473 . doi:10.1109/TIT.2017.2717599. ISSN   1557-9654. S2CID   5189440.
  29. Ding, Jie; Tarokh, Vahid; Yang, Yuhong (2018-11-14). "Model Selection Techniques: An Overview". IEEE Signal Processing Magazine. 35 (6): 16–34. arXiv: 1810.09583 . Bibcode:2018ISPM...35f..16D. doi:10.1109/MSP.2018.2867638. S2CID   53035396 . Retrieved 2023-02-18.
  30. 1 2 Burnham & Anderson 2002 , §6.3-6.4
  31. 1 2 Vrieze 2012
  32. 1 2 Aho, Derryberry & Peterson 2014
  33. Burnham & Anderson 2002 , p. 63
  34. Stone 1977
  35. Fang 2011
  36. Boisbunon et al. 2014

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a null hypothesis and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its linear approximation. The Bayes factor can be thought of as a Bayesian analog to the likelihood-ratio test, although it uses the (integrated) marginal likelihood rather than the maximized likelihood. As such, both quantities only coincide under simple hypotheses. Also, in contrast with null hypothesis significance testing, Bayes factors support evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions, or whether outcome frequencies follow a specified distribution. In the analysis of variance, one of the components into which the variance is partitioned may be a lack-of-fit sum of squares.

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given personal income together with years of schooling and on-the-job experience , we might specify a functional relationship as follows:

In statistics, the Hannan–Quinn information criterion (HQC) is a criterion for model selection. It is an alternative to Akaike information criterion (AIC) and Bayesian information criterion (BIC). It is given as

In statistics, a generalized linear mixed model (GLMM) is an extension to the generalized linear model (GLM) in which the linear predictor contains random effects in addition to the usual fixed effects. They also inherit from GLMs the idea of extending linear mixed models to non-normal data.

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. Usually the model is identifiable only under certain technical restrictions, in which case the set of these requirements is called the identification conditions.

In statistics Wilks' theorem states that the log-likelihood ratio is asymptotically normal. This can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

Further reading