Deviance (statistics)

Last updated January 02, 2025

In statistics, deviance is a goodness-of-fit statistic for a statistical model; it is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals (SSR) in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. It plays an important role in exponential dispersion models and generalized linear models.

Definition

The unit deviance^[2]^[3] $d(y,\mu )$ is a bivariate function that satisfies the following conditions:

$d(y,y)=0$
$d(y,\mu )>0\quad \forall y\neq \mu$

The total deviance $D(\mathbf {y} ,{\hat {\boldsymbol {\mu }}})$ of a model with predictions ${\hat {\boldsymbol {\mu }}}$ of the observation $\mathbf {y}$ is the sum of its unit deviances: ${\textstyle D(\mathbf {y} ,{\hat {\boldsymbol {\mu }}})=\sum _{i}d(y_{i},{\hat {\mu }}_{i})}$ .

The (total) deviance for a model M₀ with estimates ${\hat {\mu }}=E[Y|{\hat {\theta }}_{0}]$ , based on a dataset y, may be constructed by its likelihood as:^[4]^[5] $D(y,{\hat {\mu }})=2\left(\log \left[p(y\mid {\hat {\theta }}_{s})\right]-\log \left[p(y\mid {\hat {\theta }}_{0})\right]\right).$

Here ${\hat {\theta }}_{0}$ denotes the fitted values of the parameters in the model M₀, while ${\hat {\theta }}_{s}$ denotes the fitted parameters for the saturated model: both sets of fitted values are implicitly functions of the observations y. Here, the saturated model is a model with a parameter for every observation so that the data are fitted exactly. This expression is simply 2 times the log-likelihood ratio of the full model compared to the reduced model. The deviance is used to compare two models – in particular in the case of generalized linear models (GLM) where it has a similar role to residual sum of squares from ANOVA in linear models (RSS).

Suppose in the framework of the GLM, we have two nested models, M₁ and M₂. In particular, suppose that M₁ contains the parameters in M₂, and k additional parameters. Then, under the null hypothesis that M₂ is the true model, the difference between the deviances for the two models follows, based on Wilks' theorem, an approximate chi-squared distribution with k-degrees of freedom.^[5] This can be used for hypothesis testing on the deviance.

Some usage of the term "deviance" can be confusing. According to Collett:^[6]

"the quantity

-2\log {\big [}p(y\mid {\hat {\theta }}_{0}){\big ]}

is sometimes referred to as a deviance. This is [...] inappropriate, since unlike the deviance used in the context of generalized linear modelling,

-2\log {\big [}p(y\mid {\hat {\theta }}_{0}){\big ]}

does not measure deviation from a model that is a perfect fit to the data."

However, since the principal use is in the form of the difference of the deviances of two models, this confusion in definition is unimportant.

Examples

The unit deviance for the Poisson distribution is $d(y,\mu )=2\left(y\log {\frac {y}{\mu }}-y+\mu \right)$ , the unit deviance for the normal distribution with unit variance is given by $d(y,\mu )=\left(y-\mu \right)^{2}$ .

Notes

↑ Hastie, Trevor. "A closer look at the deviance." The American Statistician 41.1 (1987): 16-20.
↑ Jørgensen, B. (1997). The Theory of Dispersion Models. Chapman & Hall.
↑ Song, Peter X.-K. (2007). Correlated Data Analysis: Modeling, Analytics, and Applications. Springer Series in Statistics. Springer Series in Statistics. doi:10.1007/978-0-387-71393-9. ISBN 978-0-387-71392-2.
↑ Nelder, J.A.; Wedderburn, R.W.M. (1972). "Generalized Linear Models". Journal of the Royal Statistical Society. Series A (General). 135 (3): 370–384. doi:10.2307/2344614. JSTOR 2344614. S2CID 14154576.
1 2 McCullagh and Nelder (1989): page 17
↑ Collett (2003): page 76

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

A tolerance interval (TI) is a statistical interval within which, with some confidence level, a specified sampled proportion of a population falls. "More specifically, a $100\times p %/100\times(1-α)$ tolerance interval provides limits within which at least a certain proportion (p) of the population falls with a given level of confidence (1−α)." "A (p, 1−α) tolerance interval (TI) based on a sample is constructed so that it would include at least a proportion p of the sampled population with confidence 1−α; such a TI is usually referred to as p-content − (1−α) coverage TI." "A (p, 1−α) upper tolerance limit (TL) is simply a 1−α upper confidence limit for the 100 p percentile of the population."

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of ⁠ $⁠$ independent Bernoulli trials, where each trial has probability of success ⁠ $⁠$ . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the $-sphere in . If the distribution reduces to the von Mises distribution on the circle.$

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, the generalized linear array model (GLAM) is used for analyzing data sets with array structures. It based on the generalized linear model with the design matrix written as a Kronecker product.

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

References

McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Chapman & Hall/CRC. ISBN 0-412-31760-5.

Collett, David (2003). Modelling Survival Data in Medical Research, Second Edition. Chapman & Hall/CRC. ISBN 1-58488-325-1.

External links

Generalized Linear Models - Edward F. Connor
Lectures notes on Deviance

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Hastie, Trevor. "A closer look at the deviance." The American Statistician 41.1 (1987): 16-20.

[J1997-2] Jørgensen, B. (1997). The Theory of Dispersion Models. Chapman & Hall.

[3] Song, Peter X.-K. (2007). Correlated Data Analysis: Modeling, Analytics, and Applications. Springer Series in Statistics. Springer Series in Statistics. doi:10.1007/978-0-387-71393-9. ISBN 978-0-387-71392-2.

[4] Nelder, J.A.; Wedderburn, R.W.M. (1972). "Generalized Linear Models". Journal of the Royal Statistical Society. Series A (General). 135 (3): 370–384. doi:10.2307/2344614. JSTOR 2344614. S2CID 14154576.

[McN-5] 1 2 McCullagh and Nelder (1989): page 17

[6] Collett (2003): page 76

[1]

[2]

[3]

[4]

[5]

[6]