Binomial regression

Last updated January 27, 2024

In statistics, binomial regression is a regression analysis technique in which the response (often referred to as Y) has a binomial distribution: it is the number of successes in a series of $n$ independent Bernoulli trials, where each trial has probability of success $p$ .^[1] In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Binomial regression is closely related to binary regression: a binary regression can be considered a binomial regression with $n=1$ , or a regression on ungrouped binary data, while a binomial regression can be considered a regression on grouped binary data (see comparison).^[2] Binomial regression models are essentially the same as binary choice models, one type of discrete choice model: the primary difference is in the theoretical motivation (see comparison). In machine learning, binomial regression is considered a special case of probabilistic classification, and thus a generalization of binary classification.

Example application

In one published example of an application of binomial regression,^[3] the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.

Specification of model

The response variable Y is assumed to be binomially distributed conditional on the explanatory variables X. The number of trials n is known, and the probability of success for each trial p is specified as a function θ(X). This implies that the conditional expectation and conditional variance of the observed fraction of successes, Y/n, are

E(Y/n\mid X)=\theta (X)

\operatorname {Var} (Y/n\mid X)=\theta (X)(1-\theta (X))/n

The goal of binomial regression is to estimate the function θ(X). Typically the statistician assumes $\theta (X)=m(\beta ^{\mathrm {T} }X)$ , for a known function m, and estimates β. Common choices for m include the logistic function.^[1]

The data are often fitted as a generalised linear model where the predicted values μ are the probabilities that any individual event will result in a success. The likelihood of the predictions is then given by

L({\boldsymbol {\mu }}\mid Y)=\prod _{i=1}^{n}\left(1_{y_{i}=1}(\mu _{i})+1_{y_{i}=0}(1-\mu _{i})\right),\,\!

where 1_A is the indicator function which takes on the value one when the event A occurs, and zero otherwise: in this formulation, for any given observation y_i, only one of the two terms inside the product contributes, according to whether y_i=0 or 1. The likelihood function is more fully specified by defining the formal parameters μ_i as parameterised functions of the explanatory variables: this defines the likelihood in terms of a much reduced number of parameters. Fitting of the model is usually achieved by employing the method of maximum likelihood to determine these parameters. In practice, the use of a formulation as a generalised linear model allows advantage to be taken of certain algorithmic ideas which are applicable across the whole class of more general models but which do not apply to all maximum likelihood problems.

Models used in binomial regression can often be extended to multinomial data.

There are many methods of generating the values of μ in systematic ways that allow for interpretation of the model; they are discussed below.

Link functions

There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form

{\boldsymbol {\mu }}=g({\boldsymbol {\eta }})\,.

Here η is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function g is the cumulative distribution function (cdf) of some probability distribution. Usually this probability distribution has a support from minus infinity to plus infinity so that any finite value of η is transformed by the function g to a value inside the range 0 to 1.

In the case of logistic regression, the link function is the log of the odds ratio or logistic function. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.

Comparison with binary regression

Binomial regression is closely connected with binary regression. If the response is a binary variable (two possible outcomes), then these alternatives can be coded as 0 or 1 by considering one of the outcomes as "success" and the other as "failure" and considering these as count data: "success" is 1 success out of 1 trial, while "failure" is 0 successes out of 1 trial. This can now be considered a binomial distribution with $n=1$ trial, so a binary regression is a special case of a binomial regression. If these data are grouped (by adding counts), they are no longer binary data, but are count data for each group, and can still be modeled by a binomial regression; the individual binary outcomes are then referred to as "ungrouped data". An advantage of working with grouped data is that one can test the goodness of fit of the model;^[2] for example, grouped data may exhibit overdispersion relative to the variance estimated from the ungrouped data.

Comparison with binary choice models

A binary choice model assumes a latent variable U_n, the utility (or net benefit) that person n obtains from taking an action (as opposed to not taking the action). The utility the person obtains from taking the action depends on the characteristics of the person, some of which are observed by the researcher and some are not:

U_{n}={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} +\varepsilon _{n}

where ${\boldsymbol {\beta }}$ is a set of regression coefficients and $\mathbf {s_{n}}$ is a set of independent variables (also known as "features") describing person n, which may be either discrete "dummy variables" or regular continuous variables. $\varepsilon _{n}$ is a random variable specifying "noise" or "error" in the prediction, assumed to be distributed according to some distribution. Normally, if there is a mean or variance parameter in the distribution, it cannot be identified, so the parameters are set to convenient values — by convention usually mean 0, variance 1.

The person takes the action, y_n = 1, if U_n > 0. The unobserved term, ε_n, is assumed to have a logistic distribution.

The specification is written succinctly as:

- U_n = βs_n + ε_n
- $Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}$
- ε ∼ logistic, standard normal, etc.

Let us write it slightly differently:

- U_n = βs_n−e_n
- $Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}$
- e ∼ logistic, standard normal, etc.

Here we have made the substitution e_n = −ε_n. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution, etc.) are symmetric about 0, and hence the distribution over e_n is identical to the distribution over ε_n.

Denote the cumulative distribution function (CDF) of $e$ as $F_{e},$ and the quantile function (inverse CDF) of $e$ as $F_{e}^{-1}.$

Note that

{\begin{aligned}\Pr(Y_{n}=1)&=\Pr(U_{n}>0)\\[6pt]&=\Pr({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} -e_{n}>0)\\[6pt]&=\Pr(-e_{n}>-{\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=\Pr(e_{n}\leq {\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\end{aligned}}

Since $Y_{n}$ is a Bernoulli trial, where $\mathbb {E} [Y_{n}]=\Pr(Y_{n}=1),$ we have

\mathbb {E} [Y_{n}]=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )

or equivalently

F_{e}^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} .

Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model.

If $e_{n}\sim {\mathcal {N}}(0,1),$ i.e. distributed as a standard normal distribution, then

\Phi ^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}}

which is exactly a probit model.

If $e_{n}\sim \operatorname {Logistic} (0,1),$ i.e. distributed as a standard logistic distribution with mean 0 and scale parameter 1, then the corresponding quantile function is the logit function, and

\operatorname {logit} (\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}}

which is exactly a logit model.

Note that the two different formalisms — generalized linear models (GLM's) and discrete choice models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:

GLM's can easily handle arbitrarily distributed response variables (dependent variables), not just categorical variables or ordinal variables, which discrete choice models are limited to by their nature. GLM's are also not limited to link functions that are quantile functions of some distribution, unlike the use of an error variable, which must by assumption have a probability distribution.
On the other hand, because discrete choice models are described as types of generative models, it is conceptually easier to extend them to complicated situations with multiple, possibly correlated, choices for each person, or other variations.

Latent variable interpretation / derivation

A latent variable model involving a binomial observed variable Y can be constructed such that Y is related to the latent variable Y* via

Y={\begin{cases}0,&{\mbox{if }}Y^{*}>0\\1,&{\mbox{if }}Y^{*}<0.\end{cases}}

The latent variable Y* is then related to a set of regression variables X by the model

Y^{*}=X\beta +\epsilon \ .

This results in a binomial regression model.

The variance of ϵ can not be identified and when it is not of interest is often assumed to be equal to one. If ϵ is normally distributed, then a probit is the appropriate model and if ϵ is log-Weibull distributed, then a logit is appropriate. If ϵ is uniformly distributed, then a linear probability model is appropriate.

Notes

1 2 Sanford Weisberg (2005). "Binomial Regression". Applied Linear Regression . Wiley-IEEE. pp. 253–254. ISBN 0-471-66379-4.
1 2 Rodríguez 2007, Chapter 3, p. 5.
↑ Cox & Snell (1981), Example H, p. 91

Related Research Articles

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regression models may be compactly written as

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, a linear probability model (LPM) is a special case of a binary regression model. Here the dependent variable for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by linear regression.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

References

Cox, D. R.; Snell, E. J. (1981). Applied Statistics: Principles and Examples. Chapman and Hall. ISBN 0-412-16570-8.
Rodríguez, Germán (2007). "Lecture Notes on Generalized Linear Models".