# Binomial regression

Last updated

In statistics, binomial regression is a regression analysis technique in which the response (often referred to as Y) has a binomial distribution: it is the number of successes in a series of ${\displaystyle n}$ independent Bernoulli trials, where each trial has probability of success ${\displaystyle p}$. [1] In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

## Contents

Binomial regression is closely related to binary regression: if the response is a binary variable (two possible outcomes), then it can be considered as a binomial distribution with ${\displaystyle n=1}$ trial by considering one of the outcomes as "success" and the other as "failure", counting the outcomes as either 1 or 0: counting a success as 1 success out of 1 trial, and counting a failure as 0 successes out of 1 trial. Binomial regression models are essentially the same as binary choice models, one type of discrete choice model. The primary difference is in the theoretical motivation.

In machine learning, binomial regression is considered a special case of probabilistic classification, and thus a generalization of binary classification.

## Example application

In one published example of an application of binomial regression, [2] the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.

## Discrete choice model

Discrete choice models are motivated using utility theory so as to handle various types of correlated and uncorrelated choices, while binomial regression models are generally described in terms of the generalized linear model, an attempt to generalize various types of linear regression models. As a result, discrete choice models are usually described primarily with a latent variable indicating the "utility" of making a choice, and with randomness introduced through an error variable distributed according to a specific probability distribution. Note that the latent variable itself is not observed, only the actual choice, which is assumed to have been made if the net utility was greater than 0. Binary regression models, however, dispense with both the latent and error variable and assume that the choice itself is a random variable, with a link function that transforms the expected value of the choice variable into a value that is then predicted by the linear predictor. It can be shown that the two are equivalent, at least in the case of binary choice models: the link function corresponds to the quantile function of the distribution of the error variable, and the inverse link function to the cumulative distribution function (CDF) of the error variable. The latent variable has an equivalent if one imagines generating a uniformly distributed number between 0 and 1, subtracting from it the mean (in the form of the linear predictor transformed by the inverse link function), and inverting the sign. One then has a number whose probability of being greater than 0 is the same as the probability of success in the choice variable, and can be thought of as a latent variable indicating whether a 0 or 1 was chosen.

## Specification of model

The results are assumed to be binomially distributed. [1] They are often fitted as a generalised linear model where the predicted values μ are the probabilities that any individual event will result in a success. The likelihood of the predictions is then given by

${\displaystyle L({\boldsymbol {\mu }}\mid Y)=\prod _{i=1}^{n}\left(1_{y_{i}=1}(\mu _{i})+1_{y_{i}=0}(1-\mu _{i})\right),\,\!}$

where 1A is the indicator function which takes on the value one when the event A occurs, and zero otherwise: in this formulation, for any given observation yi, only one of the two terms inside the product contributes, according to whether yi=0 or 1. The likelihood function is more fully specified by defining the formal parameters μi as parameterised functions of the explanatory variables: this defines the likelihood in terms of a much reduced number of parameters. Fitting of the model is usually achieved by employing the method of maximum likelihood to determine these parameters. In practice, the use of a formulation as a generalised linear model allows advantage to be taken of certain algorithmic ideas which are applicable across the whole class of more general models but which do not apply to all maximum likelihood problems.

Models used in binomial regression can often be extended to multinomial data.

There are many methods of generating the values of μ in systematic ways that allow for interpretation of the model; they are discussed below.

There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form

${\displaystyle {\boldsymbol {\mu }}=g({\boldsymbol {\eta }})\,.}$

Here η is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function g is the cumulative distribution function (cdf) of some probability distribution. Usually this probability distribution has a support from minus infinity to plus infinity so that any finite value of η is transformed by the function g to a value inside the range 0 to 1.

In the case of logistic regression, the link function is the log of the odds ratio or logistic function. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.

## Comparison between binomial regression and binary choice models

A binary choice model assumes a latent variable Un, the utility (or net benefit) that person n obtains from taking an action (as opposed to not taking the action). The utility the person obtains from taking the action depends on the characteristics of the person, some of which are observed by the researcher and some are not:

${\displaystyle U_{n}={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} +\varepsilon _{n}}$

where ${\displaystyle {\boldsymbol {\beta }}}$ is a set of regression coefficients and ${\displaystyle \mathbf {s_{n}} }$ is a set of independent variables (also known as "features") describing person n, which may be either discrete "dummy variables" or regular continuous variables. ${\displaystyle \varepsilon _{n}}$ is a random variable specifying "noise" or "error" in the prediction, assumed to be distributed according to some distribution. Normally, if there is a mean or variance parameter in the distribution, it cannot be identified, so the parameters are set to convenient values — by convention usually mean 0, variance 1.

The person takes the action, yn = 1, if Un > 0. The unobserved term, εn, is assumed to have a logistic distribution.

The specification is written succinctly as:

• Un = βsn + εn
• ${\displaystyle Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}}$
• ε logistic, standard normal, etc.

Let us write it slightly differently:

• Un = βsnen
• ${\displaystyle Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}}$
• e logistic, standard normal, etc.

Here we have made the substitution en = εn. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution, etc.) are symmetric about 0, and hence the distribution over en is identical to the distribution over εn.

Denote the cumulative distribution function (CDF) of ${\displaystyle e}$ as ${\displaystyle F_{e},}$ and the quantile function (inverse CDF) of ${\displaystyle e}$ as ${\displaystyle F_{e}^{-1}.}$

Note that

{\displaystyle {\begin{aligned}\Pr(Y_{n}=1)&=\Pr(U_{n}>0)\\[6pt]&=\Pr({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} -e_{n}>0)\\[6pt]&=\Pr(-e_{n}>-{\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=\Pr(e_{n}\leq {\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\end{aligned}}}

Since ${\displaystyle Y_{n}}$ is a Bernoulli trial, where ${\displaystyle \mathbb {E} [Y_{n}]=\Pr(Y_{n}=1),}$ we have

${\displaystyle \mathbb {E} [Y_{n}]=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )}$

or equivalently

${\displaystyle F_{e}^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} .}$

Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model.

If ${\displaystyle e_{n}\sim {\mathcal {N}}(0,1),}$ i.e. distributed as a standard normal distribution, then

${\displaystyle \Phi ^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} }$

which is exactly a probit model.

If ${\displaystyle e_{n}\sim \operatorname {Logistic} (0,1),}$ i.e. distributed as a standard logistic distribution with mean 0 and scale parameter 1, then the corresponding quantile function is the logit function, and

${\displaystyle \operatorname {logit} (\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} }$

which is exactly a logit model.

Note that the two different formalisms — generalized linear models (GLM's) and discrete choice models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:

## Latent variable interpretation / derivation

A latent variable model involving a binomial observed variable Y can be constructed such that Y is related to the latent variable Y* via

${\displaystyle Y={\begin{cases}0,&{\mbox{if }}Y^{*}>0\\1,&{\mbox{if }}Y^{*}<0.\end{cases}}}$

The latent variable Y* is then related to a set of regression variables X by the model

${\displaystyle Y^{*}=X\beta +\epsilon \ .}$

This results in a binomial regression model.

The variance of ϵ can not be identified and when it is not of interest is often assumed to be equal to one. If ϵ is normally distributed, then a probit is the appropriate model and if ϵ is log-Weibull distributed, then a logit is appropriate. If ϵ is uniformly distributed, then a linear probability model is appropriate.

## Notes

1. Sanford Weisberg (2005). "Binomial Regression". . Wiley-IEEE. pp.  253–254. ISBN   0-471-66379-4.
2. Cox & Snell (1981), Example H, p. 91

## Related Research Articles

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which a researcher finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining “how much” as in problems with continuous choice variables, discrete choice analysis examines “which one.” However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.

In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics and econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly. For example, if it is believed that the decisions of sending at least one child to public school and that of voting in favor of a school budget are correlated, then the multivariate probit model would be appropriate for jointly predicting these two choices on an individual-specific basis. This approach was initially developed by Siddhartha Chib and Edward Greenberg.

In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial logit model as one method of multiclass classification. It is not to be confused with the multivariate probit model, which is used to model correlated binary outcomes for more than one independent variable.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, ordinal regression is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.