Probit model

Last updated

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. [1] The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

Contents

A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. [2] It is most often estimated using the maximum likelihood procedure, [3] such an estimation being called a probit regression.

Conceptual framework

Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form

where P is the probability and is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood.

It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable

where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive:

The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount.

To see that the two models are equivalent, note that

Model estimation

Maximum likelihood estimation

Suppose data set contains n independent statistical units corresponding to the model above.

For the single observation, conditional on the vector of inputs of that observation, we have:

[ clarification needed ]

where is a vector of inputs, and is a vector of coefficients.

The likelihood of a single observation is then

In fact, if , then , and if , then .

Since the observations are independent and identically distributed, then the likelihood of the entire sample, or the joint likelihood, will be equal to the product of the likelihoods of the single observations:

The joint log-likelihood function is thus

The estimator which maximizes this function will be consistent, asymptotically normal and efficient provided that exists and is not singular. It can be shown that this log-likelihood function is globally concave in , and therefore standard numerical algorithms for optimization will converge rapidly to the unique maximum.

Asymptotic distribution for is given by

where

[ citation needed ]

and is the Probability Density Function (PDF) of standard normal distribution.

Semi-parametric and non-parametric maximum likelihood methods for probit-type and other related models are also available. [4]

Berkson's minimum chi-square method

This method can be applied only when there are many observations of response variable having the same value of the vector of regressors (such situation may be referred to as "many observations per cell"). More specifically, the model can be formulated as follows.

Suppose among n observations there are only T distinct values of the regressors, which can be denoted as . Let be the number of observations with and the number of such observations with . We assume that there are indeed "many" observations per each "cell": for each .

Denote

Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of on with weights :

It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient.[ citation needed ] Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts , , and (for example in the analysis of voting behavior).

Gibbs sampling

Gibbs sampling of a probit model is possible because regression models typically use normal prior distributions over the weights, and this distribution is conjugate with the normal distribution of the errors (and hence of the latent variables Y*). The model can be described as

From this, we can determine the full conditional densities needed:

The result for is given in the article on Bayesian linear regression, although specified with different notation.

The only trickiness is in the last two equations. The notation is the Iverson bracket, sometimes written or similar. It indicates that the distribution must be truncated within the given range, and rescaled appropriately. In this particular case, a truncated normal distribution arises. Sampling from this distribution depends on how much is truncated. If a large fraction of the original mass remains, sampling can be easily done with rejection sampling—simply sample a number from the non-truncated distribution, and reject it if it falls outside the restriction imposed by the truncation. If sampling from only a small fraction of the original mass, however (e.g. if sampling from one of the tails of the normal distribution—for example if is around 3 or more, and a negative sample is desired), then this will be inefficient and it becomes necessary to fall back on other sampling algorithms. General sampling from the truncated normal can be achieved using approximations to the normal CDF and the probit function, and R has a function rtnorm() for generating truncated-normal samples.

Model evaluation

The suitability of an estimated binary model can be evaluated by counting the number of true observations equaling 1, and the number equaling zero, for which the model assigns a correct predicted classification by treating any estimated probability above 1/2 (or, below 1/2), as an assignment of a prediction of 1 (or, of 0). See Logistic regression § Model for details.

Performance under misspecification

Consider the latent variable model formulation of the probit model. When the variance of conditional on is not constant but dependent on , then the heteroscedasticity issue arises. For example, suppose and where is a continuous positive explanatory variable. Under heteroskedasticity, the probit estimator for is usually inconsistent, and most of the tests about the coefficients are invalid. More importantly, the estimator for becomes inconsistent, too. To deal with this problem, the original model needs to be transformed to be homoskedastic. For instance, in the same example, can be rewritten as , where . Therefore, and running probit on generates a consistent estimator for the conditional probability

When the assumption that is normally distributed fails to hold, then a functional form misspecification issue arises: if the model is still estimated as a probit model, the estimators of the coefficients are inconsistent. For instance, if follows a logistic distribution in the true model, but the model is estimated by probit, the estimates will be generally smaller than the true value. However, the inconsistency of the coefficient estimates is practically irrelevant because the estimates for the partial effects, , will be close to the estimates given by the true logit model. [5]

To avoid the issue of distribution misspecification, one may adopt a general distribution assumption for the error term, such that many different types of distribution can be included in the model. The cost is heavier computation and lower accuracy for the increase of the number of parameter. [6] In most of the cases in practice where the distribution form is misspecified, the estimators for the coefficients are inconsistent, but estimators for the conditional probability and the partial effects are still very good.[ citation needed ]

One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions on a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit). [4]

History

The probit model is usually credited to Chester Bliss, who coined the term "probit" in 1934, [7] and to John Gaddum (1933), who systematized earlier work. [8] However, the basic model dates to the Weber–Fechner law by Gustav Fechner, published in Fechner (1860), and was repeatedly rediscovered until the 1930s; see Finney (1971 , Chapter 3.6) and Aitchison & Brown (1957 , Chapter 1.2). [8]

A fast method for computing maximum likelihood estimates for the probit model was proposed by Ronald Fisher as an appendix to Bliss' work in 1935. [9]

See also

Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics and econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly. For example, if it is believed that the decisions of sending at least one child to public school and that of voting in favor of a school budget are correlated, then the multivariate probit model would be appropriate for jointly predicting these two choices on an individual-specific basis. J.R. Ashford and R.R. Sowden initially proposed an approach for multivariate probit analysis. Siddhartha Chib and Edward Greenberg extended this idea and also proposed simulation-based inference methods for the multivariate probit model which simplified and generalized parameter estimation.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial logit model as one method of multiclass classification. It is not to be confused with the multivariate probit model, which is used to model correlated binary outcomes for more than one independent variable.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

References

  1. Oxford English Dictionary, 3rd ed. s.v. probit (article dated June 2007): Bliss, C. I. (1934). "The Method of Probits". Science . 79 (2037): 38–39. Bibcode:1934Sci....79...38B. doi:10.1126/science.79.2037.38. PMID   17813446. These arbitrary probability units have been called 'probits'.
  2. Agresti, Alan (2015). Foundations of Linear and Generalized Linear Models. New York: Wiley. pp. 183–186. ISBN   978-1-118-73003-4.
  3. Aldrich, John H.; Nelson, Forrest D.; Adler, E. Scott (1984). Linear Probability, Logit, and Probit Models. Sage. pp. 48–65. ISBN   0-8039-2133-0.
  4. 1 2 Park, Byeong U.; Simar, Léopold; Zelenyuk, Valentin (2017). "Nonparametric estimation of dynamic discrete choice models for time series data" (PDF). Computational Statistics & Data Analysis. 108: 97–120. doi:10.1016/j.csda.2016.10.024.
  5. Greene, W. H. (2003), Econometric Analysis, Prentice Hall, Upper Saddle River, NJ.
  6. For more details, refer to: Cappé, O., Moulines, E. and Ryden, T. (2005): "Inference in Hidden Markov Models", Springer-Verlag New York, Chapter 2.
  7. Bliss, C. I. (1934). "The Method of Probits". Science . 79 (2037): 38–39. Bibcode:1934Sci....79...38B. doi:10.1126/science.79.2037.38. PMID   17813446.
  8. 1 2 Cramer 2002, p. 7.
  9. Fisher, R. A. (1935). "The Case of Zero Survivors in Probit Assays". Annals of Applied Biology. 22: 164–165. doi:10.1111/j.1744-7348.1935.tb07713.x. Archived from the original on 2014-04-30.

Further reading