This article may be too technical for most readers to understand.(January 2013) |
In probability theory and statistics, the probit function is the quantile function associated with the standard normal distribution. It has applications in data analysis and machine learning, in particular exploratory statistical graphics and specialized regression modeling of binary response variables.
Mathematically, the probit is the inverse of the cumulative distribution function of the standard normal distribution, which is denoted as , so the probit is defined as
Largely because of the central limit theorem, the standard normal distribution plays a fundamental role in probability theory and statistics. If we consider the familiar fact that the standard normal distribution places 95% of probability between −1.96 and 1.96 and is symmetric around zero, it follows that
The probit function gives the 'inverse' computation, generating a value of a standard normal random variable, associated with specified cumulative probability. Continuing the example,
In general,
The idea of the probit function was published by Chester Ittner Bliss in a 1934 article in Science on how to treat data such as the percentage of a pest killed by a pesticide. [1] Bliss proposed transforming the percentage killed into a "probability unit" (or "probit") which was linearly related to the modern definition (he defined it arbitrarily as equal to 0 for 0.0001 and 1 for 0.9999): [2]
These arbitrary probability units have been termed "probits" ...
He included a table to aid other researchers to convert their kill percentages to his probit, which they could then plot against the logarithm of the dose and thereby, it was hoped, obtain a more or less straight line. Such a so-called probit model is still important in toxicology, as well as other fields. The approach is justified in particular if response variation can be rationalized as a lognormal distribution of tolerances among subjects on test, where the tolerance of a particular subject is the dose just sufficient for the response of interest.
The method introduced by Bliss was carried forward in Probit Analysis, an important text on toxicological applications by D. J. Finney. [3] [4] Values tabled by Finney can be derived from probits as defined here by adding a value of 5. This distinction is summarized by Collett (p. 55): [5] "The original definition of a probit [with 5 added] was primarily to avoid having to work with negative probits; ... This definition is still used in some quarters, but in the major statistical software packages for what is referred to as probit analysis, probits are defined without the addition of 5." Probit methodology, including numerical optimization for fitting of probit functions, was introduced before widespread availability of electronic computing. When using tables, it was convenient to have probits uniformly positive. Common areas of application do not require positive probits.
In addition to providing a basis for important types of regression, the probit function is useful in statistical analysis for diagnosing deviation from normality, according to the method of Q–Q plotting. If a set of data is actually a sample of a normal distribution, a plot of the values against their probit scores will be approximately linear. Specific deviations from normality such as asymmetry, heavy tails, or bimodality can be diagnosed based on detection of specific deviations from linearity. While the Q–Q plot can be used for comparison to any distribution family (not only the normal), the normal Q–Q plot is a relatively standard exploratory data analysis procedure because the assumption of normality is often a starting point for analysis.
The normal distribution CDF and its inverse are not available in closed form, and computation requires careful use of numerical procedures. However, the functions are widely available in software for statistics and probability modeling, and in spreadsheets. In Microsoft Excel, for example, the probit function is available as norm.s.inv(p). In computing environments where numerical implementations of the inverse error function are available, the probit function may be obtained as
An example is MATLAB, where an 'erfinv' function is available. The language Mathematica implements 'InverseErf'. Other environments directly implement the probit function as is shown in the following session in the R programming language.
> qnorm(0.025)[1] -1.959964> pnorm(-1.96)[1] 0.02499790
Details for computing the inverse error function can be found at . Wichura gives a fast algorithm for computing the probit function to 16 decimal places; this is used in R to generate random variates for the normal distribution. [6]
Another means of computation is based on forming a non-linear ordinary differential equation (ODE) for probit, as per the Steinbrecher and Shaw method. [7] Abbreviating the probit function as , the ODE is
where is the probability density function of w.
In the case of the Gaussian:
Differentiating again:
with the centre (initial) conditions
This equation may be solved by several methods, including the classical power series approach. From this, solutions of arbitrarily high accuracy may be developed based on Steinbrecher's approach to the series for the inverse error function. The power series solution is given by
where the coefficients satisfy the non-linear recurrence
with . In this form the ratio as .
Closely related to the probit function (and probit model) are the logit function and logit model. The inverse of the logistic function is given by
Analogously to the probit model, we may assume that such a quantity is related linearly to a set of predictors, resulting in the logit model, the basis in particular of logistic regression model, the most prevalent form of regression analysis for categorical response data. In current statistical practice, probit and logit regression models are often handled as cases of the generalized linear model.
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
The stimulus–response model is a conceptual framework in psychology that describes how individuals respond to external stimuli. According to this model, an external stimulus triggers a reaction in an organism, often without the need for conscious thought. This model emphasizes the mechanistic aspects of behavior, suggesting that behavior can often be predicted and controlled by understanding and manipulating the stimuli that trigger responses.
In statistics, the logit function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.
In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.
In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
In probability theory and statistics, the Lévy distribution, named after Paul Lévy, is a continuous probability distribution for a non-negative random variable. In spectroscopy, this distribution, with frequency as the dependent variable, is known as a van der Waals profile. It is a special case of the inverse-gamma distribution. It is a stable distribution.
In statistics, a linear probability model (LPM) is a special case of a binary regression model. Here the dependent variable for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by linear regression.
In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining "how much" as in problems with continuous choice variables, discrete choice analysis examines "which one". However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.
In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.
In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).
In probability theory, the Mills ratio of a continuous random variable is the function
In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).
In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.
In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.
The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.
In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.
Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.
In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.
{{cite journal}}
: CS1 maint: multiple names: authors list (link)