Maximum score estimator

Last updated June 30, 2021

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties (particularly its asymptotic distribution) are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

Setting

When modelling discrete choice problems, it is assumed that the choice is determined by the comparison of the underlying latent utility.^[1] Denote the population of the agents as T and the common choice set for each agent as C. For agent $t\in T$ , denote her choice as $y_{t,i}$ , which is equal to 1 if choice i is chosen and 0 otherwise. Assume latent utility is linear in the explanatory variables, and there is an additive response error. Then for an agent $t\in T$ ,

y_{t,i}=1\leftrightarrow x_{t,i}\beta +\epsilon _{t,i}>x_{t,j}\beta +\epsilon _{t,j},\forall j\neq i

and

j\in C

where $x_{t,i}$ and $x_{t,j}$ are the q-dimensional observable covariates about the agent and the choice, and $\epsilon _{t,i}$ and $\epsilon _{t,j}$ are the factors entering the agent's decision that are not observed by the econometrician. The construction of the observable covariates is very general. For instance, if C is a set of different brands of coffee, then $x_{t,i}$ includes the characteristics both of the agent t, such as age, gender, income and ethnicity, and of the coffee i, such as price, taste and whether it is local or imported. All of the error terms are assumed i.i.d. and we need to estimate $\beta$ which characterizes the effect of different factors on the agent's choice.

Parametric estimators

Usually some specific distribution assumption on the error term is imposed, such that the parameter $\beta$ is estimated parametrically. For instance, if the distribution of error term is assumed to be normal, then the model is just a multinomial probit model;^[2] if it is assumed to be a Gumbel distribution, then the model becomes a multinomial logit model. The parametric model ^[3] is convenient for computation but might not be consistent once the distribution of the error term is misspecified.^[4]

Binary response

For example, suppose that C only contains two items. This is the latent utility representation^[5] of a binary choice model. In this model, the choice is: $Y_{t}=1[X_{1,t}\beta +\varepsilon _{1}>X_{2,t}\beta +\varepsilon _{2}]$ , where $X_{1,t},X_{2,t}$ are two vectors of the explanatory covariates, $\varepsilon _{1}$ and $\varepsilon _{2}$ are i.i.d. response errors,

X_{1,t}\beta +\varepsilon _{1}{\text{ and }}X_{2,t}\beta +\varepsilon _{2}

are latent utility of choosing choice 1 and 2. Then the log likelihood function can be given as:

Q=\sum _{i-1}^{N}Y_{t}\log(P[X_{1,t}\beta -X_{2,t}\beta >\varepsilon _{2}-\varepsilon _{1}])+(1-Y_{t})\log(1-P[X_{1,t}\beta -X_{2,t}\beta >\varepsilon _{2}-\varepsilon _{1}])

If some distributional assumption about the response error is imposed, then the log likelihood function will have a closed-form representation.^[2] For instance, if the response error is assumed to be distributed as: $N(0,\sigma ^{2})$ , then the likelihood function can be rewritten as:

Q=\sum _{i-1}^{N}Y_{t}\log \left(\Phi \left[{\frac {X_{1,t}\beta -X_{2,t}\beta }{\surd 2\sigma }}\right]\right)+(1-Y_{t})\log \left(\Phi \left[{\frac {X_{2,t}\beta -X_{1,t}\beta }{\surd 2\sigma }}\right]\right)

where $\Phi$ is the cumulative distribution function (CDF) for the standard normal distribution. Here, even if $\Phi$ doesn't have a closed-form representation, its derivative does. This is the probit model.

This model is based on a distributional assumption about the response error term. Adding a specific distribution assumption into the model can make the model computationally tractable due to the existence of the closed-form representation. But if the distribution of the error term is misspecified, the estimates based on the distribution assumption will be inconsistent.

The basic idea of the distribution-free model is to replace the two probability term in the log-likelihood function with other weights. The general form of the log-likelihood function can written as:

Q=\sum _{i-1}^{N}Y_{t}\cdot \log(W_{1}(X_{1,t}\beta ,X_{2,t}\beta ))+(1-Y_{t})\log(W_{0}(X_{1,t}\beta ,X_{2,t}\beta ))

Maximum score estimator

To make the estimator more robust to the distributional assumption, Manski (1975) proposed a non-parametric model to estimate the parameters. In this model, denote the number of the elements of the choice set as J, the total number of the agents as N, and $W(J-1)>W(J-2)>\dots >W(1)>W(0)$ is a sequence of real numbers. The Maximum Score Estimator ^[6] is defined as:

{\hat {b}}={\operatorname {arg\max } }_{b}{\frac {1}{N}}\sum _{t=1}^{N}\sum _{i=1}^{J}y_{t,i}W(\sum \nolimits _{j\in C,j\neq i}1[x_{t,i}b>x_{t,j}b])

Here, $\textstyle \sum \nolimits _{j\in C,j\neq i}1(x_{t,i}b>x_{t,j}b)$ is the ranking of the certainty part of the underlying utility of choosing i. The intuition in this model is that when the ranking is higher, more weight will be assigned to the choice.

Under certain conditions, the maximum score estimator can be weak consistent, but its asymptotic properties are very complicated.^[7] This issue mainly comes from the non-smoothness of the objective function.

Binary example

In the binary context, the maximum score estimator can be represented as:

W_{1}(X_{1,t}\beta ,X_{2,t}\beta )=w_{1}1[X_{1,t}\beta -X_{2,t}\beta >0]+w_{0}1[X_{1,t}\beta -X_{2,t}\beta <0],

where

W_{0}(X_{1,t}\beta ,X_{2,t}\beta )=1-W_{1}(X_{1,t}\beta ,X_{2,t}\beta )

and $w_{1}$ and $w_{0}$ are two constants in (0,1). The intuition of this weighting scheme is that the probability of the choice depends on the relative order of the certainty part of the utility.

Smoothed maximum score estimator

Horowitz (1992) proposed a smoothed maximum score (SMS) estimator which has much better asymptotic properties.^[8] The basic idea is to replace the non-smoothed weight function $\textstyle W(\sum \nolimits _{j\in C,j\neq i}1(x_{t,i}b>x_{t,j}b))$ with a smoothed one. Define a smooth kernel function K satisfying following conditions:

$|K(\cdot )|$ is bounded over the real numbers
$\lim _{u\to -\infty }K(u)=0$ and $\lim _{u\to +\infty }K(u)=1$
${\dot {K}}(u)={\dot {K}}(-u)$

Here, the kernel function is analogous to a CDF whose PDF is symmetric around 0. Then, the SMS estimator is defined as:

{\hat {b}}_{SMS}={\operatorname {arg\max } }_{b}{\frac {1}{N}}\sum _{t=1}^{N}\sum _{i=1}^{J}y_{t,i}\sum \nolimits _{j\in C,j\neq i}K(X_{t,i}b-x_{t,j}b/h_{N})

where $(h_{N},N=1,2,...)$ is a sequence of strictly positive numbers and $\lim _{N\to +\infty }h_{N}=0$ . Here, the intuition is the same as in the construction of the traditional maximum score estimator: the agent is more likely to choose the choice that has the higher observed part of latent utility. Under certain conditions, the smoothed maximum score estimator is consistent, and more importantly, it has an asymptotic normal distribution. Therefore, all the usual statistical testing and inference based on asymptotic normality can be implemented.^[9]

Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two different parameterizations in common use:

With a shape parameter k and a scale parameter θ.
With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining “how much” as in problems with continuous choice variables, discrete choice analysis examines “which one.” However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of $independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.$

In statistics and econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly. For example, if it is believed that the decisions of sending at least one child to public school and that of voting in favor of a school budget are correlated, then the multivariate probit model would be appropriate for jointly predicting these two choices on an individual-specific basis. J.R. Ashford and R.R. Sowden initially proposed an approach for multivariate probit analysis. Siddhartha Chib and Edward Greenberg extended this idea and also proposed simulation-based inference methods for the multivariate probit model which simplified and generalized parameter estimation.

Mixed logit is a fully general statistical model for examining discrete choices. It overcomes three important limitations of the standard logit model by allowing for random taste variation across choosers, unrestricted substitution patterns across choices, and correlation in unobserved factors over time. Mixed logit can choose any distribution $for the random coefficients, unlike probit which is limited to the normal distribution. It is called "mixed logit" because the choice probability is a mixture of logits, with as the mixing distribution. It has been shown that a mixed logit model can approximate to any degree of accuracy any true random utility model of discrete choice, given appropriate specification of variables and the coefficient distribution.$

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

Denote a binary response index model as: $, where .$

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

Dynamic discrete choice (DDC) models, also known as discrete choice models of dynamic programming, model an agent's choices over discrete options that have future implications. Rather than assuming observed choices are the result of static utility maximization, observed choices in DDC models are assumed to result from an agent's maximization of the present value of utility, generalizing the utility theory upon which discrete choice models are based.

References

↑ For more example, refer to: Smith, Michael D. and Brynjolfsson, Erik, Consumer Decision-Making at an Internet Shopbot (October 2001). MIT Sloan School of Management Working Paper No. 4206-01.
1 2 Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data . Cambridge, Mass: MIT Press. pp. 457–460. ISBN 978-0-262-23219-7.
↑ For a concrete example, refer to: Tetsuo Yai, Seiji Iwakura, Shigeru Morichi, Multinomial probit with structured covariance for route choice behavior, Transportation Research Part B: Methodological, Volume 31, Issue 3, June 1997, Pages 195-207, ISSN 0191-2615
↑ Jin Yan (2012), "A Smoothed Maximum Score Estimator for Multinomial Discrete Choice Models", Working Paper.
↑ Walker, Joan; Ben-Akiva, Moshe (2002). "Generalized random utility model". Mathematical Social Sciences. 43 (3): 303–343. doi:10.1016/S0165-4896(02)00023-9.
↑ Manski, Charles F. (1975). "Maximum Score Estimation of the Stochastic Utility Model of Choice". Journal of Econometrics . 3 (3): 205–228. CiteSeerX 10.1.1.587.6474 . doi:10.1016/0304-4076(75)90032-9.
↑ Kim, Jeankyung; Pollard, David (1990). "Cube Root Asymptotics". Annals of Statistics . 18 (1): 191–219. doi: 10.1214/aos/1176347498 . JSTOR 2241541.
↑ Horowitz, Joel L. (1992). "A Smoothed Maximum Score Estimator for the Binary Response Model". Econometrica . 60 (3): 505–531. doi:10.2307/2951582. JSTOR 2951582.
↑ For a survey study, refer to: Jin Yan (2012), "A Smoothed Maximum Score Estimator for Multinomial Discrete Choice Models", Working Paper.

Maximum score estimator

Contents

Setting

Parametric estimators

Binary response

Maximum score estimator

Binary example

Smoothed maximum score estimator

Related Research Articles

References

Further reading