Fractional model

Last updated

In applied statistics, fractional models are, to some extent, related to binary response models. However, instead of estimating the probability of being in one bin of a dichotomous variable, the fractional model typically deals with variables that take on all possible values in the unit interval. One can easily generalize this model to take on values on any other interval by appropriate transformations. [1] Examples range from participation rates in 401(k) plans [2] to television ratings of NBA games. [3]

Description

There have been two approaches to modeling this problem. Even though they both rely on an index that is linear in xi combined with a link function, [4] this is not strictly necessary. The first approach uses a log-odds transformation of y as a linear function of xi, i.e., . This approach is problematic for two distinct reasons. The y variable can not take on boundary values 1 and 0, and the interpretation of the coefficients is not straightforward. The second approach circumvents these issues by using the logistic regression as a link function. More specifically,

It immediately becomes clear that this set up is very similar to the binary logit model, with that difference that the y variable can actually take on values in the unit interval. Many of the estimation techniques for the binary logit model, such as non-linear least squares and quasi-MLE, carry over in a natural way, just like heteroskedasticity adjustments and partial effects calculations. [5]

Extensions to this cross-sectional model have been provided that allow for taking into account important econometric issues, such as endogenous explanatory variables and unobserved heterogeneous effects. Under strict exogeneity assumptions, it is possible to difference out these unobserved effects using panel data techniques, although weaker exogeneity assumptions can also result in consistent estimators. [6] Control function techniques to deal with endogeneity concerns have also been proposed. [7]

Related Research Articles

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for the response variable to have an error distribution other than the normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, a linear probability model is a special case of a binary regression model. Here the dependent variable for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by linear regression.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In econometrics, endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term. The distinction between endogenous and exogenous variables originated in simultaneous equations models, where one separates variables whose values are determined by the model from variables which are predetermined; ignoring simultaneity in the estimation leads to biased estimates as it violates the exogeneity assumption of the Gauss–Markov theorem. The problem of endogeneity is often, unfortunately, ignored by researchers conducting non-experimental research and doing so precludes making policy recommendations. Instrumental variable techniques are commonly used to address this problem.

In statistics, the ordered logit model is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining “how much” as in problems with continuous choice variables, discrete choice analysis examines “which one.” However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Errors-in-variables models Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In statistics, ordinal regression is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

Partial (pooled) likelihood estimation for panel data is a quasi-maximum likelihood method for panel analysis that assumes that density of yit given xit is correctly specified for each time period but it allows for misspecification in the conditional density of yi≔(yi1,...,yiT) given xi≔(xi1,...,xiT).

In linear panel analysis, it can be desirable to estimate the magnitude of the fixed effects, as they provide measures of the unobserved components. For instance, in wage equation regressions, Fixed Effects capture ability measures that are constant over time, such as motivation. Chamberlain's approach to unobserved effects models is a way of estimating the linear unobserved effects, under Fixed Effect assumptions, in the following unobserved effects model

Control functions are statistical methods to correct for endogeneity problems by modelling the endogeneity in the error term. The approach thereby differs in important ways from other models that try to account for the same econometric problem. Instrumental variables, for example, attempt to model the endogenous variable X as an often invertible model with respect to a relevant and exogenous instrument Z. Panel analysis uses special data properties to difference out unobserved heterogeneity that is assumed to be fixed over time.

Issues of heterogeneity in duration models can take on different forms. On the one hand, unobserved heterogeneity can play a crucial role when it comes to different sampling methods, such as stock or flow sampling. On the other hand, duration models have also been extended to allow for different subpopulations, with a strong link to mixture models. Many of these models impose the assumptions that heterogeneity is independent of the observed covariates, it has a distribution that depends on a finite number of parameters only, and it enters the hazard function multiplicatively.

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

Grouped duration data are widespread in many applications. Unemployment durations are typically measured over weeks or months and these time intervals may be considered too large for continuous approximations to hold. In this case, we will typically have grouping points , where . Models allow for time-invariant and time-variant covariates, but the latter require stronger assumptions in terms of exogeneity. The discrete-time hazard function can be written as:

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

References

  1. Wooldridge, J. (2002): Econometric Analysis of Cross Section and Panel Data, MIT Press, Cambridge, Mass.
  2. Papke, L. E. and J. M. Wooldridge (1996): "Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates." Journal of Applied Econometrics (11), pp. 619–632
  3. Hausman, J. A. and G. K. Leonard (1997): "Superstars in the National Basketball Association: Economic Value and Policy." Journal of Labor Economics (15), pp. 586–624
  4. McCullagh, P. and J. A. Nelder (1989): Generalized Linear Models, CRC Monographs on Statistics and Applied Probability (Book 37), 2nd Edition, Chapman and Hall, London.
  5. Wooldridge, J. (2002): Econometric Analysis of Cross Section and Panel Data, MIT Press, Cambridge, Mass.
  6. Papke, L. E. and J. M. Wooldridge (1996): "Panel Data Methods for Fractional Response Variables with an Application to Test Pass Rates." Journal of Econometrics (145), pp. 121–133
  7. Wooldridge, J.M. (2005): "Unobserved heterogeneity and estimation of average partial effects." Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, ed. by Andrews, D.W.K. and J.H. Stock, Cambridge University Press, Cambridge, pp. 27–55