Stochastic frontier analysis

Last updated

Stochastic frontier analysis (SFA) is a method of economic modeling. It has its starting point in the stochastic production frontier models simultaneously introduced by Aigner, Lovell and Schmidt (1977) and Meeusen and Van den Broeck (1977). [1]

Contents

The production frontier model without random component can be written as:

where yi is the observed scalar output of the producer i; i=1,..I, xi is a vector of N inputs used by the producer i; is a vector of technology parameters to be estimated; and f(xi, β) is the production frontier function.

TEi denotes the technical efficiency defined as the ratio of observed output to maximum feasible output. TEi = 1 shows that the i-th firm obtains the maximum feasible output, while TEi < 1 provides a measure of the shortfall of the observed output from maximum feasible output.

A stochastic component that describes random shocks affecting the production process is added. These shocks are not directly attributable to the producer or the underlying technology. These shocks may come from weather changes, economic adversities or plain luck. We denote these effects with . Each producer is facing a different shock, but we assume the shocks are random and they are described by a common distribution.

The stochastic production frontier will become:

We assume that TEi is also a stochastic variable, with a specific distribution function, common to all producers.

We can also write it as an exponential , where ui ≥ 0, since we required TEi ≤ 1. Thus, we obtain the following equation:

Now, if we also assume that f(xi, β) takes the log-linear Cobb–Douglas form, the model can be written as:

where vi is the “noise” component, which we will almost always consider as a two-sided normally distributed variable, and ui is the non-negative technical inefficiency component. Together they constitute a compound error term, with a specific distribution to be determined, hence the name of “composed error model” as is often referred.

Stochastic frontier analysis has examined also "cost" and "profit" efficiency. [2] The "cost frontier" approach attempts to measure how far from full-cost minimization (i.e. cost-efficiency) is the firm. Modeling-wise, the non-negative cost-inefficiency component is added rather than subtracted in the stochastic specification. "Profit frontier analysis" examines the case where producers are treated as profit-maximizers (both output and inputs should be decided by the firm) and not as cost-minimizers, (where level of output is considered as exogenously given). The specification here is similar with the "production frontier" one.

Stochastic frontier analysis has also been applied in micro data of consumer demand in an attempt to benchmark consumption and segment consumers. In a two-stage approach, a stochastic frontier model is estimated and subsequently deviations from the frontier are regressed on consumer characteristics. [3]

Extensions: The two-tier stochastic frontier model

Polacheck & Yoon (1987) have introduced a three-component error structure, where one non-negative error term is added to, while the other is subtracted from, the zero-mean symmetric random disturbance. [4] This modeling approach attempts to measure the impact of informational inefficiencies (incomplete and imperfect information) on the prices of realized transactions, inefficiencies that in most cases characterize both parties in a transaction (hence the two inefficiency components, to disentangle the two effects).

In the 2010s, various non-parametric and semi-parametric approaches were proposed in the literature, where no parametric assumption on the functional form of production relationship is made. [5] [6]

Related Research Articles

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

<span class="mw-page-title-main">Cobb–Douglas production function</span> Macroeconomic formula that describes productivity

In economics and econometrics, the Cobb–Douglas production function is a particular functional form of the production function, widely used to represent the technological relationship between the amounts of two or more inputs and the amount of output that can be produced by those inputs. The Cobb–Douglas form was developed and tested against statistical evidence by Charles Cobb and Paul Douglas between 1927 and 1947; according to Douglas, the functional form itself was developed earlier by Philip Wicksteed.

Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. DEA has been applied in a large range of fields including international banking, economic sustainability, police department operations, and logistical applications Additionally, DEA has been used to assess the performance of natural language processing models, and it has found other applications within machine learning.

In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and may be misspecified and inconsistent, just like a fully parametric model.

In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining "how much" as in problems with continuous choice variables, discrete choice analysis examines "which one". However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an infinite number of observations from it. Mathematically, this is equivalent to saying that different values of the parameters must generate different probability distributions of the observable variables. Usually the model is identifiable only under certain technical restrictions, in which case the set of these requirements is called the identification conditions.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

<span class="mw-page-title-main">Multivariate stable distribution</span>

The multivariate stable distribution is a multivariate probability distribution that is a multivariate generalisation of the univariate stable distribution. The multivariate stable distribution defines linear relations between stable distribution marginals. In the same way as for the univariate case, the distribution is defined in terms of its characteristic function.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In signal processing, nonlinear multidimensional signal processing (NMSP) covers all signal processing using nonlinear multidimensional signals and systems. Nonlinear multidimensional signal processing is a subset of signal processing (multidimensional signal processing). Nonlinear multi-dimensional systems can be used in a broad range such as imaging, teletraffic, communications, hydrology, geology, and economics. Nonlinear systems cannot be treated as linear systems, using Fourier transformation and wavelet analysis. Nonlinear systems will have chaotic behavior, limit cycle, steady state, bifurcation, multi-stability and so on. Nonlinear systems do not have a canonical representation, like impulse response for linear systems. But there are some efforts to characterize nonlinear systems, such as Volterra and Wiener series using polynomial integrals as the use of those methods naturally extend the signal into multi-dimensions. Another example is the Empirical mode decomposition method using Hilbert transform instead of Fourier Transform for nonlinear multi-dimensional systems. This method is an empirical method and can be directly applied to data sets. Multi-dimensional nonlinear filters (MDNF) are also an important part of NMSP, MDNF are mainly used to filter noise in real data. There are nonlinear-type hybrid filters used in color image processing, nonlinear edge-preserving filters use in magnetic resonance image restoration. Those filters use both temporal and spatial information and combine the maximum likelihood estimate with the spatial smoothing algorithm.

References

  1. Aigner, D.J.; Lovell, C.A.K.; Schmidt, P. (1977) Formulation and estimation of stochastic frontier production functions. Journal of Econometrics, 6:21–37.
  2. Kumbhakar & Lovell 2003
  3. Baltas, G., (2005). Exploring Consumer Differences in Food Demand: A Stochastic Frontier Approach. British Food Journal, 107(9): 685-692.
  4. Polachek, S. W. ; Yoon, B. J. (1987). A two-tiered earnings frontier estimation of employer and employee information in the labor market. Review of Economics and Statistics, 69(2), 296-302.
  5. Parmeter, C.F., Kumbhakar, S.C., (2014) "Efficiency Analysis: A Primer on Recent Advances," Foundations and Trends in Econometrics, 7(3-4), 191-385.
  6. Park, Byeong; Simar, Léopold; Zelenyuk, Valentin (2015). "Categorical data in local maximum likelihood: theory and applications to productivity analysis". Journal of Productivity Analysis. 43 (2): 199–214. doi:10.1007/s11123-014-0394-y.

Further reading