Stochastic frontier analysis

Last updated

Stochastic frontier analysis (SFA) is a method of economic modeling. It has its starting point in the stochastic production frontier models simultaneously introduced by Aigner, Lovell and Schmidt (1977) and Meeusen and Van den Broeck (1977).

The production frontier model without random component can be written as:

where yi is the observed scalar output of the producer i; i=1,..I, xi is a vector of N inputs used by the producer i; is a vector of technology parameters to be estimated; and f(xi, β) is the production frontier function.

TEi denotes the technical efficiency defined as the ratio of observed output to maximum feasible output. TEi = 1 shows that the i-th firm obtains the maximum feasible output, while TEi < 1 provides a measure of the shortfall of the observed output from maximum feasible output.

A stochastic component that describes random shocks affecting the production process is added. These shocks are not directly attributable to the producer or the underlying technology. These shocks may come from weather changes, economic adversities or plain luck. We denote these effects with . Each producer is facing a different shock, but we assume the shocks are random and they are described by a common distribution.

The stochastic production frontier will become:

We assume that TEi is also a stochastic variable, with a specific distribution function, common to all producers.

We can also write it as an exponential , where ui ≥ 0, since we required TEi ≤ 1. Thus, we obtain the following equation:

Now, if we also assume that f(xi, β) takes the log-linear Cobb–Douglas form, the model can be written as:

where vi is the “noise” component, which we will almost always consider as a two-sided normally distributed variable, and ui is the non-negative technical inefficiency component. Together they constitute a compound error term, with a specific distribution to be determined, hence the name of “composed error model” as is often referred.

Stochastic frontier analysis has examined also "cost" and "profit" efficiency (see Kumbhakar & Lovell 2003). The "cost frontier" approach attempts to measure how far from full-cost minimization (i.e. cost-efficiency) is the firm. Modeling-wise, the non-negative cost-inefficiency component is added rather than subtracted in the stochastic specification. "Profit frontier analysis" examines the case where producers are treated as profit-maximizers (both output and inputs should be decided by the firm) and not as cost-minimizers, (where level of output is considered as exogenously given). The specification here is similar with the "production frontier" one.

Stochastic frontier analysis has also been applied in micro data of consumer demand in an attempt to benchmark consumption and segment consumers. In a two-stage approach, a stochastic frontier model is estimated and subsequently deviations from the frontier are regressed on consumer characteristics (Baltas 2005).

Extensions: The two-tier stochastic frontier model

Polacheck & Yoon (1987) have introduced a three-component error structure, where one non-negative error term is added to, while the other is subtracted from, the zero-mean symmetric random disturbance. This modeling approach attempts to measure the impact of informational inefficiencies (incomplete and imperfect information) on the prices of realized transactions, inefficiencies that in most cases characterize both parties in a transaction (hence the two inefficiency components, to disentangle the two effects).

Recently, various non-parametric and semi-parametric approaches were proposed in the literature, where no parametric assumption on the functional form of production relationship is made, see for example Parmeter and Kumbhakar (2014) and Park, Simar and Zelenyuk (2015) [1] and references cited therein.

Related Research Articles

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

<span class="mw-page-title-main">Cobb–Douglas production function</span> Macroeconomic formula that describes productivity

In economics and econometrics, the Cobb–Douglas production function is a particular functional form of the production function, widely used to represent the technological relationship between the amounts of two or more inputs and the amount of output that can be produced by those inputs. The Cobb–Douglas form was developed and tested against statistical evidence by Charles Cobb and Paul Douglas between 1927 and 1947; according to Douglas, the functional form itself was developed earlier by Philip Wicksteed.

Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. DEA has been applied in a large range of fields including international banking, economic sustainability, police department operations, and logistical applications Additionally, DEA has been used to assess the performance of natural language processing models, and it has found other applications within machine learning.

<span class="mw-page-title-main">Stable distribution</span> Distribution of variables which satisfies a stability property under linear combinations

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

<span class="mw-page-title-main">Inverse-gamma distribution</span> Two-parameter family of continuous probability distributions

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining "how much" as in problems with continuous choice variables, discrete choice analysis examines "which one". However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own and the number of minutes of telecommunications service a customer decides to purchase. Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

<span class="mw-page-title-main">Gompertz distribution</span> Continuous probability distribution, named after Benjamin Gompertz

In probability and statistics, the Gompertz distribution is a continuous probability distribution, named after Benjamin Gompertz. The Gompertz distribution is often applied to describe the distribution of adult lifespans by demographers and actuaries. Related fields of science such as biology and gerontology also considered the Gompertz distribution for the analysis of survival. More recently, computer scientists have also started to model the failure rates of computer code by the Gompertz distribution. In Marketing Science, it has been used as an individual-level simulation for customer lifetime value modeling. In network theory, particularly the Erdős–Rényi model, the walk length of a random self-avoiding walk (SAW) is distributed according to the Gompertz distribution.

<span class="mw-page-title-main">Multivariate stable distribution</span>

The multivariate stable distribution is a multivariate probability distribution that is a multivariate generalisation of the univariate stable distribution. The multivariate stable distribution defines linear relations between stable distribution marginals. In the same way as for the univariate case, the distribution is defined in terms of its characteristic function.

In probability, statistics, economics, and actuarial science, the Benini distribution is a continuous probability distribution that is a statistical size distribution often applied to model incomes, severity of claims or losses in actuarial applications, and other economic data. Its tail behavior decays faster than a power law, but not as fast as an exponential. This distribution was introduced by Rodolfo Benini in 1905. Somewhat later than Benini's original work, the distribution has been independently discovered or discussed by a number of authors.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In signal processing, nonlinear multidimensional signal processing (NMSP) covers all signal processing using nonlinear multidimensional signals and systems. Nonlinear multidimensional signal processing is a subset of signal processing (multidimensional signal processing). Nonlinear multi-dimensional systems can be used in a broad range such as imaging, teletraffic, communications, hydrology, geology, and economics. Nonlinear systems cannot be treated as linear systems, using Fourier transformation and wavelet analysis. Nonlinear systems will have chaotic behavior, limit cycle, steady state, bifurcation, multi-stability and so on. Nonlinear systems do not have a canonical representation, like impulse response for linear systems. But there are some efforts to characterize nonlinear systems, such as Volterra and Wiener series using polynomial integrals as the use of those methods naturally extend the signal into multi-dimensions. Another example is the Empirical mode decomposition method using Hilbert transform instead of Fourier Transform for nonlinear multi-dimensional systems. This method is an empirical method and can be directly applied to data sets. Multi-dimensional nonlinear filters (MDNF) are also an important part of NMSP, MDNF are mainly used to filter noise in real data. There are nonlinear-type hybrid filters used in color image processing, nonlinear edge-preserving filters use in magnetic resonance image restoration. Those filters use both temporal and spatial information and combine the maximum likelihood estimate with the spatial smoothing algorithm.

<span class="mw-page-title-main">Subal Kumbhakar</span>

Subal C. Kumbhakar is an Indian born American economist. He is a Distinguished Research Professor of Economics at Binghamton University. He was awarded Doctor Honoris Causa, 1997, Gothenburg University, Sweden. He is a fellow of Journal of Econometrics, distinguished author of Journal of Applied Econometrics, co-editor of the Social Science Citation Index journal Empirical Economics, coauthor of a highly cited book on Stochastic Frontier Analysis. He is associated with the University of Stavanger, Norway and Inland School of Business and Social Sciences, Lillehammer, Norway. He advises Oxera Consulting LLP Oxford, UK on regulatory performance measures. He is internationally known for his research on efficiency and productivity. His models on efficiency and productivity are used by researchers worldwide.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References