# M-estimator

Last updated

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. [1] Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be found in a recent review study. [2]

## Contents

More generally, an M-estimator may be defined to be a zero of an estimating function. [3] [4] [5] [6] [7] [8] This estimating function is often the derivative of another statistical function. For example, a maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero; thus, a maximum-likelihood estimator is a critical point of the score function. [9] In many applications, such M-estimators can be thought of as estimating characteristics of the population.

## Historical motivation

The method of least squares is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals.

Another popular M-estimator is maximum-likelihood estimation. For a family of probability density functions f parameterized by θ, a maximum likelihood estimator of θ is computed for each set of data by maximizing the likelihood function over the parameter space { θ } . When the observations are independent and identically distributed, a ML-estimate ${\displaystyle {\hat {\theta }}}$ satisfies

${\displaystyle {\widehat {\theta }}=\arg \max _{\displaystyle \theta }{\left(\prod _{i=1}^{n}f(x_{i},\theta )\right)}\,\!}$

or, equivalently,

${\displaystyle {\widehat {\theta }}=\arg \min _{\displaystyle \theta }{\left(\sum _{i=1}^{n}-\log {(f(x_{i},\theta ))}\right)}.\,\!}$

Maximum-likelihood estimators have optimal properties in the limit of infinitely many observations under rather general conditions, but may be biased and not the most efficient estimators for finite samples.

## Definition

In 1964, Peter J. Huber proposed generalizing maximum likelihood estimation to the minimization of

${\displaystyle \sum _{i=1}^{n}\rho (x_{i},\theta ),\,\!}$

where ρ is a function with certain properties (see below). The solutions

${\displaystyle {\hat {\theta }}=\arg \min _{\displaystyle \theta }\left(\sum _{i=1}^{n}\rho (x_{i},\theta )\right)\,\!}$

are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimators include L-estimators, R-estimators and S-estimators. Maximum likelihood estimators (MLE) are thus a special case of M-estimators. With suitable rescaling, M-estimators are special cases of extremum estimators (in which more general functions of the observations can be used).

The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, close to the assumed distribution.

## Types

M-estimators are solutions, θ, which minimize

${\displaystyle \sum _{i=1}^{n}\rho (x_{i},\theta ).\,\!}$

This minimization can always be done directly. Often it is simpler to differentiate with respect to θ and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of ψ-type. Otherwise, the M-estimator is said to be of ρ-type.

In most practical cases, the M-estimators are of ψ-type.

### ρ-type

For positive integer r, let ${\displaystyle ({\mathcal {X}},\Sigma )}$ and ${\displaystyle (\Theta \subset \mathbb {R} ^{r},S)}$ be measure spaces. ${\displaystyle \theta \in \Theta }$ is a vector of parameters. An M-estimator of ρ-type ${\displaystyle T}$ is defined through a measurable function ${\displaystyle \rho$ :{\mathcal {X}}\times \Theta \rightarrow \mathbb {R} }. It maps a probability distribution ${\displaystyle F}$ on ${\displaystyle {\mathcal {X}}}$ to the value ${\displaystyle T(F)\in \Theta }$ (if it exists) that minimizes ${\displaystyle \int _{\mathcal {X}}\rho (x,\theta )dF(x)}$:

${\displaystyle T(F):=\arg \min _{\theta \in \Theta }\int _{\mathcal {X}}\rho (x,\theta )dF(x)}$

For example, for the maximum likelihood estimator, ${\displaystyle \rho (x,\theta )=-\log(f(x,\theta ))}$, where ${\displaystyle f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}}$.

### ψ-type

If ${\displaystyle \rho }$ is differentiable with respect to ${\displaystyle \theta }$, the computation of ${\displaystyle {\widehat {\theta }}}$ is usually much easier. An M-estimator of ψ-type T is defined through a measurable function ${\displaystyle \psi$ :{\mathcal {X}}\times \Theta \rightarrow \mathbb {R} ^{r}}. It maps a probability distribution F on ${\displaystyle {\mathcal {X}}}$ to the value ${\displaystyle T(F)\in \Theta }$ (if it exists) that solves the vector equation:

${\displaystyle \int _{\mathcal {X}}\psi (x,\theta )\,dF(x)=0}$
${\displaystyle \int _{\mathcal {X}}\psi (x,T(F))\,dF(x)=0}$

For example, for the maximum likelihood estimator, ${\displaystyle \psi (x,\theta )=\left({\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{1}}},\dots ,{\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{p}}}\right)^{\mathrm {T} }}$, where ${\displaystyle u^{\mathrm {T} }}$ denotes the transpose of vector u and ${\displaystyle f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}}$.

Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to ${\displaystyle \theta }$, then a necessary condition for an M-estimator of ψ-type to be an M-estimator of ρ-type is ${\displaystyle \psi (x,\theta )=\nabla _{\theta }\rho (x,\theta )}$. The previous definitions can easily be extended to finite samples.

If the function ψ decreases to zero as ${\displaystyle x\rightarrow \pm \infty }$, the estimator is called redescending. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.

## Computation

For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as Newton–Raphson. However, in most cases an iteratively re-weighted least squares fitting algorithm can be performed; this is typically the preferred method.

For some choices of ψ, specifically, redescending functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen. Robust starting points, such as the median as an estimate of location and the median absolute deviation as a univariate estimate of scale, are common.

### Concentrating parameters

In computation of M-estimators, it is sometimes useful to rewrite the objective function so that the dimension of parameters is reduced. The procedure is called “concentrating” or “profiling”. Examples in which concentrating parameters increases computation speed include seemingly unrelated regressions (SUR) models. [10] Consider the following M-estimation problem:

${\displaystyle ({\hat {\beta }}_{n},{\hat {\gamma }}_{n}):=\arg \max _{\beta ,\gamma }\textstyle \sum _{i=1}^{N}\displaystyle q(w_{i},\beta ,\gamma )}$

Assuming differentiability of the function q, M-estimator solves the first order conditions:

${\displaystyle \sum _{i=1}^{N}\triangledown _{\beta }\,q(w_{i},\beta ,\gamma )=0}$
${\displaystyle \sum _{i=1}^{N}\triangledown _{\gamma }\,q(w_{i},\beta ,\gamma )=0}$

Now, if we can solve the second equation for γ in terms of ${\displaystyle W:=(w_{1},w_{2},..,w_{N})}$ and ${\displaystyle \beta }$, the second equation becomes:

${\displaystyle \sum _{i=1}^{N}\triangledown _{\gamma }\,q(w_{i},\beta ,g(W,\beta ))=0}$

where g is, there is some function to be found. Now, we can rewrite the original objective function solely in terms of β by inserting the function g into the place of ${\displaystyle \gamma }$. As a result, there is a reduction in the number of parameters.

Whether this procedure can be done depends on particular problems at hand. However, when it is possible, concentrating parameters can facilitate computation to a great degree. For example, in estimating SUR model of 6 equations with 5 explanatory variables in each equation by Maximum Likelihood, the number of parameters declines from 51 to 30. [10]

Despite its appealing feature in computation, concentrating parameters is of limited use in deriving asymptotic properties of M-estimator. [11] The presence of W in each summand of the objective function makes it difficult to apply the law of large numbers and the central limit theorem.

## Properties

### Distribution

It can be shown that M-estimators are asymptotically normally distributed. As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or bootstrap distribution.

### Influence function

The influence function of an M-estimator of ${\displaystyle \psi }$-type is proportional to its defining ${\displaystyle \psi }$ function.

Let T be an M-estimator of ψ-type, and G be a probability distribution for which ${\displaystyle T(G)}$ is defined. Its influence function IF is

${\displaystyle \operatorname {IF} (x;T,G)=-{\frac {\psi (x,T(G))}{\int \left[{\frac {\partial \psi (y,\theta )}{\partial \theta }}\right]f(y)\mathrm {d} y}}}$

assuming the density function ${\displaystyle f(y)}$ exists. A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).

## Applications

M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.

## Examples

### Mean

Let (X1, ..., Xn) be a set of independent, identically distributed random variables, with distribution F.

If we define

${\displaystyle \rho (x,\theta )={\frac {(x-\theta )^{2}}{2}},\,\!}$

we note that this is minimized when θ is the mean of the Xs. Thus the mean is an M-estimator of ρ-type, with this ρ function.

As this ρ function is continuously differentiable in θ, the mean is thus also an M-estimator of ψ-type for ψ(x, θ) = θ  x.

### Median

For the median estimation of (X1, ..., Xn), instead we can define the ρ function as

${\displaystyle \rho (x,\theta )=|x-\theta |}$

and similarly, the ρ function is minimized when θ is the median of the Xs.

While this ρ function is not differentiable in θ, the ψ-type M-estimator, which is the subgradient of ρ function, can be expressed as

${\displaystyle \psi (x,\theta )=\operatorname {sgn}(x-\theta )}$

and

${\displaystyle \psi (x,\theta )={\begin{cases}\{-1\},&{\mbox{if }}x-\theta <0\\\{1\},&{\mbox{if }}x-\theta >0\\\left[-1,1\right],&{\mbox{if }}x-\theta =0\end{cases}}}$[ clarification needed ]

## Related Research Articles

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parameterized by two positive shape parameters, denoted by alpha (α) and beta (β), that appear as exponents of the random variable and control the shape of the distribution. The generalization to multiple variables is called a Dirichlet distribution.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two different parameterizations in common use:

1. With a shape parameter k and a scale parameter θ.
2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.

In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown quantity and that depends on the data only through a complete, sufficient statistic is the unique best unbiased estimator of that quantity. The Lehmann–Scheffé theorem is named after Erich Leo Lehmann and Henry Scheffé, given their two early papers.

In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic parameter, the variance of any such estimator is at least as high as the inverse of the Fisher information. The result is named in honor of Harald Cramér and C. R. Rao, but has independently also been derived by Maurice Fréchet, Georges Darmois, as well as Alexander Aitken and Harold Silverstone.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

Robust statistics is statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

In statistics, stochastic volatility models are those in which the variance of a stochastic process is itself randomly distributed. They are used in the field of mathematical finance to evaluate derivative securities, such as options. The name derives from the models' treatment of the underlying security's volatility as a random process, governed by state variables such as the price level of the underlying security, the tendency of volatility to revert to some long-run mean value, and the variance of the volatility process itself, among others.

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.

In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

Given a probit model y=1[y* > 0] where y* = x1 β + zδ + u, and u ~ N(0,1), without losing generality, z can be represented as z = x1 θ1 + x2 θ2 + v. When u is correlated with v, there will be an issue of endogeneity. This can be caused by omitted variables and measurement errors. There are also many cases where z is partially determined by y and endogeneity issue arises. For instance, in a model evaluating the effect of different patient features on their choice of whether going to hospital, y is the choice and z is the amount of the medicine a respondent took, then it is very intuitive that more often the respondent goes to hospital, it is more likely that she took more medicine, hence endogeneity issue arises. When there are endogenous explanatory variables, the estimator generated by usual estimation procedure will be inconsistent, then the corresponding estimated Average Partial Effect (APE) will be inconsistent, too.

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

## References

1. Hayashi, Fumio (2000). "Extremum Estimators". Econometrics. Princeton University Press. ISBN   0-691-01018-8.
2. De Menezes, Diego Q.F. (2021). "A review on robust M-estimators for regression analysis". Computers & Chemical Engineering. 147 (1): 1–30. doi:10.1016/j.compchemeng.2021.107254.
3. Vidyadhar P. Godambe, editor. Estimating functions, volume 7 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 1991.
4. Christopher C. Heyde. Quasi-likelihood and its application: A general approach to optimal parameter estimation. Springer Series in Statistics. Springer-Verlag, New York, 1997.
5. D. L. McLeish and Christopher G. Small. The theory and applications of statistical inference functions, volume 44 of Lecture Notes in Statistics. Springer-Verlag, New York, 1988.
6. Parimal Mukhopadhyay. An Introduction to Estimating Functions. Alpha Science International, Ltd, 2004.
7. Christopher G. Small and Jinfang Wang. Numerical methods for nonlinear estimating equations, volume 29 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 2003.
8. Sara A. van de Geer. Empirical Processes in M-estimation: Applications of empirical process theory, volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.
9. Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR   2287314.
10. Giles, D. E. (July 10, 2012). "Concentrating, or Profiling, the Likelihood Function".
11. Wooldridge, J. M. (2001). . Cambridge, Mass.: MIT Press. ISBN   0-262-23219-7.
• Andersen, Robert (2008). Modern Methods for Robust Regression. Quantitative Applications in the Social Sciences. Vol. 152. Los Angeles, CA: Sage Publications. ISBN   978-1-4129-4072-6.
• Godambe, V. P. (1991). Estimating functions. Oxford Statistical Science Series. Vol. 7. New York: Clarendon Press. ISBN   978-0-19-852228-7.
• Heyde, Christopher C. (1997). Quasi-likelihood and its application: A general approach to optimal parameter estimation. Springer Series in Statistics. New York: Springer. doi:10.1007/b98823. ISBN   978-0-387-98225-0.
• Huber, Peter J. (2009). Robust Statistics (2nd ed.). Hoboken, NJ: John Wiley & Sons Inc. ISBN   978-0-470-12990-6.
• Hoaglin, David C.; Frederick Mosteller; John W. Tukey (1983). Understanding Robust and Exploratory Data Analysis. Hoboken, NJ: John Wiley & Sons Inc. ISBN   0-471-09777-2.
• McLeish, D.L.; Christopher G. Small (1989). The theory and applications of statistical inference functions. Lecture Notes in Statistics. Vol. 44. New York: Springer. ISBN   978-0-387-96720-2.
• Mukhopadhyay, Parimal (2004). An Introduction to Estimating Functions. Harrow, UK: Alpha Science International, Ltd. ISBN   978-1-84265-163-6.
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 15.7. Robust Estimation", Numerical Recipes: The Art of Scientific Computing (3rd ed.), New York: Cambridge University Press, ISBN   978-0-521-88068-8
• Serfling, Robert J. (2002). Approximation theorems of mathematical statistics. Wiley Series in Probability and Mathematical Statistics. Hoboken, NJ: John Wiley & Sons Inc. ISBN   978-0-471-21927-9.
• Shapiro, Alexander (2000). "On the asymptotics of constrained local M-estimators". Annals of Statistics. 28 (3): 948–960. CiteSeerX  . doi:10.1214/aos/1015952006. JSTOR   2674061. MR   1792795.
• Small, Christopher G.; Jinfang Wang (2003). Numerical methods for nonlinear estimating equations. Oxford Statistical Science Series. Vol. 29. New York: Oxford University Press. ISBN   978-0-19-850688-1.
• van de Geer, Sara A. (2000). Empirical Processes in M-estimation: Applications of empirical process theory. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 6. Cambridge, UK: Cambridge University Press. doi:10.2277/052165002X. ISBN   978-0-521-65002-1.
• Wilcox, R. R. (2003). Applying contemporary statistical techniques. San Diego, CA: Academic Press. pp. 55–79.
• Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed. San Diego, CA: Academic Press.
• M-estimators — an introduction to the subject by Zhengyou Zhang