# Method of moments (statistics)

Last updated

In statistics, the method of moments is a method of estimation of population parameters.

## Contents

It starts by expressing the population moments (i.e., the expected values of powers of the random variable under consideration) as functions of the parameters of interest. Those expressions are then set equal to the sample moments. The number of such equations is the same as the number of parameters to be estimated. Those equations are then solved for the parameters of interest. The solutions are estimates of those parameters.

The method of moments was introduced by Pafnuty Chebyshev in 1887 in the proof of the central limit theorem. The idea of matching empirical moments of a distribution to the population moments dates back at least to Pearson.[ citation needed ]

## Method

Suppose that the problem is to estimate ${\displaystyle k}$ unknown parameters ${\displaystyle \theta _{1},\theta _{2},\dots ,\theta _{k}}$ characterizing the distribution ${\displaystyle f_{W}(w;\theta )}$ of the random variable ${\displaystyle W}$. [1] Suppose the first ${\displaystyle k}$ moments of the true distribution (the "population moments") can be expressed as functions of the ${\displaystyle \theta }$s:

{\displaystyle {\begin{aligned}\mu _{1}&\equiv \operatorname {E} [W]=g_{1}(\theta _{1},\theta _{2},\ldots ,\theta _{k}),\\[4pt]\mu _{2}&\equiv \operatorname {E} [W^{2}]=g_{2}(\theta _{1},\theta _{2},\ldots ,\theta _{k}),\\&\,\,\,\vdots \\\mu _{k}&\equiv \operatorname {E} [W^{k}]=g_{k}(\theta _{1},\theta _{2},\ldots ,\theta _{k}).\end{aligned}}}

Suppose a sample of size ${\displaystyle n}$ is drawn, resulting in the values ${\displaystyle w_{1},\dots ,w_{n}}$. For ${\displaystyle j=1,\dots ,k}$, let

${\displaystyle {\widehat {\mu }}_{j}={\frac {1}{n}}\sum _{i=1}^{n}w_{i}^{j}}$

be the j-th sample moment, an estimate of ${\displaystyle \mu _{j}}$. The method of moments estimator for ${\displaystyle \theta _{1},\theta _{2},\ldots ,\theta _{k}}$ denoted by ${\displaystyle {\widehat {\theta }}_{1},{\widehat {\theta }}_{2},\dots ,{\widehat {\theta }}_{k}}$ is defined as the solution (if there is one) to the equations:[ citation needed ]

{\displaystyle {\begin{aligned}{\widehat {\mu }}_{1}&=g_{1}({\widehat {\theta }}_{1},{\widehat {\theta }}_{2},\ldots ,{\widehat {\theta }}_{k}),\\[4pt]{\widehat {\mu }}_{2}&=g_{2}({\widehat {\theta }}_{1},{\widehat {\theta }}_{2},\ldots ,{\widehat {\theta }}_{k}),\\&\,\,\,\vdots \\{\widehat {\mu }}_{k}&=g_{k}({\widehat {\theta }}_{1},{\widehat {\theta }}_{2},\ldots ,{\widehat {\theta }}_{k}).\end{aligned}}}

The method of moments is fairly simple and yields consistent estimators (under very weak assumptions), though these estimators are often biased.

It is an alternative to the method of maximum likelihood.

However, in some cases the likelihood equations may be intractable without computers, whereas the method-of-moments estimators can be computed much more quickly and easily. Due to easy computability, method-of-moments estimates may be used as the first approximation to the solutions of the likelihood equations, and successive improved approximations may then be found by the Newton–Raphson method. In this way the method of moments can assist in finding maximum likelihood estimates.

In some cases, infrequent with large samples but not so infrequent with small samples, the estimates given by the method of moments are outside of the parameter space (as shown in the example below); it does not make sense to rely on them then. That problem never arises in the method of maximum likelihood [ citation needed ]. Also, estimates by the method of moments are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.

When estimating other structural parameters (e.g., parameters of a utility function, instead of parameters of a known probability distribution), appropriate probability distributions may not be known, and moment-based estimates may be preferred to maximum likelihood estimation.

## Examples

An example application of the method of moments is to estimate polynomial probability density distributions. In this case, an approximate polynomial of order ${\displaystyle N}$ is defined on an interval ${\displaystyle [a,b]}$. The method of moments then yields a system of equations, whose solution involves the inversion of a Hankel matrix. [2]

### Uniform distribution

Consider the uniform distribution on the interval ${\displaystyle [a,b]}$, ${\displaystyle U(a,b)}$. If ${\displaystyle W\sim U(a,b)}$ then we have

${\displaystyle \mu _{1}=\operatorname {E} [W]={\frac {1}{2}}(a+b)}$
${\displaystyle \mu _{2}=\operatorname {E} [W^{2}]={\frac {1}{3}}(a^{2}+ab+b^{2})}$

Solving these equations gives

${\displaystyle {\widehat {a}}=\mu _{1}-{\sqrt {3\left(\mu _{2}-\mu _{1}^{2}\right)}}}$
${\displaystyle {\widehat {b}}=\mu _{1}+{\sqrt {3\left(\mu _{2}-\mu _{1}^{2}\right)}}}$

Given a set of samples ${\displaystyle \{w_{i}\}}$ we can use the sample moments ${\displaystyle {\widehat {\mu }}_{1}}$ and ${\displaystyle {\widehat {\mu }}_{2}}$ in these formulae in order to estimate ${\displaystyle a}$ and ${\displaystyle b}$.

Note, however, that this method can produce inconsistent results in some cases. For example, the set of samples ${\displaystyle \{0,0,0,0,1\}}$ results in the estimate ${\displaystyle {\widehat {a}}={\frac {1}{5}}-{\frac {2{\sqrt {3}}}{5}},{\widehat {b}}={\frac {1}{5}}+{\frac {2{\sqrt {3}}}{5}}}$ even though ${\displaystyle {\widehat {b}}<1}$ and so it is impossible for the set ${\displaystyle \{0,0,0,0,1\}}$ to have been drawn from ${\displaystyle U({\widehat {a}},{\widehat {b}})}$ in this case.

## Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic parameter, stating that the variance of any such estimator is at least as high as the inverse of the Fisher information. The result is named in honor of Harald Cramér and C. R. Rao, but has independently also been derived by Maurice Fréchet, Georges Darmois, as well as Alexander Aitken and Harold Silverstone.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be founded in a recent review study.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each of n trials is not fixed but randomly drawn from a beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is a distinct concept from consistency. Consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In statistics, maximum spacing estimation, or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

## References

1. Kimiko O. Bowman and L. R. Shenton, "Estimator: Method of Moments", pp 2092–2098, Encyclopedia of statistical sciences, Wiley (1998).
2. J. Munkhammar, L. Mattsson, J. Rydén (2017) "Polynomial probability distribution estimation using the method of moments". PLoS ONE 12(4): e0174573. https://doi.org/10.1371/journal.pone.0174573