Method of moments (statistics)

Last updated

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

Contents

It starts by expressing the population moments (i.e., the expected values of powers of the random variable under consideration) as functions of the parameters of interest. Those expressions are then set equal to the sample moments. The number of such equations is the same as the number of parameters to be estimated. Those equations are then solved for the parameters of interest. The solutions are estimates of those parameters.

The method of moments was introduced by Pafnuty Chebyshev in 1887 in the proof of the central limit theorem. The idea of matching empirical moments of a distribution to the population moments dates back at least to Pearson.

Method

Suppose that the parameter = () characterizes the distribution of the random variable . [1] Suppose the first moments of the true distribution (the "population moments") can be expressed as functions of the s:

Suppose a sample of size is drawn, resulting in the values . For , let

be the j-th sample moment, an estimate of . The method of moments estimator for denoted by is defined to be the solution (if one exists) to the equations:


The method described here for single random variables generalizes in an obvious manner to multiple random variables leading to multiple choices for moments to be used. Different choices generally lead to different solutions [5], [6].

Advantages and disadvantages

The method of moments is fairly simple and yields consistent estimators (under very weak assumptions), though these estimators are often biased.

It is an alternative to the method of maximum likelihood.

However, in some cases the likelihood equations may be intractable without computers, whereas the method-of-moments estimators can be computed much more quickly and easily. Due to easy computability, method-of-moments estimates may be used as the first approximation to the solutions of the likelihood equations, and successive improved approximations may then be found by the Newton–Raphson method. In this way the method of moments can assist in finding maximum likelihood estimates.

In some cases, infrequent with large samples but less infrequent with small samples, the estimates given by the method of moments are outside of the parameter space (as shown in the example below); it does not make sense to rely on them then. That problem never arises in the method of maximum likelihood Also, estimates by the method of moments are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.

When estimating other structural parameters (e.g., parameters of a utility function, instead of parameters of a known probability distribution), appropriate probability distributions may not be known, and moment-based estimates may be preferred to maximum likelihood estimation.

Alternative method of moments

The equations to be solved in the method of moments (MoM) are in general nonlinear and there are no generally applicable guarantees that tractable solutions exist[ citation needed ]. But there is an alternative approach to using sample moments to estimate data model parameters in terms of known dependence of model moments on these parameters, and this alternative requires the solution of only linear equations or, more generally, tensor equations. This alternative is referred to as the Bayesian-Like MoM (BL-MoM), and it differs from the classical MoM in that it uses optimally weighted sample moments. Considering that the MoM is typically motivated by a lack of sufficient knowledge about the data model to determine likelihood functions and associated a posteriori probabilities of unknown or random parameters, it is odd that there exists a type of MoM that is Bayesian-Like. But the particular meaning of Bayesian-Like leads to a problem formulation in which required knowledge of a posteriori probabilities is replaced with required knowledge of only the dependence of model moments on unknown model parameters, which is exactly the knowledge required by the traditional MoM [1],[2],[5]–[9]. The BL-MoM also uses knowledge of a priori probabilities of the parameters to be estimated, when available, but otherwise uses uniform priors.[ citation needed ]

The BL-MoM has been reported on in only the applied statistics literature in connection with parameter estimation and hypothesis testing using observations of stochastic processes for problems in Information and Communications Theory and, in particular, communications receiver design in the absence of knowledge of likelihood functions or associated a posteriori probabilities [10] and references therein. In addition, the restatement of this receiver design approach for stochastic process models as an alternative to the classical MoM for any type of multivariate data is available in tutorial form at the university website [11, page 11.4]. The applications in [10] and references demonstrate some important characteristics of this alternative to the classical MoM, and a detailed list of relative advantages and disadvantages is given in [11, page 11.4], but the literature is missing direct comparisons in specific applications of the classical MoM and the BL-MoM.[ citation needed ]

Examples

An example application of the method of moments is to estimate polynomial probability density distributions. In this case, an approximating polynomial of order is defined on an interval . The method of moments then yields a system of equations, whose solution involves the inversion of a Hankel matrix. [2]

Proving the central limit theorem

Let be independent random variables with mean 0 and variance 1, then let . We can compute the moments of asExplicit expansion shows thatwhere the numerator is the number of ways to select distinct pairs of balls by picking one each from buckets, each containing balls numbered from to . At the limit, all moments converge to that of a standard normal distribution. More analysis then show that this convergence in moments imply a convergence in distribution.

Essentially this argument was published by Chebyshev in 1887. [3]

Uniform distribution

Consider the uniform distribution on the interval , . If then we have

Solving these equations gives

Given a set of samples we can use the sample moments and in these formulae in order to estimate and .

Note, however, that this method can produce inconsistent results in some cases. For example, the set of samples results in the estimate even though and so it is impossible for the set to have been drawn from in this case.

See also

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

A point estimator that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically the Lebesgue measure. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation is therefore a regularization of maximum likelihood estimation, so is not a well-defined statistic of the Bayesian posterior distribution.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. The "M" initial stands for "maximum likelihood-type".

<span class="mw-page-title-main">Inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased.

<span class="mw-page-title-main">Jackknife resampling</span> Statistical method for resampling

In statistics, the jackknife is a cross-validation technique and, therefore, a form of resampling. It is especially useful for bias and variance estimation. The jackknife pre-dates other common resampling methods such as the bootstrap. Given a sample of size , a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size obtained by omitting one observation.

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

<span class="mw-page-title-main">Wrapped Cauchy distribution</span>

In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

In Bayesian inference, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in total variation distance to a multivariate normal distribution centered at the maximum likelihood estimator with covariance matrix given by , where is the true population parameter and is the Fisher information matrix at the true population parameter value:

<span class="mw-page-title-main">Hermite distribution</span> Statistical probability Distribution for discrete event counts

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

In statistics, the Innovation Method provides an estimator for the parameters of stochastic differential equations given a time series of observations of the state variables. In the framework of continuous-discrete state space models, the innovation estimator is obtained by maximizing the log-likelihood of the corresponding discrete-time innovation process with respect to the parameters. The innovation estimator can be classified as a M-estimator, a quasi-maximum likelihood estimator or a prediction error estimator depending on the inferential considerations that want to be emphasized. The innovation method is a system identification technique for developing mathematical models of dynamical systems from measured data and for the optimal design of experiments.

References

  1. Kimiko O. Bowman and L. R. Shenton, "Estimator: Method of Moments", pp 2092–2098, Encyclopedia of statistical sciences, Wiley (1998).
  2. J. Munkhammar, L. Mattsson, J. Rydén (2017) "Polynomial probability distribution estimation using the method of moments". PLoS ONE 12(4): e0174573. https://doi.org/10.1371/journal.pone.0174573
  3. Fischer, Hans (2011). "4. Chebyshev's and Markov's Contributions". History of the central limit theorem : from classical to modern probability theory. New York: Springer. ISBN   978-0-387-87857-7. OCLC   682910965.

References needing to be wikified

[4] Pearson, K. (1936), "Method of Moments and Method of Maximum Likelihood", Biometrika 28(1/2), 35–59.

[5] Lindsay, B.G. & Basak P. (1993). “Multivariate normal mixtures: a fast consistent method of moments”, Journal of the American Statistical Association88, 468–476.

[6] Quandt, R.E. & Ramsey, J.B. (1978). “Estimating mixtures of normal distributions and switching regressions”, Journal of the American Statistical Association73, 730–752.

[7] https://real-statistics.com/distribution-fitting/method-of-moments/

[8] Hansen, L. (1982). “Large sample properties of generalized method of moments estimators”, Econometrica50, 1029–1054.

[9] Lindsay, B.G. (1982). “Conditional score functions: some optimality results”, Biometrika69, 503–512.

[10] Gardner, W.A., “Design of nearest prototype signal classifiers”, IEEE Transactions on Information Theory 27 (3), 368–372,1981

[11] Cyclostationarity