Scoring algorithm

Last updated February 02, 2024

Scoring algorithm, also known as Fisher's scoring,^[1] is a form of Newton's method used in statistics to solve maximum likelihood equations numerically, named after Ronald Fisher.

Sketch of derivation

Let $Y_{1},\ldots ,Y_{n}$ be random variables, independent and identically distributed with twice differentiable p.d.f. $f(y;\theta )$ , and we wish to calculate the maximum likelihood estimator (M.L.E.) $\theta ^{*}$ of $\theta$ . First, suppose we have a starting point for our algorithm $\theta _{0}$ , and consider a Taylor expansion of the score function, $V(\theta )$ , about $\theta _{0}$ :

V(\theta )\approx V(\theta _{0})-{\mathcal {J}}(\theta _{0})(\theta -\theta _{0}),\,

where

{\mathcal {J}}(\theta _{0})=-\sum _{i=1}^{n}\left.\nabla \nabla ^{\top }\right|_{\theta =\theta _{0}}\log f(Y_{i};\theta )

is the observed information matrix at $\theta _{0}$ . Now, setting $\theta =\theta ^{*}$ , using that $V(\theta ^{*})=0$ and rearranging gives us:

\theta ^{*}\approx \theta _{0}+{\mathcal {J}}^{-1}(\theta _{0})V(\theta _{0}).\,

We therefore use the algorithm

\theta _{m+1}=\theta _{m}+{\mathcal {J}}^{-1}(\theta _{m})V(\theta _{m}),\,

and under certain regularity conditions, it can be shown that $\theta _{m}\rightarrow \theta ^{*}$ .

Fisher scoring

In practice, ${\mathcal {J}}(\theta )$ is usually replaced by ${\mathcal {I}}(\theta )=\mathrm {E} [{\mathcal {J}}(\theta )]$ , the Fisher information, thus giving us the Fisher Scoring Algorithm:

\theta _{m+1}=\theta _{m}+{\mathcal {I}}^{-1}(\theta _{m})V(\theta _{m})

..

Under some regularity conditions, if $\theta _{m}$ is a consistent estimator, then $\theta _{m+1}$ (the correction after a single step) is 'optimal' in the sense that its error distribution is asymptotically identical to that of the true max-likelihood estimate.^[2]

Related Research Articles

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

In statistics, the informant is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ²-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, over the generation sequence, individuals with better and better -values are generated.$

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In statistics, the observed information, or observed Fisher information, is the negative of the second derivative of the "log-likelihood". It is a sample-based version of the Fisher information.

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) $from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.$

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

↑ Longford, Nicholas T. (1987). "A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects". Biometrika. 74 (4): 817–827. doi:10.1093/biomet/74.4.817.
↑ Li, Bing; Babu, G. Jogesh (2019), "Bayesian Inference", Springer Texts in Statistics, New York, NY: Springer New York, Theorem 9.4, doi:10.1007/978-1-4939-9761-9_6, ISBN 978-1-4939-9759-6, S2CID 239322258 , retrieved 2023-01-03

Scoring algorithm

Contents

Sketch of derivation

Fisher scoring

See also

Related Research Articles

References

Further reading