Score test

Last updated November 19, 2023 • 4 min readFrom Wikipedia, The Free Encyclopedia

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score —evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948,^[1] a fact that can be used to determine statistical significance.

Since function maximization subject to equality constraints is most conveniently done using a Lagrangean expression of the problem, the score test can be equivalently understood as a test of the magnitude of the Lagrange multipliers associated with the constraints where, again, if the constraints are non-binding at the maximum likelihood, the vector of Lagrange multipliers should not differ from zero by more than sampling error. The equivalence of these two approaches was first shown by S. D. Silvey in 1959,^[2] which led to the name Lagrange multiplier test that has become more commonly used, particularly in econometrics, since Breusch and Pagan's much-cited 1980 paper.^[3]

The main advantage of the score test over the Wald test and likelihood-ratio test is that the score test only requires the computation of the restricted estimator.^[4] This makes testing feasible when the unconstrained maximum likelihood estimate is a boundary point in the parameter space.^{[ citation needed ]} Further, because the score test only requires the estimation of the likelihood function under the null hypothesis, it is less specific than the likelihood ratio test about the alternative hypothesis.^[5]

Single-parameter test

The statistic

Let $L$ be the likelihood function which depends on a univariate parameter $\theta$ and let $x$ be the data. The score $U(\theta )$ is defined as

U(\theta )={\frac {\partial \log L(\theta \mid x)}{\partial \theta }}.

The Fisher information is^[6]

I(\theta )=-\operatorname {E} \left[\left.{\frac {\partial ^{2}}{\partial \theta ^{2}}}\log f(X;\theta )\,\right|\,\theta \right]\,,

where ƒ is the probability density.

The statistic to test ${\mathcal {H}}_{0}:\theta =\theta _{0}$ is $S(\theta _{0})={\frac {U(\theta _{0})^{2}}{I(\theta _{0})}}$

which has an asymptotic distribution of $\chi _{1}^{2}$ , when ${\mathcal {H}}_{0}$ is true. While asymptotically identical, calculating the LM statistic using the outer-gradient-product estimator of the Fisher information matrix can lead to bias in small samples.^[7]

Note on notation

Note that some texts use an alternative notation, in which the statistic $S^{*}(\theta )={\sqrt {S(\theta )}}$ is tested against a normal distribution. This approach is equivalent and gives identical results.

As most powerful test for small deviations

\left({\frac {\partial \log L(\theta \mid x)}{\partial \theta }}\right)_{\theta =\theta _{0}}\geq C

where $L$ is the likelihood function, $\theta _{0}$ is the value of the parameter of interest under the null hypothesis, and $C$ is a constant set depending on the size of the test desired (i.e. the probability of rejecting $H_{0}$ if $H_{0}$ is true; see Type I error).

The score test is the most powerful test for small deviations from $H_{0}$ . To see this, consider testing $\theta =\theta _{0}$ versus $\theta =\theta _{0}+h$ . By the Neyman–Pearson lemma, the most powerful test has the form

{\frac {L(\theta _{0}+h\mid x)}{L(\theta _{0}\mid x)}}\geq K;

Taking the log of both sides yields

\log L(\theta _{0}+h\mid x)-\log L(\theta _{0}\mid x)\geq \log K.

The score test follows making the substitution (by Taylor series expansion)

\log L(\theta _{0}+h\mid x)\approx \log L(\theta _{0}\mid x)+h\times \left({\frac {\partial \log L(\theta \mid x)}{\partial \theta }}\right)_{\theta =\theta _{0}}

and identifying the $C$ above with $\log(K)$ .

Relationship with other hypothesis tests

If the null hypothesis is true, the likelihood ratio test, the Wald test, and the Score test are asymptotically equivalent tests of hypotheses.^[8]^[9] When testing nested models, the statistics for each test then converge to a Chi-squared distribution with degrees of freedom equal to the difference in degrees of freedom in the two models. If the null hypothesis is not true, however, the statistics converge to a noncentral chi-squared distribution with possibly different noncentrality parameters.

Multiple parameters

A more general score test can be derived when there is more than one parameter. Suppose that ${\widehat {\theta }}_{0}$ is the maximum likelihood estimate of $\theta$ under the null hypothesis $H_{0}$ while $U$ and $I$ are respectively, the score vector and the Fisher information matrix. Then

U^{T}({\widehat {\theta }}_{0})I^{-1}({\widehat {\theta }}_{0})U({\widehat {\theta }}_{0})\sim \chi _{k}^{2}

asymptotically under $H_{0}$ , where $k$ is the number of constraints imposed by the null hypothesis and

U({\widehat {\theta }}_{0})={\frac {\partial \log L({\widehat {\theta }}_{0}\mid x)}{\partial \theta }}

and

I({\widehat {\theta }}_{0})=-\operatorname {E} \left({\frac {\partial ^{2}\log L({\widehat {\theta }}_{0}\mid x)}{\partial \theta \,\partial \theta '}}\right).

This can be used to test $H_{0}$ .

The actual formula for the test statistic depends on which estimator of the Fisher information matrix is being used.^[10]

Special cases

In many situations, the score statistic reduces to another commonly used statistic.^[11]

In linear regression, the Lagrange multiplier test can be expressed as a function of the F-test.^[12]

When the data follows a normal distribution, the score statistic is the same as the t statistic.^{[ clarification needed ]}

When the data consists of binary observations, the score statistic is the same as the chi-squared statistic in the Pearson's chi-squared test.

Related Research Articles

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In statistics, the informant is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ²-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, the delta method is a result concerning the approximate probability distribution for a function of an asymptotically normal statistical estimator from knowledge of the limiting variance of that estimator.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected.

In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization of a certain objective function, which depends on the data. The general theory of extremum estimators was developed by Amemiya (1985).

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

Denote a binary response index model as: $, where .$

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

In econometrics, the information matrix test is used to determine whether a regression model is misspecified. The test was developed by Halbert White, who observed that in a correctly specified model and under standard regularity assumptions, the Fisher information matrix can be expressed in either of two ways: as the outer product of the gradient, or as a function of the Hessian matrix of the log-likelihood function.

References

↑ Rao, C. Radhakrishna (1948). "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation". Mathematical Proceedings of the Cambridge Philosophical Society . 44 (1): 50–57. doi:10.1017/S0305004100023987.
↑ Silvey, S. D. (1959). "The Lagrangian Multiplier Test". Annals of Mathematical Statistics . 30 (2): 389–407. doi: 10.1214/aoms/1177706259 . JSTOR 2237089.
↑ Breusch, T. S.; Pagan, A. R. (1980). "The Lagrange Multiplier Test and its Applications to Model Specification in Econometrics". Review of Economic Studies . 47 (1): 239–253. JSTOR 2297111.
↑ Fahrmeir, Ludwig; Kneib, Thomas; Lang, Stefan; Marx, Brian (2013). Regression : Models, Methods and Applications . Berlin: Springer. pp. 663–664. ISBN 978-3-642-34332-2.
↑ Kennedy, Peter (1998). A Guide to Econometrics (Fourth ed.). Cambridge: MIT Press. p. 68. ISBN 0-262-11235-3.
↑ Lehmann and Casella, eq. (2.5.16).
↑ Davidson, Russel; MacKinnon, James G. (1983). "Small sample properties of alternative forms of the Lagrange Multiplier test". Economics Letters . 12 (3–4): 269–275. doi:10.1016/0165-1765(83)90048-4.
↑ Engle, Robert F. (1983). "Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics". In Intriligator, M. D.; Griliches, Z. (eds.). Handbook of Econometrics. Vol. II. Elsevier. pp. 796–801. ISBN 978-0-444-86185-6.
↑ Burzykowski, Andrzej Gałecki, Tomasz (2013). Linear mixed-effects models using R : a step-by-step approach. New York, NY: Springer. ISBN 1461438993.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Taboga, Marco. "Lectures on Probability Theory and Mathematical Statistics". statlect.com. Retrieved 31 May 2022.
↑ Cook, T. D.; DeMets, D. L., eds. (2007). Introduction to Statistical Methods for Clinical Trials. Chapman and Hall. pp. 296–297. ISBN 1-58488-027-9.
↑ Vandaele, Walter (1981). "Wald, likelihood ratio, and Lagrange multiplier tests as an F test". Economics Letters . 8 (4): 361–365. doi:10.1016/0165-1765(81)90026-4.