# Score test

Last updated

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ2-distribution under the null hypothesis as first proved by C. R. Rao in 1948, [1] a fact that can be used to determine statistical significance.

## Contents

Since function maximization subject to equality constraints is most conveniently done using a Lagrangean expression of the problem, the score test can be equivalently understood as a test of the magnitude of the Lagrange multipliers associated with the constraints where, again, if the constraints are non-binding at the maximum likelihood, the vector of Lagrange multipliers should not differ from zero by more than sampling error. The equivalence of these two approaches was first shown by S. D. Silvey in 1959, [2] which led to the name Lagrange multiplier test that has become more commonly used, particularly in econometrics, since Breusch and Pagan's much-cited 1980 paper. [3]

The main advantage of the score test over the Wald test and likelihood-ratio test is that the score test only requires the computation of the restricted estimator. [4] This makes testing feasible when the unconstrained maximum likelihood estimate is a boundary point in the parameter space.[ citation needed ] Further, because the score test only requires the estimation of the likelihood function under the null hypothesis, it is less specific than the other two tests about the precise nature of the alternative hypothesis. [5]

## Single-parameter test

### The statistic

Let ${\displaystyle L}$ be the likelihood function which depends on a univariate parameter ${\displaystyle \theta }$ and let ${\displaystyle x}$ be the data. The score ${\displaystyle U(\theta )}$ is defined as

${\displaystyle U(\theta )={\frac {\partial \log L(\theta \mid x)}{\partial \theta }}.}$
${\displaystyle I(\theta )=-\operatorname {E} \left[\left.{\frac {\partial ^{2}}{\partial \theta ^{2}}}\log f(X;\theta )\,\right|\,\theta \right]\,,}$

where ƒ is the probability density.

The statistic to test ${\displaystyle {\mathcal {H}}_{0}:\theta =\theta _{0}}$ is ${\displaystyle S(\theta _{0})={\frac {U(\theta _{0})^{2}}{I(\theta _{0})}}}$

which has an asymptotic distribution of ${\displaystyle \chi _{1}^{2}}$, when ${\displaystyle {\mathcal {H}}_{0}}$ is true. While asymptotically identical, calculating the LM statistic using the outer-gradient-product estimator of the Fisher information matrix can lead to bias in small samples. [7]

#### Note on notation

Note that some texts use an alternative notation, in which the statistic ${\displaystyle S^{*}(\theta )={\sqrt {S(\theta )}}}$ is tested against a normal distribution. This approach is equivalent and gives identical results.

### As most powerful test for small deviations

${\displaystyle \left({\frac {\partial \log L(\theta \mid x)}{\partial \theta }}\right)_{\theta =\theta _{0}}\geq C}$

where ${\displaystyle L}$ is the likelihood function, ${\displaystyle \theta _{0}}$ is the value of the parameter of interest under the null hypothesis, and ${\displaystyle C}$ is a constant set depending on the size of the test desired (i.e. the probability of rejecting ${\displaystyle H_{0}}$ if ${\displaystyle H_{0}}$ is true; see Type I error).

The score test is the most powerful test for small deviations from ${\displaystyle H_{0}}$. To see this, consider testing ${\displaystyle \theta =\theta _{0}}$ versus ${\displaystyle \theta =\theta _{0}+h}$. By the Neyman–Pearson lemma, the most powerful test has the form

${\displaystyle {\frac {L(\theta _{0}+h\mid x)}{L(\theta _{0}\mid x)}}\geq K;}$

Taking the log of both sides yields

${\displaystyle \log L(\theta _{0}+h\mid x)-\log L(\theta _{0}\mid x)\geq \log K.}$

The score test follows making the substitution (by Taylor series expansion)

${\displaystyle \log L(\theta _{0}+h\mid x)\approx \log L(\theta _{0}\mid x)+h\times \left({\frac {\partial \log L(\theta \mid x)}{\partial \theta }}\right)_{\theta =\theta _{0}}}$

and identifying the ${\displaystyle C}$ above with ${\displaystyle \log(K)}$.

### Relationship with other hypothesis tests

If the null hypothesis is true, the likelihood ratio test, the Wald test, and the Score test are asymptotically equivalent tests of hypotheses. [8] [9] When testing nested models, the statistics for each test then converge to a Chi-squared distribution with degrees of freedom equal to the difference in degrees of freedom in the two models. If the null hypothesis is not true, however, the statistics converge to a noncentral chi-squared distribution with possibly different noncentrality parameters.

## Multiple parameters

A more general score test can be derived when there is more than one parameter. Suppose that ${\displaystyle {\widehat {\theta }}_{0}}$ is the maximum likelihood estimate of ${\displaystyle \theta }$ under the null hypothesis ${\displaystyle H_{0}}$ while ${\displaystyle U}$ and ${\displaystyle I}$ are respectively, the score and the Fisher information matrices under the alternative hypothesis. Then

${\displaystyle U^{T}({\widehat {\theta }}_{0})I^{-1}({\widehat {\theta }}_{0})U({\widehat {\theta }}_{0})\sim \chi _{k}^{2}}$

asymptotically under ${\displaystyle H_{0}}$, where ${\displaystyle k}$ is the number of constraints imposed by the null hypothesis and

${\displaystyle U({\widehat {\theta }}_{0})={\frac {\partial \log L({\widehat {\theta }}_{0}\mid x)}{\partial \theta }}}$

and

${\displaystyle I({\widehat {\theta }}_{0})=-\operatorname {E} \left({\frac {\partial ^{2}\log L({\widehat {\theta }}_{0}\mid x)}{\partial \theta \,\partial \theta '}}\right).}$

This can be used to test ${\displaystyle H_{0}}$.

## Special cases

In many situations, the score statistic reduces to another commonly used statistic. [10]

In linear regression, the Lagrange multiplier test can be expressed as a function of the F-test. [11]

When the data follows a normal distribution, the score statistic is the same as the t statistic.[ clarification needed ]

When the data consists of binary observations, the score statistic is the same as the chi-squared statistic in the Pearson's chi-squared test.

When the data consists of failure time data in two groups, the score statistic for the Cox partial likelihood is the same as the log-rank statistic in the log-rank test. Hence the log-rank test for difference in survival between two groups is most powerful when the proportional hazards assumption holds.

## Related Research Articles

In statistics, the likelihood function measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: properly, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In statistics, the score is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ2-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be founded in a recent review study.

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected.

In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization of a certain objective function, which depends on the data. The general theory of extremum estimators was developed by Amemiya (1985).

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

Denote a binary response index model as: , where .

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

In econometrics, the information matrix test is used to determine whether a regression model is misspecified. The test was developed by Halbert White, who observed that in a correctly specified model and under standard regularity assumptions, the Fisher information matrix can be expressed in either of two ways: as the outer product of the gradient, or as a function of the Hessian matrix of the log-likelihood function.

## References

1. Rao, C. Radhakrishna (1948). "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation". Mathematical Proceedings of the Cambridge Philosophical Society . 44 (1): 50–57. doi:10.1017/S0305004100023987.
2. Silvey, S. D. (1959). "The Lagrangian Multiplier Test". Annals of Mathematical Statistics . 30 (2): 389–407. doi:. JSTOR   2237089.
3. Breusch, T. S.; Pagan, A. R. (1980). "The Lagrange Multiplier Test and its Applications to Model Specification in Econometrics". Review of Economic Studies . 47 (1): 239–253. JSTOR   2297111.
4. Fahrmeir, Ludwig; Kneib, Thomas; Lang, Stefan; Marx, Brian (2013). . Berlin: Springer. pp.  663–664. ISBN   978-3-642-34332-2.
5. Kennedy, Peter (1998). A Guide to Econometrics (Fourth ed.). Cambridge: MIT Press. p. 68. ISBN   0-262-11235-3.
6. Lehmann and Casella, eq. (2.5.16).
7. Davidson, Russel; MacKinnon, James G. (1983). "Small sample properties of alternative forms of the Lagrange Multiplier test". Economics Letters . 12 (3–4): 269–275. doi:10.1016/0165-1765(83)90048-4.
8. Engle, Robert F. (1983). "Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics". In Intriligator, M. D.; Griliches, Z. (eds.). Handbook of Econometrics. II. Elsevier. pp. 796–801. ISBN   978-0-444-86185-6.
9. Burzykowski, Andrzej Gałecki, Tomasz (2013). Linear mixed-effects models using R : a step-by-step approach. New York, NY: Springer. ISBN   1461438993.
10. Cook, T. D.; DeMets, D. L., eds. (2007). Introduction to Statistical Methods for Clinical Trials. Chapman and Hall. pp. 296–297. ISBN   1-58488-027-9.
11. Vandaele, Walter (1981). "Wald, likelihood ratio, and Lagrange multiplier tests as an F test". Economics Letters . 8 (4): 361–365. doi:10.1016/0165-1765(81)90026-4.
• Buse, A. (1982). "The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note". The American Statistician . 36 (3a): 153–157. doi:10.1080/00031305.1982.10482817.
• Godfrey, L. G. (1988). "The Lagrange Multiplier Test and Testing for Misspecification : An Extended Analysis". Misspecification Tests in Econometrics. New York: Cambridge University Press. pp. 69–99. ISBN   0-521-26616-5.
• Rao, C. R. (2005). "Score Test: Historical Review and Recent Developments". Advances in Ranking and Selection, Multiple Comparisons, and Reliability. Boston: Birkhäuser. pp. 3–20. ISBN   978-0-8176-3232-8.