Wald test

Last updated March 23, 2024

In statistics, the Wald test (named after Abraham Wald) assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate.^[1]^[2] Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown,^[3]^: 138 it has an asymptotic χ²-distribution under the null hypothesis, a fact that can be used to determine statistical significance.^[4]

Together with the Lagrange multiplier test and the likelihood-ratio test, the Wald test is one of three classical approaches to hypothesis testing. An advantage of the Wald test over the other two is that it only requires the estimation of the unrestricted model, which lowers the computational burden as compared to the likelihood-ratio test. However, a major disadvantage is that (in finite samples) it is not invariant to changes in the representation of the null hypothesis; in other words, algebraically equivalent expressions of non-linear parameter restriction can lead to different values of the test statistic.^[5]^[6] That is because the Wald statistic is derived from a Taylor expansion,^[7] and different ways of writing equivalent nonlinear expressions lead to nontrivial differences in the corresponding Taylor coefficients.^[8] Another aberration, known as the Hauck–Donner effect,^[9] can occur in binomial models when the estimated (unconstrained) parameter is close to the boundary of the parameter space—for instance a fitted probability being extremely close to zero or one—which results in the Wald test no longer monotonically increasing in the distance between the unconstrained and constrained parameter.^[10]^[11]

Mathematical details

Under the Wald test, the estimated ${\hat {\theta }}$ that was found as the maximizing argument of the unconstrained likelihood function is compared with a hypothesized value $\theta _{0}$ . In particular, the squared difference ${\hat {\theta }}-\theta _{0}$ is weighted by the curvature of the log-likelihood function.

Test on a single parameter

If the hypothesis involves only a single parameter restriction, then the Wald statistic takes the following form:

W={\frac {{({\widehat {\theta }}-\theta _{0})}^{2}}{\operatorname {var} ({\hat {\theta }})}}

which under the null hypothesis follows an asymptotic χ²-distribution with one degree of freedom. The square root of the single-restriction Wald statistic can be understood as a (pseudo) t-ratio that is, however, not actually t-distributed except for the special case of linear regression with normally distributed errors.^[12] In general, it follows an asymptotic z distribution.^[13]

{\sqrt {W}}={\frac {{\widehat {\theta }}-\theta _{0}}{\operatorname {se} ({\hat {\theta }})}}

where $\operatorname {se} ({\widehat {\theta }})$ is the standard error (SE) of the maximum likelihood estimate (MLE), the square root of the variance. There are several ways to consistently estimate the variance matrix which in finite samples leads to alternative estimates of standard errors and associated test statistics and p-values.^[3]^: 129 The validity of still getting an asymptotically normal distribution after plugin-in the MLE estimator of ${\hat {\theta }}$ into the SE relies on Slutsky's theorem.

Test(s) on multiple parameters

The Wald test can be used to test a single hypothesis on multiple parameters, as well as to test jointly multiple hypotheses on single/multiple parameters. Let ${\hat {\theta }}_{n}$ be our sample estimator of P parameters (i.e., ${\hat {\theta }}_{n}$ is a $P\times 1$ vector), which is supposed to follow asymptotically a normal distribution with covariance matrix V, ${\sqrt {n}}({\hat {\theta }}_{n}-\theta )\,\xrightarrow {\mathcal {D}} \,N(0,V)$ . The test of Q hypotheses on the P parameters is expressed with a $Q\times P$ matrix R:

H_{0}:R\theta =r

H_{1}:R\theta \neq r

The distribution of the test statistic under the null hypothesis is

(R{\hat {\theta }}_{n}-r)'[R({\hat {V}}_{n}/n)R']^{-1}(R{\hat {\theta }}_{n}-r)/Q\quad \xrightarrow {\mathcal {D}} \quad F(Q,n-P)\quad {\xrightarrow[{n\rightarrow \infty }]{\mathcal {D}}}\quad \chi _{Q}^{2}/Q,

which in turn implies

(R{\hat {\theta }}_{n}-r)'[R({\hat {V}}_{n}/n)R']^{-1}(R{\hat {\theta }}_{n}-r)\quad {\xrightarrow[{n\rightarrow \infty }]{\mathcal {D}}}\quad \chi _{Q}^{2},

where ${\hat {V}}_{n}$ is an estimator of the covariance matrix.^[14]

Proof

Suppose ${\sqrt {n}}({\hat {\theta }}_{n}-\theta )\,\xrightarrow {\mathcal {D}} \,N(0,V)$ . Then, by Slutsky's theorem and by the properties of the normal distribution, multiplying by R has distribution:

R{\sqrt {n}}({\hat {\theta }}_{n}-\theta )={\sqrt {n}}(R{\hat {\theta }}_{n}-r)\,\xrightarrow {\mathcal {D}} \,N(0,RVR')

Recalling that a quadratic form of normal distribution has a Chi-squared distribution:

{\sqrt {n}}(R{\hat {\theta }}_{n}-r)'[RVR']^{-1}{\sqrt {n}}(R{\hat {\theta }}_{n}-r)\,\xrightarrow {\mathcal {D}} \,\chi _{Q}^{2}

Rearranging n finally gives:

(R{\hat {\theta }}_{n}-r)'[R(V/n)R']^{-1}(R{\hat {\theta }}_{n}-r)\quad \xrightarrow {\mathcal {D}} \quad \chi _{Q}^{2}

What if the covariance matrix is not known a-priori and needs to be estimated from the data? If we have a consistent estimator ${\hat {V}}_{n}$ of $V$ such that $V^{-1}{\hat {V}}_{n}$ has a determinant that is distributed $\chi _{n-P}^{2}$ , then by the independence of the covariance estimator and equation above, we have:

(R{\hat {\theta }}_{n}-r)'[R({\hat {V}}_{n}/n)R']^{-1}(R{\hat {\theta }}_{n}-r)/Q\quad \xrightarrow {\mathcal {D}} \quad F(Q,n-P)

Nonlinear hypothesis

In the standard form, the Wald test is used to test linear hypotheses that can be represented by a single matrix R. If one wishes to test a non-linear hypothesis of the form:

H_{0}:c(\theta )=0

H_{1}:c(\theta )\neq 0

The test statistic becomes:

c\left({\hat {\theta }}_{n}\right)'\left[c'\left({\hat {\theta }}_{n}\right)\left({\hat {V}}_{n}/n\right)c'\left({\hat {\theta }}_{n}\right)'\right]^{-1}c\left({\hat {\theta }}_{n}\right)\quad {\xrightarrow {\mathcal {D}}}\quad \chi _{Q}^{2}

where $c'({\hat {\theta }}_{n})$ is the derivative of c evaluated at the sample estimator. This result is obtained using the delta method, which uses a first order approximation of the variance.

Non-invariance to re-parameterisations

The fact that one uses an approximation of the variance has the drawback that the Wald statistic is not-invariant to a non-linear transformation/reparametrisation of the hypothesis: it can give different answers to the same question, depending on how the question is phrased.^[15]^[5] For example, asking whether R = 1 is the same as asking whether log R = 0; but the Wald statistic for R = 1 is not the same as the Wald statistic for log R = 0 (because there is in general no neat relationship between the standard errors of R and log R, so it needs to be approximated).^[16]

Alternatives to the Wald test

There exist several alternatives to the Wald test, namely the likelihood-ratio test and the Lagrange multiplier test (also known as the score test). Robert F. Engle showed that these three tests, the Wald test, the likelihood-ratio test and the Lagrange multiplier test are asymptotically equivalent.^[17] Although they are asymptotically equivalent, in finite samples, they could disagree enough to lead to different conclusions.

There are several reasons to prefer the likelihood ratio test or the Lagrange multiplier to the Wald test:^[18]^[19]^[20]

Non-invariance: As argued above, the Wald test is not invariant under reparametrization, while the likelihood ratio tests will give exactly the same answer whether we work with R, log R or any other monotonic transformation of R.^[5]
The other reason is that the Wald test uses two approximations (that we know the standard error or Fisher information and the maximum likelihood estimate), whereas the likelihood ratio test depends only on the ratio of likelihood functions under the null hypothesis and alternative hypothesis.
The Wald test requires an estimate using the maximizing argument, corresponding to the "full" model. In some cases, the model is simpler under the null hypothesis, so that one might prefer to use the score test (also called Lagrange multiplier test), which has the advantage that it can be formulated in situations where the variability of the maximizing element is difficult to estimate or computing the estimate according to the maximum likelihood estimator is difficult; e.g. the Cochran–Mantel–Haenzel test is a score test.^[21]

Related Research Articles

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function $is the probability of observing data assuming is the actual parameter.$

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the score is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ₀—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ₀. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ₀ converges to one.

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

<span class="mw-page-title-main">Empirical distribution function</span> Distribution function associated with the empirical measure of a sample

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

In statistics, the delta method is a method of deriving the asymptotic distribution of a random variable. It is applicable when the random variable being considered can be defined as a differentiable function of a random variable which is asymptotically Gaussian.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

The Durbin–Wu–Hausman test is a statistical hypothesis test in econometrics named after James Durbin, De-Min Wu, and Jerry A. Hausman. The test evaluates the consistency of an estimator when compared to an alternative, less efficient estimator which is already known to be consistent. It helps one evaluate if a statistical model corresponds to the data.

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization of a certain objective function, which depends on the data. The general theory of extremum estimators was developed by Amemiya (1985).

In statistics, asymptotic theory, or large sample theory, is a framework for assessing properties of estimators and statistical tests. Within this framework, it is often assumed that the sample size $n$ may grow indefinitely; the properties of estimators and tests are then evaluated under the limit of $n \to \infty$ . In practice, a limit evaluation is considered to be approximately valid for large finite sample sizes too.

In statistics, Hodges' estimator, named for Joseph Hodges, is a famous counterexample of an estimator which is "superefficient", i.e. it attains smaller asymptotic variance than regular efficient estimators. The existence of such a counterexample is the reason for the introduction of the notion of regular estimators.

Denote a binary response index model as: $, where .$

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

References

↑ Fahrmeir, Ludwig; Kneib, Thomas; Lang, Stefan; Marx, Brian (2013). Regression : Models, Methods and Applications. Berlin: Springer. p. 663. ISBN 978-3-642-34332-2.
↑ Ward, Michael D.; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge University Press. p. 36. ISBN 978-1-316-63682-4.
1 2 Martin, Vance; Hurn, Stan; Harris, David (2013). Econometric Modelling with Time Series: Specification, Estimation and Testing. Cambridge University Press. ISBN 978-0-521-13981-6.
↑ Davidson, Russell; MacKinnon, James G. (1993). "The Method of Maximum Likelihood : Fundamental Concepts and Notation". Estimation and Inference in Econometrics. New York: Oxford University Press. p. 89. ISBN 0-19-506011-3.
1 2 3 Gregory, Allan W.; Veall, Michael R. (1985). "Formulating Wald Tests of Nonlinear Restrictions". Econometrica . 53 (6): 1465–1468. doi:10.2307/1913221. JSTOR 1913221.
↑ Phillips, P. C. B.; Park, Joon Y. (1988). "On the Formulation of Wald Tests of Nonlinear Restrictions" (PDF). Econometrica . 56 (5): 1065–1083. doi:10.2307/1911359. JSTOR 1911359.
↑ Hayashi, Fumio (2000). Econometrics. Princeton: Princeton University Press. pp. 489–491. ISBN 1-4008-2383-8.,
↑ Lafontaine, Francine; White, Kenneth J. (1986). "Obtaining Any Wald Statistic You Want". Economics Letters . 21 (1): 35–40. doi:10.1016/0165-1765(86)90117-5.
↑ Hauck, Walter W. Jr.; Donner, Allan (1977). "Wald's Test as Applied to Hypotheses in Logit Analysis". Journal of the American Statistical Association . 72 (360a): 851–853. doi:10.1080/01621459.1977.10479969.
↑ King, Maxwell L.; Goh, Kim-Leng (2002). "Improvements to the Wald Test". Handbook of Applied Econometrics and Statistical Inference. New York: Marcel Dekker. pp. 251–276. ISBN 0-8247-0652-8.
↑ Yee, Thomas William (2022). "On the Hauck–Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization". Journal of the American Statistical Association. 117 (540): 1763–1774. arXiv: 2001.08431 . doi:10.1080/01621459.2021.1886936.
↑ Cameron, A. Colin; Trivedi, Pravin K. (2005). Microeconometrics : Methods and Applications. New York: Cambridge University Press. p. 137. ISBN 0-521-84805-9.
↑ Davidson, Russell; MacKinnon, James G. (1993). "The Method of Maximum Likelihood : Fundamental Concepts and Notation". Estimation and Inference in Econometrics. New York: Oxford University Press. p. 89. ISBN 0-19-506011-3.
↑ Harrell, Frank E. Jr. (2001). "Section 9.3.1". Regression modeling strategies. New York: Springer-Verlag. ISBN 0387952322.
↑ Fears, Thomas R.; Benichou, Jacques; Gail, Mitchell H. (1996). "A reminder of the fallibility of the Wald statistic". The American Statistician . 50 (3): 226–227. doi:10.1080/00031305.1996.10474384.
↑ Critchley, Frank; Marriott, Paul; Salmon, Mark (1996). "On the Differential Geometry of the Wald Test with Nonlinear Restrictions". Econometrica . 64 (5): 1213–1222. doi:10.2307/2171963. hdl: 1814/524 . JSTOR 2171963.
↑ Engle, Robert F. (1983). "Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics". In Intriligator, M. D.; Griliches, Z. (eds.). Handbook of Econometrics. Vol. II. Elsevier. pp. 796–801. ISBN 978-0-444-86185-6.
↑ Harrell, Frank E. Jr. (2001). "Section 9.3.3". Regression modeling strategies. New York: Springer-Verlag. ISBN 0387952322.
↑ Collett, David (1994). Modelling Survival Data in Medical Research. London: Chapman & Hall. ISBN 0412448807.
↑ Pawitan, Yudi (2001). In All Likelihood. New York: Oxford University Press. ISBN 0198507658.
↑ Agresti, Alan (2002). Categorical Data Analysis (2nd ed.). Wiley. p. 232. ISBN 0471360937.

External links

Wald test on the Earliest known uses of some of the words of mathematics

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Fahrmeir, Ludwig; Kneib, Thomas; Lang, Stefan; Marx, Brian (2013). Regression : Models, Methods and Applications. Berlin: Springer. p. 663. ISBN 978-3-642-34332-2.

[2] Ward, Michael D.; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge University Press. p. 36. ISBN 978-1-316-63682-4.

[EconometricModelling-3] 1 2 Martin, Vance; Hurn, Stan; Harris, David (2013). Econometric Modelling with Time Series: Specification, Estimation and Testing. Cambridge University Press. ISBN 978-0-521-13981-6.

[4] Davidson, Russell; MacKinnon, James G. (1993). "The Method of Maximum Likelihood : Fundamental Concepts and Notation". Estimation and Inference in Econometrics. New York: Oxford University Press. p. 89. ISBN 0-19-506011-3.

[GregoryVeall1985-5] 1 2 3 Gregory, Allan W.; Veall, Michael R. (1985). "Formulating Wald Tests of Nonlinear Restrictions". Econometrica . 53 (6): 1465–1468. doi:10.2307/1913221. JSTOR 1913221.

[6] Phillips, P. C. B.; Park, Joon Y. (1988). "On the Formulation of Wald Tests of Nonlinear Restrictions" (PDF). Econometrica . 56 (5): 1065–1083. doi:10.2307/1911359. JSTOR 1911359.

[7] Hayashi, Fumio (2000). Econometrics. Princeton: Princeton University Press. pp. 489–491. ISBN 1-4008-2383-8.,

[8] Lafontaine, Francine; White, Kenneth J. (1986). "Obtaining Any Wald Statistic You Want". Economics Letters . 21 (1): 35–40. doi:10.1016/0165-1765(86)90117-5.

[9] Hauck, Walter W. Jr.; Donner, Allan (1977). "Wald's Test as Applied to Hypotheses in Logit Analysis". Journal of the American Statistical Association . 72 (360a): 851–853. doi:10.1080/01621459.1977.10479969.

[10] King, Maxwell L.; Goh, Kim-Leng (2002). "Improvements to the Wald Test". Handbook of Applied Econometrics and Statistical Inference. New York: Marcel Dekker. pp. 251–276. ISBN 0-8247-0652-8.

[11] Yee, Thomas William (2022). "On the Hauck–Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization". Journal of the American Statistical Association. 117 (540): 1763–1774. arXiv: 2001.08431 . doi:10.1080/01621459.2021.1886936.

[12] Cameron, A. Colin; Trivedi, Pravin K. (2005). Microeconometrics : Methods and Applications. New York: Cambridge University Press. p. 137. ISBN 0-521-84805-9.

[13] Davidson, Russell; MacKinnon, James G. (1993). "The Method of Maximum Likelihood : Fundamental Concepts and Notation". Estimation and Inference in Econometrics. New York: Oxford University Press. p. 89. ISBN 0-19-506011-3.

[14] Harrell, Frank E. Jr. (2001). "Section 9.3.1". Regression modeling strategies. New York: Springer-Verlag. ISBN 0387952322.

[15] Fears, Thomas R.; Benichou, Jacques; Gail, Mitchell H. (1996). "A reminder of the fallibility of the Wald statistic". The American Statistician . 50 (3): 226–227. doi:10.1080/00031305.1996.10474384.

[16] Critchley, Frank; Marriott, Paul; Salmon, Mark (1996). "On the Differential Geometry of the Wald Test with Nonlinear Restrictions". Econometrica . 64 (5): 1213–1222. doi:10.2307/2171963. hdl: 1814/524 . JSTOR 2171963.

[17] Engle, Robert F. (1983). "Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics". In Intriligator, M. D.; Griliches, Z. (eds.). Handbook of Econometrics. Vol. II. Elsevier. pp. 796–801. ISBN 978-0-444-86185-6.

[18] Harrell, Frank E. Jr. (2001). "Section 9.3.3". Regression modeling strategies. New York: Springer-Verlag. ISBN 0387952322.

[19] Collett, David (1994). Modelling Survival Data in Medical Research. London: Chapman & Hall. ISBN 0412448807.

[20] Pawitan, Yudi (2001). In All Likelihood. New York: Oxford University Press. ISBN 0198507658.

[21] Agresti, Alan (2002). Categorical Data Analysis (2nd ed.). Wiley. p. 232. ISBN 0471360937.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]