D'Agostino's K-squared test

Last updated

In statistics, D'Agostino's K2 test, named for Ralph D'Agostino, is a goodness-of-fit measure of departure from normality, that is the test aims to gauge the compatibility of given data with the null hypothesis that the data is a realization of independent, identically distributed Gaussian random variables. The test is based on transformations of the sample kurtosis and skewness, and has power only against the alternatives that the distribution is skewed and/or kurtic.

Contents

Skewness and kurtosis

In the following, { xi } denotes a sample of n observations, g1 and g2 are the sample skewness and kurtosis, mj’s are the j-th sample central moments, and is the sample mean. Frequently in the literature related to normality testing, the skewness and kurtosis are denoted as β1 and β2 respectively. Such notation can be inconvenient since, for example, β1 can be a negative quantity.

The sample skewness and kurtosis are defined as

These quantities consistently estimate the theoretical skewness and kurtosis of the distribution, respectively. Moreover, if the sample indeed comes from a normal population, then the exact finite sample distributions of the skewness and kurtosis can themselves be analysed in terms of their means μ1, variances μ2, skewnesses γ1, and kurtosis γ2. This has been done by Pearson (1931), who derived the following expressions:[ better source needed ]

and

For example, a sample with size n = 1000 drawn from a normally distributed population can be expected to have a skewness of 0, SD 0.08 and a kurtosis of 0, SD 0.15, where SD indicates the standard deviation.[ citation needed ]

Transformed sample skewness and kurtosis

The sample skewness g1 and kurtosis g2 are both asymptotically normal. However, the rate of their convergence to the distribution limit is frustratingly slow, especially for g2. For example even with n = 5000 observations the sample kurtosis g2 has both the skewness and the kurtosis of approximately 0.3, which is not negligible. In order to remedy this situation, it has been suggested to transform the quantities g1 and g2 in a way that makes their distribution as close to standard normal as possible.

In particular, D'Agostino & Pearson (1973) suggested the following transformation for sample skewness:

where constants α and δ are computed as

and where μ2 = μ2(g1) is the variance of g1, and γ2 = γ2(g1) is the kurtosis — the expressions given in the previous section.

Similarly, Anscombe & Glynn (1983) suggested a transformation for g2, which works reasonably well for sample sizes of 20 or greater:

where

and μ1 = μ1(g2), μ2 = μ2(g2), γ1 = γ1(g2) are the quantities computed by Pearson.

Omnibus K2 statistic

Statistics Z1 and Z2 can be combined to produce an omnibus test, able to detect deviations from normality due to either skewness or kurtosis ( D'Agostino, Belanger & D'Agostino 1990 ):

If the null hypothesis of normality is true, then K2 is approximately χ2-distributed with 2 degrees of freedom.

Note that the statistics g1, g2 are not independent, only uncorrelated. Therefore, their transforms Z1, Z2 will be dependent also ( Shenton & Bowman 1977 ), rendering the validity of χ2 approximation questionable. Simulations show that under the null hypothesis the K2 test statistic is characterized by

expected valuestandard deviation95% quantile
n = 201.9712.3396.373
n = 502.0172.3086.339
n = 1002.0262.2676.271
n = 2502.0122.1746.129
n = 5002.0092.1136.063
n = 10002.0002.0626.038
χ2(2) distribution2.0002.0005.991

See also

Related Research Articles

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes a particular aspect of a probability distribution. There are different ways to quantify kurtosis for a theoretical distribution, and there are corresponding ways of estimating it using a sample from a population. Different measures of kurtosis may have different interpretations.

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Bernoulli distribution</span> Probability distribution modeling a coin toss which need not be fair

In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability and the value 0 with probability . Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are Boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads. In particular, unfair coins would have

<span class="mw-page-title-main">Weibull distribution</span> Continuous probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

<span class="mw-page-title-main">Multimodal distribution</span> Probability distribution with more than one mode

In statistics, a multimodaldistribution is a probability distribution with more than one mode. These appear as distinct peaks in the probability density function, as shown in Figures 1 and 2. Categorical, continuous, and discrete data can all form multimodal distributions. Among univariate analyses, multimodal distributions are commonly bimodal.

<span class="mw-page-title-main">Skellam distribution</span>

The Skellam distribution is the discrete probability distribution of the difference of two statistically independent random variables and each Poisson-distributed with respective expected values and . It is useful in describing the statistics of the difference of two images with simple photon noise, as well as describing the point spread distribution in sports where all scored points are equal, such as baseball, hockey and soccer.

<span class="mw-page-title-main">Pearson distribution</span> Family of continuous probability distributions

The Pearson distribution is a family of continuous probability distributions. It was first published by Karl Pearson in 1895 and subsequently extended by him in 1901 and 1916 in a series of articles on biostatistics.

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

<span class="mw-page-title-main">Chi distribution</span> Probability distribution

In probability theory and statistics, the chi distribution is a continuous probability distribution over the non-negative real line. It is the distribution of the positive square root of a sum of squared independent Gaussian random variables. Equivalently, it is the distribution of the Euclidean distance between a multivariate Gaussian random variable and the origin. It is thus related to the chi-squared distribution by describing the distribution of the positive square roots of a variable obeying a chi-squared distribution.

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

Also known as the (Moran-)Gamma Process, the gamma process is a random process studied in mathematics, statistics, probability theory, and stochastics. The gamma process is a stochastic or random process consisting of independently distributed gamma distributions where represents the number of event occurrences from time 0 to time . The gamma distribution has shape parameter and rate parameter , often written as . Both and must be greater than 0. The gamma process is often written as where represents the time from 0. The process is a pure-jump increasing Lévy process with intensity measure for all positive . Thus jumps whose size lies in the interval occur as a Poisson process with intensity The parameter controls the rate of jump arrivals and the scaling parameter inversely controls the jump size. It is assumed that the process starts from a value 0 at t = 0 meaning

In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation of a population of values, in such a way that the expected value of the calculation equals the true value. Except in some important situations, outlined later, the task has little relevance to applications of statistics since its need is avoided by standard procedures, such as the use of significance tests and confidence intervals, or by using Bayesian analysis.

In classical mechanics, a Liouville dynamical system is an exactly solvable dynamical system in which the kinetic energy T and potential energy V can be expressed in terms of the s generalized coordinates q as follows:

<span class="mw-page-title-main">Skew normal distribution</span> Probability distribution

In probability theory and statistics, the skew normal distribution is a continuous probability distribution that generalises the normal distribution to allow for non-zero skewness.

The generalized normal distribution or generalized Gaussian distribution (GGD) is either of two families of parametric continuous probability distributions on the real line. Both families add a shape parameter to the normal distribution. To distinguish the two families, they are referred to below as "symmetric" and "asymmetric"; however, this is not a standard nomenclature.

In statistics, the Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. The test is named after Carlos Jarque and Anil K. Bera. The test statistic is always nonnegative. If it is far from zero, it signals the data do not have a normal distribution.

In statistics, the standardized mean of a contrast variable , is a parameter assessing effect size. The SMCV is defined as mean divided by the standard deviation of a contrast variable. The SMCV was first proposed for one-way ANOVA cases and was then extended to multi-factor ANOVA cases .

References