Test statistic

Last updated July 22, 2024

Test statistic is a quantity derived from the sample for statistical hypothesis testing.^[1] A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

An important property of a test statistic is that its sampling distribution under the null hypothesis must be calculable, either exactly or approximately, which allows p-values to be calculated. A test statistic shares some of the same qualities of a descriptive statistic, and many statistics can be used as both test statistics and descriptive statistics. However, a test statistic is specifically intended for use in statistical testing, whereas the main quality of a descriptive statistic is that it is easily interpretable. Some informative descriptive statistics, such as the sample range, do not make good test statistics since it is difficult to determine their sampling distribution.

Two widely used test statistics are the t-statistic and the F-statistic.

Example

Suppose the task is to test whether a coin is fair (i.e. has equal probabilities of producing a head or a tail). If the coin is flipped 100 times and the results are recorded, the raw data can be represented as a sequence of 100 heads and tails. If there is interest in the marginal probability of obtaining a tail, only the number T out of the 100 flips that produced a tail needs to be recorded. But T can also be used as a test statistic in one of two ways:

the exact sampling distribution of T under the null hypothesis is the binomial distribution with parameters 0.5 and 100.
the value of T can be compared with its expected value under the null hypothesis of 50, and since the sample size is large, a normal distribution can be used as an approximation to the sampling distribution either for T or for the revised test statistic T−50.

Using one of these sampling distributions, it is possible to compute either a one-tailed or two-tailed p-value for the null hypothesis that the coin is fair. The test statistic in this case reduces a set of 100 numbers to a single numerical summary that can be used for testing.

Common test statistics

One-sample tests are appropriate when a sample is being compared to the population from a hypothesis. The population characteristics are known from theory or are calculated from the population.

Two-sample tests are appropriate for comparing two samples, typically experimental and control samples from a scientifically controlled experiment.

Paired tests are appropriate for comparing two samples where it is impossible to control important variables. Rather than comparing two sets, members are paired between samples so the difference between the members becomes the sample. Typically the mean of the differences is then compared to zero. The common example scenario for when a paired difference test is appropriate is when a single set of test subjects has something applied to them and the test is intended to check for an effect.

Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation.

A t-test is appropriate for comparing means under relaxed conditions (less is assumed).

Tests of proportions are analogous to tests of means (the 50% proportion).

Chi-squared tests use the same calculations and the same probability distribution for different applications:

Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does.
Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with height (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables).
Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors.

F-tests (analysis of variance, ANOVA) are commonly used when deciding whether groupings of data by category are meaningful. If the variance of test scores of the left-handed in a class is much smaller than the variance of the whole class, then it may be useful to study lefties as a group. The null hypothesis is that two variances are the same – so the proposed grouping is not meaningful.

In the table below, the symbols used are defined at the bottom of the table. Many other tests can be found in other articles. Proofs exist that the test statistics are appropriate.^[2]

Name

Formula

Assumptions or notes

One-sample

z

-test

z={\frac {{\overline {x}}-\mu _{0}}{({\sigma }/{\sqrt {n}})}}

(Normal population orn large) and σ known.

(z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality ).

Two-sample z-test

z={\frac {({\overline {x}}_{1}-{\overline {x}}_{2})-d_{0}}{\sqrt {{\frac {\sigma _{1}^{2}}{n_{1}}}+{\frac {\sigma _{2}^{2}}{n_{2}}}}}}

Normal population and independent observations and σ₁ and σ₂ are known where

d_{0}

is the value of

\mu _{1}-\mu _{2}

under the null hypothesis

One-sample t-test

t={\frac {{\overline {x}}-\mu _{0}}{(s/{\sqrt {n}})}},

df=n-1\

(Normal population orn large) and

\sigma

unknown

Paired t-test

t={\frac {{\overline {d}}-d_{0}}{(s_{d}/{\sqrt {n}})}},

$df=n-1\$

(Normal population of differences orn large) and

\sigma

unknown

Two-sample pooled t-test, equal variances

t={\frac {({\overline {x}}_{1}-{\overline {x}}_{2})-d_{0}}{s_{p}{\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}},

$s_{p}^{2}={\frac {(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}},$
$df=n_{1}+n_{2}-2\$ ^[3]

(Normal populations orn₁ + n₂ > 40) and independent observations and σ₁ = σ₂ unknown

Two-sample unpooled t-test, unequal variances (Welch's t-test)

t={\frac {({\overline {x}}_{1}-{\overline {x}}_{2})-d_{0}}{\sqrt {{\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}}}},

$df={\frac {\left({\dfrac {s_{1}^{2}}{n_{1}}}+{\dfrac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\dfrac {\left({\dfrac {s_{1}^{2}}{n_{1}}}\right)^{2}}{n_{1}-1}}+{\dfrac {\left({\dfrac {s_{2}^{2}}{n_{2}}}\right)^{2}}{n_{2}-1}}}}$ ^[3]

(Normal populations orn₁ + n₂ > 40) and independent observations and σ₁ ≠ σ₂ both unknown

One-proportion z-test

z={\frac {{\hat {p}}-p_{0}}{\sqrt {p_{0}(1-p_{0})}}}{\sqrt {n}}

n^.p₀ > 10 andn (1 − p₀) > 10 and it is a SRS (Simple Random Sample), see notes.

Two-proportion z-test, pooled for

H_{0}\colon p_{1}=p_{2}

z={\frac {({\hat {p}}_{1}-{\hat {p}}_{2})}{\sqrt {{\hat {p}}(1-{\hat {p}})({\frac {1}{n_{1}}}+{\frac {1}{n_{2}}})}}}

${\hat {p}}={\frac {x_{1}+x_{2}}{n_{1}+n_{2}}}$

n₁p₁ > 5 andn₁(1 − p₁) > 5 andn₂p₂ > 5 andn₂(1 − p₂) > 5 and independent observations, see notes.

Two-proportion z-test, unpooled for

|d_{0}|>0

z={\frac {({\hat {p}}_{1}-{\hat {p}}_{2})-d_{0}}{\sqrt {{\frac {{\hat {p}}_{1}(1-{\hat {p}}_{1})}{n_{1}}}+{\frac {{\hat {p}}_{2}(1-{\hat {p}}_{2})}{n_{2}}}}}}

n₁p₁ > 5 andn₁(1 − p₁) > 5 andn₂p₂ > 5 andn₂(1 − p₂) > 5 and independent observations, see notes.

Chi-squared test for variance

\chi ^{2}=(n-1){\frac {s^{2}}{\sigma _{0}^{2}}}

df = n-1

• Normal population

Chi-squared test for goodness of fit

\chi ^{2}=\sum _{k}{\frac {({\text{observed}}-{\text{expected}})^{2}}{\text{expected}}}

df = k − 1 − # parameters estimated, and one of these must hold.

• All expected counts are at least 5.^[4]

• All expected counts are > 1 and no more than 20% of expected counts are less than 5^[5]

Two-sample F test for equality of variances

F={\frac {s_{1}^{2}}{s_{2}^{2}}}

Normal populations
Arrange so

s_{1}^{2}\geq s_{2}^{2}

and reject H₀ for

F>F(\alpha /2,n_{1}-1,n_{2}-1)

^[6]

Regression t-test of

H_{0}\colon R^{2}=0.

t={\sqrt {\frac {R^{2}(n-k-1^{*})}{1-R^{2}}}}

Reject H₀ for

t>t(\alpha /2,n-k-1^{*})

^[7]
*Subtract 1 for intercept; k terms contain independent variables.

In general, the subscript 0 indicates a value taken from the null hypothesis , H₀, which should be used as much as possible in constructing its test statistic. ... Definitions of other symbols:

$\alpha$ , the probability of Type I error (rejecting a null hypothesis when it is in fact true)
$n$ = sample size
$n_{1}$ = sample 1 size
$n_{2}$ = sample 2 size
${\overline {x}}$ = sample mean
$\mu _{0}$ = hypothesized population mean
$\mu _{1}$ = population 1 mean
$\mu _{2}$ = population 2 mean
$\sigma$ = population standard deviation
$\sigma ^{2}$ = population variance
$s$ = sample standard deviation
$\sum ^{k}$ = sum (of ${\textstyle k}$ numbers)

$s^{2}$ = sample variance
$s_{1}$ = sample 1 standard deviation
$s_{2}$ = sample 2 standard deviation
$t$ = t statistic
$df$ = degrees of freedom
${\overline {d}}$ = sample mean of differences
$d_{0}$ = hypothesized population mean difference
$s_{d}$ = standard deviation of differences
$\chi ^{2}$ = Chi-squared statistic

${\hat {p}}={\frac {x}{n}}$ = sample proportion, unless specified otherwise
$p_{0}$ = hypothesized population proportion
$p_{1}$ = proportion 1
$p_{2}$ = proportion 2
$d_{p}$ = hypothesized difference in proportion
$\min\{n_{1},n_{2}\}$ = minimum of ${\textstyle n_{1}}$ and ${\textstyle n_{2}}$
$x_{1}=n_{1}p_{1}$
$x_{2}=n_{2}p_{2}$
$F$ = F statistic

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.

In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed. If the null hypothesis is true, any experimentally observed effect is due to chance alone, hence the term "null". In contrast with the null hypothesis, an alternative hypothesis is developed, which claims that a relationship does exist between two variables.

Pearson's chi-squared test or Pearson's $test$ is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term (ε). It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

In null-hypothesis significance testing, the $-value$ is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis". That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data".

In statistical hypothesis testing, the alternative hypothesis is one of the proposed proposition in the hypothesis test. In general the goal of hypothesis test is to demonstrate that in the given condition, there is sufficient evidence supporting the credibility of alternative hypothesis instead of the exclusive proposition in the test. It is usually consistent with the research hypothesis because it is constructed from literature review, previous studies, etc. However, the research hypothesis is sometimes consistent with the null hypothesis.

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. An example can be whether a machine produces more than one-percent defective products. In this situation, if the estimated value exists in one of the one-sided critical areas, depending on the direction of interest, the alternative hypothesis is accepted over the null hypothesis. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution, colored in yellow, or "bell curve", pictured on the right and colored in green.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution $. Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.$

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.

The foundations of statistics consist of the mathematical and philosophical basis for arguments and inferences made using statistics. This includes the justification for the methods of statistical inference, estimation, and hypothesis testing, the quantification of uncertainty in the conclusions of statistical arguments, and the interpretation of those conclusions in probabilistic terms. A valid foundation can be used to explain statistical paradoxes such as Simpson's paradox, provide a precise description of observed statistical laws,, and guide the application of statistical conclusions in social and scientific applications.

References

↑ Berger, R. L.; Casella, G. (2001). Statistical Inference, Duxbury Press, Second Edition (p.374)
↑ Loveland, Jennifer L. (2011). Mathematical Justification of Introductory Hypothesis Tests and Development of Reference Materials (M.Sc. (Mathematics)). Utah State University. Retrieved April 30, 2013. Abstract: "The focus was on the Neyman–Pearson approach to hypothesis testing. A brief historical development of the Neyman–Pearson approach is followed by mathematical proofs of each of the hypothesis tests covered in the reference material." The proofs do not reference the concepts introduced by Neyman and Pearson, instead they show that traditional test statistics have the probability distributions ascribed to them, so that significance calculations assuming those distributions are correct. The thesis information is also posted at mathnstats.com as of April 2013.
1 2 NIST handbook: Two-Sample t-test for Equal Means
↑ Steel, R. G. D., and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 350.
↑ Weiss, Neil A. (1999). Introductory Statistics (5th ed.). pp. 802. ISBN 0-201-59877-9.
↑ NIST handbook: F-Test for Equality of Two Standard Deviations (Testing standard deviations the same as testing variances)
↑ Steel, R. G. D., and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[CasellaBerger-1] Berger, R. L.; Casella, G. (2001). Statistical Inference, Duxbury Press, Second Edition (p.374)

[Loveland-2] Loveland, Jennifer L. (2011). Mathematical Justification of Introductory Hypothesis Tests and Development of Reference Materials (M.Sc. (Mathematics)). Utah State University. Retrieved April 30, 2013. Abstract: "The focus was on the Neyman–Pearson approach to hypothesis testing. A brief historical development of the Neyman–Pearson approach is followed by mathematical proofs of each of the hypothesis tests covered in the reference material." The proofs do not reference the concepts introduced by Neyman and Pearson, instead they show that traditional test statistics have the probability distributions ascribed to them, so that significance calculations assuming those distributions are correct. The thesis information is also posted at mathnstats.com as of April 2013.

[NIST2mean-3] 1 2 NIST handbook: Two-Sample t-test for Equal Means

[4] Steel, R. G. D., and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 350.

[5] Weiss, Neil A. (1999). Introductory Statistics (5th ed.). pp. 802. ISBN 0-201-59877-9.

[6] NIST handbook: F-Test for Equality of Two Standard Deviations (Testing standard deviations the same as testing variances)

[7] Steel, R. G. D., and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Test statistic

Contents

Example

Common test statistics

See also

Related Research Articles

References