This article may be too technical for most readers to understand.(August 2021) |
In scientific research, the null hypothesis (often denoted H0) [1] is the claim that the effect being studied does not exist. [note 1] The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed. If the null hypothesis is true, any experimentally observed effect is due to chance alone, hence the term "null". In contrast with the null hypothesis, an alternative hypothesis is developed, which claims that a relationship does exist between two variables.
The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.
The statement being tested in a test of statistical significance is called the null hypothesis. The test of significance is designed to assess the strength of the evidence against the null hypothesis, or a statement of 'no effect' or 'no difference'. [2] It is often symbolized as H0.
The statement that is being tested against the null hypothesis is the alternative hypothesis. [2] Symbols may include H1 and Ha.
A statistical significance test starts with a random sample from a population. If the sample data are consistent with the null hypothesis, then you do not reject the null hypothesis; if the sample data are inconsistent with the null hypothesis, then you reject the null hypothesis and conclude that the alternative hypothesis is true. [3]
Consider the following example. Given the test scores of two random samples, one of men and one of women, does one group score better than the other? A possible null hypothesis is that the mean male score is the same as the mean female score:
where
A stronger null hypothesis is that the two samples have equal variances and shapes of their respective distributions.
The simple/composite distinction was made by Neyman and Pearson. [5]
Fisher required an exact null hypothesis for testing (see the quotations below).
A one-tailed hypothesis (tested using a one-sided test) [2] is an inexact hypothesis in which the value of a parameter is specified as being either:
A one-tailed hypothesis is said to have directionality.
Fisher's original (lady tasting tea) example was a one-tailed test. The null hypothesis was asymmetric. The probability of guessing all cups correctly was the same as guessing all cups incorrectly, but Fisher noted that only guessing correctly was compatible with the lady's claim.
The null hypothesis is a default hypothesis that a quantity to be measured is zero (null). Typically, the quantity to be measured is the difference between two situations. For instance, trying to determine if there is a positive proof that an effect has occurred or that samples derive from different batches. [7] [8]
The null hypothesis is generally assumed to remain possibly true. Multiple analyses can be performed to show how the hypothesis should either be rejected or excluded e.g. having a high confidence level, thus demonstrating a statistically significant difference. This is demonstrated by showing that zero is outside of the specified confidence interval of the measurement on either side, typically within the real numbers. [8] Failure to exclude the null hypothesis (with any confidence) does not logically confirm or support the (unprovable) null hypothesis. (When it is proven that something is e.g. bigger than x, it does not necessarily imply it is plausible that it is smaller or equal than x; it may instead be a poor quality measurement with low accuracy. Confirming the null hypothesis two-sided would amount to positively proving it is bigger or equal than 0 and to positively proving it is smaller or equal than 0; this is something for which infinite accuracy is needed as well as exactly zero effect, neither of which normally are realistic. Also measurements will never indicate a non-zero probability of exactly zero difference.) So failure of an exclusion of a null hypothesis amounts to a "don't know" at the specified confidence level; it does not immediately imply null somehow, as the data may already show a (less strong) indication for a non-null. The used confidence level does absolutely certainly not correspond to the likelihood of null at failing to exclude; in fact in this case a high used confidence level expands the still plausible range.
A non-null hypothesis can have the following meanings, depending on the author a) a value other than zero is used, b) some margin other than zero is used and c) the "alternative" hypothesis. [9] [10]
Testing (excluding or failing to exclude) the null hypothesis provides evidence that there are (or are not) statistically sufficient grounds to believe there is a relationship between two phenomena (e.g., that a potential treatment has a non-zero effect, either way). Testing the null hypothesis is a central task in statistical hypothesis testing in the modern practice of science. There are precise criteria for excluding or not excluding a null hypothesis at a certain confidence level. The confidence level should indicate the likelihood that much more and better data would still be able to exclude the null hypothesis on the same side. [8]
The concept of a null hypothesis is used differently in two approaches to statistical inference. In the significance testing approach of Ronald Fisher, a null hypothesis is rejected if the observed data are significantly unlikely to have occurred if the null hypothesis were true. In this case, the null hypothesis is rejected and an alternative hypothesis is accepted in its place. If the data are consistent with the null hypothesis statistically possibly true, then the null hypothesis is not rejected. In neither case is the null hypothesis or its alternative proven; with better or more data, the null may still be rejected. This is analogous to the legal principle of presumption of innocence, in which a suspect or defendant is assumed to be innocent (null is not rejected) until proven guilty (null is rejected) beyond a reasonable doubt (to a statistically significant degree). [8]
In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and the two hypotheses are distinguished on the basis of data, with certain error rates. It is used in formulating answers in research.
Statistical inference can be done without a null hypothesis, by specifying a statistical model corresponding to each candidate hypothesis, and by using model selection techniques to choose the most appropriate model. [11] (The most common selection techniques are based on either Akaike information criterion or Bayes factor).
Hypothesis testing requires constructing a statistical model of what the data would look like if chance or random processes alone were responsible for the results. The hypothesis that chance alone is responsible for the results is called the null hypothesis. The model of the result of the random process is called the distribution under the null hypothesis. The obtained results are compared with the distribution under the null hypothesis, and the likelihood of finding the obtained results is thereby determined. [12]
Hypothesis testing works by collecting data and measuring how likely the particular set of data is (assuming the null hypothesis is true), when the study is on a randomly selected representative sample. The null hypothesis assumes no relationship between variables in the population from which the sample is selected. [13]
If the data-set of a randomly selected representative sample is very unlikely relative to the null hypothesis (defined as being part of a class of sets of data that only rarely will be observed), the experimenter rejects the null hypothesis, concluding it (probably) is false. This class of data-sets is usually specified via a test statistic, which is designed to measure the extent of apparent departure from the null hypothesis. The procedure works by assessing whether the observed departure, measured by the test statistic, is larger than a value defined, so that the probability of occurrence of a more extreme value is small under the null hypothesis (usually in less than either 5% or 1% of similar data-sets in which the null hypothesis does hold).
If the data do not contradict the null hypothesis, then only a weak conclusion can be made: namely, that the observed data set provides insufficient evidence against the null hypothesis. In this case, because the null hypothesis could be true or false, in some contexts this is interpreted as meaning that the data give insufficient evidence to make any conclusion, while in other contexts, it is interpreted as meaning that there is not sufficient evidence to support changing from a currently useful regime to a different one. Nevertheless, if at this point the effect appears likely and/or large enough, there may be an incentive to further investigate, such as running a bigger sample.
For instance, a certain drug may reduce the risk of having a heart attack. Possible null hypotheses are "this drug does not reduce the risk of having a heart attack" or "this drug has no effect on the risk of having a heart attack". The test of the hypothesis consists of administering the drug to half of the people in a study group as a controlled experiment. If the data show a statistically significant change in the people receiving the drug, the null hypothesis is rejected.
There are many types of significance tests for one, two or more samples, for means, variances and proportions, paired or unpaired data, for different distributions, for large and small samples; all have null hypotheses. There are also at least four goals of null hypotheses for significance tests: [14]
Rejection of the null hypothesis is not necessarily the real goal of a significance tester. An adequate statistical model may be associated with a failure to reject the null; the model is adjusted until the null is not rejected. The numerous uses of significance testing were well known to Fisher who discussed many in his book written a decade before defining the null hypothesis. [15]
A statistical significance test shares much mathematics with a confidence interval. They are mutually illuminating. A result is often significant when there is confidence in the sign of a relationship (the interval does not include 0). Whenever the sign of a relationship is important, statistical significance is a worthy goal. This also reveals weaknesses of significance testing: A result can be significant without a good estimate of the strength of a relationship; significance can be a modest goal. A weak relationship can also achieve significance with enough data. Reporting both significance and confidence intervals is commonly recommended.
The varied uses of significance tests reduce the number of generalizations that can be made about all applications.
The choice of the null hypothesis is associated with sparse and inconsistent advice. Fisher mentioned few constraints on the choice and stated that many null hypotheses should be considered and that many tests are possible for each. The variety of applications and the diversity of goals suggests that the choice can be complicated. In many applications the formulation of the test is traditional. A familiarity with the range of tests available may suggest a particular null hypothesis and test. Formulating the null hypothesis is not automated (though the calculations of significance testing usually are). David Cox said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis". [16]
A statistical significance test is intended to test a hypothesis. If the hypothesis summarizes a set of data, there is no value in testing the hypothesis on that set of data. Example: If a study of last year's weather reports indicates that rain in a region falls primarily on weekends, it is only valid to test that null hypothesis on weather reports from any other year. Testing hypotheses suggested by the data is circular reasoning that proves nothing; It is a special limitation on the choice of the null hypothesis.
A routine procedure is as follows: Start from the scientific hypothesis. Translate this to a statistical alternative hypothesis and proceed: "Because Ha expresses the effect that we wish to find evidence for, we often begin with Ha and then set up H0 as the statement that the hoped-for effect is not present." [2] This advice is reversed for modeling applications where we hope not to find evidence against the null.
A complex case example is as follows: [17] The gold standard in clinical research is the randomized placebo-controlled double-blind clinical trial. But testing a new drug against a (medically ineffective) placebo may be unethical for a serious illness. Testing a new drug against an older medically effective drug raises fundamental philosophical issues regarding the goal of the test and the motivation of the experimenters. The standard "no difference" null hypothesis may reward the pharmaceutical company for gathering inadequate data. "Difference" is a better null hypothesis in this case, but statistical significance is not an adequate criterion for reaching a nuanced conclusion which requires a good numeric estimate of the drug's effectiveness. A "minor" or "simple" proposed change in the null hypothesis ((new vs old) rather than (new vs placebo)) can have a dramatic effect on the utility of a test for complex non-statistical reasons.
The choice of null hypothesis (H0) and consideration of directionality (see "one-tailed test") is critical.
Consider the question of whether a tossed coin is fair (i.e. that on average it lands heads up 50% of the time) and an experiment where you toss the coin 5 times. A possible result of the experiment that we consider here is 5 heads. Let outcomes be considered unlikely with respect to an assumed distribution if their probability is lower than a significance threshold of 0.05.
A potential null hypothesis implying a one-tailed test is "this coin is not biased toward heads". Beware that, in this context, the term "one-tailed" does not refer to the outcome of a single coin toss (i.e., whether or not the coin comes up "tails" instead of "heads"); the term "one-tailed" refers to a specific way of testing the null hypothesis in which the critical region (also known as "region of rejection") ends up in on only one side of the probability distribution.
Indeed, with a fair coin the probability of this experiment outcome is 1/25 = 0.031, which would be even lower if the coin were biased in favour of tails. Therefore, the observations are not likely enough for the null hypothesis to hold, and the test refutes it. Since the coin is ostensibly neither fair nor biased toward tails, the conclusion of the experiment is that the coin is biased towards heads.
Alternatively, a null hypothesis implying a two-tailed test is "this coin is fair". This one null hypothesis could be examined by looking out for either too many tails or too many heads in the experiments. The outcomes that would tend to refute this null hypothesis are those with a large number of heads or a large number of tails, and our experiment with 5 heads would seem to belong to this class.
However, the probability of 5 tosses of the same kind, irrespective of whether these are head or tails, is twice as much as that of the 5-head occurrence singly considered. Hence, under this two-tailed null hypothesis, the observation receives a probability value of 0.063. Hence again, with the same significance threshold used for the one-tailed test (0.05), the same outcome is not statistically significant. Therefore, the two-tailed null hypothesis will be preserved in this case, not supporting the conclusion reached with the single-tailed null hypothesis, that the coin is biased towards heads.
This example illustrates that the conclusion reached from a statistical test may depend on the precise formulation of the null and alternative hypotheses.
Fisher said, "the null hypothesis must be exact, that is free of vagueness and ambiguity, because it must supply the basis of the 'problem of distribution,' of which the test of significance is the solution", implying a more restrictive domain for H0. [18] According to this view, the null hypothesis must be numerically exact—it must state that a particular quantity or difference is equal to a particular number. In classical science, it is most typically the statement that there is no effect of a particular treatment; in observations, it is typically that there is no difference between the value of a particular measured variable and that of a prediction.
Most statisticians believe that it is valid to state direction as a part of null hypothesis, or as part of a null hypothesis/alternative hypothesis pair. [19] However, the results are not a full description of all the results of an experiment, merely a single result tailored to one particular purpose. For example, consider an H0 that claims the population mean for a new treatment is an improvement on a well-established treatment with population mean = 10 (known from long experience), with the one-tailed alternative being that the new treatment's mean > 10. If the sample evidence obtained through x-bar equals −200 and the corresponding t-test statistic equals −50, the conclusion from the test would be that there is no evidence that the new treatment is better than the existing one: it would not report that it is markedly worse, but that is not what this particular test is looking for. To overcome any possible ambiguity in reporting the result of the test of a null hypothesis, it is best to indicate whether the test was two-sided and, if one-sided, to include the direction of the effect being tested.
The statistical theory required to deal with the simple cases of directionality dealt with here, and more complicated ones, makes use of the concept of an unbiased test.
The directionality of hypotheses is not always obvious. The explicit null hypothesis of Fisher's Lady tasting tea example was that the Lady had no such ability, which led to a symmetric probability distribution. The one-tailed nature of the test resulted from the one-tailed alternate hypothesis (a term not used by Fisher). The null hypothesis became implicitly one-tailed. The logical negation of the Lady's one-tailed claim was also one-tailed. (Claim: Ability > 0; Stated null: Ability = 0; Implicit null: Ability ≤ 0).
Pure arguments over the use of one-tailed tests are complicated by the variety of tests. Some tests (for instance the χ2 goodness of fit test) are inherently one-tailed. Some probability distributions are asymmetric. The traditional tests of 3 or more groups are two-tailed.
Advice concerning the use of one-tailed hypotheses has been inconsistent and accepted practice varies among fields. [20] The greatest objection to one-tailed hypotheses is their potential subjectivity. A non-significant result can sometimes be converted to a significant result by the use of a one-tailed hypothesis (as the fair coin test, at the whim of the analyst). The flip side of the argument: One-sided tests are less likely to ignore a real effect. One-tailed tests can suppress the publication of data that differs in sign from predictions. Objectivity was a goal of the developers of statistical tests.
It is a common practice to use a one-tailed hypothesis by default. However, "If you do not have a specific direction firmly in mind in advance, use a two-sided alternative. Moreover, some users of statistics argue that we should always work with the two-sided alternative." [2] [21]
One alternative to this advice is to use three-outcome tests. It eliminates the issues surrounding directionality of hypotheses by testing twice, once in each direction and combining the results to produce three possible outcomes. [22] Variations on this approach have a history, being suggested perhaps 10 times since 1950. [23]
Disagreements over one-tailed tests flow from the philosophy of science. While Fisher was willing to ignore the unlikely case of the Lady guessing all cups of tea incorrectly (which may have been appropriate for the circumstances), medicine believes that a proposed treatment that kills patients is significant in every sense and should be reported and perhaps explained. Poor statistical reporting practices have contributed to disagreements over one-tailed tests. Statistical significance resulting from two-tailed tests is insensitive to the sign of the relationship; Reporting significance alone is inadequate. "The treatment has an effect" is the uninformative result of a two-tailed test. "The treatment has a beneficial effect" is the more informative result of a one-tailed test. "The treatment has an effect, reducing the average length of hospitalization by 1.5 days" is the most informative report, combining a two-tailed significance test result with a numeric estimate of the relationship between treatment and effect. Explicitly reporting a numeric result eliminates a philosophical advantage of a one-tailed test. An underlying issue is the appropriate form of an experimental science without numeric predictive theories: A model of numeric results is more informative than a model of effect signs (positive, negative or unknown) which is more informative than a model of simple significance (non-zero or unknown); in the absence of numeric theory signs may suffice.
The history of the null and alternative hypotheses has much to do with the history of statistical tests. [24] [25]
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.
A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently supports a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.
In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when . The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.
In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.
A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.
In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience.
In statistical hypothesis testing, the alternative hypothesis is one of the proposed proposition in the hypothesis test. In general the goal of hypothesis test is to demonstrate that in the given condition, there is sufficient evidence supporting the credibility of alternative hypothesis instead of the exclusive proposition in the test. It is usually consistent with the research hypothesis because it is constructed from literature review, previous studies, etc. However, the research hypothesis is sometimes consistent with the null hypothesis.
In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. An example can be whether a machine produces more than one-percent defective products. In this situation, if the estimated value exists in one of the one-sided critical areas, depending on the direction of interest, the alternative hypothesis is accepted over the null hypothesis. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution, colored in yellow, or "bell curve", pictured on the right and colored in green.
Test statistic is a quantity derived from the sample for statistical hypothesis testing. A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.
A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.
The sign test is a statistical test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations for each subject, the sign test determines if one member of the pair tends to be greater than the other member of the pair.
In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted.
Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.
In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H0).
The Foundations of Statistics are the mathematical and philosophical bases for statistical methods. These bases are the theoretical frameworks that ground and justify methods of statistical inference, estimation, hypothesis testing, uncertainty quantification, and the interpretation of statistical conclusions. Further, a foundation can be used to explain statistical paradoxes, provide descriptions of statistical laws, and guide the application of statistics to real-world problems.
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.
Estimation statistics, or simply estimation, is a data analysis framework that uses a combination of effect sizes, confidence intervals, precision planning, and meta-analysis to plan experiments, analyze data and interpret results. It complements hypothesis testing approaches such as null hypothesis significance testing (NHST), by going beyond the question is an effect present or not, and provides information about how large an effect is. Estimation statistics is sometimes referred to as the new statistics.
Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.