Equivalence test

Last updated

Equivalence tests are a variety of hypothesis tests used to draw statistical inferences from observed data. In these tests, the null hypothesis is defined as an effect large enough to be deemed interesting, specified by an equivalence bound. The alternative hypothesis is any effect that is less extreme than said equivalence bound. The observed data are statistically compared against the equivalence bounds. If the statistical test indicates the observed data is surprising, assuming that true effects are at least as extreme as the equivalence bounds, a Neyman-Pearson approach to statistical inferences can be used to reject effect sizes larger than the equivalence bounds with a pre-specified Type 1 error rate.  

Contents

Equivalence testing originates from the field of clinical trials. [1] One application, known as a non-inferiority trial, is used to show that a new drug that is cheaper than available alternatives works as well as an existing drug. In essence, equivalence tests consist of calculating a confidence interval around an observed effect size and rejecting effects more extreme than the equivalence bound when the confidence interval does not overlap with the equivalence bound. In two-sided tests, both upper and lower equivalence bounds are specified. In non-inferiority trials, where the goal is to test the hypothesis that a new treatment is not worse than existing treatments, only a lower equivalence bound is specified.   

Mean differences (black squares) and 90% confidence intervals (horizontal lines) with equivalence bounds DL = -0.5 and DU= 0.5 for four combinations of test results that are statistically equivalent or not and statistically different from zero or not. Pattern A is statistically equivalent, pattern B is statistically different from 0, pattern C is practically insignificant, and pattern D is inconclusive (neither statistically different from 0 nor equivalent). Equivalence Test.png
Mean differences (black squares) and 90% confidence intervals (horizontal lines) with equivalence bounds ΔL = -0.5 and ΔU= 0.5 for four combinations of test results that are statistically equivalent or not and statistically different from zero or not. Pattern A is statistically equivalent, pattern B is statistically different from 0, pattern C is practically insignificant, and pattern D is inconclusive (neither statistically different from 0 nor equivalent).

Equivalence tests can be performed in addition to null-hypothesis significance tests. [2] [3] [4] [5] This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect. Furthermore, equivalence tests can identify effects that are statistically significant but practically insignificant, whenever effects are statistically different from zero, but also statistically smaller than any effect size deemed worthwhile (see the first figure). [6] Equivalence tests were originally used in areas such as pharmaceutics, frequently in bioequivalence trials. However, these tests can be applied to any instance where the research question asks whether the means of two sets of scores are practically or theoretically equivalent. As such, equivalence analyses have seen increased usage in almost all medical research fields. Additionally, the field of psychology has been adopting the use of equivalence testing, particularly in clinical trials. This is not to say, however, that equivalence analyses should be limited to clinical trials, and the application of these tests can occur in a range of research areas. In this regard, equivalence tests have recently been introduced in evaluation of measurement devices, [7] [8] artificial intelligence [9] as well as exercise physiology and sports science. [10] Several tests exist for equivalence analyses; however, more recently the two-one-sided t-tests (TOST) procedure has been garnering considerable attention. As outlined below, this approach is an adaptation of the widely known t-test.  

TOST procedure

A very simple equivalence testing approach is the ‘two one-sided t-tests’ (TOST) procedure. [11] In the TOST procedure an upper (ΔU) and lower (–ΔL) equivalence bound is specified based on the smallest effect size of interest (e.g., a positive or negative difference of d = 0.3). Two composite null hypotheses are tested: H01: Δ ≤ –ΔL and H02: Δ ≥ ΔU. When both these one-sided tests can be statistically rejected, we can conclude that –ΔL < Δ < ΔU, or that the observed effect falls within the equivalence bounds and is statistically smaller than any effect deemed worthwhile and considered practically equivalent". [12] Alternatives to the TOST procedure have been developed as well. [13] A recent modification to TOST makes the approach feasible in cases of repeated measures and assessing multiple variables. [14]

Comparison between t-test and equivalence test

The equivalence test can be induced from the t-test. [7] Consider a t-test at the significance level αt-test with a power of 1-βt-test for a relevant effect size dr. If Δ=dr as well as αequiv.-testt-test and βequiv.-testt-test coincide, i.e. the error types (type I and type II) are interchanged between the t-test and the equivalence test, then the t-test will obtain the same results as the equivalence test. To achieve this for the t-test, either the sample size calculation needs to be carried out correctly, or the t-test significance level αt-test needs to be adjusted, referred to as the so-called revised t-test. [7] Both approaches have difficulties in practice since sample size planning relies on unverifiable assumptions of the standard deviation, and the revised t-test yields numerical problems. [7] Preserving the test behavior, those limitations can be removed by using an equivalence test.  

The figure below allows a visual comparison of the equivalence test and the t-test when the sample size calculation is affected by differences between the a priori standard deviation and the sample's standard deviation , which is a common problem. Using an equivalence test instead of a t-test additionally ensures that αequiv.-test is bounded, which the t-test does not do in case that with the type II error growing arbitrary large. On the other hand, having results in the t-test being stricter than the dr specified in the planning, which may randomly penalize the sample source (e.g., a device manufacturer). This makes the equivalence test safer to use.

Chances to pass (a) the t-test and (b) the equivalence test, depending on the actual error . For more details, see T-test vs equivalence test.png
Chances to pass (a) the t-test and (b) the equivalence test, depending on the actual error 𝜇. For more details, see

See also

Literature

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant and is the standard deviation.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

<span class="mw-page-title-main">Poisson's ratio</span> Measure of material deformation perpendicular to loading

In materials science and solid mechanics, Poisson's ratio is a measure of the Poisson effect, the deformation of a material in directions perpendicular to the specific direction of loading. The value of Poisson's ratio is the negative of the ratio of transverse strain to axial strain. For small values of these changes, ν is the amount of transversal elongation divided by the amount of axial compression. Most materials have Poisson's ratio values ranging between 0.0 and 0.5. For soft materials, such as rubber, where the bulk modulus is much higher than the shear modulus, Poisson's ratio is near 0.5. For open-cell polymer foams, Poisson's ratio is near zero, since the cells tend to collapse in compression. Many typical solids have Poisson's ratios in the range of 0.2 to 0.3. The ratio is named after the French mathematician and physicist Siméon Poisson.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of one parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim to provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

<span class="mw-page-title-main">Multivariate analysis of variance</span> Procedure for comparing multivariate sample means

In statistics, multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is often followed by significance tests involving individual dependent variables separately.

In probability theory and statistics, the coefficient of variation (CV), also known as normalized root-mean-square deviation (NRMSD), percent RMS, and relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation to the mean , and often expressed as a percentage ("%RSD"). The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R, by economists and investors in economic models, and in psychology/neuroscience.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

<span class="mw-page-title-main">Satellite galaxy</span> Galaxy that orbits a larger galaxy due to gravitational attraction

A satellite galaxy is a smaller companion galaxy that travels on bound orbits within the gravitational potential of a more massive and luminous host galaxy. Satellite galaxies and their constituents are bound to their host galaxy, in the same way that planets within the Solar System are gravitationally bound to the Sun. While most satellite galaxies are dwarf galaxies, satellite galaxies of large galaxy clusters can be much more massive. The Milky Way is orbited by about fifty satellite galaxies, the largest of which is the Large Magellanic Cloud.

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to correctly interpret the statistical significance of the difference between means that have been selected for comparison because of their extreme values.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics, almost sure hypothesis testing or a.s. hypothesis testing utilizes almost sure convergence in order to determine the validity of a statistical hypothesis with probability one. This is to say that whenever the null hypothesis is true, then an a.s. hypothesis test will fail to reject the null hypothesis w.p. 1 for all sufficiently large samples. Similarly, whenever the alternative hypothesis is true, then an a.s. hypothesis test will reject the null hypothesis with probability one, for all sufficiently large samples. Along similar lines, an a.s. confidence interval eventually contains the parameter of interest with probability 1. Dembo and Peres (1994) proved the existence of almost sure hypothesis tests.

The harmonic mean p-value(HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the power of Bonferroni correction by performing combined tests, i.e. by testing whether groups of p-values are statistically significant, like Fisher's method. However, it avoids the restrictive assumption that the p-values are independent, unlike Fisher's method. Consequently, it controls the false positive rate when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent. Besides providing an alternative to approaches such as Bonferroni correction that controls the stringent family-wise error rate, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent false discovery rate. This is because the power of the HMP to detect significant groups of hypotheses is greater than the power of BH to detect significant individual hypotheses.

In statistics, expected mean squares (EMS) are the expected values of certain statistics arising in partitions of sums of squares in the analysis of variance (ANOVA). They can be used for ascertaining which statistic should appear in the denominator in an F-test for testing a null hypothesis that a particular effect is absent.

References

  1. Snapinn, Steven M. (2000). "Noninferiority trials". Current Controlled Trials in Cardiovascular Medicine. 1 (1): 19–21. doi: 10.1186/CVM-1-1-019 . PMC   59590 . PMID   11714400.
  2. Rogers, James L.; Howard, Kenneth I.; Vessey, John T. (1993). "Using significance tests to evaluate equivalence between two experimental groups". Psychological Bulletin. 113 (3): 553–565. doi:10.1037/0033-2909.113.3.553. PMID   8316613.
  3. Statistics applied to clinical trials (4th ed.). Springer. 2009. ISBN   978-1402095221.
  4. Piaggio, Gilda; Elbourne, Diana R.; Altman, Douglas G.; Pocock, Stuart J.; Evans, Stephen J. W.; CONSORT Group, for the (8 March 2006). "Reporting of Noninferiority and Equivalence Randomized Trials" (PDF). JAMA. 295 (10): 1152–60. doi:10.1001/jama.295.10.1152. PMID   16522836.
  5. Piantadosi, Steven (28 August 2017). Clinical trials : a methodologic perspective (Third ed.). John Wiley & Sons. p. 8.6.2. ISBN   978-1-118-95920-6.
  6. Lakens, Daniël (2017-05-05). "Equivalence Tests". Social Psychological and Personality Science. 8 (4): 355–362. doi:10.1177/1948550617697177. PMC   5502906 . PMID   28736600.
  7. 1 2 3 4 5 Siebert, Michael; Ellenberger, David (2019-04-10). "Validation of automatic passenger counting: introducing the t-test-induced equivalence test". Transportation. 47 (6): 3031–3045. arXiv: 1802.03341 . doi: 10.1007/s11116-019-09991-9 . ISSN   0049-4488.
  8. Schnellbach, Teresa (2022). Hydraulic Data Analysis Using Python. doi:10.26083/tuprints-00022026.
  9. Jahn, Nico; Siebert, Michael (2022). "Engineering the Neural Automatic Passenger Counter". Engineering Applications of Artificial Intelligence. 114. arXiv: 2203.01156 . doi:10.1016/j.engappai.2022.105148.
  10. Mazzolari, Raffaele; Porcelli, Simone; Bishop, David J.; Lakens, Daniël (March 2022). "Myths and methodologies: The use of equivalence and non‐inferiority tests for interventional studies in exercise physiology and sport science". Experimental Physiology. 107 (3): 201–212. doi:10.1113/EP090171. ISSN   0958-0670. PMID   35041233. S2CID   246051376.
  11. Schuirmann, Donald J. (1987-12-01). "A comparison of the Two One-Sided Tests Procedure and the Power Approach for assessing the equivalence of average bioavailability". Journal of Pharmacokinetics and Biopharmaceutics. 15 (6): 657–680. doi:10.1007/BF01068419. ISSN   0090-466X. PMID   3450848. S2CID   206788664.
  12. Lakens, Daniël (May 2017). "Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses". Social Psychological and Personality Science. 8 (4): 355–362. doi:10.1177/1948550617697177. ISSN   1948-5506. PMC   5502906 . PMID   28736600.
  13. Wellek, Stefan (2010). Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC. ISBN   978-1439808184.
  14. Rose, Evangeline M.; Mathew, Thomas; Coss, Derek A.; Lohr, Bernard; Omland, Kevin E. (2018). "A new statistical method to test equivalence: an application in male and female eastern bluebird song". Animal Behaviour. 145: 77–85. doi:10.1016/j.anbehav.2018.09.004. ISSN   0003-3472. S2CID   53152801.