Sequential analysis

Last updated

In statistics, sequential analysis or sequential hypothesis testing is statistical analysis where the sample size is not fixed in advance. Instead data is evaluated as it is collected, and further sampling is stopped in accordance with a pre-defined stopping rule as soon as significant results are observed. Thus a conclusion may sometimes be reached at a much earlier stage than would be possible with more classical hypothesis testing or estimation, at consequently lower financial and/or human cost.

Contents

History

The method of sequential analysis is first attributed to Abraham Wald [1] with Jacob Wolfowitz, W. Allen Wallis, and Milton Friedman [2] while at Columbia University's Statistical Research Group as a tool for more efficient industrial quality control during World War II. Its value to the war effort was immediately recognised, and led to its receiving a "restricted" classification. [3] At the same time, George Barnard led a group working on optimal stopping in Great Britain. Another early contribution to the method was made by K.J. Arrow with D. Blackwell and M.A. Girshick. [4]

A similar approach was independently developed from first principles at about the same time by Alan Turing, as part of the Banburismus technique used at Bletchley Park, to test hypotheses about whether different messages coded by German Enigma machines should be connected and analysed together. This work remained secret until the early 1980s. [5]

Peter Armitage introduced the use of sequential analysis in medical research, especially in the area of clinical trials. Sequential methods became increasingly popular in medicine following Stuart Pocock's work that provided clear recommendations on how to control Type 1 error rates in sequential designs. [6]

Alpha spending functions

When researchers repeatedly analyze data as more observations are added, the probability of a Type 1 error increases. Therefore, it is important to adjust the alpha level at each interim analysis, such that the overall Type 1 error rate remains at the desired level. This is conceptually similar to using the Bonferroni correction, but because the repeated looks at the data are dependent, more efficient corrections for the alpha level can be used. Among the earliest proposals is the Pocock boundary. Alternative ways to control the Type 1 error rate exist, such as the Haybittle–Peto bounds, and additional work on determining the boundaries for interim analyses has been done by O’Brien & Fleming [7] and Wang & Tsiatis. [8]

A limitation of corrections such as the Pocock boundary is that the number of looks at the data must be determined before the data is collected, and that the looks at the data should be equally spaced (e.g., after 50, 100, 150, and 200 patients). The alpha spending function approach developed by Demets & Lan [9] does not have these restrictions, and depending on the parameters chosen for the spending function, can be very similar to Pocock boundaries or the corrections proposed by O'Brien and Fleming. Another approach that has no such restrictions at all is based on e-values and e-processes.

Applications of sequential analysis

Clinical trials

In a randomized trial with two treatment groups, group sequential testing may for example be conducted in the following manner: After n subjects in each group are available an interim analysis is conducted. A statistical test is performed to compare the two groups and if the null hypothesis is rejected the trial is terminated; otherwise, the trial continues, another n subjects per group are recruited, and the statistical test is performed again, including all subjects. If the null is rejected, the trial is terminated, and otherwise it continues with periodic evaluations until a maximum number of interim analyses have been performed, at which point the last statistical test is conducted and the trial is discontinued. [10]

Other applications

Sequential analysis also has a connection to the problem of gambler's ruin that has been studied by, among others, Huygens in 1657. [11]

Step detection is the process of finding abrupt changes in the mean level of a time series or signal. It is usually considered as a special kind of statistical method known as change point detection. Often, the step is small and the time series is corrupted by some kind of noise, and this makes the problem challenging because the step may be hidden by the noise. Therefore, statistical and/or signal processing algorithms are often required. When the algorithms are run online as the data is coming in, especially with the aim of producing an alert, this is an application of sequential analysis.

Bias

Trials that are terminated early because they reject the null hypothesis typically overestimate the true effect size. [12] This is because in small samples, only large effect size estimates will lead to a significant effect, and the subsequent termination of a trial. Methods to correct effect size estimates in single trials have been proposed. [13] Note that this bias is mainly problematic when interpreting single studies. In meta-analyses, overestimated effect sizes due to early stopping are balanced by underestimation in trials that stop late, leading Schou & Marschner to conclude that "early stopping of clinical trials is not a substantive source of bias in meta-analyses". [14]

The meaning of p-values in sequential analyses also changes, because when using sequential analyses, more than one analysis is performed, and the typical definition of a p-value as the data “at least as extreme” as is observed needs to be redefined. One solution is to order the p-values of a series of sequential tests based on the time of stopping and how high the test statistic was at a given look, which is known as stagewise ordering, [12] first proposed by Armitage.

See also

Notes

  1. Wald, Abraham (June 1945). "Sequential Tests of Statistical Hypotheses". The Annals of Mathematical Statistics. 16 (2): 117–186. doi: 10.1214/aoms/1177731118 . JSTOR   2235829.
  2. Berger, James (2008). "Sequential Analysis". (2nd ed.). pp. 438–439. doi:10.1057/9780230226203.1513. ISBN   978-0-333-78676-5.{{cite book}}: |journal= ignored (help); Missing or empty |title= (help)
  3. Weigl, Hans Günter (2013). Abraham Wald : a statistician as a key figure for modern econometrics (PDF) (Doctoral thesis). University of Hamburg.
  4. Kenneth J. Arrow, David Blackwell and M.A. Girshick (1949). "Bayes and minimax solutions of sequential decision problems". Econometrica. 17 (3/4): 213–244. doi:10.2307/1905525. JSTOR   1905525.
  5. Randell, Brian (1980), "The Colossus", A History of Computing in the Twentieth Century, p. 30
  6. W., Turnbull, Bruce (2000). Group sequential methods with applications to clinical trials. Chapman & Hall. ISBN   9780849303166. OCLC   900071609.{{cite book}}: CS1 maint: multiple names: authors list (link)
  7. O'Brien, Peter C.; Fleming, Thomas R. (1979-01-01). "A Multiple Testing Procedure for Clinical Trials". Biometrics. 35 (3): 549–556. doi:10.2307/2530245. JSTOR   2530245. PMID   497341.
  8. Wang, Samuel K.; Tsiatis, Anastasios A. (1987-01-01). "Approximately Optimal One-Parameter Boundaries for Group Sequential Trials". Biometrics. 43 (1): 193–199. doi:10.2307/2531959. JSTOR   2531959. PMID   3567304.
  9. Demets, David L.; Lan, K. K. Gordon (1994-07-15). "Interim analysis: The alpha spending function approach". Statistics in Medicine. 13 (13–14): 1341–1352. doi:10.1002/sim.4780131308. ISSN   1097-0258. PMID   7973215.
  10. Korosteleva, Olga (2008). Clinical Statistics: Introducing Clinical Trials, Survival Analysis, and Longitudinal Data Analysis (First ed.). Jones and Bartlett Publishers. ISBN   978-0-7637-5850-9.
  11. Ghosh, B. K.; Sen, P. K. (1991). Handbook of Sequential Analysis. New York: Marcel Dekker. ISBN   9780824784089.[ page needed ]
  12. 1 2 Proschan, Michael A.; Lan, K. K. Gordan; Wittes, Janet Turk (2006). Statistical monitoring of clinical trials : a unified approach. Springer. ISBN   9780387300597. OCLC   553888945.
  13. Liu, A.; Hall, W. J. (1999-03-01). "Unbiased estimation following a group sequential test". Biometrika. 86 (1): 71–78. doi:10.1093/biomet/86.1.71. ISSN   0006-3444.
  14. Schou, I. Manjula; Marschner, Ian C. (2013-12-10). "Meta-analysis of clinical trials with early stopping: an investigation of potential bias". Statistics in Medicine. 32 (28): 4859–4874. doi:10.1002/sim.5893. ISSN   1097-0258. PMID   23824994. S2CID   22428591.

Related Research Articles

<span class="mw-page-title-main">Design of experiments</span> Design of tasks

The design of experiments, also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associated with experiments in which the design introduces conditions that directly affect the variation, but may also refer to the design of quasi-experiments, in which natural conditions that influence the variation are selected for observation.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. More generally, hypothesis testing allows us to make probabilistic statements about population parameters. More informally, hypothesis testing is the processes of making decisions under uncertainty. Typically, hypothesis testing procedures involve a user selected tradeoff between false positives and false negatives.

<span class="mw-page-title-main">Meta-analysis</span> Statistical method that summarizes data from multiple sources

A meta-analysis is the statistical integration of evidence from multiple studies that address a common research question. By extracting effect sizes and measures of variance, numerous outcomes can be combined to compute a summary effect size. Meta-analyses are commonly used to support research grant applications, treatment guidelines, and health policy. Moreover, meta-analytic outcomes are often used to summarize a research area in an effort to better direct future work. Because of this the meta-analysis has become a core methodology of metascience. Meta-analytic results are considered the most trustworthy source of evidence by the evidence-based medicine literature. In addition to being able to provide an estimate of an unknown effect size, meta-analyses has the capacity to contrast results from different studies and identify both patterns and sources of disagreement among study results, or other relationships highlighted by multiple studies.

<span class="mw-page-title-main">Abraham Wald</span> Hungarian mathematician

Abraham Wald was a Jewish Hungarian mathematician who contributed to decision theory, geometry and econometrics, and founded the field of sequential analysis. One of his well-known statistical works was written during World War II on how to minimize the damage to bomber aircraft and took into account the survivorship bias in his calculations. He spent his research career at Columbia University. He was the grandson of Rabbi Moshe Shmuel Glasner.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In clinical trials and other scientific studies, an interim analysis is an analysis of data that is conducted before data collection has been completed. Clinical trials are unusual in that enrollment of subjects is a continual process staggered in time. If a treatment can be proven to be clearly beneficial or harmful compared to the concurrent control, or to be obviously futile, based on a pre-defined analysis of an incomplete data set while the study is on-going, the investigators may stop the study early.

<span class="mw-page-title-main">Fisher's method</span> Statistical method

In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H0).

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are collected in a longitudinal study in which change over time is assessed.

The Haybittle–Peto boundary is a rule for deciding when to stop a clinical trial prematurely. It is named for John Haybittle and Richard Peto.

The Pocock boundary is a method for determining whether to stop a clinical trial prematurely. The typical clinical trial compares two groups of patients. One group are given a placebo or conventional treatment, while the other group of patients are given the treatment that is being tested. The investigators running the clinical trial will wish to stop the trial early for ethical reasons if the treatment group clearly shows evidence of benefit. In other words, "when early results proved so promising it was no longer fair to keep patients on the older drugs for comparison, without giving them the opportunity to change."

<span class="mw-page-title-main">Equivalence test</span> Tool used to draw statistical inferences from observed data

Equivalence tests are a variety of hypothesis tests used to draw statistical inferences from observed data. In these tests, the null hypothesis is defined as an effect large enough to be deemed interesting, specified by an equivalence bound. The alternative hypothesis is any effect that is less extreme than said equivalence bound. The observed data are statistically compared against the equivalence bounds. If the statistical test indicates the observed data is surprising, assuming that true effects are at least as extreme as the equivalence bounds, a Neyman-Pearson approach to statistical inferences can be used to reject effect sizes larger than the equivalence bounds with a pre-specified Type 1 error rate. 

Preregistration is the practice of registering the hypotheses, methods, and/or analyses of a scientific study before it is conducted. Clinical trial registration is similar, although it may not require the registration of a study's analysis protocol. Finally, registered reports include the peer review and in principle acceptance of a study protocol prior to data collection.

<span class="mw-page-title-main">Adaptive design (medicine)</span> Concept in medicine referring to design of clinical trials

In an adaptive design of a clinical trial, the parameters and conduct of the trial for a candidate drug or vaccine may be changed based on an interim analysis. Adaptive design typically involves advanced statistics to interpret a clinical trial endpoint. This is in contrast to traditional single-arm clinical trials or randomized clinical trials (RCTs) that are static in their protocol and do not modify any parameters until the trial is completed. The adaptation process takes place at certain points in the trial, prescribed in the trial protocol. Importantly, this trial protocol is set before the trial begins with the adaptation schedule and processes specified. Adaptions may include modifications to: dosage, sample size, drug undergoing trial, patient selection criteria and/or "cocktail" mix. The PANDA provides not only a summary of different adaptive designs, but also comprehensive information on adaptive design planning, conduct, analysis and reporting.

In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis. They serve as a more robust alternative to p-values, addressing some shortcomings of the latter.

References

Commercial