Sequential probability ratio test

Last updated October 02, 2024

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald ^[1] and later proven to be optimal by Wald and Jacob Wolfowitz.^[2] Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected (and its likelihood ratio known).

Theory

As in classical hypothesis testing, SPRT starts with a pair of hypotheses, say $H_{0}$ and $H_{1}$ for the null hypothesis and alternative hypothesis respectively. They must be specified as follows:

H_{0}:p=p_{0}

H_{1}:p=p_{1}

The next step is to calculate the cumulative sum of the log-likelihood ratio, $\log \Lambda _{i}$ , as new data arrive: with $S_{0}=0$ , then, for $i$ =1,2,...,

S_{i}=S_{i-1}+\log \Lambda _{i}

The stopping rule is a simple thresholding scheme:

$a<S_{i}<b$ : continue monitoring (critical inequality)
$S_{i}\geq b$ : Accept $H_{1}$
$S_{i}\leq a$ : Accept $H_{0}$

where $a$ and $b$ ( $a<0<b<\infty$ ) depend on the desired type I and type II errors, $\alpha$ and $\beta$ . They may be chosen as follows:

$a\approx \log {\frac {\beta }{1-\alpha }}$ and $b\approx \log {\frac {1-\beta }{\alpha }}$

In other words, $\alpha$ and $\beta$ must be decided beforehand in order to set the thresholds appropriately. The numerical value will depend on the application. The reason for being only an approximation is that, in the discrete case, the signal may cross the threshold between samples. Thus, depending on the penalty of making an error and the sampling frequency, one might set the thresholds more aggressively. The exact bounds are correct in the continuous case.

Example

A textbook example is parameter estimation of a probability distribution function. Consider the exponential distribution:

f_{\theta }(x)=\theta ^{-1}e^{-{\frac {x}{\theta }}},\qquad x,\theta >0

The hypotheses are

{\begin{cases}H_{0}:\theta =\theta _{0}\\H_{1}:\theta =\theta _{1}\end{cases}}\qquad \theta _{1}>\theta _{0}.

Then the log-likelihood function (LLF) for one sample is

{\begin{aligned}\log \Lambda (x)&=\log \left({\frac {\theta _{1}^{-1}e^{-{\frac {x}{\theta _{1}}}}}{\theta _{0}^{-1}e^{-{\frac {x}{\theta _{0}}}}}}\right)\\&=\log \left({\frac {\theta _{0}}{\theta _{1}}}e^{{\frac {x}{\theta _{0}}}-{\frac {x}{\theta _{1}}}}\right)\\&=\log \left({\frac {\theta _{0}}{\theta _{1}}}\right)+\log \left(e^{{\frac {x}{\theta _{0}}}-{\frac {x}{\theta _{1}}}}\right)\\&=-\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)+\left({\frac {x}{\theta _{0}}}-{\frac {x}{\theta _{1}}}\right)\\&=-\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)+\left({\frac {\theta _{1}-\theta _{0}}{\theta _{0}\theta _{1}}}\right)x\end{aligned}}

The cumulative sum of the LLFs for all $x$ is

S_{n}=\sum _{i=1}^{n}\log \Lambda (x_{i})=-n\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)+\left({\frac {\theta _{1}-\theta _{0}}{\theta _{0}\theta _{1}}}\right)\sum _{i=1}^{n}x_{i}

Accordingly, the stopping rule is:

a<-n\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)+\left({\frac {\theta _{1}-\theta _{0}}{\theta _{0}\theta _{1}}}\right)\sum _{i=1}^{n}x_{i}<b

After re-arranging we finally find

a+n\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)<\left({\frac {\theta _{1}-\theta _{0}}{\theta _{0}\theta _{1}}}\right)\sum _{i=1}^{n}x_{i}<b+n\log \left({\frac {\theta _{1}}{\theta _{0}}}\right)

The thresholds are simply two parallel lines with slope $\log(\theta _{1}/\theta _{0})$ . Sampling should stop when the sum of the samples makes an excursion outside the continue-sampling region.

Applications

Manufacturing

The test is done on the proportion metric, and tests that a variable p is equal to one of two desired points, p₁ or p₂. The region between these two points is known as the indifference region (IR). For example, suppose you are performing a quality control study on a factory lot of widgets. Management would like the lot to have 3% or less defective widgets, but 1% or less is the ideal lot that would pass with flying colors. In this example, p₁ = 0.01 and p₂ = 0.03 and the region between them is the IR because management considers these lots to be marginal and is OK with them being classified either way. Widgets would be sampled one at a time from the lot (sequential analysis) until the test determines, within an acceptable error level, that the lot is ideal or should be rejected.

Testing of human examinees

The SPRT is currently the predominant method of classifying examinees in a variable-length computerized classification test (CCT)^{[ citation needed ]}. The two parameters are p₁ and p₂ are specified by determining a cutscore (threshold) for examinees on the proportion correct metric, and selecting a point above and below that cutscore. For instance, suppose the cutscore is set at 70% for a test. We could select p₁ = 0.65 and p₂ = 0.75 . The test then evaluates the likelihood that an examinee's true score on that metric is equal to one of those two points. If the examinee is determined to be at 75%, they pass, and they fail if they are determined to be at 65%.

These points are not specified completely arbitrarily. A cutscore should always be set with a legally defensible method, such as a modified Angoff procedure. Again, the indifference region represents the region of scores that the test designer is OK with going either way (pass or fail). The upper parameter p₂ is conceptually the highest level that the test designer is willing to accept for a Fail (because everyone below it has a good chance of failing), and the lower parameter p₁ is the lowest level that the test designer is willing to accept for a pass (because everyone above it has a decent chance of passing). While this definition may seem to be a relatively small burden, consider the high-stakes case of a licensing test for medical doctors: at just what point should we consider somebody to be at one of these two levels?

While the SPRT was first applied to testing in the days of classical test theory, as is applied in the previous paragraph, Reckase (1983) suggested that item response theory be used to determine the p₁ and p₂ parameters. The cutscore and indifference region are defined on the latent ability (theta) metric, and translated onto the proportion metric for computation. Research on CCT since then has applied this methodology for several reasons:

Large item banks tend to be calibrated with IRT
This allows more accurate specification of the parameters
By using the item response function for each item, the parameters are easily allowed to vary between items.

Detection of anomalous medical outcomes

Spiegelhalter et al.^[6] have shown that SPRT can be used to monitor the performance of doctors, surgeons and other medical practitioners in such a way as to give early warning of potentially anomalous results. In their 2003 paper, they showed how it could have helped identify Harold Shipman as a murderer well before he was actually identified.

Extensions

MaxSPRT

More recently, in 2011, an extension of the SPRT method called Maximized Sequential Probability Ratio Test (MaxSPRT)^[7] was introduced. The salient feature of MaxSPRT is the allowance of a composite, one-sided alternative hypothesis, and the introduction of an upper stopping boundary. The method has been used in several medical research studies.^[8]

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $k$ and a scale parameter $θ$
With a shape parameter $and an inverse scale parameter ⁠ ⁠, called a rate parameter.$

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

In statistics, the score is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q . Mathematically, it is defined as$

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ²-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matrix:

In statistics, the delta method is a method of deriving the asymptotic distribution of a random variable. It is applicable when the random variable being considered can be defined as a differentiable function of a random variable which is asymptotically Gaussian.

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

In statistical quality control, the CUSUM is a sequential analysis technique developed by E. S. Page of the University of Cambridge. It is typically used for monitoring change detection. CUSUM was announced in Biometrika, in 1954, a few years after the publication of Wald's sequential probability ratio test (SPRT).

In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power among all possible tests of a given size α. For example, according to the Neyman–Pearson lemma, the likelihood-ratio test is UMP for testing simple (point) hypotheses.

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable $is known as the natural exponential family of .$

In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

↑ Wald, Abraham (June 1945). "Sequential Tests of Statistical Hypotheses". Annals of Mathematical Statistics. 16 (2): 117–186. doi: 10.1214/aoms/1177731118 . JSTOR 2235829.
↑ Wald, A.; Wolfowitz, J. (1948). "Optimum Character of the Sequential Probability Ratio Test". The Annals of Mathematical Statistics. 19 (3): 326–339. doi: 10.1214/aoms/1177730197 . JSTOR 2235638.
↑ Ferguson, Richard L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction. Unpublished doctoral dissertation, University of Pittsburgh.
↑ Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.
↑ Eggen, T. J. H. M. (1999). "Item Selection in Adaptive Testing with the Sequential Probability Ratio Test". Applied Psychological Measurement. 23 (3): 249–261. doi:10.1177/01466219922031365. S2CID 120780131.
↑ Risk-adjusted sequential probability ratio tests: application to Bristol, Shipman and adult cardiac surgery Spiegelhalter, D. et al Int J Qual Health Care vol 15 7-13 (2003) ^{[ dead link ]}
↑ Kulldorff, Martin; Davis, Robert L.; Kolczak†, Margarette; Lewis, Edwin; Lieu, Tracy; Platt, Richard (2011). "A Maximized Sequential Probability Ratio Test for Drug and Vaccine Safety Surveillance". Sequential Analysis. 30: 58–78. doi: 10.1080/07474946.2011.539924 .
↑ 2nd to last paragraph of section 1: http://www.tandfonline.com/doi/full/10.1080/07474946.2011.539924 A Maximized Sequential Probability Ratio Test for Drug and Vaccine Safety Surveillance Kulldorff, M. et al Sequential Analysis: Design Methods and Applications vol 30, issue 1

External links

Wald's Sequential Probability Ratio Test for R by Stéphane Bottine
Wald's Sequential Probability Ratio Test for Python by Zhenning Yu

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Wald, Abraham (June 1945). "Sequential Tests of Statistical Hypotheses". Annals of Mathematical Statistics. 16 (2): 117–186. doi: 10.1214/aoms/1177731118 . JSTOR 2235829.

[2] Wald, A.; Wolfowitz, J. (1948). "Optimum Character of the Sequential Probability Ratio Test". The Annals of Mathematical Statistics. 19 (3): 326–339. doi: 10.1214/aoms/1177730197 . JSTOR 2235638.

[3] Ferguson, Richard L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction. Unpublished doctoral dissertation, University of Pittsburgh.

[4] Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.

[Eggen1999-5] Eggen, T. J. H. M. (1999). "Item Selection in Adaptive Testing with the Sequential Probability Ratio Test". Applied Psychological Measurement. 23 (3): 249–261. doi:10.1177/01466219922031365. S2CID 120780131.

[6] Risk-adjusted sequential probability ratio tests: application to Bristol, Shipman and adult cardiac surgery Spiegelhalter, D. et al Int J Qual Health Care vol 15 7-13 (2003) ^{[ dead link ]}

[7] Kulldorff, Martin; Davis, Robert L.; Kolczak†, Margarette; Lewis, Edwin; Lieu, Tracy; Platt, Richard (2011). "A Maximized Sequential Probability Ratio Test for Drug and Vaccine Safety Surveillance". Sequential Analysis. 30: 58–78. doi: 10.1080/07474946.2011.539924 .

[8] 2nd to last paragraph of section 1: http://www.tandfonline.com/doi/full/10.1080/07474946.2011.539924 A Maximized Sequential Probability Ratio Test for Drug and Vaccine Safety Surveillance Kulldorff, M. et al Sequential Analysis: Design Methods and Applications vol 30, issue 1

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]