Paired difference test

Last updated

Paired difference test is a type of location test that is used when comparing two sets of paired measurements to assess whether their population means differ. A paired difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power, or to reduce the effects of confounders.

Contents

Specific methods for carrying out paired difference tests are, for normally distributed difference t-test (where the population standard deviation of difference is not known) and the paired Z-test (where the population standard deviation of the difference is known), and for differences that may not be normally distributed the Wilcoxon signed-rank test [1] as well as the paired permutation test.

The most familiar example of a paired difference test occurs when subjects are measured before and after a treatment. Such a "repeated measures" test compares these measurements within subjects, rather than across subjects, and will generally have greater power than an unpaired test. Another example comes from matching cases of a disease with comparable controls.

Use in reducing variance

Paired difference tests for reducing variance are a specific type of blocking. To illustrate the idea, suppose we are assessing the performance of a drug for treating high cholesterol. Under the design of our study, we enroll 100 subjects, and measure each subject's cholesterol level. Then all the subjects are treated with the drug for six months, after which their cholesterol levels are measured again. Our interest is in whether the drug has any effect on mean cholesterol levels, which can be inferred through a comparison of the post-treatment to pre-treatment measurements.

The key issue that motivates the paired difference test is that unless the study has very strict entry criteria, it is likely that the subjects will differ substantially from each other before the treatment begins. Important baseline differences among the subjects may be due to their gender, age, smoking status, activity level, and diet.

There are two natural approaches to analyzing these data:

If we only consider the means, the paired and unpaired approaches give the same result. To see this, let Yi1, Yi2 be the observed data for the ith pair, and let Di = Yi2  Yi1. Also let D, Y1, and Y2 denote, respectively, the sample means of the Di, the Yi1, and the Yi2. By rearranging terms we can see that

where n is the number of pairs. Thus the mean difference between the groups does not depend on whether we organize the data as pairs.

Although the mean difference is the same for the paired and unpaired statistics, their statistical significance levels can be very different, because it is easy to overstate the variance of the unpaired statistic. The variance of D is

where σ1 and σ2 are the population standard deviations of the Yi1 and Yi2 data, respectively. Thus the variance of D is lower if there is positive correlation within each pair. Such correlation is very common in the repeated measures setting, since many factors influencing the value being compared are unaffected by the treatment. For example, if cholesterol levels are associated with age, the effect of age will lead to positive correlations between the cholesterol levels measured within subjects, as long as the duration of the study is small relative to the variation in ages in the sample.

Power of the paired Z-test

Suppose we are using a Z-test to analyze the data, where the variances of the pre-treatment and post-treatment data σ12 and σ22 are known (the situation with a t-test is similar). The unpaired Z-test statistic is

The power of the unpaired, one-sided test carried out at level α = 0.05 can be calculated as follows:

where S is the standard deviation of D, Φ is the standard normal cumulative distribution function, and δ = EY2  EY1 is the true effect of the treatment. The constant 1.645 is the 95th percentile of the standard normal distribution, which defines the rejection region of the test.

By a similar calculation, the power of the paired Z-test is

By comparing the expressions for power of the paired and unpaired tests, one can see that the paired test has more power as long as

This condition is met whenever , the within-pairs correlation, is positive.

A random effects model for paired testing

The following statistical model is useful for understanding the paired difference test

where αi is a random effect that is shared between the two values in the pair, and εij is a random noise term that is independent across all data points. The constant values μ1, μ2 are the expected values of the two measurements being compared, and our interest is in δ = μ2  μ1.

In this model, the αi capture "stable confounders" that have the same effect on the pre-treatment and post-treatment measurements. When we subtract to form Di, the αi cancel out, so do not contribute to the variance. The within-pairs covariance is

This is non-negative, so it leads to better performance for the paired difference test compared to the unpaired test, unless the αi are constant over i, in which case the paired and unpaired tests are equivalent.

In less mathematical terms, the unpaired test assumes that the data in the two groups being compared are independent. This assumption determines the form for the variance of D. However, when two measurements are made for each subject, it is unlikely that the two measurements are independent. If the two measurements within a subject are positively correlated, the unpaired test overstates the variance of D, making it a conservative test in the sense that its actual type I error probability will be lower than the nominal level, with a corresponding loss of statistical power. In rare cases, the data may be negatively correlated within subjects, in which case the unpaired test becomes anti-conservative. The paired test is generally used when repeated measurements are made on the same subjects, since it has the correct level regardless of the correlation of the measurements within pairs.

Use in reducing confounding

Another application of paired difference testing arises when comparing two groups in a set of observational data, with the goal being to isolate the effect of one factor of interest from the effects of other factors that may play a role. For example, suppose teachers adopt one of two different approaches, denoted "A" and "B", to teaching a particular mathematical topic. We may be interested in whether the performances of the students on a standardized mathematics test differ according to the teaching approach. If the teachers are free to adopt approach A or approach B, it is possible that teachers whose students are already performing well in mathematics will preferentially choose method A (or vice versa). In this situation, a simple comparison between the mean performances of students taught with approach A and approach B will likely show a difference, but this difference is partially or entirely due to the pre-existing differences between the two groups of students. In this situation, the baseline abilities of the students serve as a confounding variable, in that they are related to both the outcome (performance on the standardized test), and to the treatment assignment to approach A or approach B.

It is possible to reduce, but not necessarily eliminate, the effects of confounding variables by forming "artificial pairs" and performing a pairwise difference test. These artificial pairs are constructed based on additional variables that are thought to serve as confounders. By pairing students whose values on the confounding variables are similar, a greater fraction of the difference in the value of interest (e.g. the standardized test score in the example discussed above), is due to the factor of interest, and a lesser fraction is due to the confounder. Forming artificial pairs for paired difference testing is an example of a general approach for reducing the effects of confounding when making comparisons using observational data called matching. [2] [3] [4]

As a concrete example, suppose we observe student test scores X under teaching strategies A and B, and each student has either a "high" or "low" level of mathematical knowledge before the two teaching strategies are implemented. However, we do not know which students are in the "high" category and which are in the "low" category. The population mean test scores in the four possible groups are and the proportions of students in the groups are where pHA + pHB + pLA + pLB = 1.

The "treatment difference" among students in the "high" group is μHA  μHB and the treatment difference among students in the "low" group is μLA  μLB. In general, it is possible that the two teaching strategies could differ in either direction, or show no difference, and the effects could differ in magnitude or even in sign between the "high" and "low" groups. For example, if strategy B were superior to strategy A for well-prepared students, but strategy A were superior to strategy B for poorly prepared students, the two treatment differences would have opposite signs.

Since we do not know the baseline levels of the students, the expected value of the average test score XA among students in the A group is an average of those in the two baseline levels:

and similarly the average test score XB among students in the B group is

Thus the expected value of the observed treatment difference D = XA  XB is

A reasonable null hypothesis is that there is no effect of the treatment within either the "high" or "low" student groups, so that μHA = μHB and μLA = μLB. Under this null hypothesis, the expected value of D will be zero if

and

This condition asserts that the assignment of students to the A and B teaching strategy groups is independent of their mathematical knowledge before the teaching strategies are implemented. If this holds, baseline mathematical knowledge is not a confounder, and conversely, if baseline mathematical knowledge is a confounder, the expected value of D will generally differ from zero. If the expected value of D under the null hypothesis is not equal to zero, then a situation where we reject the null hypothesis could either be due to an actual differential effect between teaching strategies A and B, or it could be due to non-independence in the assignment of students to the A and B groups (even in the complete absence of an effect due to the teaching strategy).

This example illustrates that if we make a direct comparison between two groups when confounders are present, we do not know whether any difference that is observed is due to the grouping itself, or is due to some other factor. If we are able to pair students by an exact or estimated measure of their baseline mathematical ability, then we are only comparing students "within rows" of the table of means given above. Consequently, if the null hypothesis holds, the expected value of D will equal zero, and statistical significance levels have their intended interpretation.

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Pauli matrices</span> Matrices important in quantum mechanics and the study of spin

In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices that are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

In particle physics, the Dirac equation is a relativistic wave equation derived by British physicist Paul Dirac in 1928. In its free form, or including electromagnetic interactions, it describes all spin-12 massive particles, called "Dirac particles", such as electrons and quarks for which parity is a symmetry. It is consistent with both the principles of quantum mechanics and the theory of special relativity, and was the first theory to account fully for special relativity in the context of quantum mechanics. It was validated by accounting for the fine structure of the hydrogen spectrum in a completely rigorous way.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Standard score</span> How many standard deviations apart from the mean an observed datum is

In statistics, the standard score is the number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members often asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Experimental uncertainty analysis is a technique that analyses a derived quantity, based on the uncertainties in the experimentally measured quantities that are used in some form of mathematical relationship ("model") to calculate that derived quantity. The model used to convert the measurements into the derived quantity is usually based on fundamental principles of a science or engineering discipline.

In statistics, the strictly standardized mean difference (SSMD) is a measure of effect size. It is the mean divided by the standard deviation of a difference between two random values each from one of two groups. It was initially proposed for quality control and hit selection in high-throughput screening (HTS) and has become a statistical parameter measuring effect sizes for the comparison of any two groups with random values.

<span class="mw-page-title-main">Weyl equation</span> Relativistic wave equation describing massless fermions

In physics, particularly in quantum field theory, the Weyl equation is a relativistic wave equation for describing massless spin-1/2 particles called Weyl fermions. The equation is named after Hermann Weyl. The Weyl fermions are one of the three possible types of elementary fermions, the other two being the Dirac and the Majorana fermions.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

References

  1. Derrick, B; Broad, A; Toher, D; White, P (2017). "The impact of an extreme observation in a paired samples design". Metodološki Zvezki - Advances in Methodology and Statistics. 14 (2): 1–17.
  2. Rubin, Donald B. (1973). "Matching to Remove Bias in Observational Studies". Biometrics. 29 (1): 159–183. doi:10.2307/2529684. JSTOR   2529684.
  3. Anderson, Dallas W.; Kish, Leslie; Cornell, Richard G. (1980). "On Stratification, Grouping and Matching". Scandinavian Journal of Statistics. Blackwell Publishing. 7 (2): 61–66. JSTOR   4615774.
  4. Kupper, Lawrence L.; Karon, John M.; Kleinbaum, David G.; Morgenstern, Hal; Lewis, Donald K. (1981). "Matching in Epidemiologic Studies: Validity and Efficiency Considerations". Biometrics. 37 (2): 271–291. CiteSeerX   10.1.1.154.1197 . doi:10.2307/2530417. JSTOR   2530417. PMID   7272415.