Power (statistics)

Last updated

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used (including the desired level of statistical significance), the assumed distribution of the test (for example, the degree of variability, and sample size), and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

Contents

More formally, in the case of a simple hypothesis test with two hypotheses, the power of the test is the probability that the test correctly rejects the null hypothesis () when the alternative hypothesis () is true. It is commonly denoted by , where is the probability of making a type II error (a false negative) conditional on there being a true effect or association.

Background

Statistical testing uses data from samples to assess, or make inferences about, a statistical population. For example, we may measure the yields of samples of two varieties of a crop, and use a two sample test to assess whether the mean values of this yield differs between varieties.

Under a frequentist hypothesis testing framework, this is done by calculating a test statistic (such as a t-statistic) for the dataset, which has a known theoretical probability distribution if there is no difference (the so called null hypothesis). If the actual value calculated on the sample is sufficiently unlikely to arise under the null hypothesis, we say we identified a statistically significant effect.

The threshold for significance can be set small to ensure there is little chance of falsely detecting a non-existent effect. However, failing to identify a significant effect does not imply there was none. If we insist on being careful to avoid false positives, we may create false negatives instead. It may simply be too much to expect that we will be able to find satisfactorily strong evidence of a very subtle difference even if it exists. Statistical power is an attempt to quantify this issue.

In the case of the comparison of the two crop varieties, it enables us to answer questions like:

Description

Illustration of the power of a statistical test, for a two sided test, through the probability distribution of the test statistic under the null and alternative hypothesis. a is shown as the blue area, the probability of rejection under null, while the red area shows power, 1 - b, the probability of correctly rejecting under the alternative. PowerOfTest.png
Illustration of the power of a statistical test, for a two sided test, through the probability distribution of the test statistic under the null and alternative hypothesis. α is shown as the blue area, the probability of rejection under null, while the red area shows power, 1 − β, the probability of correctly rejecting under the alternative.

Suppose we are conducting a hypothesis test. We define two hypotheses the null hypothesis, and the alternative hypothesis. If we design the test such that α is the significance level - being the probability of rejecting when is in fact true, then the power of the test is 1 - β where β is the probability of failing to reject when the alternative is true.

Probability to reject Probability to not reject
If is Trueα1-α
If is True1-β (power)β

To make this more concrete, a typical statistical test would be based on a test statistic t calculated from the sampled data, which has a particular probability distribution under . A desired significance level α would then define a corresponding "rejection region" (bounded by certain "critical values"), a set of values t is unlikely to take if was correct. If we reject in favor of only when the sample t takes those values, we would be able to keep the probability of falsely rejecting within our desired significance level. At the same time, if defines its own probability distribution for t (the difference between the two distributions being a function of the effect size), the power of the test would be the probability, under , that the sample t falls into our defined rejection region and causes to be correctly rejected.

Statistical power is one minus the type II error probability and is also the sensitivity of the hypothesis testing procedure to detect a true effect. There is usually a trade-off between demanding more stringent tests (and so, smaller rejection regions) and trying to have a high probability of rejecting the null under the alternative hypothesis. Statistical power may also be extended to the case where multiple hypotheses are being tested based on an experiment or survey. It is thus also common to refer to the power of a study, evaluating a scientific project in terms of its ability to answer the research questions they are seeking to answer.

Applications

The main application of statistical power is "power analysis", a calculation of power usually done before an experiment is conducted using data from pilot studies or a literature review. Power analyses can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size (in other words, producing an acceptable level of power). For example: "How many times do I need to toss a coin to conclude it is rigged by a certain amount?" [1] If resources and thus sample sizes are fixed, power analyses can also be used to calculate the minimum effect size that is likely to be detected.

Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis. An underpowered study is likely be inconclusive, failing to allow one to choose between hypotheses at the desired significance level, while an overpowered study will spend great expense on being able to report significant effects even if they are tiny and so practically meaningless. If a large number of underpowered studies are done and statistically significant results published, published findings are more likely false positives than true results, contributing to a replication crisis. However, excessive demands for power could be connected to wasted resources and ethical problems, for example the use of a large number of animal test subjects when a smaller number would have been sufficient. It could also induce researchers trying to seek funding to overstate their expected effect sizes, or avoid looking for more subtle interaction effects that cannot be easily detected. [2]

Power analysis is primarily a frequentist statistics tool. In Bayesian statistics, hypothesis testing of the type used in classical power analysis is not done. In the Bayesian framework, one updates his or her prior beliefs using the data obtained in a given study. In principle, a study that would be deemed underpowered from the perspective of hypothesis testing could still be used in such an updating process. However, power remains a useful measure of how much a given experiment size can be expected to refine one's beliefs. A study with low power is unlikely to lead to a large change in beliefs.

In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric test and a nonparametric test of the same hypothesis. Tests may have the same size, and hence the same false positive rates, but different ability to detect true effects. Consideration of their theoretical power proprieties is a key reason for the common use of likelihood ratio tests.

Rule of thumb for t-test

Lehr's [3] [4] (rough) rule of thumb says that the sample size (for each group) for the common case of a two-sided two-sample t-test with power 80% () and significance level should be: where is an estimate of the population variance and the to-be-detected difference in the mean values of both samples. This expression can be rearranged, implying for example that 80% power is obtained when looking for a difference in means that exceeds about 4 times the group-wise standard error of the mean.

For a one sample t-test 16 is to be replaced with 8. Other values provide an appropriate approximation when the desired power or significance level are different. [5]

However, a full power analysis should always be performed to confirm and refine this estimate.

Factors influencing power

An example of the relationship between sample size and power levels. Higher power requires larger sample sizes Sample Sizes Effect on Power.png
An example of the relationship between sample size and power levels. Higher power requires larger sample sizes

Statistical power may depend on a number of factors. Some factors may be particular to a specific testing situation, but in normal use, power depends on the following three aspects that can be potentially controlled by the practitioner:

For a given test, the significance criterion determines the desired degree of rigor, specifying how unlikely it is for the null hypothesis of no effect to be rejected if it is in fact true. The most commonly used threshold is a probability of rejection of 0.05, though smaller values like 0.01 or 0.001 are sometimes used. This threshold then implies that the observation must be at least that unlikely (perhaps by suggesting a sufficiently large estimate of difference) to be considered strong enough evidence against the null. Picking a smaller value to tighten the threshold, so as to reduce the chance of a false positive, would also reduce power, increase the chance of a false negative. Some statistical tests will inherently produce better power, albeit often at the cost of requiring stronger assumptions.

The magnitude of the effect of interest defines what is being looked for by the test. It can be the expected effect size if it exists, as a scientific hypothesis that the researcher has arrived at and wishes to test. Alternatively, in a more practical context it could be determined by the size the effect must be to be useful, for example that which is required to be clinically significant. An effect size can be a direct value of the quantity of interest (for example, a difference in mean of a particular size), or it can be a standardized measure that also accounts for the variability in the population (such as a difference in means expressed as a multiple of the standard deviation). If the researcher is looking for a larger effect, then it should be easier to find with a given experimental or analytic setup, and so power is higher.

The nature of the sample underlies the information being used in the test. This will usually involve the sample size, and the sample variability, if that is not implicit in the definition of the effect size. More broadly, the precision with which the data are measured can also be an important factor (such as the statistical reliability), as well as the design of an experiment or observational study. Ultimately, these factors lead to an expected amount of sampling error. A smaller sampling error could be obtained by larger sample sizes from a less variability population, from more accurate measurements, or from more efficient experimental designs (for example, with the appropriate use of blocking), and such smaller errors would lead to improved power, albeit usually at a cost in resources. How increased sample size translates to higher power is a measure of the efficiency of the test – for example, the sample size required for a given power. [6]

Discussion

The statistical power of a hypothesis test has an impact on the interpretation of its results. Not finding a result with a more powerful study is stronger evidence against the effect existing than the same finding with a less powerful study. However, this is not completely conclusive. The effect may exist, but be smaller than what was looked for, meaning the study is in fact underpowered and the sample is thus unable to distinguish it from random chance. [7] Many clinical trials, for instance, have low statistical power to detect differences in adverse effects of treatments, since such effects may only affect a few patients, even if this difference can be important. [8] Conclusions about the probability of actual presence of an effect also should consider more things than a single test, especially as real world power is rarely close to 1.

Indeed, although there are no formal standards for power, many researchers and funding bodies assess power using 0.80 (or 80%) as a standard for adequacy. This convention implies a four-to-one trade off between β-risk and α-risk, as the probability of a type II error β is set as 1 - 0.8 = 0.2, while α, the probability of a type I error, is commonly set at 0.05. Some applications require much higher levels of power. Medical tests may be designed to minimise the number of false negatives (type II errors) produced by loosening the threshold of significance, raising the risk of obtaining a false positive (a type I error). The rationale is that it is better to tell a healthy patient "we may have found something—let's test further," than to tell a diseased patient "all is well." [9]

Power analysis focuses on the correct rejection of a null hypothesis. Alternative concerns may however motivate an experiment, and so lead to different needs for sample size. In many contexts, the issue is less about deciding between hypotheses but rather with getting an estimate of the population effect size of sufficient accuracy. For example, a careful power analysis can tell you that 55 pairs of normally distributed samples with a correlation of 0.5 will be sufficient to grant 80% power in rejecting a null that the correlation is no more than 0.2 (using a one-sided test, α = 0.05). But the typical 95% confidence interval with this sample would be around [0.27, 0.67]. An alternative, albeit related analysis would be required if we wish to be able to measure correlation to an accuracy of +/- 0.1, implying a different (in this case, larger) sample size. Alternatively, multiple under-powered studies can still be useful, if appropriately combined through a meta-analysis.

Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities are nuisance parameters. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more "exploratory", there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.

Additional complications arise when we consider these multiple hypotheses together. For example, if we consider a false positive to be making an erroneous null rejection on any one of these hypotheses, our likelihood of this "family-wise error" will be inflated if appropriate measures are not taken. Such measures typically involve applying a higher threshold of stringency to reject a hypothesis (such as with the Bonferroni method), and so would reduce power. Alternatively, there may be different notions of power connected with how the different hypotheses are considered. "Complete power" demands that all true effects are detected across all of the hypotheses, which is a much stronger requirement than the "minimal power" of being able to find at least one true effect, a type of power that might increase with an increasing number of hypotheses. [10]

A priori vs. post hoc analysis

Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data are collected. A priori power analysis is conducted prior to the research study, and is typically used in estimating sufficient sample sizes to achieve adequate power. Post-hoc analysis of "observed power" is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, post hoc power analysis is fundamentally flawed. [11] [12] Falling for the temptation to use the statistical analysis of the collected data to estimate the power will result in uninformative and misleading values. In particular, it has been shown that post-hoc "observed power" is a one-to-one function of the p-value attained. [11] This has been extended to show that all post-hoc power analyses suffer from what is called the "power approach paradox" (PAP), in which a study with a null result is thought to show more evidence that the null hypothesis is actually true when the p-value is smaller, since the apparent power to detect an actual effect would be higher. [11] In fact, a smaller p-value is properly understood to make the null hypothesis relatively less likely to be true.[ citation needed ]

Example

The following is an example that shows how to compute power for a randomized experiment: Suppose the goal of an experiment is to study the effect of a treatment on some quantity, and so we shall compare research subjects by measuring the quantity before and after the treatment, analyzing the data using a one-sided paired t-test, with a significance level threshold of 0.05. We are interested in being able to detect a positive change of size .

We first set up the problem according to our test. Let and denote the pre-treatment and post-treatment measures on subject , respectively. The possible effect of the treatment should be visible in the differences which are assumed to be independent and identically Normal in distribution, with unknown mean value and variance .

Here, it is natural to choose our null hypothesis to be that the expected mean difference is zero, i.e. For our one-sided test, the alternative hypothesis would be that there is a positive effect, corresponding to The test statistic in this case is defined as:

where is the mean under the null so we substitute in 0, n is the sample size (number of subjects), is the sample mean of the difference

and is the sample standard deviation of the difference.

Analytic solution

We can proceed according to our knowledge of statistical theory, though in practice for a standard case like this software will exist to compute more accurate answers.

Thanks to t-test theory, we know this test statistic under the null hypothesis follows a Student t-distribution with degrees of freedom. If we wish to reject the null at significance level , we must find the critical value such that the probability of under the null is equal to . If n is large, the t-distribution converges to the standard normal distribution (thus no longer involving n) and so through use of the corresponding quantile function , we obtain that the null should be rejected if

Now suppose that the alternative hypothesis is true so . Then, writing the power as a function of the effect size, , we find the probability of being above under .

again follows a student-t distribution under , converging on to a standard normal distribution for large n. The estimated will also converge on to its population value Thus power can be approximated as

According to this formula, the power increases with the values of the effect size and the sample size n, and reduces with increasing variability . In the trivial case of zero effect size, power is at a minimum (infimum) and equal to the significance level of the test in this example 0.05. For finite sample sizes and non-zero variability, it is the case here, as is typical, that power cannot be made equal to 1 except in the trivial case where so the null is always rejected.

We can invert to obtain required sample sizes:

Suppose and we believe is around 2, say, then we require for a power of , a sample size

Simulation solution

Alternatively we can use a Monte Carlo simulation method that works more generally. [13] Once again, we return to the assumption of the distribution of and the definition of . Suppose we have fixed values of the sample size, variability and effect size, and wish to compute power. We can adopt this process:

1. Generate a large number of sets of according to the null hypothesis,

2. Compute the resulting test statistic for each set.

3. Compute the th quantile of the simulated and use that as an estimate of .

4. Now generate a large number of sets of according to the alternative hypothesis, , and compute the corresponding test statistics again.

5. Look at the proportion of these simulated alternative that are above the calculated in step 3 and so are rejected. This is the power.

This can be done with a variety of software packages. Using this methodology with the values before, setting the sample size to 25 leads to an estimated power of around 0.78. The small discrepancy with the previous section is due mainly to inaccuracies with the normal approximation.

Extension

Bayesian power

In the frequentist setting, parameters are assumed to have a specific value which is unlikely to be true. This issue can be addressed by assuming the parameter has a distribution. The resulting power is sometimes referred to as Bayesian power which is commonly used in clinical trial design.

Predictive probability of success

Both frequentist power and Bayesian power use statistical significance as the success criterion. However, statistical significance is often not enough to define success. To address this issue, the power concept can be extended to the concept of predictive probability of success (PPOS). The success criterion for PPOS is not restricted to statistical significance and is commonly used in clinical trial designs.

Software for power and sample size calculations

Numerous free and/or open source programs are available for performing power and sample size calculations. These include

See also

Related Research Articles

In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Confidence interval</span> Range to estimate an unknown parameter

Informally, in frequentist statistics, a confidence interval (CI) is an interval which is expected to typically contain the parameter being estimated. More specifically, given a confidence level , a CI is a random interval which contains the parameter being estimated % of the time. The confidence level, degree of confidence or confidence coefficient represents the long-run proportion of CIs that theoretically contain the true value of the parameter; this is tantamount to the nominal coverage probability. For example, out of all intervals computed at the 95% level, 95% of them should contain the parameter's true value.

<i>Z</i>-test Statistical test

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim to provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

In statistics, the Neyman–Pearson lemma describes the existence and uniqueness of the likelihood ratio as a uniformly most powerful test in certain contexts. It was introduced by Jerzy Neyman and Egon Pearson in a paper in 1933. The Neyman–Pearson lemma is part of the Neyman–Pearson theory of statistical testing, which introduced concepts like errors of the second kind, power function, and inductive behavior. The previous Fisherian theory of significance testing postulated only one hypothesis. By introducing a competing hypothesis, the Neyman–Pearsonian flavor of statistical testing allows investigating the two types of errors. The trivial cases where one always rejects or accepts the null hypothesis are of little interest but it does prove that one must not relinquish control over one type of error while calibrating the other. Neyman and Pearson accordingly proceeded to restrict their attention to the class of all level tests while subsequently minimizing type II error, traditionally denoted by . Their seminal paper of 1933, including the Neyman–Pearson lemma, comes at the end of this endeavor, not only showing the existence of tests with the most power that retain a prespecified level of type I error, but also providing a way to construct such tests. The Karlin-Rubin theorem extends the Neyman–Pearson lemma to settings involving composite hypotheses with monotone likelihood ratios.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to correctly interpret the statistical significance of the difference between means that have been selected for comparison because of their extreme values.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

A paired difference test, better known as a paired comparison, is a type of location test that is used when comparing two sets of paired measurements to assess whether their population means differ. A paired difference test is designed for situations where there is dependence between pairs of measurements. That applies in a within-subjects study design, i.e., in a study where the same set of subjects undergo both of the conditions being compared.

In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.

In statistics and probability theory, the nonparametric skew is a statistic occasionally used with random variables that take real values. It is a measure of the skewness of a random variable's distribution—that is, the distribution's tendency to "lean" to one side or the other of the mean. Its calculation does not require any knowledge of the form of the underlying distribution—hence the name nonparametric. It has some desirable properties: it is zero for any symmetric distribution; it is unaffected by a scale shift; and it reveals either left- or right-skewness equally well. In some statistical samples it has been shown to be less powerful than the usual measures of skewness in detecting departures of the population from normality.

In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis. They serve as a more robust alternative to p-values, addressing some shortcomings of the latter.

References

  1. "Statistical power and underpowered statistics — Statistics Done Wrong". www.statisticsdonewrong.com. Retrieved 30 September 2019.
  2. Nakagawa, Shinichi; Lagisz, Malgorzata; Yang, Yefeng; Drobniak, Szymon M. (2024). "Finding the right power balance: Better study design and collaboration can reduce dependence on statistical power". PLOS Biology. 22 (1): e3002423. doi: 10.1371/journal.pbio.3002423 . PMC   10773938 . PMID   38190355.
  3. Robert Lehr (1992), "SixteenS-squared overD-squared: A relation for crude sample size estimates", Statistics in Medicine (in German), vol. 11, no. 8, pp. 1099–1102, doi:10.1002/sim.4780110811, ISSN   0277-6715, PMID   1496197
  4. van Belle, Gerald (2008-08-18). Statistical Rules of Thumb, Second Edition. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc. doi:10.1002/9780470377963. ISBN   978-0-470-37796-3.
  5. Sample Size Estimation in Clinical Research From Randomized Controlled Trials to Observational Studies, 2020, doi: 10.1016/j.chest.2020.03.010, Xiaofeng Wang, PhD; and Xinge Ji, MS pdf
  6. Everitt, Brian S. (2002). The Cambridge Dictionary of Statistics. Cambridge University Press. p. 321. ISBN   0-521-81099-X.
  7. Ellis, Paul (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press. p. 52. ISBN   978-0521142465.
  8. Tsang, R.; Colley, L.; Lynd, L.D. (2009). "Inadequate statistical power to detect clinically significant differences in adverse event rates in randomized controlled trials". Journal of Clinical Epidemiology. 62 (6): 609–616. doi:10.1016/j.jclinepi.2008.08.005. PMID   19013761.
  9. Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of Research Results. United Kingdom: Cambridge University Press. p. 56.
  10. "Estimating Statistical Power When Using Multiple Testing Procedures". mdrc.org. November 2017.
  11. 1 2 3 Hoenig; Heisey (2001). "The Abuse of Power". The American Statistician . 55 (1): 19–24. doi:10.1198/000313001300339897.
  12. Thomas, L. (1997). "Retrospective power analysis" (PDF). Conservation Biology . 11 (1): 276–280. Bibcode:1997ConBi..11..276T. doi:10.1046/j.1523-1739.1997.96102.x. hdl:10023/679.
  13. Graebner, Robert W. (1999). Study design with SAS: Estimating power with Monte Carlo methods (PDF). SUGI 24.

Sources