Lindley's paradox

Last updated

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; [1] it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper. [2]

Contents

Although referred to as a paradox, the differing results from the Bayesian and frequentist approaches can be explained as using them to answer fundamentally different questions, rather than actual disagreement between the two methods.

Nevertheless, for a large class of priors the differences between the frequentist and Bayesian approach are caused by keeping the significance level fixed: as even Lindley recognized, "the theory does not justify the practice of keeping the significance level fixed" and even "some computations by Prof. Pearson in the discussion to that paper emphasized how the significance level would have to change with the sample size, if the losses and prior probabilities were kept fixed". [2] In fact, if the critical value increases with the sample size suitably fast, then the disagreement between the frequentist and Bayesian approaches becomes negligible as the sample size increases. [3]

Description of the paradox

The result of some experiment has two possible explanations  hypotheses and   and some prior distribution representing uncertainty as to which hypothesis is more accurate before taking into account .

Lindley's paradox occurs when

  1. The result is "significant" by a frequentist test of indicating sufficient evidence to reject say, at the 5% level, and
  2. The posterior probability of given is high, indicating strong evidence that is in better agreement with than

These results can occur at the same time when is very specific, more diffuse, and the prior distribution does not strongly favor one or the other, as seen below.

Numerical example

The following numerical example illustrates Lindley's paradox. In a certain city 49,581 boys and 48,870 girls have been born over a certain time period. The observed proportion of male births is thus 49581/98451 ≈ 0.5036. We assume the fraction of male births is a binomial variable with parameter We are interested in testing whether is 0.5 or some other value. That is, our null hypothesis is and the alternative is

Frequentist approach

The frequentist approach to testing is to compute a p-value, the probability of observing a fraction of boys at least as large as assuming is true. Because the number of births is very large, we can use a normal approximation for the fraction of male births with and to compute

We would have been equally surprised if we had seen 49581 female births, i.e. so a frequentist would usually perform a two-sided test, for which the p-value would be In both cases, the p-value is lower than the significance level α = 5%, so the frequentist approach rejects as it disagrees with the observed data.

Bayesian approach

Assuming no reason to favor one hypothesis over the other, the Bayesian approach would be to assign prior probabilities and a uniform distribution to under and then to compute the posterior probability of using Bayes' theorem:

After observing boys out of births, we can compute the posterior probability of each hypothesis using the probability mass function for a binomial variable:

where is the Beta function.

From these values, we find the posterior probability of which strongly favors over .

The two approaches—the Bayesian and the frequentist—appear to be in conflict, and this is the "paradox".

Reconciling the Bayesian and frequentist approaches

Almost sure hypothesis testing

Naaman [3] proposed an adaption of the significance level to the sample size in order to control false positives: αn, such that αn = nr with r > 1/2. At least in the numerical example, taking r = 1/2, results in a significance level of 0.00318, so the frequentist would not reject the null hypothesis, which is in agreement with the Bayesian approach.

Uninformative priors

Distribution of p under the null hypothesis, and the posterior distribution of p Illustration.pdf
Distribution of p under the null hypothesis, and the posterior distribution of p

If we use an uninformative prior and test a hypothesis more similar to that in the frequentist approach, the paradox disappears.

For example, if we calculate the posterior distribution , using a uniform prior distribution on (i.e. ), we find

If we use this to check the probability that a newborn is more likely to be a boy than a girl, i.e. we find

In other words, it is very likely that the proportion of male births is above 0.5.

Neither analysis gives an estimate of the effect size, directly, but both could be used to determine, for instance, if the fraction of boy births is likely to be above some particular threshold.

The lack of an actual paradox

The apparent disagreement between the two approaches is caused by a combination of factors. First, the frequentist approach above tests without reference to . The Bayesian approach evaluates as an alternative to and finds the first to be in better agreement with the observations. This is because the latter hypothesis is much more diffuse, as can be anywhere in , which results in it having a very low posterior probability. To understand why, it is helpful to consider the two hypotheses as generators of the observations:

Most of the possible values for under are very poorly supported by the observations. In essence, the apparent disagreement between the methods is not a disagreement at all, but rather two different statements about how the hypotheses relate to the data:

The ratio of the sex of newborns is improbably 50/50 male/female, according to the frequentist test. Yet 50/50 is a better approximation than most, but not all, other ratios. The hypothesis would have fit the observation much better than almost all other ratios, including

For example, this choice of hypotheses and prior probabilities implies the statement "if > 0.49 and < 0.51, then the prior probability of being exactly 0.5 is 0.50/0.51 ≈ 98%". Given such a strong preference for it is easy to see why the Bayesian approach favors in the face of even though the observed value of lies away from 0.5. The deviation of over 2σ from is considered significant in the frequentist approach, but its significance is overruled by the prior in the Bayesian approach.

Looking at it another way, we can see that the prior distribution is essentially flat with a delta function at Clearly, this is dubious. In fact, picturing real numbers as being continuous, it would be more logical to assume that it would be impossible for any given number to be exactly the parameter value, i.e., we should assume

A more realistic distribution for in the alternative hypothesis produces a less surprising result for the posterior of For example, if we replace with i.e., the maximum likelihood estimate for the posterior probability of would be only 0.07 compared to 0.93 for (of course, one cannot actually use the MLE as part of a prior distribution).

Recent discussion

The paradox continues to be a source of active discussion. [3] [4] [5] [6]

See also

Notes

  1. Jeffreys, Harold (1939). Theory of Probability. Oxford University Press. MR   0000924.
  2. 1 2 Lindley, D. V. (1957). "A statistical paradox". Biometrika . 44 (1–2): 187–192. doi:10.1093/biomet/44.1-2.187. JSTOR   2333251.
  3. 1 2 3 Naaman, Michael (2016-01-01). "Almost sure hypothesis testing and a resolution of the Jeffreys–Lindley paradox". Electronic Journal of Statistics . 10 (1): 1526–1550. doi: 10.1214/16-EJS1146 . ISSN   1935-7524.
  4. Spanos, Aris (2013). "Who should be afraid of the Jeffreys-Lindley paradox?". Philosophy of Science . 80 (1): 73–93. doi:10.1086/668875. S2CID   85558267.
  5. Sprenger, Jan (2013). "Testing a precise null hypothesis: The case of Lindley's paradox" (PDF). Philosophy of Science . 80 (5): 733–744. doi:10.1086/673730. hdl: 2318/1657960 . S2CID   27444939.
  6. Robert, Christian P. (2014). "On the Jeffreys-Lindley paradox". Philosophy of Science . 81 (2): 216–232. arXiv: 1303.5973 . doi:10.1086/675729. S2CID   120002033.

Further reading

Related Research Articles

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function is the probability of observing data assuming is the actual parameter.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Fundamentally, Bayesian inference uses prior knowledge, in the form of a prior distribution in order to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a null hypothesis and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its linear approximation. The Bayes factor can be thought of as a Bayesian analog to the likelihood-ratio test, although it uses the integrated likelihood rather than the maximized likelihood. As such, both quantities only coincide under simple hypotheses. Also, in contrast with null hypothesis significance testing, Bayes factors support evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power among all possible tests of a given size α. For example, according to the Neyman–Pearson lemma, the likelihood-ratio test is UMP for testing simple (point) hypotheses.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

Amplitude amplification is a technique in quantum computing which generalizes the idea behind Grover's search algorithm, and gives rise to a family of quantum algorithms. It was discovered by Gilles Brassard and Peter Høyer in 1997, and independently rediscovered by Lov Grover in 1998.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

In Bayesian inference, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in the limit of infinite data to a multivariate normal distribution centered at the maximum likelihood estimator with covariance matrix given by , where is the true population parameter and is the Fisher information matrix at the true population parameter value:

In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable.