Likelihood principle

Last updated

In statistics, the likelihood principle is the proposition that, given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.

Contents

A likelihood function arises from a probability density function considered as a function of its distributional parameterization argument. For example, consider a model which gives the probability density function of observable random variable as a function of a parameter . Then for a specific value of , the function is a likelihood function of : it gives a measure of how "likely" any particular value of is, if we know that has the value . The density function may be a density with respect to counting measure, i.e. a probability mass function.

Two likelihood functions are equivalent if one is a scalar multiple of the other. [lower-alpha 1] The likelihood principle is this: All information from the data that is relevant to inferences about the value of the model parameters is in the equivalence class to which the likelihood function belongs. The strong likelihood principle applies this same criterion to cases such as sequential experiments where the sample of data that is available results from applying a stopping rule to the observations earlier in the experiment. [1]

Example

Suppose

Then the observation that induces the likelihood function

while the observation that induces the likelihood function

The likelihood principle says that, as the data are the same in both cases, the inferences drawn about the value of should also be the same. In addition, all the inferential content in the data about the value of is contained in the two likelihoods, and is the same if they are proportional to one another. This is the case in the above example, reflecting the fact that the difference between observing and observing lies not in the actual data collected, nor in the conduct of the experimenter, but in the two different designs of the experiment.

Specifically, in one case, the decision in advance was to try twelve times, regardless of the outcome; in the other case, the advance decision was to keep trying until three successes were observed. If you support the likelihood principle then inference about should be the same for both cases because the two likelihoods are proportional to each other: Except for a constant leading factor of 220 vs. 55, the two likelihood functions are the same – constant multiples of each other.

This equivalence is not always the case, however. The use of frequentist methods involving p values leads to different inferences for the two cases above, [2] showing that the outcome of frequentist methods depends on the experimental procedure, and thus violates the likelihood principle.

The law of likelihood

A related concept is the law of likelihood, the notion that the extent to which the evidence supports one parameter value or hypothesis against another is indicated by the ratio of their likelihoods, their likelihood ratio. That is,

is the degree to which the observation x supports parameter value or hypothesis a against b. If this ratio is 1, the evidence is indifferent; if greater than 1, the evidence supports the value a against b; or if less, then vice versa.

In Bayesian statistics, this ratio is known as the Bayes factor, and Bayes' rule can be seen as the application of the law of likelihood to inference.

In frequentist inference, the likelihood ratio is used in the likelihood-ratio test, but other non-likelihood tests are used as well. The Neyman–Pearson lemma states the likelihood-ratio test is equally statistically powerful as the most powerful test for comparing two simple hypotheses at a given significance level, which gives a frequentist justification for the law of likelihood.

Combining the likelihood principle with the law of likelihood yields the consequence that the parameter value which maximizes the likelihood function is the value which is most strongly supported by the evidence. This is the basis for the widely used method of maximum likelihood.

History

The likelihood principle was first identified by that name in print in 1962 (Barnard et al., Birnbaum, and Savage et al.), but arguments for the same principle, unnamed, and the use of the principle in applications goes back to the works of R.A. Fisher in the 1920s. The law of likelihood was identified by that name by I. Hacking (1965). More recently the likelihood principle as a general principle of inference has been championed by A.W.F. Edwards. The likelihood principle has been applied to the philosophy of science by R. Royall. [3]

Birnbaum (1962) initially argued that the likelihood principle follows from two more primitive and seemingly reasonable principles, the conditionality principle and the sufficiency principle :

However, upon further consideration Birnbaum rejected both his conditionality principle and the likelihood principle. [4] The adequacy of Birnbaum's original argument has also been contested by others (see below for details).

Arguments for and against

Some widely used methods of conventional statistics, for example many significance tests, are not consistent with the likelihood principle.

Let us briefly consider some of the arguments for and against the likelihood principle.

The original Birnbaum argument

According to Giere (1977), [5] Birnbaum rejected [4] both his own conditionality principle and the likelihood principle because they were both incompatible with what he called the “confidence concept of statistical evidence”, which Birnbaum (1970) describes as taking “from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” ( [4] p. 1033). The confidence concept incorporates only limited aspects of the likelihood concept and only some applications of the conditionality concept. Birnbaum later notes that it was the unqualified equivalence formulation of his 1962 version of the conditionality principle that led “to the monster of the likelihood axiom” ( [6] p. 263).

Birnbaum's original argument for the likelihood principle has also been disputed by other statisticians including Akaike, [7] Evans [8] and philosophers of science, including Deborah Mayo. [9] [10] Dawid points out fundamental differences between Mayo's and Birnbaum's definitions of the conditionality principle, arguing Birnbaum's argument cannot be so readily dismissed. [11] A new proof of the likelihood principle has been provided by Gandenberger that addresses some of the counterarguments to the original proof. [12]

Experimental design arguments on the likelihood principle

Unrealized events play a role in some common statistical methods. For example, the result of a significance test depends on the p-value, the probability of a result as extreme or more extreme than the observation, and that probability may depend on the design of the experiment. To the extent that the likelihood principle is accepted, such methods are therefore denied.

Some classical significance tests are not based on the likelihood. The following are a simple and more complicated example of those, using a commonly cited example called the optional stopping problem.

Example 1 – simple version

Suppose I tell you that I tossed a coin 12 times and in the process observed 3 heads. You might make some inference about the probability of heads and whether the coin was fair.

Suppose now I tell that I tossed the coin until I observed 3 heads, and I tossed it 12 times. Will you now make some different inference?

The likelihood function is the same in both cases: It is proportional to

.

So according to the likelihood principle, in either case the inference should be the same.

Example 2 – a more elaborated version of the same statistics

Suppose a number of scientists are assessing the probability of a certain outcome (which we shall call 'success') in experimental trials. Conventional wisdom suggests that if there is no bias towards success or failure then the success probability would be one half. Adam, a scientist, conducted 12 trials and obtains 3 successes and 9 failures. One of those successes was the 12th and last observation. Then Adam left the lab.

Bill, a colleague in the same lab, continued Adam's work and published Adam's results, along with a significance test. He tested the null hypothesis that p, the success probability, is equal to a half, versus p < 0.5 . If we ignore the information that the third success was the 12th and last observation, the probability of the observed result that out of 12 trials 3 or something fewer (i.e. more extreme) were successes, if H0 is true, is

,

which is 299/4096 = 7.3% . Thus the null hypothesis is not rejected at the 5% significance level if we ignore the knowledge that the third success was the 12th result.

However observe that this first calculation also includes 12 token long sequences that end in tails contrary to the problem statement!

If we redo this calculation we realize the likelihood according to the null hypothesis must be the probability of a fair coin landing 2 or fewer heads on 11 trials multiplied with the probability of the fair coin landing a head for the 12th trial:

,

which is 67/20481/2 = 67/4096 = 1.64% . Now the result is statistically significant at the 5% level.

Charlotte, another scientist, reads Bill's paper and writes a letter, saying that it is possible that Adam kept trying until he obtained 3 successes, in which case the probability of needing to conduct 12 or more experiments is given by

,

which is 134/40961/2 = 1.64% . Now the result is statistically significant at the 5% level. Note that there is no contradiction between the latter two correct analyses; both computations are correct, and result in the same p-value.

To these scientists, whether a result is significant or not does not depend on the design of the experiment, but does on the likelihood (in the sense of the likelihood function) of the parameter value being 1/2 .

Summary of the illustrated issues

Results of this kind are considered by some as arguments against the likelihood principle. For others it exemplifies the value of the likelihood principle and is an argument against significance tests.

Similar themes appear when comparing Fisher's exact test with Pearson's chi-squared test.

The voltmeter story

An argument in favor of the likelihood principle is given by Edwards in his book Likelihood. He cites the following story from J.W. Pratt, slightly condensed here. Note that the likelihood function depends only on what actually happened, and not on what could have happened.

An engineer draws a random sample of electron tubes and measures their voltages. The measurements range from 75 to 99 Volts. A statistician computes the sample mean and a confidence interval for the true mean. Later the statistician discovers that the voltmeter reads only as far as 100 Volts, so technically, the population appears to be “ censored ”. If the statistician is orthodox this necessitates a new analysis.
However, the engineer says he has another meter reading to 1000 Volts, which he would have used if any voltage had been over 100. This is a relief to the statistician, because it means the population was effectively uncensored after all. But later, the statistician discovers that the second meter had not been working when the measurements were taken. The engineer informs the statistician that he would not have held up the original measurements until the second meter was fixed, and the statistician informs him that new measurements are required. The engineer is astounded. “Next you'll be asking about my oscilloscope!
Throwback to Example 2 in the prior section

This story can be translated to Adam's stopping rule above, as follows: Adam stopped immediately after 3 successes, because his boss Bill had instructed him to do so. After the publication of the statistical analysis by Bill, Adam realizes that he has missed a later instruction from Bill to instead conduct 12 trials, and that Bill's paper is based on this second instruction. Adam is very glad that he got his 3 successes after exactly 12 trials, and explains to his friend Charlotte that by coincidence he executed the second instruction. Later, Adam is astonished to hear about Charlotte's letter, explaining that now the result is significant.

See also

Notes

  1. Geometrically, if they occupy the same point in projective space.

Related Research Articles

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory".

<span class="mw-page-title-main">Statistical inference</span> Process of using data analysis

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function is the probability of observing data assuming is the actual parameter.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Fundamentally, Bayesian inference uses prior knowledge, in the form of a prior distribution in order to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, completeness is a property of a statistic in relation to a parameterised model for a set of observed data.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior probability contains everything there is to know about an uncertain proposition, given prior knowledge and a mathematical model describing the observations available at a particular time. After the arrival of new information, the current posterior probability may serve as the prior in another round of Bayesian updating.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

In statistical theory, a pseudolikelihood is an approximation to the joint probability distribution of a collection of random variables. The practical use of this is that it can provide an approximation to the likelihood function of a set of observed data which may either provide a computationally simpler problem for estimation, or may provide a way of obtaining explicit estimates of model parameters.

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.

<span class="mw-page-title-main">Monotone likelihood ratio</span> Statistical property

In statistics, the monotone likelihood ratio property is a property of the ratio of two probability density functions (PDFs). Formally, distributions and bear the property if

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

The conditionality principle is a Fisherian principle of statistical inference that Allan Birnbaum formally defined and studied in an article in the Journal of the American Statistical Association, Birnbaum (1962).

In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.

<span class="mw-page-title-main">Bayesian inference in marketing</span>

In marketing, Bayesian inference allows for decision making and market research evaluation under uncertainty and with limited data.

In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

  1. Dodge, Y. (2003). The Oxford Dictionary of Statistical Terms. Oxford University Press. ISBN   0-19-920613-9.
  2. Vidakovic, Brani. "The Likelihood Principle" (PDF). H. Milton Stewart School of Industrial & Systems Engineering. Georgia Tech . Retrieved 21 October 2017.
  3. Royall, Richard (1997). Statistical Evidence: A likelihood paradigm. Boca Raton, FL: Chapman and Hall. ISBN   0-412-04411-0.
  4. 1 2 3 Birnbaum, A. (14 March 1970). "Statistical methods in scientific inference". Nature . 225: 1033.
  5. Giere, R. (1977) Allan Birnbaum's Conception of Statistical Evidence. Synthese, 36, pp.5-13.
  6. Birnbaum, A., (1975) Discussion of J. D. Kalbfleisch's paper 'Sufficiency and Conditionality'. Biometrika, 62, pp. 262-264.
  7. Akaike, H., 1982. On the fallacy of the likelihood principle. Statistics & probability letters, 1(2), pp.75-78]
  8. Evans, Michael (2013). "What does the proof of Birnbaum's theorem prove?". arXiv: 1302.5468 [math.ST].
  9. Mayo, D. (2010). "An error in the argument from Conditionality and Sufficiency to the Likelihood Principle". In Mayo, D.; Spanos, A. (eds.). Error and Inference: Recent exchanges on experimental reasoning, reliability and the objectivity and rationality of science (PDF). Cambridge, GB: Cambridge University Press. pp. 305–314.
  10. Mayo, D. (2014). "On the Birnbaum argument for the Strong Likelihood Principle". Statistical Science . 29: 227–266 (with discussion).
  11. Dawid, A.P. (2014). "Discussion of "On the Birnbaum argument for the Strong Likelihood Principle"". Statistical Science . 29 (2): 240–241. arXiv: 1411.0807 . doi: 10.1214/14-STS470 . S2CID   55068072.
  12. Gandenberger, Greg (2014). "A new proof of the likelihood principle". British Journal for the Philosophy of Science . 66 (3): 475–503. doi:10.1093/bjps/axt039.

Sources