Bayesian inference

Last updated

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Bayes theorem Probability based on prior knowledge

In probability theory and statistics, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes’ theorem, a person's age can be used to more accurately assess the probability that he has cancer than can be done without knowledge of the person’s age.

Evidence Material supporting an assertion

Evidence, broadly construed, is anything presented in support of an assertion. This support may be strong or weak. The strongest type of evidence is that which provides direct proof of the truth of an assertion. At the other extreme is evidence that is merely consistent with an assertion but does not rule out other, contradictory assertions, as in circumstantial evidence.

Contents

Introduction to Bayes' rule

A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that P(A|B) P(B) = P(B|A) P(A) i.e. P(A|B) = P(B|A) P(A)/P(B) . Similar reasoning can be used to show that P(A|B) = P(B|A) P(A)/P(B) etc. Bayes theorem visualisation.svg
A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that P(A|B) P(B) = P(B|A) P(A) i.e. P(A|B) = P(B|A) P(A)/P(B) . Similar reasoning can be used to show that P(Ā|B) = P(B|Ā) P(Ā)/P(B) etc.

Formal explanation

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.

An antecedent is the first half of a hypothetical proposition, whenever the if-clause precedes the then-clause. In some contexts the antecedent is called the protasis.

In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

where

Experimental data in science and engineering is data produced by a measurement, test method, experimental design or quasi-experimental design. In clinical research any data produced are the result of a clinical trial. Experimental data may be qualitative or quantitative, each being appropriate for different investigations.

In statistics, the likelihood function expresses how probable a given set of observations is for different values of statistical parameters. It is equal to the joint probability distribution of the random sample evaluated at the given observations, and it is, thus, solely a function of parameters that index the family of those probability distributions.

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

For different values of , only the factors and , both in the numerator, affect the value of – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

Bayes' rule can also be written as follows:

where the factor can be interpreted as the impact of on the probability of .

Alternatives to Bayesian updating

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote [1] [2] "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."

Ian Hacking Canadian philosopher

Ian MacDougall Hacking is a Canadian philosopher specializing in the philosophy of science. Throughout his career, he has won numerous awards, such as the Killam Prize for the Humanities and the Balzan Prize, and been a member of many prestigious groups, including the Order of Canada, the Royal Society of Canada and the British Academy.

In gambling, a Dutch book or lock is a set of odds and bets which guarantees a profit, regardless of the outcome of the gamble. It is associated with probabilities implied by the odds not being coherent.

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability. [3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory. [4]

Formal description of Bayesian inference

Definitions

A parameter, generally, is any characteristic that can help in defining or classifying a particular system. That is, a parameter is an element of a system that is useful, or critical, when identifying the system, or when evaluating its performance, status, condition, etc.

In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.

Bayesian inference

This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".

Bayesian prediction

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.

(In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the fact that (1) the average of normally distributed random variables is also normally distributed; (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least, to an arbitrary level of precision, when numerical methods are used.)

Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, and hence the prior and posterior distributions come from the same family, it can easily be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Inference over exclusive and exhaustive possibilities

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation

Diagram illustrating event space
O
{\displaystyle \Omega }
in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities. Bayesian inference event space.svg
Diagram illustrating event space in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities.

Suppose a process is generating independent and identically distributed events , but the probability distribution is unknown. Let the event space represent the current state of belief for this process. Each model is represented by event . The conditional probabilities are specified to define the models. is the degree of belief in . Before the first inference step, is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate . For each , the prior is updated to the posterior . From Bayes' theorem: [5]

Upon observation of further evidence, this procedure may be repeated.

Multiple observations

For a sequence of independent and identically distributed observations , it can be shown by induction that repeated application of the above is equivalent to

Where


Parametric formulation

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is however equally applicable to discrete distributions.

Let the vector span the parameter space. Let the initial prior distribution over be , where is a set of parameters to the prior itself, or hyperparameters . Let be a sequence of independent and identically distributed event observations, where all are distributed as for some . Bayes' theorem is applied to find the posterior distribution over :

Where

Mathematical properties

Interpretation of factor

. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule

If then . If , then . This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not " in place of "", yielding "if , then ", from which the result immediately follows.

Asymptotic behaviour of posterior

Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 [6] and 1965 [7] when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces. [8] To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator. [9]

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation. [10]

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates: [11]

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics"). [12]

The posterior predictive distribution of a new observation (that is independent of previous observations) is determined by [13]

Examples

Probability of a hypothesis

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let correspond to bowl #1, and to bowl #2. It is given that the bowls are identical from Fred's point of view, thus , and the two must add up to 1, so both are equal to 0.5. The event is the observation of a plain cookie. From the contents of the bowls, we know that and Bayes' formula then yields

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, , which was 0.5. After observing the cookie, we must revise the probability to , which is 0.6.

Making a prediction

Example results for archaeology example. This simulation was generated using c=15.2. Bayesian inference archaeology example.jpg
Example results for archaeology example. This simulation was generated using c=15.2.

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

The degree of belief in the continuous variable (century) is to be calculated, with the discrete set of events as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

Assume a uniform prior of , and that trials are independent and identically distributed. When a new fragment of type is discovered, Bayes' theorem is applied to update the degree of belief for each :

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory

A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures. [14]

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals. [15] [16] [17] For example:

Model selection

Applications

Computer applications

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes. [22] Recently[ when? ] Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naive Bayes classifier.

Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occam’s Razor. [23] [ unreliable source? ] Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. [24] [25]

Bioinformatic and healthcare applications

Bayesian inference has been applied in different Bioinformatics applications, including differentially gene expression analysis [26] [27] , single-cell classification [28] , cancer subtyping [29] , and etc. Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge [30] [31] .

In the courtroom

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'. [32] [33] [34] Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

Adding up evidence. Ebits2c.png
Adding up evidence.

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population. [35] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams . The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

Gardner-Medwin [36] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A The known facts and testimony could have arisen if the defendant is guilty
B The known facts and testimony could have arisen if the defendant is innocent
C The defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Bayesian epistemology

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences: [37] It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other

Bayes and Bayesian inference

The problem considered by Bayes in Proposition 9 of his essay, "An Essay towards solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.[ citation needed ]

History

The term Bayesian refers to Thomas Bayes (1702–1761), who proved a special case of what is now called Bayes' theorem. However, it was Pierre-Simon Laplace (1749–1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence. [43] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes [44] ). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics. [44]

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed, [45] and the method assigning the prior, which differs from one objective Bayesian to another objective Bayesian. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications. [46] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics. [47] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning. [48]

See also

Related Research Articles

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken.

Gamma distribution probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are three different parametrizations in common use:

  1. With a shape parameter k and a scale parameter θ.
  2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.
  3. With a shape parameter k and a mean parameter μ = = α/β.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman-Darmois family. The terms "distribution" and "family" are often used loosely: properly, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family.

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

Dirichlet distribution probability distribution

In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Dirichlet process

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique used to smooth categorical data. Given an observation from a multinomial distribution with trials, a "smoothed" version of the data gives the estimator:

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

References

Citations

  1. Hacking, Ian (December 1967). "Slightly More Realistic Personal Probability". Philosophy of Science. 34 (4): 316. doi:10.1086/288169.
  2. Hacking (1988, p. 124)[ full citation needed ]
  3. "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 2014-01-05.
  4. van Fraassen, B. (1989) Laws and Symmetry, Oxford University Press. ISBN   0-19-824860-1
  5. Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.;Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN   978-1-4398-4095-5.
  6. Freedman, DA (1963). "On the asymptotic behavior of Bayes' estimates in the discrete case". The Annals of Mathematical Statistics. 34 (4): 1386–1403. doi:10.1214/aoms/1177703871. JSTOR   2238346.
  7. Freedman, DA (1965). "On the asymptotic behavior of Bayes estimates in the discrete case II". The Annals of Mathematical Statistics. 36 (2): 454–456. doi:10.1214/aoms/1177700155. JSTOR   2238150.
  8. Robins, James; Wasserman, Larry (2000). "Conditioning, likelihood, and coherence: A review of some foundational concepts". JASA. 95 (452): 1340–1346. doi:10.1080/01621459.2000.10474344.
  9. Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.
  10. Choudhuri, Nidhan; Ghosal, Subhashis; Roy, Anindya (2005-01-01). Bayesian Methods for Function Estimation. Handbook of Statistics. Bayesian Thinking. 25. pp. 373–414. CiteSeerX   10.1.1.324.3052 . doi:10.1016/s0169-7161(05)25013-7. ISBN   9780444515391.
  11. "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com. Retrieved 2017-06-02.
  12. Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). cogsci.ucsd.edu/. Archived from the original (PDF) on 2013-02-28.
  13. Hitchcock, David. "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.
  14. 1 2 Bickel & Doksum (2001, p. 32)
  15. Kiefer, J.; Schwartz R. (1965). "Admissible Bayes Character of T2-, R2-, and Other Fully Invariant Tests for Multivariate Normal Problems". Annals of Mathematical Statistics. 36 (3): 747–770. doi:10.1214/aoms/1177700051.
  16. Schwartz, R. (1969). "Invariant Proper Bayes Tests for Exponential Families". Annals of Mathematical Statistics. 40: 270–283. doi:10.1214/aoms/1177697822.
  17. Hwang, J. T. & Casella, George (1982). "Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution" (PDF). Annals of Statistics. 10 (3): 868–881. doi:10.1214/aos/1176345877.
  18. Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 "Admissibilty", and pp. 17–18 of Chapter 1.8 "Complete Classes"
  19. Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN   978-0-387-96307-5. (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)
  20. Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 432. ISBN   978-0-04-121537-3.
  21. Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 433. ISBN   978-0-04-121537-3.)
  22. Jim Albert (2009). Bayesian Computation with R, Second edition. New York, Dordrecht, etc.: Springer. ISBN   978-0-387-92297-3.
  23. Rathmanner, Samuel; Hutter, Marcus; Ormerod, Thomas C (2011). "A Philosophical Treatise of Universal Induction". Entropy. 13 (6): 1076–1136. arXiv: 1105.5721 . doi:10.3390/e13061076.
  24. Hutter, Marcus; He, Yang-Hui; Ormerod, Thomas C (2007). "On Universal Prediction and Bayesian Confirmation". Theoretical Computer Science. 384 (2007): 33–48. arXiv: 0709.1516 . Bibcode:2007arXiv0709.1516H. doi:10.1016/j.tcs.2007.05.016.
  25. Gács, Peter; Vitányi, Paul M. B. (2 December 2010). "Raymond J. Solomonoff 1926-2009". CiteSeerX.Cite journal requires |journal= (help)
  26. Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.
  27. Hajiramezanali, E. & Dadaneh, S. Z. & Figueiredo, P. d. & Sze, S. & Zhou, Z. & Qian, X. Differential Expression Analysis of Dynamical Sequencing Count Data with a Gamma Markov Chain. https://arxiv.org/pdf/1803.02527.pdf
  28. Hajiramezanali, E.; Imani, M.; Braga-Neto, U.; Qian, X.; Dougherty, E. R. "Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertainty". Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. arXiv: 1902.03188 . doi:10.1145/3233547.3233689.
  29. 1 2 Hajiramezanali, E.; Dadaneh, S. Z.; Karbalayghareh, A.; Zhou, Z.; Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018). Montréal, Canada. arXiv: 1810.09433 .
  30. "CIRI". ciri.stanford.edu. Retrieved 2019-08-11.
  31. Kurtz, David M.; Esfahani, Mohammad S.; Scherer, Florian; Soo, Joanne; Jin, Michael C.; Liu, Chih Long; Newman, Aaron M.; Dührsen, Ulrich; Hüttmann, Andreas (2019-07-25). "Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction". Cell. 178 (3): 699–713.e19. doi:10.1016/j.cell.2019.06.011. ISSN   1097-4172. PMID   31280963.
  32. Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society , Series B, 58, 425–443.
  33. Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.
  34. Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester. ISBN   978-0-471-96026-3
  35. Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries Archived 2015-07-01 at the Wayback Machine
  36. Gardner-Medwin, A. (2005) "What Probability Should the Jury Address?". Significance , 2 (1), March 2005
  37. Miller, David (1994). Critical Rationalism. Chicago: Open Court. ISBN   978-0-8126-9197-9.
  38. Howson & Urbach (2005), Jaynes (2003)
  39. Cai, X.Q.; Wu, X.Y.; Zhou, X. (2009). "Stochastic scheduling subject to breakdown-repeat breakdowns with incomplete information". Operations Research. 57 (5): 1236–1249. doi:10.1287/opre.1080.0660.
  40. Ogle, Kiona; Tucker, Colin; Cable, Jessica M. (2014-01-01). "Beyond simple linear mixing models: process-based isotope partitioning of ecological processes". Ecological Applications. 24 (1): 181–195. doi:10.1890/1051-0761-24.1.181. ISSN   1939-5582.
  41. Evaristo, Jaivime; McDonnell, Jeffrey J.; Scholl, Martha A.; Bruijnzeel, L. Adrian; Chun, Kwok P. (2016-01-01). "Insights into plant water uptake from xylem-water isotope measurements in two tropical catchments with contrasting moisture conditions". Hydrological Processes. 30 (18): 3210–3227. Bibcode:2016HyPr...30.3210E. doi:10.1002/hyp.10841. ISSN   1099-1085.
  42. Gupta, Ankur; Rawlings, James B. (April 2014). "Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology". AIChE Journal. 60 (4): 1253–1268. doi:10.1002/aic.14409. ISSN   0001-1541. PMC   4946376 . PMID   27429455.
  43. Stigler, Stephen M. (1986). "Chapter 3". The History of Statistics. Harvard University Press.
  44. 1 2 Fienberg, Stephen E. (2006). "When did Bayesian Inference Become 'Bayesian'?" (PDF). Bayesian Analysis. 1 (1): 1–40 [p. 5]. Bibcode:2007BayAn...2..665S. doi:10.1214/06-ba101. Archived from the original (PDF) on 2014-09-10.
  45. Bernardo, José-Miguel (2005). "Reference analysis". Handbook of statistics. 25. pp. 17–90.
  46. Wolpert, R. L. (2004). "A Conversation with James O. Berger". Statistical Science. 19 (1): 205–218. CiteSeerX   10.1.1.71.6112 . doi:10.1214/088342304000000053. MR   2082155.
  47. Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). Icots-7.
  48. Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN   978-0387310732.

Sources

Further reading

Elementary

The following books are listed in ascending order of probabilistic sophistication:

Intermediate or advanced