Posterior probability

Last updated

The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. [1] From an epistemological perspective, the posterior probability contains everything there is to know about an uncertain proposition (such as a scientific hypothesis, or parameter values), given prior knowledge and a mathematical model describing the observations available at a particular time. [2] After the arrival of new information, the current posterior probability may serve as the prior in another round of Bayesian updating. [3]

Contents

In the context of Bayesian statistics, the posterior probability distribution usually describes the epistemic uncertainty about statistical parameters conditional on a collection of observed data. From a given posterior distribution, various point and interval estimates can be derived, such as the maximum a posteriori (MAP) or the highest posterior density interval (HPDI). [4] But while conceptually simple, the posterior distribution is generally not tractable and therefore needs to be either analytically or numerically approximated. [5]

Definition in the distributional case

In variational Bayesian methods, the posterior probability is the probability of the parameters given the evidence , and is denoted .

It contrasts with the likelihood function, which is the probability of the evidence given the parameters: .

The two are related as follows:

Given a prior belief that a probability distribution function is and that the observations have a likelihood , then the posterior probability is defined as

, [6]

where is the normalizing constant and is calculated as

for continuous , or by summing over all possible values of for discrete . [7]

The posterior probability is therefore proportional to the product Likelihood · Prior probability. [8]

Example

Suppose there is a school with 60% boys and 40% girls as students. The girls wear trousers or skirts in equal numbers; all boys wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem.

The event is that the student observed is a girl, and the event is that the student observed is wearing trousers. To compute the posterior probability , we first need to know:

Given all this information, the posterior probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula:

An intuitive way to solve this is to assume the school has N students. Number of boys = 0.6N and number of girls = 0.4N. If N is sufficiently large, total number of trouser wearers = 0.6N+ 50% of 0.4N. And number of girl trouser wearers = 50% of 0.4N. Therefore, in the population of trousers, girls are (50% of 0.4N)/(0.6N+ 50% of 0.4N) = 25%. In other words, if you separated out the group of trouser wearers, a quarter of that group will be girls. Therefore, if you see trousers, the most you can deduce is that you are looking at a single sample from a subset of students where 25% are girls. And by definition, chance of this random student being a girl is 25%. Every Bayes-theorem problem can be solved in this way. [9]

Calculation

The posterior probability distribution of one random variable given the value of another can be calculated with Bayes' theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:

gives the posterior probability density function for a random variable given the data , where

Credible interval

Posterior probability is a conditional probability conditioned on randomly observed data. Hence it is a random variable. For a random variable, it is important to summarize its amount of uncertainty. One way to achieve this goal is to provide a credible interval of the posterior probability. [11]

Classification

In classification, posterior probabilities reflect the uncertainty of assessing an observation to particular class, see also class-membership probabilities. While statistical classification methods by definition generate posterior probabilities, Machine Learners usually supply membership values which do not induce any probabilistic confidence. It is desirable to transform or rescale membership values to class-membership probabilities, since they are comparable and additionally more easily applicable for post-processing. [12]

See also

Related Research Articles

The likelihood function is the joint probability of the observed data viewed as a function of the parameters of a statistical model.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In Bayesian probability theory, if the posterior distribution is in the same probability distribution family as the prior probability distribution , the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function .

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

In probability and statistics, a compound probability distribution is the probability distribution that results from assuming that a random variable is distributed according to some parametrized distribution, with the parameters of that distribution themselves being random variables. If the parameter is a scale parameter, the resulting mixture is also called a scale mixture.

In Bayesian inference, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in the limit of infinite data to a multivariate normal distribution centered at the maximum likelihood estimator with covariance matrix given by , where is the true population parameter and is the Fisher information matrix at the true population parameter value:

Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

  1. Lambert, Ben (2018). "The posterior – the goal of Bayesian inference". A Student's Guide to Bayesian Statistics. Sage. pp. 121–140. ISBN   978-1-4739-1636-4.
  2. Grossman, Jason (2005). Inferences from observations to simple statistical hypotheses (PhD thesis). University of Sydney. hdl:2123/9107.
  3. Etz, Alex (2015-07-25). "Understanding Bayes: Updating priors via the likelihood". The Etz-Files. Retrieved 2022-08-18.
  4. Gill, Jeff (2014). "Summarizing Posterior Distributions with Intervals". Bayesian Methods: A Social and Behavioral Sciences Approach (Third ed.). Chapman & Hall. pp. 42–48. ISBN   978-1-4398-6248-3.
  5. Press, S. James (1989). "Approximations, Numerical Methods, and Computer Programs". Bayesian Statistics : Principles, Models, and Applications. New York: John Wiley & Sons. pp. 69–102. ISBN   0-471-63729-7.
  6. Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. pp. 21–24. ISBN   978-0-387-31073-2.
  7. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari and Donald B. Rubin (2014). Bayesian Data Analysis. CRC Press. p. 7. ISBN   978-1-4398-4095-5.{{cite book}}: CS1 maint: multiple names: authors list (link)
  8. Ross, Kevin. Chapter 8 Introduction to Continuous Prior and Posterior Distributions | An Introduction to Bayesian Reasoning and Methods.
  9. "Bayes' theorem - C o r T e x T". sites.google.com. Retrieved 2022-08-18.
  10. "Posterior probability - formulasearchengine". formulasearchengine.com. Retrieved 2022-08-19.
  11. Clyde, Merlise; Çetinkaya-Rundel, Mine; Rundel, Colin; Banks, David; Chai, Christine; Huang, Lizzy. Chapter 1 The Basics of Bayesian Statistics | An Introduction to Bayesian Thinking.
  12. Boedeker, Peter; Kearns, Nathan T. (2019-07-09). "Linear Discriminant Analysis for Prediction of Group Membership: A User-Friendly Primer". Advances in Methods and Practices in Psychological Science. 2 (3): 250–263. doi:10.1177/2515245919849378. ISSN   2515-2459. S2CID   199007973.

Further reading