# Posterior probability

Last updated

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability given the relevant evidence or background. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined.

## Contents

The posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey.

## Definition

The posterior probability is the probability of the parameters ${\displaystyle \theta }$ given the evidence ${\displaystyle X}$: ${\displaystyle p(\theta |X)}$.

It contrasts with the likelihood function, which is the probability of the evidence given the parameters: ${\displaystyle p(X|\theta )}$.

The two are related as follows:

Given a prior belief that a probability distribution function is ${\displaystyle p(\theta )}$ and that the observations ${\displaystyle x}$ have a likelihood ${\displaystyle p(x|\theta )}$, then the posterior probability is defined as

${\displaystyle p(\theta |x)={\frac {p(x|\theta )}{p(x)}}p(\theta )}$ [1]

where ${\displaystyle p(x)}$ is the normalizing constant and is calculated as

${\displaystyle p(x)=\int p(x|\theta )p(\theta )d\theta }$

for continuous ${\displaystyle \theta }$, or by summing ${\displaystyle p(x|\theta )p(\theta )}$ over all possible values of ${\displaystyle \theta }$ for discrete ${\displaystyle \theta }$. [2]

The posterior probability is therefore proportional to the product Likelihood · Prior probability.

## Example

Suppose there is a school having 60% boys and 40% girls as students. The girls wear trousers or skirts in equal numbers; all boys wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem.

The event ${\displaystyle G}$ is that the student observed is a girl, and the event ${\displaystyle T}$ is that the student observed is wearing trousers. To compute the posterior probability ${\displaystyle P(G|T)}$, we first need to know:

• ${\displaystyle P(G)}$, or the probability that the student is a girl regardless of any other information. Since the observer sees a random student, meaning that all students have the same probability of being observed, and the percentage of girls among the students is 40%, this probability equals 0.4.
• ${\displaystyle P(B)}$, or the probability that the student is not a girl (i.e. a boy) regardless of any other information (${\displaystyle B}$ is the complementary event to ${\displaystyle G}$). This is 60%, or 0.6.
• ${\displaystyle P(T|G)}$, or the probability of the student wearing trousers given that the student is a girl. As they are as likely to wear skirts as trousers, this is 0.5.
• ${\displaystyle P(T|B)}$, or the probability of the student wearing trousers given that the student is a boy. This is given as 1.
• ${\displaystyle P(T)}$, or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since ${\displaystyle P(T)=P(T|G)P(G)+P(T|B)P(B)}$ (via the law of total probability), this is ${\displaystyle P(T)=0.5\times 0.4+1\times 0.6=0.8}$.

Given all this information, the posterior probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula:

${\displaystyle P(G|T)={\frac {P(T|G)P(G)}{P(T)}}={\frac {0.5\times 0.4}{0.8}}=0.25.}$

An intuitive way to solve this is to assume the school has N students. Number of boys = 0.6N and number of girls = 0.4N. If N is sufficiently large, total number of trouser wearers = 0.6N+ 50% of 0.4N. And number of girl trouser wearers = 50% of 0.4N. Therefore, in the population of trousers, girls are (50% of 0.4N)/(0.6N+ 50% of 0.4N) = 25%. In other words, if you separated out the group of trouser wearers, a quarter of that group will be girls. Therefore, if you see trousers, the most you can deduce is that you are looking at a single sample from a subset of students where 25% are girls. And by definition, chance of this random student being a girl is 25%. Every Bayes theorem problem can be solved in this way.

## Calculation

The posterior probability distribution of one random variable given the value of another can be calculated with Bayes' theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:

${\displaystyle f_{X\mid Y=y}(x)={f_{X}(x){\mathcal {L}}_{X\mid Y=y}(x) \over {\int _{-\infty }^{\infty }f_{X}(u){\mathcal {L}}_{X\mid Y=y}(u)\,du}}}$

gives the posterior probability density function for a random variable ${\displaystyle X}$ given the data ${\displaystyle Y=y}$, where

• ${\displaystyle f_{X}(x)}$ is the prior density of ${\displaystyle X}$,
• ${\displaystyle {\mathcal {L}}_{X\mid Y=y}(x)=f_{Y\mid X=x}(y)}$ is the likelihood function as a function of ${\displaystyle x}$,
• ${\displaystyle \int _{-\infty }^{\infty }f_{X}(u){\mathcal {L}}_{X\mid Y=y}(u)\,du}$ is the normalizing constant, and
• ${\displaystyle f_{X\mid Y=y}(x)}$ is the posterior density of ${\displaystyle X}$ given the data ${\displaystyle Y=y}$.

## Credible interval

Posterior probability is a conditional probability conditioned on randomly observed data. Hence it is a random variable. For a random variable, it is important to summarize its amount of uncertainty. One way to achieve this goal is to provide a credible interval of the posterior probability.

## Classification

In classification, posterior probabilities reflect the uncertainty of assessing an observation to particular class, see also Class membership probabilities. While statistical classification methods by definition generate posterior probabilities, Machine Learners usually supply membership values which do not induce any probabilistic confidence. It is desirable to transform or re-scale membership values to class membership probabilities, since they are comparable and additionally more easily applicable for post-processing.

## Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter , which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In Bayesian probability theory, if the posterior distribution p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x | θ).

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

In mathematical statistics, the Kullback–Leibler divergence,, is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread – it also does not satisfy the triangle inequality. In the simple case, a relative entropy of 0 indicates that the two distributions in question have identical quantities of information. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and bioinformatics.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In mathematics, a π-system on a set is a collection of certain subsets of such that

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, the observed information, or observed Fisher information, is the negative of the second derivative of the "log-likelihood". It is a sample-based version of the Fisher information.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson as a prior over probability distributions.

## References

1. Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. pp. 21–24. ISBN   978-0-387-31073-2.
2. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari and Donald B. Rubin (2014). Bayesian Data Analysis. CRC Press. p. 7. ISBN   978-1-4398-4095-5.CS1 maint: multiple names: authors list (link)
• Lancaster, Tony (2004). An Introduction to Modern Bayesian Econometrics. Oxford: Blackwell. ISBN   1-4051-1720-6.
• Lee, Peter M. (2004). Bayesian Statistics : An Introduction (3rd ed.). Wiley. ISBN   0-340-81405-5.