Law of total probability

Last updated August 20, 2024

In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct events, hence the name.

Statement

The law of total probability is^[1] a theorem that states, in its discrete case, if $\left\{{B_{n}:n=1,2,3,\ldots }\right\}$ is a finite or countably infinite set of mutually exclusive and collectively exhaustive events, then for any event $A$

P(A)=\sum _{n}P(A\cap B_{n})

or, alternatively,^[1]

P(A)=\sum _{n}P(A\mid B_{n})P(B_{n}),

where, for any $n$ , if $P(B_{n})=0$ , then these terms are simply omitted from the summation since $P(A\mid B_{n})$ is finite.

The summation can be interpreted as a weighted average, and consequently the marginal probability, $P(A)$ , is sometimes called "average probability";^[2] "overall probability" is sometimes used in less formal writings.^[3]

The law of total probability can also be stated for conditional probabilities:

P({A|C})={\frac {P({A,C})}{P(C)}}={\frac {\sum \limits _{n}{P({A,{B_{n}},C})}}{P(C)}}={\frac {\sum \limits _{n}P({A\mid {B_{n}},C})P({{B_{n}}\mid C})P(C)}{P(C)}}=\sum \limits _{n}P({A\mid {B_{n}},C})P({{B_{n}}\mid C})

Taking the $B_{n}$ as above, and assuming $C$ is an event independent of any of the $B_{n}$ :

P(A\mid C)=\sum _{n}P(A\mid C,B_{n})P(B_{n})

Continuous case

The law of total probability extends to the case of conditioning on events generated by continuous random variables. Let $(\Omega ,{\mathcal {F}},P)$ be a probability space. Suppose $X$ is a random variable with distribution function $F_{X}$ , and $A$ an event on $(\Omega ,{\mathcal {F}},P)$ . Then the law of total probability states

$P(A)=\int _{-\infty }^{\infty }P(A|X=x)dF_{X}(x).$

If $X$ admits a density function $f_{X}$ , then the result is

$P(A)=\int _{-\infty }^{\infty }P(A|X=x)f_{X}(x)dx.$

Moreover, for the specific case where $A=\{Y\in B\}$ , where $B$ is a Borel set, then this yields

$P(Y\in B)=\int _{-\infty }^{\infty }P(Y\in B|X=x)f_{X}(x)dx.$

Example

Suppose that two factories supply light bulbs to the market. Factory X's bulbs work for over 5000 hours in 99% of cases, whereas factory Y's bulbs work for over 5000 hours in 95% of cases. It is known that factory X supplies 60% of the total bulbs available and Y supplies 40% of the total bulbs available. What is the chance that a purchased bulb will work for longer than 5000 hours?

Applying the law of total probability, we have:

{\begin{aligned}P(A)&=P(A\mid B_{X})\cdot P(B_{X})+P(A\mid B_{Y})\cdot P(B_{Y})\\[4pt]&={99 \over 100}\cdot {6 \over 10}+{95 \over 100}\cdot {4 \over 10}={{594+380} \over 1000}={974 \over 1000}\end{aligned}}

where

$P(B_{X})={6 \over 10}$ is the probability that the purchased bulb was manufactured by factory X;
$P(B_{Y})={4 \over 10}$ is the probability that the purchased bulb was manufactured by factory Y;
$P(A\mid B_{X})={99 \over 100}$ is the probability that a bulb manufactured by X will work for over 5000 hours;
$P(A\mid B_{Y})={95 \over 100}$ is the probability that a bulb manufactured by Y will work for over 5000 hours.

Thus each purchased light bulb has a 97.4% chance to work for more than 5000 hours.

Other names

The term law of total probability is sometimes taken to mean the law of alternatives, which is a special case of the law of total probability applying to discrete random variables.^{[ citation needed ]} One author uses the terminology of the "Rule of Average Conditional Probabilities",^[4] while another refers to it as the "continuous law of alternatives" in the continuous case.^[5] This result is given by Grimmett and Welsh^[6] as the partition theorem, a name that they also give to the related law of total expectation.

Notes

1 2 Zwillinger, D., Kokoska, S. (2000) CRC Standard Probability and Statistics Tables and Formulae, CRC Press. ISBN 1-58488-059-7 page 31.
↑ Paul E. Pfeiffer (1978). Concepts of probability theory. Courier Dover Publications. pp. 47–48. ISBN 978-0-486-63677-1.
↑ Deborah Rumsey (2006). Probability for dummies. For Dummies. p. 58. ISBN 978-0-471-75141-0.
↑ Jim Pitman (1993). Probability. Springer. p. 41. ISBN 0-387-97974-3.
↑ Kenneth Baclawski (2008). Introduction to probability with R. CRC Press. p. 179. ISBN 978-1-4200-6521-3.
↑ Probability: An Introduction, by Geoffrey Grimmett and Dominic Welsh, Oxford Science Publications, 1986, Theorem 1B.

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

Bayes' theorem gives a mathematical rule for inverting conditional probabilities, allowing us to find the probability of a cause given its effect. For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to an individual of a known age to be assessed more accurately by conditioning it relative to their age, rather than assuming that the individual is typical of the population as a whole. Based on Bayes law both the prevalence of a disease in a given population and the error rate of an infectious disease test have to be taken into account to evaluate the meaning of a positive test result correctly and avoid the base-rate fallacy.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of $successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.$

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if $is a random variable whose expected value is defined, and is any random variable on the same probability space, then$

In probability theory, the law of total variance or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, states that if $and are random variables on the same probability space, and the variance of is finite, then$

In abstract algebra, a semiring is an algebraic structure. It is a generalization of a ring, dropping the requirement that each element must have an additive inverse. At the same time, it is a generalization of bounded distributive lattices.

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probability, as a special case where the probability of the hypothesis given the uninformative observation is equal to the probability without. If $is the hypothesis, and and are observations, conditional independence can be stated as an equality:$

Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables and the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

The algebra of random variables in statistics, provides rules for the symbolic manipulation of random variables, while avoiding delving too deeply into the mathematically sophisticated ideas of probability theory. Its symbolism allows the treatment of sums, products, ratios and general functions of random variables, as well as dealing with operations such as finding the probability distributions and the expectations, variances and covariances of such combinations.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

References

Introduction to Probability and Statistics by Robert J. Beaver, Barbara M. Beaver, Thomson Brooks/Cole, 2005, page 159.
Theory of Statistics, by Mark J. Schervish, Springer, 1995.
Schaum's Outline of Probability, Second Edition, by John J. Schiller, Seymour Lipschutz, McGraw–Hill Professional, 2010, page 89.
A First Course in Stochastic Models, by H. C. Tijms, John Wiley and Sons, 2003, pages 431–432.
An Intermediate Course in Probability, by Alan Gut, Springer, 1995, pages 5–6.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[ZK-1] 1 2 Zwillinger, D., Kokoska, S. (2000) CRC Standard Probability and Statistics Tables and Formulae, CRC Press. ISBN 1-58488-059-7 page 31.

[Pfeiffer1978-2] Paul E. Pfeiffer (1978). Concepts of probability theory. Courier Dover Publications. pp. 47–48. ISBN 978-0-486-63677-1.

[Rumsey2006-3] Deborah Rumsey (2006). Probability for dummies. For Dummies. p. 58. ISBN 978-0-471-75141-0.

[Pitman1993-4] Jim Pitman (1993). Probability. Springer. p. 41. ISBN 0-387-97974-3.

[Baclawski2008-5] Kenneth Baclawski (2008). Introduction to probability with R. CRC Press. p. 179. ISBN 978-1-4200-6521-3.

[6] Probability: An Introduction, by Geoffrey Grimmett and Dominic Welsh, Oxford Science Publications, 1986, Theorem 1B.

[1]

[2]

[3]

[4]

[5]

[6]