Chain rule (probability)

Last updated October 18, 2024

In probability theory, the chain rule^[1] (also called the general product rule^[2]^[3]) describes how to calculate the probability of the intersection of, not necessarily independent, events or the joint distribution of random variables respectively, using conditional probabilities. This rule allows one to express a joint probability in terms of only conditional probabilities.^[4] The rule is notably used in the context of discrete stochastic processes and in applications, e.g. the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.

Chain rule for events

Two events

For two events $A$ and $B$ , the chain rule states that

\mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)

,

where $\mathbb {P} (B\mid A)$ denotes the conditional probability of $B$ given $A$ .

Example

An Urn A has 1 black ball and 2 white balls and another Urn B has 1 black ball and 3 white balls. Suppose we pick an urn at random and then select a ball from that urn. Let event $A$ be choosing the first urn, i.e. $\mathbb {P} (A)=\mathbb {P} ({\overline {A}})=1/2$ , where ${\overline {A}}$ is the complementary event of $A$ . Let event $B$ be the chance we choose a white ball. The chance of choosing a white ball, given that we have chosen the first urn, is $\mathbb {P} (B|A)=2/3.$ The intersection $A\cap B$ then describes choosing the first urn and a white ball from it. The probability can be calculated by the chain rule as follows:

\mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)={\frac {2}{3}}\cdot {\frac {1}{2}}={\frac {1}{3}}.

Finitely many events

For events $A_{1},\ldots ,A_{n}$ whose intersection has not probability zero, the chain rule states

{\begin{aligned}\mathbb {P} \left(A_{1}\cap A_{2}\cap \ldots \cap A_{n}\right)&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-1}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-2}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\cdot \ldots \cdot \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\\&=\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\cdot \ldots \cdot \mathbb {P} (A_{n}\mid A_{1}\cap \dots \cap A_{n-1})\\&=\prod _{k=1}^{n}\mathbb {P} (A_{k}\mid A_{1}\cap \dots \cap A_{k-1})\\&=\prod _{k=1}^{n}\mathbb {P} \left(A_{k}\,{\Bigg |}\,\bigcap _{j=1}^{k-1}A_{j}\right).\end{aligned}}

Example 1

For $n=4$ , i.e. four events, the chain rule reads

{\begin{aligned}\mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\cap A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\end{aligned}}

.

Example 2

We randomly draw 4 cards without replacement from deck with 52 cards. What is the probability that we have picked 4 aces?

First, we set ${\textstyle A_{n}:=\left\{{\text{draw an ace in the }}n^{\text{th}}{\text{ try}}\right\}}$ . Obviously, we get the following probabilities

\mathbb {P} (A_{1})={\frac {4}{52}},\qquad \mathbb {P} (A_{2}\mid A_{1})={\frac {3}{51}},\qquad \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})={\frac {2}{50}},\qquad \mathbb {P} (A_{4}\mid A_{1}\cap A_{2}\cap A_{3})={\frac {1}{49}}

.

Applying the chain rule,

\mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})={\frac {4}{52}}\cdot {\frac {3}{51}}\cdot {\frac {2}{50}}\cdot {\frac {1}{49}}

.

Statement of the theorem and proof

Let $(\Omega ,{\mathcal {A}},\mathbb {P} )$ be a probability space. Recall that the conditional probability of an $A\in {\mathcal {A}}$ given $B\in {\mathcal {A}}$ is defined as

{\begin{aligned}\mathbb {P} (A\mid B):={\begin{cases}{\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}},&\mathbb {P} (B)>0,\\0&\mathbb {P} (B)=0.\end{cases}}\end{aligned}}

Then we have the following theorem.

Chain rule — Let $(\Omega ,{\mathcal {A}},\mathbb {P} )$ be a probability space. Let $A_{1},...,A_{n}\in {\mathcal {A}}$ . Then

{\begin{aligned}\mathbb {P} \left(A_{1}\cap A_{2}\cap \ldots \cap A_{n}\right)&=\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\cdot \ldots \cdot \mathbb {P} (A_{n}\mid A_{1}\cap \dots \cap A_{n-1})\\&=\mathbb {P} (A_{1})\prod _{j=2}^{n}\mathbb {P} (A_{j}\mid A_{1}\cap \dots \cap A_{j-1}).\end{aligned}}

Proof

The formula follows immediately by recursion

{\begin{aligned}(1)&&&\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})&=&\qquad \mathbb {P} (A_{1}\cap A_{2})\\(2)&&&\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})&=&\qquad \mathbb {P} (A_{1}\cap A_{2})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\\&&&&=&\qquad \mathbb {P} (A_{1}\cap A_{2}\cap A_{3}),\end{aligned}}

where we used the definition of the conditional probability in the first step.

Chain rule for discrete random variables

Two random variables

For two discrete random variables $X,Y$ , we use the events $A:=\{X=x\}$ and $B:=\{Y=y\}$ in the definition above, and find the joint distribution as

\mathbb {P} (X=x,Y=y)=\mathbb {P} (X=x\mid Y=y)\mathbb {P} (Y=y),

or

\mathbb {P} _{(X,Y)}(x,y)=\mathbb {P} _{X\mid Y}(x\mid y)\mathbb {P} _{Y}(y),

where $\mathbb {P} _{X}(x):=\mathbb {P} (X=x)$ is the probability distribution of $X$ and $\mathbb {P} _{X\mid Y}(x\mid y)$ conditional probability distribution of $X$ given $Y$ .

Finitely many random variables

Let $X_{1},\ldots ,X_{n}$ be random variables and $x_{1},\dots ,x_{n}\in \mathbb {R}$ . By the definition of the conditional probability,

\mathbb {P} \left(X_{n}=x_{n},\ldots ,X_{1}=x_{1}\right)=\mathbb {P} \left(X_{n}=x_{n}|X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)\mathbb {P} \left(X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)

and using the chain rule, where we set $A_{k}:=\{X_{k}=x_{k}\}$ , we can find the joint distribution as

{\begin{aligned}\mathbb {P} \left(X_{1}=x_{1},\ldots X_{n}=x_{n}\right)&=\mathbb {P} \left(X_{1}=x_{1}\mid X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\mathbb {P} \left(X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\\&=\mathbb {P} (X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{3}=x_{3}\mid X_{1}=x_{1},X_{2}=x_{2})\cdot \ldots \\&\qquad \cdot \mathbb {P} (X_{n}=x_{n}\mid X_{1}=x_{1},\dots ,X_{n-1}=x_{n-1})\\\end{aligned}}

Example

For $n=3$ , i.e. considering three random variables. Then, the chain rule reads

{\begin{aligned}\mathbb {P} _{(X_{1},X_{2},X_{3})}(x_{1},x_{2},x_{3})&=\mathbb {P} (X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2},X_{1}=x_{1})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{1}=x_{1})\\&=\mathbb {P} _{X_{3}\mid X_{2},X_{1}}(x_{3}\mid x_{2},x_{1})\mathbb {P} _{X_{2}\mid X_{1}}(x_{2}\mid x_{1})\mathbb {P} _{X_{1}}(x_{1}).\end{aligned}}

Bibliography

René L. Schilling (2021), Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum (1 ed.), Technische Universität Dresden, Germany, ISBN 979-8-5991-0488-9 {{citation}}: CS1 maint: location missing publisher (link)
William Feller (1968), An Introduction to Probability Theory and Its Applications, vol. I (3 ed.), New York / London / Sydney: Wiley, ISBN 978-0-471-25708-0
Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, ISBN 0-13-790395-2 , p. 496.

Related Research Articles

In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions $f$ and $g$ in terms of the derivatives of $f$ and $g$ . More precisely, if $is the function such that for every x, then the chain rule is, in Lagrange's notation, or, equivalently,$

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of $L p$ spaces.

In mathematics, Fatou's lemma establishes an inequality relating the Lebesgue integral of the limit inferior of a sequence of functions to the limit inferior of integrals of these functions. The lemma is named after Pierre Fatou.

In combinatorics, a branch of mathematics, the inclusion–exclusion principle is a counting technique which generalizes the familiar method of obtaining the number of elements in the union of two finite sets; symbolically expressed as

Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables and the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

A continuous game is a mathematical concept, used in game theory, that generalizes the idea of an ordinary game like tic-tac-toe or checkers (draughts). In other words, it extends the notion of a discrete game, where the players choose from a finite set of pure strategies. The continuous game concepts allows games to include more general sets of pure strategies, which may be uncountably infinite.

<span class="mw-page-title-main">Conditional mutual information</span> Information theory

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This particular method relies on event A occurring with some sort of relationship with another event B. In this situation, the event A can be analyzed by a conditional probability with respect to B. If the event of interest is $A$ and the event $B$ is known or assumed to have occurred, "the conditional probability of $A$ given $B$ ", or "the probability of $A$ under the condition $B$ ", is usually written as $P(A | B)$ or occasionally $P B (A)$ . This can also be understood as the fraction of probability B that intersects with A, or the ratio of the probabilities of both events happening to the "given" one happening (how many times A occurs rather than not assuming B has occurred): $.$

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

References

↑ Schilling, René L. (2021). Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum. Technische Universität Dresden, Germany. p. 136ff. ISBN 979-8-5991-0488-9.{{cite book}}: CS1 maint: location missing publisher (link)
↑ Schum, David A. (1994). The Evidential Foundations of Probabilistic Reasoning. Northwestern University Press. p. 49. ISBN 978-0-8101-1821-8.
↑ Klugh, Henry E. (2013). Statistics: The Essentials for Research (3rd ed.). Psychology Press. p. 149. ISBN 978-1-134-92862-0.
↑ Virtue, Pat. "10-606: Mathematical Foundations for Machine Learning" (PDF).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Schilling, René L. (2021). Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum. Technische Universität Dresden, Germany. p. 136ff. ISBN 979-8-5991-0488-9.{{cite book}}: CS1 maint: location missing publisher (link)

[2] Schum, David A. (1994). The Evidential Foundations of Probabilistic Reasoning. Northwestern University Press. p. 49. ISBN 978-0-8101-1821-8.

[3] Klugh, Henry E. (2013). Statistics: The Essentials for Research (3rd ed.). Psychology Press. p. 149. ISBN 978-1-134-92862-0.

[4] Virtue, Pat. "10-606: Mathematical Foundations for Machine Learning" (PDF).

[1]

[2]

[3]

[4]

Chain rule (probability)

Contents

Chain rule for events

Two events

Example

Finitely many events

Example 1

Example 2

Statement of the theorem and proof

Chain rule for discrete random variables

Two random variables

Finitely many random variables

Example

Bibliography

Related Research Articles

References