Conditional probability distribution

Last updated January 20, 2024

In probability theory and statistics, the conditional probability distribution is a probability distribution that describes the probability of an outcome given the occurrence of a particular event. Given two jointly distributed random variables $X$ and $Y$ , the conditional probability distribution of $Y$ given $X$ is the probability distribution of $Y$ when $X$ is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value $x$ of $X$ as a parameter. When both $X$ and $Y$ are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

If the conditional distribution of $Y$ given $X$ is a continuous distribution, then its probability density function is known as the conditional density function.^[1] The properties of a conditional distribution, such as the moments, are often referred to by corresponding names such as the conditional mean and conditional variance.

More generally, one can refer to the conditional distribution of a subset of a set of more than two variables; this conditional distribution is contingent on the values of all the remaining variables, and if more than one variable is included in the subset then this conditional distribution is the conditional joint distribution of the included variables.

Conditional discrete distributions

For discrete random variables, the conditional probability mass function of $Y$ given $X=x$ can be written according to its definition as:

p_{Y|X}(y\mid x)\triangleq P(Y=y\mid X=x)={\frac {P(\{X=x\}\cap \{Y=y\})}{P(X=x)}}\qquad

Due to the occurrence of $P(X=x)$ in the denominator, this is defined only for non-zero (hence strictly positive) $P(X=x).$

The relation with the probability distribution of $X$ given $Y$ is:

P(Y=y\mid X=x)P(X=x)=P(\{X=x\}\cap \{Y=y\})=P(X=x\mid Y=y)P(Y=y).

Example

Consider the roll of a fair die and let $X=1$ if the number is even (i.e., 2, 4, or 6) and $X=0$ otherwise. Furthermore, let $Y=1$ if the number is prime (i.e., 2, 3, or 5) and $Y=0$ otherwise.

D	1	2	3	4	5	6
X	0	1	0	1	0	1
Y	0	1	1	0	1	0

Then the unconditional probability that $X=1$ is 3/6 = 1/2 (since there are six possible rolls of the dice, of which three are even), whereas the probability that $X=1$ conditional on $Y=1$ is 1/3 (since there are three possible prime number rolls—2, 3, and 5—of which one is even).

Conditional continuous distributions

Similarly for continuous random variables, the conditional probability density function of $Y$ given the occurrence of the value $x$ of $X$ can be written as^[2]^{: p. 99}

f_{Y\mid X}(y\mid x)={\frac {f_{X,Y}(x,y)}{f_{X}(x)}}\qquad

where $f_{X,Y}(x,y)$ gives the joint density of $X$ and $Y$ , while $f_{X}(x)$ gives the marginal density for $X$ . Also in this case it is necessary that $f_{X}(x)>0$ .

The relation with the probability distribution of $X$ given $Y$ is given by:

f_{Y\mid X}(y\mid x)f_{X}(x)=f_{X,Y}(x,y)=f_{X|Y}(x\mid y)f_{Y}(y).

The concept of the conditional distribution of a continuous random variable is not as intuitive as it might seem: Borel's paradox shows that conditional probability density functions need not be invariant under coordinate transformations.

Example

The graph shows a bivariate normal joint density for random variables $X$ and $Y$ . To see the distribution of $Y$ conditional on $X=70$ , one can first visualize the line $X=70$ in the $X,Y$ plane, and then visualize the plane containing that line and perpendicular to the $X,Y$ plane. The intersection of that plane with the joint normal density, once rescaled to give unit area under the intersection, is the relevant conditional density of $Y$ .

$Y\mid X=70\ \sim \ {\mathcal {N}}\left(\mu _{Y}+{\frac {\sigma _{Y}}{\sigma _{X}}}\rho (70-\mu _{X}),\,(1-\rho ^{2})\sigma _{Y}^{2}\right).$

Relation to independence

Random variables $X$ , $Y$ are independent if and only if the conditional distribution of $Y$ given $X$ is, for all possible realizations of $X$ , equal to the unconditional distribution of $Y$ . For discrete random variables this means $P(Y=y|X=x)=P(Y=y)$ for all possible $y$ and $x$ with $P(X=x)>0$ . For continuous random variables $X$ and $Y$ , having a joint density function, it means $f_{Y}(y|X=x)=f_{Y}(y)$ for all possible $y$ and $x$ with $f_{X}(x)>0$ .

Properties

Seen as a function of $y$ for given $x$ , $P(Y=y|X=x)$ is a probability mass function and so the sum over all $y$ (or integral if it is a conditional probability density) is 1. Seen as a function of $x$ for given $y$ , it is a likelihood function, so that the sum (or integral) over all $x$ need not be 1.

Additionally, a marginal of a joint distribution can be expressed as the expectation of the corresponding conditional distribution. For instance, $p_{X}(x)=E_{Y}[p_{X|Y}(x\ |\ Y)]$ .

Measure-theoretic formulation

Let $(\Omega ,{\mathcal {F}},P)$ be a probability space, ${\mathcal {G}}\subseteq {\mathcal {F}}$ a $\sigma$ -field in ${\mathcal {F}}$ . Given $A\in {\mathcal {F}}$ , the Radon-Nikodym theorem implies that there is^[3] a ${\mathcal {G}}$ -measurable random variable $P(A\mid {\mathcal {G}}):\Omega \to \mathbb {R}$ , called the conditional probability, such that

\int _{G}P(A\mid {\mathcal {G}})(\omega )dP(\omega )=P(A\cap G)

for every $G\in {\mathcal {G}}$ , and such a random variable is uniquely defined up to sets of probability zero. A conditional probability is called regular if $\operatorname {P} (\cdot \mid {\mathcal {G}})(\omega )$ is a probability measure on $(\Omega ,{\mathcal {F}})$ for all $\omega \in \Omega$ a.e.

Special cases:

For the trivial sigma algebra ${\mathcal {G}}=\{\emptyset ,\Omega \}$ , the conditional probability is the constant function $\operatorname {P} \!\left(A\mid \{\emptyset ,\Omega \}\right)=\operatorname {P} (A).$
If $A\in {\mathcal {G}}$ , then $\operatorname {P} (A\mid {\mathcal {G}})=1_{A}$ , the indicator function (defined below).

Let $X:\Omega \to E$ be a $(E,{\mathcal {E}})$ -valued random variable. For each $B\in {\mathcal {E}}$ , define

\mu _{X\,|\,{\mathcal {G}}}(B\,|\,{\mathcal {G}})=\mathrm {P} (X^{-1}(B)\,|\,{\mathcal {G}}).

For any $\omega \in \Omega$ , the function $\mu _{X\,|{\mathcal {G}}}(\cdot \,|{\mathcal {G}})(\omega ):{\mathcal {E}}\to \mathbb {R}$ is called the conditional probability distribution of $X$ given ${\mathcal {G}}$ . If it is a probability measure on $(E,{\mathcal {E}})$ , then it is called regular.

For a real-valued random variable (with respect to the Borel $\sigma$ -field ${\mathcal {R}}^{1}$ on $\mathbb {R}$ ), every conditional probability distribution is regular.^[4] In this case, $E[X\mid {\mathcal {G}}]=\int _{-\infty }^{\infty }x\,\mu (dx,\cdot )$ almost surely.

Relation to conditional expectation

For any event $A\in {\mathcal {F}}$ , define the indicator function:

\mathbf {1} _{A}(\omega )={\begin{cases}1\;&{\text{if }}\omega \in A,\\0\;&{\text{if }}\omega \notin A,\end{cases}}

which is a random variable. Note that the expectation of this random variable is equal to the probability of A itself:

\operatorname {E} (\mathbf {1} _{A})=\operatorname {P} (A).\;

Given a $\sigma$ -field ${\mathcal {G}}\subseteq {\mathcal {F}}$ , the conditional probability $\operatorname {P} (A\mid {\mathcal {G}})$ is a version of the conditional expectation of the indicator function for $A$ :

\operatorname {P} (A\mid {\mathcal {G}})=\operatorname {E} (\mathbf {1} _{A}\mid {\mathcal {G}})\;

An expectation of a random variable with respect to a regular conditional probability is equal to its conditional expectation.

Interpretation of conditioning on a Sigma Field

Consider the probability space $(\Omega ,{\mathcal {F}},\mathbb {P} )$ and a sub-sigma field ${\mathcal {A}}\subset {\mathcal {F}}$ . The sub-sigma field ${\mathcal {A}}$ can be loosely interpreted as containing a subset of the information in ${\mathcal {F}}$ . For example, we might think of $\mathbb {P} (B|{\mathcal {A}})$ as the probability of the event $B$ given the information in ${\mathcal {A}}$ .

Also recall that an event $B$ is independent of a sub-sigma field ${\mathcal {A}}$ if $\mathbb {P} (B|A)=\mathbb {P} (B)$ for all $A\in {\mathcal {A}}$ . It is incorrect to conclude in general that the information in ${\mathcal {A}}$ does not tell us anything about the probability of event $B$ occurring. This can be shown with a counter-example:

Consider a probability space on the unit interval, $\Omega =[0,1]$ . Let ${\mathcal {G}}$ be the sigma-field of all countable sets and sets whose complement is countable. So each set in ${\mathcal {G}}$ has measure $0$ or $1$ and so is independent of each event in ${\mathcal {F}}$ . However, notice that ${\mathcal {G}}$ also contains all the singleton events in ${\mathcal {F}}$ (those sets which contain only a single $\omega \in \Omega$ ). So knowing which of the events in ${\mathcal {G}}$ occurred is equivalent to knowing exactly which $\omega \in \Omega$ occurred! So in one sense, ${\mathcal {G}}$ contains no information about ${\mathcal {F}}$ (it is independent of it), and in another sense it contains all the information in ${\mathcal {F}}$ .^[5]

Related Research Articles

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesX_i are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable X_i in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if $is a random variable whose expected value is defined, and is any random variable on the same probability space, then$

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

A Dynkin system, named after Eugene Dynkin, is a collection of subsets of another universal set $satisfying a set of axioms weaker than those of 𝜎-algebra. Dynkin systems are sometimes referred to as 𝜆-systems or d-system . These set families have applications in measure theory and probability.$

In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. The canonical ensemble gives the probability of the system X being in state x as

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

In probability theory, random element is a generalization of the concept of random variable to more complicated spaces than the simple real line. The concept was introduced by Maurice Fréchet (1948) who commented that the “development of probability theory and expansion of area of its applications have led to necessity to pass from schemes where (random) outcomes of experiments can be described by number or a finite set of numbers, to schemes where outcomes of experiments represent, for example, vectors, functions, processes, fields, series, transformations, and also sets or collections of sets.”

In probability theory, a random measure is a measure-valued random element. Random measures are for example used in the theory of random processes, where they form many important point processes such as Poisson point processes and Cox processes.

In probability theory, a standard probability space, also called Lebesgue–Rokhlin probability space or just Lebesgue space is a probability space satisfying certain assumptions introduced by Vladimir Rokhlin in 1940. Informally, it is a probability space consisting of an interval and/or a finite or countable number of atoms.

In probability theory, regular conditional probability is a concept that formalizes the notion of conditioning on the outcome of a random variable. The resulting conditional probability distribution is a parametrized family of probability measures called a Markov kernel.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

In probability theory, the Doob–Dynkin lemma, named after Joseph L. Doob and Eugene Dynkin, characterizes the situation when one random variable is a function of another by the inclusion of the $-algebras$ generated by the random variables. The usual statement of the lemma is formulated in terms of one random variable being measurable with respect to the $-algebra generated by the other.$

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

References

Citations

↑ Ross, Sheldon M. (1993). Introduction to Probability Models (Fifth ed.). San Diego: Academic Press. pp. 88–91. ISBN 0-12-598455-3.
↑ Park, Kun Il (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer. ISBN 978-3-319-68074-3.
↑ Billingsley (1995), p. 430
↑ Billingsley (1995), p. 439
↑ Billingsley, Patrick (2012-02-28). Probability and Measure. Hoboken, New Jersey: Wiley. ISBN 978-1-118-12237-2.

Sources

Billingsley, Patrick (1995). Probability and Measure (3rd ed.). New York, NY: John Wiley and Sons.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Ross, Sheldon M. (1993). Introduction to Probability Models (Fifth ed.). San Diego: Academic Press. pp. 88–91. ISBN 0-12-598455-3.

[KunIlPark-2] Park, Kun Il (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer. ISBN 978-3-319-68074-3.

[3] Billingsley (1995), p. 430

[4] Billingsley (1995), p. 439

[5] Billingsley, Patrick (2012-02-28). Probability and Measure. Hoboken, New Jersey: Wiley. ISBN 978-1-118-12237-2.

[1]

[2]

[3]

[4]

[5]

Conditional probability distribution

Contents

Conditional discrete distributions

Example

Conditional continuous distributions

Example

Relation to independence

Properties

Measure-theoretic formulation

Relation to conditional expectation

Interpretation of conditioning on a Sigma Field

See also

Related Research Articles

References

Citations

Sources