Regular conditional probability

Last updated December 07, 2023

In probability theory, regular conditional probability is a concept that formalizes the notion of conditioning on the outcome of a random variable. The resulting conditional probability distribution is a parametrized family of probability measures called a Markov kernel.

Definition

Conditional probability distribution

Consider two random variables $X,Y:\Omega \to \mathbb {R}$ . The conditional probability distribution of Y given X is a two variable function $\kappa _{Y\mid X}:\mathbb {R} \times {\mathcal {B}}(\mathbb {R} )\to [0,1]$

If the random variable X is discrete

\kappa _{Y\mid X}(x,A)=P(Y\in A\mid X=x)={\begin{cases}{\frac {P(Y\in A,X=x)}{P(X=x)}}&{\text{ if }}P(X=x)>0\\[3pt]{\text{arbitrary value}}&{\text{ otherwise}}.\end{cases}}

If the random variables X, Y are continuous with density $f_{X,Y}(x,y)$ .

\kappa _{Y\mid X}(x,A)={\begin{cases}{\frac {\int _{A}f_{X,Y}(x,y)\,\mathrm {d} y}{\int _{\mathbb {R} }f_{X,Y}(x,y)\mathrm {d} y}}&{\text{ if }}\int _{\mathbb {R} }f_{X,Y}(x,y)\,\mathrm {d} y>0\\[3pt]{\text{arbitrary value}}&{\text{ otherwise}}.\end{cases}}

A more general definition can be given in terms of conditional expectation. Consider a function $e_{Y\in A}:\mathbb {R} \to [0,1]$ satisfying

e_{Y\in A}(X(\omega ))=\operatorname {E} [1_{Y\in A}\mid X](\omega )

for almost all $\omega$ . Then the conditional probability distribution is given by

\kappa _{Y\mid X}(x,A)=e_{Y\in A}(x).

As with conditional expectation, this can be further generalized to conditioning on a sigma algebra ${\mathcal {F}}$ . In that case the conditional distribution is a function $\Omega \times {\mathcal {B}}(\mathbb {R} )\to [0,1]$ :

\kappa _{Y\mid {\mathcal {F}}}(\omega ,A)=\operatorname {E} [1_{Y\in A}\mid {\mathcal {F}}]

Regularity

For working with $\kappa _{Y\mid X}$ , it is important that it be regular, that is:

For almost all x, $A\mapsto \kappa _{Y\mid X}(x,A)$ is a probability measure
For all A, $x\mapsto \kappa _{Y\mid X}(x,A)$ is a measurable function

In other words $\kappa _{Y\mid X}$ is a Markov kernel.

The second condition holds trivially, but the proof of the first is more involved. It can be shown that if Y is a random element $\Omega \to S$ in a Radon space S, there exists a $\kappa _{Y\mid X}$ that satisfies the first condition.^[1] It is possible to construct more general spaces where a regular conditional probability distribution does not exist.^[2]

Relation to conditional expectation

For discrete and continuous random variables, the conditional expectation can be expressed as

{\begin{aligned}\operatorname {E} [Y\mid X=x]&=\sum _{y}y\,P(Y=y\mid X=x)\\\operatorname {E} [Y\mid X=x]&=\int y\,f_{Y\mid X}(x,y)\,\mathrm {d} y\end{aligned}}

where $f_{Y\mid X}(x,y)$ is the conditional density of $Y$ given $X$ .

This result can be extended to measure theoretical conditional expectation using the regular conditional probability distribution:

\operatorname {E} [Y\mid X](\omega )=\int y\,\kappa _{Y\mid \sigma (X)}(\omega ,\mathrm {d} y).

Formal definition

Let $(\Omega ,{\mathcal {F}},P)$ be a probability space, and let $T:\Omega \rightarrow E$ be a random variable, defined as a Borel- measurable function from $\Omega$ to its state space $(E,{\mathcal {E}})$ . One should think of $T$ as a way to "disintegrate" the sample space $\Omega$ into $\{T^{-1}(x)\}_{x\in E}$ . Using the disintegration theorem from the measure theory, it allows us to "disintegrate" the measure $P$ into a collection of measures, one for each $x\in E$ . Formally, a regular conditional probability is defined as a function $\nu :E\times {\mathcal {F}}\rightarrow [0,1],$ called a "transition probability", where:

For every $x\in E$ , $\nu (x,\cdot )$ is a probability measure on ${\mathcal {F}}$ . Thus we provide one measure for each $x\in E$ .
For all $A\in {\mathcal {F}}$ , $\nu (\cdot ,A)$ (a mapping $E\to [0,1]$ ) is ${\mathcal {E}}$ -measurable, and
For all $A\in {\mathcal {F}}$ and all $B\in {\mathcal {E}}$ ^[3]

P{\big (}A\cap T^{-1}(B){\big )}=\int _{B}\nu (x,A)\,(P\circ T^{-1})(\mathrm {d} x)

where $P\circ T^{-1}$ is the pushforward measure $T_{*}P$ of the distribution of the random element $T$ , $x\in \operatorname {supp} T,$ i.e. the support of the $T_{*}P$ . Specifically, if we take $B=E$ , then $A\cap T^{-1}(E)=A$ , and so

P(A)=\int _{E}\nu (x,A)\,(P\circ T^{-1})(\mathrm {d} x),

where $\nu (x,A)$ can be denoted, using more familiar terms $P(A\ |\ T=x)$ .

Alternate definition

Consider a Radon space $\Omega$ (that is a probability measure defined on a Radon space endowed with the Borel sigma-algebra) and a real-valued random variable T. As discussed above, in this case there exists a regular conditional probability with respect to T. Moreover, we can alternatively define the regular conditional probability for an event A given a particular value t of the random variable T in the following manner:

P(A\mid T=t)=\lim _{U\supset \{T=t\}}{\frac {P(A\cap U)}{P(U)}},

where the limit is taken over the net of open neighborhoods U of t as they become smaller with respect to set inclusion. This limit is defined if and only if the probability space is Radon, and only in the support of T, as described in the article. This is the restriction of the transition probability to the support of T. To describe this limiting process rigorously:

For every $\varepsilon >0,$ there exists an open neighborhood U of the event {T = t}, such that for every open V with $\{T=t\}\subset V\subset U,$

\left|{\frac {P(A\cap V)}{P(V)}}-L\right|<\varepsilon ,

where $L=P(A\mid T=t)$ is the limit.

Related Research Articles

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if $is a random variable whose expected value is defined, and is any random variable on the same probability space, then$

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take "on average" over an arbitrarily large number of occurrences – given that a certain set of "conditions" is known to occur. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

In probability theory and statistics, given two jointly distributed random variables $and, the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.$

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

A Dynkin system, named after Eugene Dynkin, is a collection of subsets of another universal set $satisfying a set of axioms weaker than those of 𝜎-algebra. Dynkin systems are sometimes referred to as 𝜆-systems or d-system . These set families have applications in measure theory and probability.$

In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. The canonical ensemble gives the probability of the system X being in state x as

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

In probability theory, random element is a generalization of the concept of random variable to more complicated spaces than the simple real line. The concept was introduced by Maurice Fréchet (1948) who commented that the “development of probability theory and expansion of area of its applications have led to necessity to pass from schemes where (random) outcomes of experiments can be described by number or a finite set of numbers, to schemes where outcomes of experiments represent, for example, vectors, functions, processes, fields, series, transformations, and also sets or collections of sets.”

This article discusses how information theory is related to measure theory.

In mathematics, more specifically measure theory, there are various notions of the convergence of measures. For an intuitive general sense of what is meant by convergence of measures, consider a sequence of measures μ_n on a space, sharing a common collection of measurable sets. Such a sequence might represent an attempt to construct 'better and better' approximations to a desired measure μ that is difficult to obtain directly. The meaning of 'better and better' is subject to all the usual caveats for taking limits; for any error tolerance ε > 0 we require there be N sufficiently large for n ≥ N to ensure the 'difference' between μ_n and μ is smaller than ε. Various notions of convergence specify precisely what the word 'difference' should mean in that description; these notions are not equivalent to one another, and vary in strength.

In mathematics, the disintegration theorem is a result in measure theory and probability theory. It rigorously defines the idea of a non-trivial "restriction" of a measure to a measure zero subset of the measure space in question. It is related to the existence of conditional probability measures. In a sense, "disintegration" is the opposite process to the construction of a product measure.

In probability theory, a random measure is a measure-valued random element. Random measures are for example used in the theory of random processes, where they form many important point processes such as Poisson point processes and Cox processes.

In probability theory, a standard probability space, also called Lebesgue–Rokhlin probability space or just Lebesgue space is a probability space satisfying certain assumptions introduced by Vladimir Rokhlin in 1940. Informally, it is a probability space consisting of an interval and/or a finite or countable number of atoms.

<span class="mw-page-title-main">Conditional mutual information</span> Information theory

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Poisson-type random measures are a family of three random counting measures which are closed under restriction to a subspace, i.e. closed under thinning. They are the only distributions in the canonical non-negative power series family of distributions to possess this property and include the Poisson distribution, negative binomial distribution, and binomial distribution. The PT family of distributions is also known as the Katz family of distributions, the Panjer or (a,b,0) class of distributions and may be retrieved through the Conway–Maxwell–Poisson distribution.

References

↑ Klenke, Achim. Probability theory : a comprehensive course (Second ed.). London. ISBN 978-1-4471-5361-0.
↑ Faden, A.M., 1985. The existence of regular conditional probabilities: necessary and sufficient conditions. The Annals of Probability, 13(1), pp. 288–298.
↑ D. Leao Jr. et al. Regular conditional probability, disintegration of probability and Radon spaces. Proyecciones. Vol. 23, No. 1, pp. 15–29, May 2004, Universidad Católica del Norte, Antofagasta, Chile PDF

External links

Regular Conditional Probability on PlanetMath

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Klenke, Achim. Probability theory : a comprehensive course (Second ed.). London. ISBN 978-1-4471-5361-0.

[2] Faden, A.M., 1985. The existence of regular conditional probabilities: necessary and sufficient conditions. The Annals of Probability, 13(1), pp. 288–298.

[3] D. Leao Jr. et al. Regular conditional probability, disintegration of probability and Radon spaces. Proyecciones. Vol. 23, No. 1, pp. 15–29, May 2004, Universidad Católica del Norte, Antofagasta, Chile PDF

[1]

[2]

[3]