Conditioning (probability)

Last updated

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

Contents

Conditioning on the discrete level

Example: A fair coin is tossed 10 times; the random variable X is the number of heads in these 10 tosses, and Y is the number of heads in the first 3 tosses. In spite of the fact that Y emerges before X it may happen that someone knows X but not Y.

Conditional probability

Given that X = 1, the conditional probability of the event Y = 0 is

More generally,

One may also treat the conditional probability as a random variable, — a function of the random variable X, namely,

The expectation of this random variable is equal to the (unconditional) probability,

namely,

which is an instance of the law of total probability

Thus, may be treated as the value of the random variable corresponding to X = 1. On the other hand, is well-defined irrespective of other possible values of X.

Conditional expectation

Given that X = 1, the conditional expectation of the random variable Y is More generally,

(In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,

The expectation of this random variable is equal to the (unconditional) expectation of Y,

namely,

or simply

which is an instance of the law of total expectation

The random variable is the best predictor of Y given X. That is, it minimizes the mean square error on the class of all random variables of the form f(X). This class of random variables remains intact if X is replaced, say, with 2X. Thus, It does not mean that rather, In particular, More generally, for every function g that is one-to-one on the set of all possible values of X. The values of X are irrelevant; what matters is the partition (denote it αX)

of the sample space Ω into disjoint sets {X = xn}. (Here are all possible values of X.) Given an arbitrary partition α of Ω, one may define the random variable E ( Y | α ). Still, E ( E ( Y | α)) = E ( Y ).

Conditional probability may be treated as a special case of conditional expectation. Namely, P ( A | X ) = E ( Y | X ) if Y is the indicator of A. Therefore the conditional probability also depends on the partition αX generated by X rather than on X itself; P ( A | g(X) ) = P (A | X) = P (A | α),α = αX = αg(X).

On the other hand, conditioning on an event B is well-defined, provided that irrespective of any partition that may contain B as one of several parts.

Conditional distribution

Given X = x, the conditional distribution of Y is

for 0 ≤ y ≤ min ( 3, x ). It is the hypergeometric distribution H ( x; 3, 7 ), or equivalently, H ( 3; x, 10-x ). The corresponding expectation 0.3 x, obtained from the general formula

for H ( n; R, W ), is nothing but the conditional expectation E (Y | X = x) = 0.3 x.

Treating H ( X; 3, 7 ) as a random distribution (a random vector in the four-dimensional space of all measures on {0,1,2,3}), one may take its expectation, getting the unconditional distribution of Y, — the binomial distribution Bin ( 3, 0.5 ). This fact amounts to the equality

for y = 0,1,2,3; which is an instance of the law of total probability.

Conditioning on the level of densities

Example. A point of the sphere x2 + y2 + z2 = 1 is chosen at random according to the uniform distribution on the sphere. [1] The random variables X, Y, Z are the coordinates of the random point. The joint density of X, Y, Z does not exist (since the sphere is of zero volume), but the joint density fX,Y of X, Y exists,

(The density is non-constant because of a non-constant angle between the sphere and the plane.) The density of X may be calculated by integration,

surprisingly, the result does not depend on x in (−1,1),

which means that X is distributed uniformly on (−1,1). The same holds for Y and Z (and in fact, for aX + bY + cZ whenever a2 + b2 + c2 = 1).

Example. A different measure of calculating the marginal distribution function is provided below [2] [3]

Conditional probability

Calculation

Given that X = 0.5, the conditional probability of the event Y ≤ 0.75 is the integral of the conditional density,

More generally,

for all x and y such that −1 < x < 1 (otherwise the denominator fX(x) vanishes) and (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable X, namely,

The expectation of this random variable is equal to the (unconditional) probability,

which is an instance of the law of total probability E ( P ( A | X ) ) = P ( A ).

Interpretation

The conditional probability P ( Y ≤ 0.75 | X = 0.5 ) cannot be interpreted as P ( Y ≤ 0.75, X = 0.5 ) / P ( X = 0.5 ), since the latter gives 0/0. Accordingly, P ( Y ≤ 0.75 | X = 0.5 ) cannot be interpreted via empirical frequencies, since the exact value X = 0.5 has no chance to appear at random, not even once during an infinite sequence of independent trials.

The conditional probability can be interpreted as a limit,

Conditional expectation

The conditional expectation E ( Y | X = 0.5 ) is of little interest; it vanishes just by symmetry. It is more interesting to calculate E ( |Z| | X = 0.5 ) treating |Z| as a function of X, Y:

More generally,

for −1 < x < 1. One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,

The expectation of this random variable is equal to the (unconditional) expectation of |Z|,

namely,

which is an instance of the law of total expectation E ( E ( Y | X ) ) = E ( Y ).

The random variable E(|Z| | X) is the best predictor of |Z| given X. That is, it minimizes the mean square error E ( |Z| - f(X) )2 on the class of all random variables of the form f(X). Similarly to the discrete case, E ( |Z| | g(X) ) = E ( |Z| | X ) for every measurable function g that is one-to-one on (-1,1).

Conditional distribution

Given X = x, the conditional distribution of Y, given by the density fY|X=x(y), is the (rescaled) arcsin distribution; its cumulative distribution function is

for all x and y such that x2 + y2 < 1.The corresponding expectation of h(x,Y) is nothing but the conditional expectation E ( h(X,Y) | X=x ). The mixture of these conditional distributions, taken for all x (according to the distribution of X) is the unconditional distribution of Y. This fact amounts to the equalities

the latter being the instance of the law of total probability mentioned above.

What conditioning is not

On the discrete level, conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on X = x is possible even though P ( X = x ) = 0. This success may create the illusion that conditioning is always possible. Regretfully, it is not, for several reasons presented below.

Geometric intuition: caution

The result P ( Y ≤ 0.75 | X = 0.5 ) = 5/6, mentioned above, is geometrically evident in the following sense. The points (x,y,z) of the sphere x2 + y2 + z2 = 1, satisfying the condition x = 0.5, are a circle y2 + z2 = 0.75 of radius on the plane x = 0.5. The inequality y ≤ 0.75 holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6.

This successful geometric explanation may create the illusion that the following question is trivial.

A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?

It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, Z is distributed uniformly on (-1,+1) and independent of the ratio Y/X, thus, P ( Z ≤ 0.5 | Y/X ) = 0.75. On the other hand, the inequality z ≤ 0.5 holds on an arc of the circle x2 + y2 + z2 = 1,y = cx (for any given c). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox. [4] [5]

Appeals to symmetry can be misleading if not formalized as invariance arguments.

Pollard [6]

Another example. A random rotation of the three-dimensional space is a rotation by a random angle around a random axis. Geometric intuition suggests that the angle is independent of the axis and distributed uniformly. However, the latter is wrong; small values of the angle are less probable.

The limiting procedure

Given an event B of zero probability, the formula is useless, however, one can try for an appropriate sequence of events Bn of nonzero probability such that BnB (that is, and ). One example is given above. Two more examples are Brownian bridge and Brownian excursion.

In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. By contrast, in the example above the law of total probability applies, since the event X = 0.5 is included into a family of events X = x where x runs over (−1,1), and these events are a partition of the probability space.

In order to avoid paradoxes (such as the Borel's paradox), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted above. By contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov [6] )

The additional input may be (a) a symmetry (invariance group); (b) a sequence of events Bn such that BnB, P ( Bn ) > 0; (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.

Some events of zero probability are beyond the reach of conditioning. An example: let Xn be independent random variables distributed uniformly on (0,1), and B the event "Xn → 0 as n → ∞"; what about P ( Xn < 0.5 | B ) ? Does it tend to 1, or not? Another example: let X be a random variable distributed uniformly on (0,1), and B the event "X is a rational number"; what about P ( X = 1/n | B ) ? The only answer is that, once again,

the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.

Kolmogorov [6]

Conditioning on the level of measure theory

Example. Let Y be a random variable distributed uniformly on (0,1), and X = f(Y) where f is a given function. Two cases are treated below: f = f1 and f = f2, where f1 is the continuous piecewise-linear function

and f2 is the Weierstrass function.

Geometric intuition: caution

Given X = 0.75, two values of Y are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is congruent to another point. However, this is an illusion; see below.

Conditional probability

The conditional probability P ( Y ≤ 1/3 | X ) may be defined as the best predictor of the indicator

given X. That is, it minimizes the mean square error E ( I - g(X) )2 on the class of all random variables of the form g (X).

In the case f = f1 the corresponding function g = g1 may be calculated explicitly, [details 1]

Alternatively, the limiting procedure may be used,

giving the same result.

Thus, P ( Y ≤ 1/3 | X ) = g1 (X). The expectation of this random variable is equal to the (unconditional) probability, E ( P ( Y ≤ 1/3 | X ) ) = P ( Y ≤ 1/3 ), namely,

which is an instance of the law of total probability E ( P ( A | X ) ) = P ( A ).

In the case f = f2 the corresponding function g = g2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the space L2 (Ω) of all square integrable random variables is a Hilbert space; the indicator I is a vector of this space; and random variables of the form g (X) are a (closed, linear) subspace. The orthogonal projection of this vector to this subspace is well-defined. It can be computed numerically, using finite-dimensional approximations to the infinite-dimensional Hilbert space.

Once again, the expectation of the random variable P ( Y ≤ 1/3 | X ) = g2 (X) is equal to the (unconditional) probability, E ( P ( Y ≤ 1/3 | X ) ) = P ( Y ≤ 1/3 ), namely,

However, the Hilbert space approach treats g2 as an equivalence class of functions rather than an individual function. Measurability of g2 is ensured, but continuity (or even Riemann integrability) is not. The value g2 (0.5) is determined uniquely, since the point 0.5 is an atom of the distribution of X. Other values x are not atoms, thus, corresponding values g2 (x) are not determined uniquely. Once again, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov. [6]

Alternatively, the same function g (be it g1 or g2) may be defined as the Radon–Nikodym derivative

where measures μ, ν are defined by

for all Borel sets That is, μ is the (unconditional) distribution of X, while ν is one third of its conditional distribution,

Both approaches (via the Hilbert space, and via the Radon–Nikodym derivative) treat g as an equivalence class of functions; two functions g and g′ are treated as equivalent, if g (X) = g′ (X) almost surely. Accordingly, the conditional probability P ( Y ≤ 1/3 | X ) is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.

Conditional expectation

The conditional expectation may be defined as the best predictor of Y given X. That is, it minimizes the mean square error on the class of all random variables of the form h(X).

In the case f = f1 the corresponding function h = h1 may be calculated explicitly, [details 2]

Alternatively, the limiting procedure may be used,

giving the same result.

Thus, The expectation of this random variable is equal to the (unconditional) expectation, namely,

which is an instance of the law of total expectation

In the case f = f2 the corresponding function h = h2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as g2 above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace.

Alternatively, the same function h (be it h1 or h2) may be defined as the Radon–Nikodym derivative

where measures μ, ν are defined by

for all Borel sets Here is the restricted expectation, not to be confused with the conditional expectation

Conditional distribution

In the case f = f1 the conditional cumulative distribution function may be calculated explicitly, similarly to g1. The limiting procedure gives:

which cannot be correct, since a cumulative distribution function must be right-continuous!

This paradoxical result is explained by measure theory as follows. For a given y the corresponding is well-defined (via the Hilbert space or the Radon–Nikodym derivative) as an equivalence class of functions (of x). Treated as a function of y for a given x it is ill-defined unless some additional input is provided. Namely, a function (of x) must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions.

A right choice can be made as follows. First, is considered for rational numbers y only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational y is well-defined (for almost every x). Second, the function is extended from rational numbers to real numbers by right continuity.

In general the conditional distribution is defined for almost all x (according to the distribution of X), but sometimes the result is continuous in x, in which case individual values are acceptable. In the considered example this is the case; the correct result for x = 0.75,

shows that the conditional distribution of Y given X = 0.75 consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.

Similarly, the conditional distribution may be calculated for all x in (0, 0.5) or (0.5, 1).

The value x = 0.5 is an atom of the distribution of X, thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of Y given X = 0.5 is uniform on (2/3, 1). Measure theory leads to the same result.

The mixture of all conditional distributions is the (unconditional) distribution of Y.

The conditional expectation is nothing but the expectation with respect to the conditional distribution.

In the case f = f2 the corresponding probably cannot be calculated explicitly. For a given y it is well-defined (via the Hilbert space or the Radon–Nikodym derivative) as an equivalence class of functions (of x). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be atomic or absolutely continuous (nor mixtures of both types). Probably, in the considered example they are singular (like the Cantor distribution).

Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.

Technical details

  1. Proof:
    it remains to note that (1−a )2 + 2a2 is minimal at a = 1/3.
  2. Proof:
    it remains to note that
    is minimal at and is minimal at

See also

Notes

  1. "Mathematica/Uniform Spherical Distribution - Wikibooks, open books for an open world". en.wikibooks.org. Retrieved 2018-10-27.
  2. Buchanan, K.; Huff, G. H. (July 2011). "A comparison of geometrically bound random arrays in euclidean space". 2011 IEEE International Symposium on Antennas and Propagation (APSURSI). pp. 2008–2011. doi:10.1109/APS.2011.5996900. ISBN   978-1-4244-9563-4. S2CID   10446533.
  3. Buchanan, K.; Flores, C.; Wheeland, S.; Jensen, J.; Grayson, D.; Huff, G. (May 2017). "Transmit beamforming for radar applications using circularly tapered random arrays". 2017 IEEE Radar Conference (RadarConf). pp. 0112–0117. doi:10.1109/RADAR.2017.7944181. ISBN   978-1-4673-8823-8. S2CID   38429370.
  4. Pollard 2002, Sect. 5.5, Example 17 on page 122.
  5. Durrett 1996, Sect. 4.1(a), Example 1.6 on page 224.
  6. 1 2 3 4 Pollard 2002, Sect. 5.5, page 122.

Related Research Articles

<span class="mw-page-title-main">Expected value</span> Average value of a random variable

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a large number of independently selected outcomes of a random variable.

In integral calculus, an elliptic integral is one of a number of related functions defined as the value of certain integrals, which were first studied by Giulio Fagnano and Leonhard Euler. Their name originates from their originally arising in connection with the problem of finding the arc length of an ellipse.

<span class="mw-page-title-main">Maxwell–Boltzmann distribution</span> Specific probability distribution function, important in physics

In physics, the Maxwell–Boltzmann distribution, or Maxwell(ian) distribution, is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann.

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as it is not actually random or a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

<span class="mw-page-title-main">Independence (probability theory)</span> When the occurrence of one event does not affect the likelihood of another

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

<span class="mw-page-title-main">Probability density function</span> Function whose integral over a region describes the probability of an event occurring in that region

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

<span class="mw-page-title-main">Wiener process</span> Stochastic process generalizing Brownian motion

In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is often also called Brownian motion due to its historical connection with the physical process of the same name originally observed by Scottish botanist Robert Brown. It is one of the best known Lévy processes and occurs frequently in pure and applied mathematics, economics, quantitative finance, evolutionary biology, and physics.

<span class="mw-page-title-main">Fokker–Planck equation</span> Partial differential equation

In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.

In mathematics, the Cauchy principal value, named after Augustin Louis Cauchy, is a method for assigning values to certain improper integrals which would otherwise be undefined. In this method, a singularity on an integral interval is avoided by limiting the integral interval to the singularity.

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take "on average" over an arbitrarily large number of occurrences – given that a certain set of "conditions" is known to occur. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

In probability theory, the Borel–Kolmogorov paradox is a paradox relating to conditional probability with respect to an event of probability zero. It is named after Émile Borel and Andrey Kolmogorov.

In probability theory and statistics, given two jointly distributed random variables and , the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

<span class="mw-page-title-main">Stable distribution</span> Distribution of variables which satisfies a stability property under linear combinations

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

In mathematics, the disintegration theorem is a result in measure theory and probability theory. It rigorously defines the idea of a non-trivial "restriction" of a measure to a measure zero subset of the measure space in question. It is related to the existence of conditional probability measures. In a sense, "disintegration" is the opposite process to the construction of a product measure.

In mathematics, a local martingale is a type of stochastic process, satisfying the localized version of the martingale property. Every martingale is a local martingale; every bounded local martingale is a martingale; in particular, every local martingale that is bounded from below is a supermartingale, and every local martingale that is bounded from above is a submartingale; however, in general a local martingale is not a martingale, because its expectation can be distorted by large values of small probability. In particular, a driftless diffusion process is a local martingale, but not necessarily a martingale.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability theory, regular conditional probability is a concept that formalizes the notion of conditioning on the outcome of a random variable. The resulting conditional probability distribution is a parametrized family of probability measures called a Markov kernel.

<span class="mw-page-title-main">Conditional probability</span> Probability of an event occurring, given that another event has already occurred

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occurring with some sort of relationship with another event A. In this event, the event B can be analyzed by a conditional probability with respect to A. If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(A|B) or occasionally PB(A). This can also be understood as the fraction of probability B that intersects with A, or the ratio of the probabilities of both events happening to the "given" one happening (how many times A occurs rather than not assuming B has occurred): .

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

References