Law of the unconscious statistician

Last updated December 27, 2024

In probability theory and statistics, the law of the unconscious statistician, or LOTUS, is a theorem which expresses the expected value of a function $g (X)$ of a random variable $X$ in terms of $g$ and the probability distribution of $X$ .

The form of the law depends on the type of random variable $X$ in question. If the distribution of $X$ is discrete and one knows its probability mass function $p X$ , then the expected value of $g (X)$ is $\operatorname {E} [g(X)]=\sum _{x}g(x)p_{X}(x),\,$ where the sum is over all possible values $x$ of $X$ . If instead the distribution of $X$ is continuous with probability density function $f X$ , then the expected value of $g (X)$ is $\operatorname {E} [g(X)]=\int _{-\infty }^{\infty }g(x)f_{X}(x)\,\mathrm {d} x$

Both of these special cases can be expressed in terms of the cumulative probability distribution function $F X$ of $X$ , with the expected value of $g (X)$ now given by the Lebesgue–Stieltjes integral $\operatorname {E} [g(X)]=\int _{-\infty }^{\infty }g(x)\,\mathrm {d} F_{X}(x).$

In even greater generality, $X$ could be a random element in any measurable space, in which case the law is given in terms of measure theory and the Lebesgue integral. In this setting, there is no need to restrict the context to probability measures, and the law becomes a general theorem of mathematical analysis on Lebesgue integration relative to a pushforward measure.

Etymology

This proposition is (sometimes) known as the law of the unconscious statistician because of a purported tendency to think of the aforementioned law as the very definition of the expected value of a function $g (X)$ and a random variable $X$ , rather than (more formally) as a consequence of the true definition of expected value.^[1] The naming is sometimes attributed to Sheldon Ross' textbook Introduction to Probability Models, although he removed the reference in later editions.^[2] Many statistics textbooks do present the result as the definition of expected value.^[3]

Joint distributions

A similar property holds for joint distributions, or equivalently, for random vectors. For discrete random variables X and Y, a function of two variables g, and joint probability mass function $p_{X,Y}(x,y)$ :^[4] $\operatorname {E} [g(X,Y)]=\sum _{y}\sum _{x}g(x,y)p_{X,Y}(x,y)$ In the absolutely continuous case, with $f_{X,Y}(x,y)$ being the joint probability density function, $\operatorname {E} [g(X,Y)]=\int _{-\infty }^{\infty }\int _{-\infty }^{\infty }g(x,y)f_{X,Y}(x,y)\,\mathrm {d} x\,\mathrm {d} y$

Special cases

A number of special cases are given here. In the simplest case, where the random variable $X$ takes on countably many values (so that its distribution is discrete), the proof is particularly simple, and holds without modification if $X$ is a discrete random vector or even a discrete random element.

The case of a continuous random variable is more subtle, since the proof in generality requires subtle forms of the change-of-variables formula for integration. However, in the framework of measure theory, the discrete case generalizes straightforwardly to general (not necessarily discrete) random elements, and the case of a continuous random variable is then a special case by making use of the Radon–Nikodym theorem.

Discrete case

Suppose that $X$ is a random variable which takes on only finitely or countably many different values $x 1, x 2, ...$ , with probabilities $p 1, p 2, ...$ . Then for any function $g$ of these values, the random variable $g (X)$ has values $g (x 1), g (x 2), ...$ , although some of these may coincide with each other. For example, this is the case if $X$ can take on both values $1$ and $-1$ and $g (x) = x 2$ .

Let $y 1, y 2, ...$ enumerate the possible distinct values of $g(X)$ , and for each $i$ let $I i$ denote the collection of all $j$ with $g (x j) = y i$ . Then, according to the definition of expected value, there is $\operatorname {E} [g(X)]=\sum _{i}y_{i}p_{g(X)}(y_{i}).$

Since a $y_{i}$ can be the image of multiple, distinct $x_{j}$ , it holds that $p_{g(X)}(y_{i})=\sum _{j\in I_{i}}p_{X}(x_{j}).$

Then the expected value can be rewritten as $\sum _{i}y_{i}p_{g(X)}(y_{i})=\sum _{i}y_{i}\sum _{j\in I_{i}}p_{X}(x_{j})=\sum _{i}\sum _{j\in I_{i}}g(x_{j})p_{X}(x_{j})=\sum _{x}g(x)p_{X}(x).$ This equality relates the average of the outputs of $g (X)$ as weighted by the probabilities of the outputs themselves to the average of the outputs of $g (X)$ as weighted by the probabilities of the outputs of $X$ .

If $X$ takes on only finitely many possible values, the above is fully rigorous. However, if $X$ takes on countably many values, the last equality given does not always hold, as seen by the Riemann series theorem. Because of this, it is necessary to assume the absolute convergence of the sums in question.^[5]

Continuous case

Suppose that $X$ is a random variable whose distribution has a continuous density $f$ . If $g$ is a general function, then the probability that $g (X)$ is valued in a set of real numbers $K$ equals the probability that $X$ is valued in $g -1 (K)$ , which is given by $\int _{g^{-1}(K)}f(x)\,\mathrm {d} x.$ Under various conditions on $g$ , the change-of-variables formula for integration can be applied to relate this to an integral over $K$ , and hence to identify the density of $g (X)$ in terms of the density of $X$ . In the simplest case, if $g$ is differentiable with nowhere-vanishing derivative, then the above integral can be written as $\int _{K}f(g^{-1}(y))(g^{-1})'(y)\,\mathrm {d} y,$ thereby identifying $g (X)$ as possessing the density $f (g -1 (y))(g -1)'(y)$ . The expected value of $g (X)$ is then identified as $\int _{-\infty }^{\infty }yf(g^{-1}(y))(g^{-1})'(y)\,\mathrm {d} y=\int _{-\infty }^{\infty }g(x)f(x)\,\mathrm {d} x,$ where the equality follows by another use of the change-of-variables formula for integration. This shows that the expected value of $g (X)$ is encoded entirely by the function $g$ and the density $f$ of $X$ .^[6]

The assumption that $g$ is differentiable with nonvanishing derivative, which is necessary for applying the usual change-of-variables formula, excludes many typical cases, such as $g (x) = x 2$ . The result still holds true in these broader settings, although the proof requires more sophisticated results from mathematical analysis such as Sard's theorem and the coarea formula. In even greater generality, using the Lebesgue theory as below, it can be found that the identity $\operatorname {E} [g(X)]=\int _{-\infty }^{\infty }g(x)f(x)\,\mathrm {d} x$ holds true whenever $X$ has a density $f$ (which does not have to be continuous) and whenever $g$ is a measurable function for which $g (X)$ has finite expected value. (Every continuous function is measurable.) Furthermore, without modification to the proof, this holds even if $X$ is a random vector (with density) and $g$ is a multivariable function; the integral is then taken over the multi-dimensional range of values of $X$ .

Measure-theoretic formulation

An abstract and general form of the result is available using the framework of measure theory and the Lebesgue integral. Here, the setting is that of a measure space $(Ω, μ)$ and a measurable map $X$ from $Ω$ to a measurable space $Ω'$ . The theorem then says that for any measurable function $g$ on $Ω'$ which is valued in real numbers (or even the extended real number line), there is $\int _{\Omega }g\circ X\,\mathrm {d} \mu =\int _{\Omega '}g\,\mathrm {d} (X_{\sharp }\mu ),$ (interpreted as saying, in particular, that either side of the equality exists if the other side exists). Here $X ♯ μ$ denotes the pushforward measure on $Ω'$ . The 'discrete case' given above is the special case arising when $X$ takes on only countably many values and $μ$ is a probability measure. In fact, the discrete case (although without the restriction to probability measures) is the first step in proving the general measure-theoretic formulation, as the general version follows therefrom by an application of the monotone convergence theorem.^[7] Without any major changes, the result can also be formulated in the setting of outer measures.^[8]

If $μ$ is a σ-finite measure, the theory of the Radon–Nikodym derivative is applicable. In the special case that the measure $X ♯ μ$ is absolutely continuous relative to some background σ-finite measure $ν$ on $Ω'$ , there is a real-valued function $f X$ on $Ω'$ representing the Radon–Nikodym derivative of the two measures, and then $\int _{\Omega '}g\,\mathrm {d} (X_{\sharp }\mu )=\int _{\Omega '}gf_{X}\,\mathrm {d} \nu .$ In the further special case that $Ω'$ is the real number line, as in the contexts discussed above, it is natural to take $ν$ to be the Lebesgue measure, and this then recovers the 'continuous case' given above whenever $μ$ is a probability measure. (In this special case, the condition of σ-finiteness is vacuous, since Lebesgue measure and every probability measure are trivially σ-finite.)^[9]

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of the sample space is called an event.

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In mathematics, Fatou's lemma establishes an inequality relating the Lebesgue integral of the limit inferior of a sequence of functions to the limit inferior of integrals of these functions. The lemma is named after Pierre Fatou.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation.

In mathematics, the Riemann–Stieltjes integral is a generalization of the Riemann integral, named after Bernhard Riemann and Thomas Joannes Stieltjes. The definition of this integral was first published in 1894 by Stieltjes. It serves as an instructive and useful precursor of the Lebesgue integral, and an invaluable tool in unifying equivalent forms of statistical theorems that apply to discrete and continuous probability.

In mathematical analysis, a function of bounded variation, also known as $BV$ function, is a real-valued function whose total variation is bounded (finite): the graph of a function having this property is well behaved in a precise sense. For a continuous function of a single variable, being of bounded variation means that the distance along the direction of the $y$ -axis, neglecting the contribution of motion along $x$ -axis, traveled by a point moving along the graph has a finite value. For a continuous function of several variables, the meaning of the definition is the same, except for the fact that the continuous path to be considered cannot be the whole graph of the given function, but can be every intersection of the graph itself with a hyperplane parallel to a fixed $x$ -axis and to the $y$ -axis.

In mathematics, the moments of a function are certain quantitative measures related to the shape of the function's graph. If the function represents mass density, then the zeroth moment is the total mass, the first moment is the center of mass, and the second moment is the moment of inertia. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis.

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

In probability theory and statistics, the conditional probability distribution is a probability distribution that describes the probability of an outcome given the occurrence of a particular event. Given two jointly distributed random variables $and, the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.$

In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function f, defined on an interval [a, b] ⊂ R, its total variation on the interval of definition is a measure of the one-dimensional arclength of the curve with parametric equation x ↦ f(x), for x ∈ [a, b]. Functions whose total variation is finite are called functions of bounded variation.

In probability theory, the joint probability distribution is the probability distribution of all possible pairs of outputs of two random variables that are defined on the same probability space. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables and the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

This article discusses how information theory is related to measure theory.

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In probability theory, a random measure is a measure-valued random element. Random measures are for example used in the theory of random processes, where they form many important point processes such as Poisson point processes and Cox processes.

In mathematics, the Pettis integral or Gelfand–Pettis integral, named after Israel M. Gelfand and Billy James Pettis, extends the definition of the Lebesgue integral to vector-valued functions on a measure space, by exploiting duality. The integral was introduced by Gelfand for the case when the measure space is an interval with Lebesgue measure. The integral is also called the weak integral in contrast to the Bochner integral, which is the strong integral.

<span class="mw-page-title-main">Lebesgue integral</span> Method of integration

In mathematics, the integral of a non-negative function of a single variable can be regarded, in the simplest case, as the area between the graph of that function and the $X$ axis. The Lebesgue integral, named after French mathematician Henri Lebesgue, is one way to make this concept rigorous and to extend it to more general functions.

References

↑ DeGroot & Schervish 2014, pp. 213−214.
↑ Casella & Berger 2001, Section 2.2; Ross 2019.
↑ Casella & Berger 2001, Section 2.2.
↑ Ross 2019.
↑ Feller 1968, Section IX.2.
↑ Papoulis & Pillai 2002, Chapter 5.
↑ Bogachev 2007, Section 3.6; Cohn 2013, Section 2.6; Halmos 1950, Section 39.
↑ Federer 1969, Section 2.4.
↑ Halmos 1950, Section 39.

Bogachev, V. I. (2007). Measure theory. Volume I. Berlin: Springer-Verlag. doi:10.1007/978-3-540-34514-5. ISBN 978-3-540-34513-8. MR 2267655. Zbl 1120.28001.
Casella, George; Berger, Roger L. (2001). Statistical inference. Duxbury Advanced Series (Second edition of 1990 original ed.). Pacific Grove, CA: Duxbury. ISBN 0-534-11958-1. Zbl 0699.62001.
Cohn, Donald L. (2013). Measure theory. Birkhäuser Advanced Texts: Basler Lehrbücher (Second edition of 1980 original ed.). New York: Birkhäuser/Springer. doi:10.1007/978-1-4614-6956-8. ISBN 978-1-4614-6955-1. MR 3098996. Zbl 1292.28002.
DeGroot, Morris H.; Schervish, Mark J. (2014). Probability and statistics (Fourth edition of 1975 original ed.). Pearson Education. ISBN 0-321-50046-6. MR 0373075. Zbl 0619.62001.
Federer, Herbert (1969). Geometric measure theory. Die Grundlehren der mathematischen Wissenschaften. Vol. 153. Berlin–Heidelberg–New York: Springer-Verlag. doi:10.1007/978-3-642-62010-2. ISBN 978-3-540-60656-7. MR 0257325. Zbl 0176.00801.
Feller, William (1968). An introduction to probability theory and its applications. Volume I (Third edition of 1950 original ed.). New York–London–Sydney: John Wiley & Sons, Inc. MR 0228020. Zbl 0155.23101.
Halmos, Paul R. (1950). Measure theory. New York: D. Van Nostrand Co., Inc. doi:10.1007/978-1-4684-9440-2. MR 0033869. Zbl 0040.16802.
Papoulis, Athanasios; Pillai, S. Unnikrishna (2002). Probability, random variables, and stochastic processes (Fourth edition of 1965 original ed.). New York: McGraw-Hill. ISBN 0-07-366011-6.
Ross, Sheldon M. (2019). Introduction to probability models (Twelfth edition of 1972 original ed.). London: Academic Press. doi:10.1016/C2017-0-01324-1. ISBN 978-0-12-814346-9. MR 3931305. Zbl 1408.60002.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[FOOTNOTEDeGrootSchervish2014213−214-1] DeGroot & Schervish 2014, pp. 213−214.

[FOOTNOTECasellaBerger2001Section_2.2Ross2019-2] Casella & Berger 2001, Section 2.2; Ross 2019.

[FOOTNOTECasellaBerger2001Section_2.2-3] Casella & Berger 2001, Section 2.2.

[FOOTNOTERoss2019-4] Ross 2019.

[FOOTNOTEFeller1968Section_IX.2-5] Feller 1968, Section IX.2.

[FOOTNOTEPapoulisPillai2002Chapter_5-6] Papoulis & Pillai 2002, Chapter 5.

[FOOTNOTEBogachev2007Section_3.6Cohn2013Section_2.6Halmos1950Section_39-7] Bogachev 2007, Section 3.6; Cohn 2013, Section 2.6; Halmos 1950, Section 39.

[FOOTNOTEFederer1969Section_2.4-8] Federer 1969, Section 2.4.

[FOOTNOTEHalmos1950Section_39-9] Halmos 1950, Section 39.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]