Law of total variance

Last updated

In probability theory, the law of total variance [1] or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, [2] states that if and are random variables on the same probability space, and the variance of is finite, then

Contents

In language perhaps better known to statisticians than to probability theorists, the two terms are the "unexplained" and the "explained" components of the variance respectively (cf. fraction of variance unexplained, explained variation). In actuarial science, specifically credibility theory, the first component is called the expected value of the process variance (EVPV) and the second is called the variance of the hypothetical means (VHM). [3] These two components are also the source of the term "Eve's law", from the initials EV VE for "expectation of variance" and "variance of expectation".

Explanation

To understand the formula above, we need to comprehend the random variables and . These variables depend on the value of : for a given , and are constant numbers. Essentially, we use the possible values of to group the outcomes and then compute the expected values and variances for each group.

The "unexplained" component is simply the average of all the variances of within each group. The "explained" component is the variance of the expected values, i.e., it represents the part of the variance that is explained by the variation of the average value of for each group.

Weight of dogs by breed ANOVA very good fit.jpg
Weight of dogs by breed

For an illustration, consider the example of a dog show (a selected excerpt of Analysis_of_variance#Example). Let the random variable correspond to the dog weight and correspond to the breed. In this situation, it is reasonable to expect that the breed explains a major portion of the variance in weight since there is a big variance in the breeds' average weights. Of course, there is still some variance in weight for each breed, which is taken into account in the "unexplained" term.

Note that the "explained" term actually means "explained by the averages." If variances for each fixed (e.g., for each breed in the example above) are very distinct, those variances are still combined in the "unexplained" term.

Examples


Example 1

Five graduate students take an exam that is graded from 0 to 100. Let denote the student's grade and indicate whether the student is international or domestic. The data is summarized as follows:

Student
120International
230International
3100International
440Domestic
560Domestic

Among international students, the mean is and the variance is .

Among domestic students, the mean is and the variance is .

International3/5501266.6
Domestic2/550100

The part of the variance of "unexplained" by is the mean of the variances for each group. In this case, it is . The part of the variance of "explained" by is the variance of the means of inside each group defined by the values of the . In this case, it is zero, since the mean is the same for each group. So the total variation is

Example 2

Suppose X is a coin flip with the probability of heads being h. Suppose that when X = heads then Y is drawn from a normal distribution with mean μh and standard deviation σh, and that when X = tails then Y is drawn from normal distribution with mean μt and standard deviation σt. Then the first, "unexplained" term on the right-hand side of the above formula is the weighted average of the variances, h2 + (1 − h)σt2, and the second, "explained" term is the variance of the distribution that gives μh with probability h and gives μt with probability 1 − h.

Formulation

There is a general variance decomposition formula for components (see below). [4] For example, with two conditioning random variables:

which follows from the law of total conditional variance: [4]

Note that the conditional expected value is a random variable in its own right, whose value depends on the value of Notice that the conditional expected value of given the event is a function of (this is where adherence to the conventional and rigidly case-sensitive notation of probability theory becomes important!). If we write then the random variable is just Similar comments apply to the conditional variance.

One special case, (similar to the law of total expectation) states that if is a partition of the whole outcome space, that is, these events are mutually exclusive and exhaustive, then

In this formula, the first component is the expectation of the conditional variance; the other two components are the variance of the conditional expectation.

Proof

Finite Case

Let be observed values of , with repetitions.

Set and, for each possible value of , set .

Note that

Summing these for , the last parcel becomes

Hence,

General Case

The law of total variance can be proved using the law of total expectation. [5] First,

from the definition of variance. Again, from the definition of variance, and applying the law of total expectation, we have

Now we rewrite the conditional second moment of in terms of its variance and first moment, and apply the law of total expectation on the right hand side:

Since the expectation of a sum is the sum of expectations, the terms can now be regrouped:

Finally, we recognize the terms in the second set of parentheses as the variance of the conditional expectation :

General variance decomposition applicable to dynamic systems

The following formula shows how to apply the general, measure theoretic variance decomposition formula [4] to stochastic dynamic systems. Let be the value of a system variable at time Suppose we have the internal histories (natural filtrations) , each one corresponding to the history (trajectory) of a different collection of system variables. The collections need not be disjoint. The variance of can be decomposed, for all times into components as follows:

The decomposition is not unique. It depends on the order of the conditioning in the sequential decomposition.

The square of the correlation and explained (or informational) variation

In cases where are such that the conditional expected value is linear; that is, in cases where

it follows from the bilinearity of covariance that

and

and the explained component of the variance divided by the total variance is just the square of the correlation between and that is, in such cases,

One example of this situation is when have a bivariate normal (Gaussian) distribution.

More generally, when the conditional expectation is a non-linear function of [4]

which can be estimated as the squared from a non-linear regression of on using data drawn from the joint distribution of When has a Gaussian distribution (and is an invertible function of ), or itself has a (marginal) Gaussian distribution, this explained component of variation sets a lower bound on the mutual information: [4]

Higher moments

A similar law for the third central moment says

For higher cumulants, a generalization exists. See law of total cumulance.

See also

Related Research Articles

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Multivariate random variable</span> Random variable with multiple component dimensions

In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In mathematics, the moments of a function are certain quantitative measures related to the shape of the function's graph. If the function represents mass density, then the zeroth moment is the total mass, the first moment is the center of mass, and the second moment is the moment of inertia. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. The mathematical concept is closely related to the concept of moment in physics.

In statistics, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result that characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria.

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

<span class="mw-page-title-main">Kriging</span> Method of interpolation

In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In probability theory and statistics, a conditional variance is the variance of a random variable given the value(s) of one or more other variables. Particularly in econometrics, the conditional variance is also known as the scedastic function or skedastic function. Conditional variances are important parts of autoregressive conditional heteroskedasticity (ARCH) models.

In regression, mean response and predicted response, also known as mean outcome and predicted outcome, are values of the dependent variable calculated from the regression parameters and a given value of the independent variable. The values of these two responses are the same, but their calculated variances are different. The concept is a generalization of the distinction between the standard error of the mean and the sample standard deviation.

In probability theory, the law of total covariance, covariance decomposition formula, or conditional covariance formula states that if X, Y, and Z are random variables on the same probability space, and the covariance of X and Y is finite, then

<span class="mw-page-title-main">Distance correlation</span> Statistical measure

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

<span class="mw-page-title-main">Complex random variable</span>

In probability theory and statistics, complex random variables are a generalization of real-valued random variables to complex numbers, i.e. the possible values a complex random variable may take are complex numbers. Complex random variables can always be considered as pairs of real random variables: their real and imaginary parts. Therefore, the distribution of one complex random variable may be interpreted as the joint distribution of two real random variables.

References

  1. Neil A. Weiss, A Course in Probability, AddisonWesley, 2005, pages 385386.
  2. Joseph K. Blitzstein and Jessica Hwang: "Introduction to Probability"
  3. Mahler, Howard C.; Dean, Curtis Gary (2001). "Chapter 8: Credibility" (PDF). In Casualty Actuarial Society (ed.). Foundations of Casualty Actuarial Science (4th ed.). Casualty Actuarial Society. pp. 525–526. ISBN   978-0-96247-622-8 . Retrieved June 25, 2015.
  4. 1 2 3 4 5 Bowsher, C.G. and P.S. Swain, Identifying sources of variation and the flow of information in biochemical networks, PNAS May 15, 2012 109 (20) E1320-E1328.
  5. Neil A. Weiss, A Course in Probability, AddisonWesley, 2005, pages 380383.