Stein's example

Last updated

In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected mean squared error) than any method that handles the parameters separately. It is named after Charles Stein of Stanford University, who discovered the phenomenon in 1955. [1]

Contents

An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.

Formal statement

The following is the simplest form of the paradox, the special case in which the number of observations is equal to the number of parameters to be estimated. Let be a vector consisting of unknown parameters. To estimate these parameters, a single measurement is performed for each parameter , resulting in a vector of length . Suppose the measurements are known to be independent, Gaussian random variables, with mean and variance 1, i.e., . Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate.

Under these conditions, it is intuitive and common to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as , which is the maximum likelihood estimator (MLE). The quality of such an estimator is measured by its risk function. A commonly used risk function is the mean squared error, defined as . Surprisingly, it turns out that the "ordinary" decision rule is suboptimal (inadmissible) in terms of mean squared error when . In other words, in the setting discussed here, there exist alternative estimators which always achieve lower mean squared error, no matter what the value of is. For a given one could obviously define a perfect "estimator" which is always just , but this estimator would be bad for other values of .

The estimators of Stein's paradox are, for a given , better than the "ordinary" decision rule for some but necessarily worse for others. It is only on average that they are better. More accurately, an estimator is said to dominate another estimator if, for all values of , the risk of is lower than, or equal to, the risk of , and if the inequality is strict for some . An estimator is said to be admissible if no other estimator dominates it, otherwise it is inadmissible. Thus, Stein's example can be simply stated as follows: The "ordinary" decision rule of the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.

Many simple, practical estimators achieve better performance than the "ordinary" decision rule. The best-known example is the James–Stein estimator, which shrinks towards a particular point (such as the origin) by an amount inversely proportional to the distance of from that point. For a sketch of the proof of this result, see Proof of Stein's example. An alternative proof is due to Larry Brown: he proved that the ordinary estimator for an -dimensional multivariate normal mean vector is admissible if and only if the -dimensional Brownian motion is recurrent. [2] Since the Brownian motion is not recurrent for , the MLE is not admissible for .

An intuitive explanation

For any particular value of the new estimator will improve at least one of the individual mean square errors This is not hard − for instance, if is between −1 and 1, and , then an estimator that linearly shrinks towards 0 by 0.5 (i.e., , soft thresholding with threshold ) will have a lower mean square error than itself. But there are other values of for which this estimator is worse than itself. The trick of the Stein estimator, and others that yield the Stein paradox, is that they adjust the shift in such a way that there is always (for any vector) at least one whose mean square error is improved, and its improvement more than compensates for any degradation in mean square error that might occur for another . The trouble is that, without knowing , you don't know which of the mean square errors are improved, so you can't use the Stein estimator only for those parameters.

An example of the above setting occurs in channel estimation in telecommunications, for instance, because different factors affect overall channel performance.

Implications

Stein's example is surprising, since the "ordinary" decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including maximum likelihood estimation, best linear unbiased estimation, least squares estimation and optimal equivariant estimation, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.

Example

To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.

At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. However, we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.

Sketched proof

The risk function of the decision rule is

Now consider the decision rule

where . We will show that is a better decision rule than . The risk function is

a quadratic in . We may simplify the middle term by considering a general "well-behaved" function and using integration by parts. For , for any continuously differentiable growing sufficiently slowly for large we have:

Therefore,

(This result is known as Stein's lemma.) Now, we choose

If met the "well-behaved" condition (it doesn't, but this can be remedied—see below), we would have

and so

Then returning to the risk function of :

This quadratic in is minimized at , giving

which of course satisfies making an inadmissible decision rule.

It remains to justify the use of

This function is not continuously differentiable, since it is singular at . However, the function

is continuously differentiable, and after following the algebra through and letting , one obtains the same result.

See also

Notes

  1. Efron, B.; Morris, C. (1977), "Stein's paradox in statistics" (PDF), Scientific American , 236 (5): 119–127, Bibcode:1977SciAm.236e.119E, doi:10.1038/scientificamerican0577-119
  2. Brown, L. D. (1971). "Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems". The Annals of Mathematical Statistics. 42 (3): 855–903. doi: 10.1214/aoms/1177693318 . ISSN   0003-4851.

Related Research Articles

The likelihood function represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood function indicates which parameter values are more likely than others, in the sense that they would have made the observed data more probable. Consequently, the likelihood is often written as instead of , to emphasize that it is to be understood as a function of the parameters instead of the random variable .

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter and a scale parameter .
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic parameter, the variance of any such estimator is at least as high as the inverse of the Fisher information. Equivalently, it expresses an upper bound on the precision of unbiased estimators: the precision of any such estimator is at most the Fisher information. The result is named in honor of Harald Cramér and C. R. Rao, but has independently also been derived by Maurice Fréchet, Georges Darmois, as well as Alexander Aitken and Harold Silverstone.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

The James–Stein estimator is a biased estimator of the mean, , of (possibly) correlated Gaussian distributed random vectors with unknown means .

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References