Bayes estimator

Last updated

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements.

Decision theory is the study of the reasoning underlying an agent's choices. Decision theory can be broken into two branches: normative decision theory, which gives advice on how to make the best decisions given a set of uncertain beliefs and a set of values, and descriptive decision theory which analyzes how existing, possibly irrational agents actually make decisions.

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.



Suppose an unknown parameter is known to have a prior distribution . Let be an estimator of (based on some measurements x), and let be a loss function, such as squared error. The Bayes risk of is defined as , where the expectation is taken over the probability distribution of : this defines the risk function as a function of . An estimator is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss for each also minimizes the Bayes risk and therefore is a Bayes estimator. [1]

In mathematical optimization, statistics, econometrics, decision theory, machine learning and computational neuroscience, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative, in which case it is to be maximized.

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the same experiment it represents. For example, the expected value in rolling a six-sided die is 3.5, because the average of all the numbers that come up is 3.5 as the number of rolls approaches infinity. In other words, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity. The expected value is also known as the expectation, mathematical expectation, EV, average, mean value, mean, or first moment.

If the prior is improper then an estimator which minimizes the posterior expected loss for each is called a generalized Bayes estimator. [2]


Minimum mean square error estimation

The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined by

where the expectation is taken over the joint distribution of and .

Posterior mean

Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution, [3]

This is known as the minimum mean square error (MMSE) estimator.

Bayes estimators for conjugate priors

If there is no inherent reason to prefer one prior probability distribution over another, a conjugate prior is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In mathematics and its applications, a parametric family or a parameterized family is a family of objects whose differences depend only on the chosen values for a set of parameters.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

Following are some examples of conjugate priors.

Alternative risk functions

Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by .

Posterior median and other quantiles

  • A "linear" loss function, with , which yields the posterior median as the Bayes' estimate:
  • Another "linear" loss function, which assigns different "weights" to over or sub estimation. It yields a quantile from the posterior distribution, and is a generalization of the previous loss function:

Posterior mode

  • The following loss function is trickier: it yields either the posterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter are recommended, in order to use the mode as an approximation ():

Other loss functions can be conceived, although the mean squared error is the most widely used and validated. Other loss functions are used in statistics, particularly in robust statistics.

Generalized Bayes estimators

The prior distribution has thus far been assumed to be a true probability distribution, in that

However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a non-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function , but this would not be a proper probability distribution since it has infinite mass,

Such measures , which are not probability distributions, are referred to as improper priors.

The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution

This is a definition, and not an application of Bayes' theorem, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss

is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator. [2]


A typical example is estimation of a location parameter with a loss function of the type . Here is a location parameter, i.e., .

It is common to use the improper prior in this case, especially when no other more subjective information is available. This yields

so the posterior expected loss

The generalized Bayes estimator is the value that minimizes this expression for a given . This is equivalent to minimizing

for a given      (1)

In this case it can be shown that the generalized Bayes estimator has the form , for some constant . To see this, let be the value minimizing (1) when . Then, given a different value , we must minimize


This is identical to (1), except that has been replaced by . Thus, the expression minimizing is given by , so that the optimal estimator has the form

Empirical Bayes estimators

A Bayes estimator derived through the empirical Bayes method is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.

There are parametric and non-parametric approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data. [4]


The following is a simple example of parametric empirical Bayes estimation. Given past observations having conditional distribution , one is interested in estimating based on . Assume that the 's have a common prior which depends on unknown parameters. For example, suppose that is normal with unknown mean and variance We can then use the past observations to determine the mean and variance of in the following way.

First, we estimate the mean and variance of the marginal distribution of using the maximum likelihood approach:

Next, we use the relation

where and are the moments of the conditional distribution , which are assumed to be known. In particular, suppose that and that ; we then have

Finally, we obtain the estimated moments of the prior,

For example, if , and if we assume a normal prior (which is a conjugate prior in this case), we conclude that , from which the Bayes estimator of based on can be calculated.



Bayes rules having finite Bayes risk are typically admissible. The following are some specific examples of admissibility theorems.

By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for ; this is known as Stein's phenomenon.

Asymptotic efficiency

Let θ be an unknown random variable, and suppose that are iid samples with density . Let be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of for large n.

To this end, it is customary to regard θ as a deterministic parameter whose true value is . Under specific conditions, [6] for large samples (large values of n), the posterior density of θ is approximately normal. In other words, for large n, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is asymptotically unbiased and it converges in distribution to the normal distribution:

where I0) is the fisher information of θ0. It follows that the Bayes estimator δn under MSE is asymptotically efficient.

Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.

Consider the estimator of θ based on binomial sample x~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is the Beta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is

The MLE in this case is x/n and so we get,

The last equation implies that, for n → ∞, the Bayes estimator (in the described problem) is close to the MLE.

On the other hand, when n is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that a=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as a+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation d which gives the weight of prior information equal to 1/(4d2)-1 bits of new information."

Another example of the same phenomena is the case when the prior estimate and a measurement are normally distributed. If the prior is centered at B with deviation Σ, and the measurement is centered at b with deviation σ, then the posterior is centered at , with weights in this weighted average being α=σ², β=Σ². Moreover, the squared posterior deviation is Σ²+σ². In other words, the prior is combined with the measurement in exactly the same way as if it were an extra measurement to take into account.

For example, if Σ=σ/2, then the deviation of 4 measurements combined together matches the deviation of the prior (assuming that errors of measurements are independent). And the weights α,β in the formula for posterior match this: the weight of the prior is 4 times the weight of the measurement. Combining this prior with n measurements with average v results in the posterior centered at ; in particular, the prior plays the same role as 4 measurements made in advance. In general, the prior has the weight of (σ/Σ)² measurements.

Compare to the example of binomial distribution: there the prior has the weight of (σ/Σ)²−1 measurements. One can see that the exact weight does depend on the details of the distribution, but when σ≫Σ, the difference becomes small.

Practical example of Bayes estimators

The Internet Movie Database uses a formula for calculating and comparing the ratings of films by its users, including their Top Rated 250 Titles which is claimed to give "a true Bayesian estimate". [7] The following Bayesian formula was initially used to calculate a weighted average score for the Top 250, though the formula has since changed:


= weighted rating
= average rating for the movie as a number from 1 to 10 (mean) = (Rating)
= number of votes/ratings for the movie = (votes)
= weight given to the prior estimate (in this case, the number of votes IMDB deemed necessary for average rating to approach statistical validity)
= the mean vote across the whole pool (currently 7.0)

Note that W is just the weighted arithmetic mean of R and C with weight vector (v, m). As the number of ratings surpasses m, the confidence of the average rating surpasses the confidence of the prior knowledge, and the weighted bayesian rating (W) approaches a straight average (R). The closer v (the number of ratings for the film) is to zero, the closer W gets to C, where W is the weighted rating and C is the average rating of all films. So, in simpler terms, the fewer ratings/votes cast for a film, the more that film's Weighted Rating will skew towards the average across all films, while films with many ratings/votes will have a rating approaching its pure arithmetic average rating.

IMDb's approach ensures that a film with only a few ratings, all at 10, would not rank above "the Godfather", for example, with a 9.2 average from over 500,000 ratings.

See also


  1. Lehmann and Casella, Theorem 4.1.1
  2. 1 2 Lehmann and Casella, Definition 4.2.9
  3. Jaynes, E.T. (2007). Probability Theory: The Logic of Science (5. print. ed.). Cambridge [u.a.]: Cambridge Univ. Press. p. 172. ISBN   978-0-521-59271-0.
  4. Berger (1980), section 4.5.
  5. Lehmann and Casella (1998), Theorem 5.2.4.
  6. Lehmann and Casella (1998), section 6.8
  7. IMDb Top 250

Related Research Articles

Normal distribution probability distribution

In probability theory, the normaldistribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. The method obtains the parameter estimates by finding the parameter values that maximize the likelihood function. The estimates are called maximum likelihood estimates, which is also abbreviated as MLE.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

Rayleigh distribution

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. It is essentially a chi distribution with two degrees of freedom.

In estimation theory and statistics, the Cramér–Rao bound (CRB), Cramér–Rao lower bound (CRLB), Cramér–Rao inequality, Frechet–Darmois–Cramér–Rao inequality, or information inequality expresses a lower bound on the variance of unbiased estimators of a deterministic parameter. This term is named in honor of Harald Cramér, Calyampudi Radhakrishna Rao, Maurice Fréchet and Georges Darmois all of whom independently derived this limit to statistical precision in the 1940s.

Directional statistics

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

von Mises distribution

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere.

Rice distribution probability distribution of the magnitude of a circular bivariate normal random variable

In probability theory, the Rice distribution, Rician distribution or Ricean distribution is the probability distribution of the magnitude of a circular bivariate normal random variable with potentially non-zero mean. It was named after Stephen O. Rice.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; it is proportional to the square root of the determinant of the Fisher information matrix:

Beta-binomial distribution distribution

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each trial is fixed but randomly drawn from a beta distribution prior to n Bernoulli trials. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio

In credibility theory, a branch of study in actuarial science, the Bühlmann model is a random effects model used in to determine the appropriate premium for a group of insurance contracts. The model is named after Hans Bühlmann who first published a description in 1967.

Half-normal distribution

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.

Flamant solution

The Flamant solution provides expressions for the stresses and displacements in a linear elastic wedge loaded by point forces at its sharp end. This solution was developed by A. Flamant in 1892 by modifying the three-dimensional solution of Boussinesq.

Wrapped normal distribution

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

Maximum spacing estimation

In statistics, maximum spacing estimation, or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.