# Maximum a posteriori estimation

Last updated

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

## Description

Assume that we want to estimate an unobserved population parameter $\theta$ on the basis of observations $x$ . Let $f$ be the sampling distribution of $x$ , so that $f(x\mid \theta )$ is the probability of $x$ when the underlying population parameter is $\theta$ . Then the function:

$\theta \mapsto f(x\mid \theta )\!$ is known as the likelihood function and the estimate:

${\hat {\theta }}_{\mathrm {MLE} }(x)={\underset {\theta }{\operatorname {arg\,max} }}\ f(x\mid \theta )\!$ is the maximum likelihood estimate of $\theta$ .

Now assume that a prior distribution $g$ over $\theta$ exists. This allows us to treat $\theta$ as a random variable as in Bayesian statistics. We can calculate the posterior distribution of $\theta$ using Bayes' theorem:

$\theta \mapsto f(\theta \mid x)={\frac {f(x\mid \theta )\,g(\theta )}\int _{\Theta }f(x\mid \vartheta )\,g(\vartheta )\,d\vartheta }}\!$ where $g$ is density function of $\theta$ , $\Theta$ is the domain of $g$ .

The method of maximum a posteriori estimation then estimates $\theta$ as the mode of the posterior distribution of this random variable:

{\begin{aligned}{\hat {\theta }}_{\mathrm {MAP} }(x)&={\underset {\theta }{\operatorname {arg\,max} }}\ f(\theta \mid x)\\&={\underset {\theta }{\operatorname {arg\,max} }}\ {\frac {f(x\mid \theta )\,g(\theta )}\int _{\Theta }f(x\mid \vartheta )\,g(\vartheta )\,d\vartheta }}\\&={\underset {\theta }{\operatorname {arg\,max} }}\ f(x\mid \theta )\,g(\theta ).\end{aligned}}\! The denominator of the posterior distribution (so-called marginal likelihood) is always positive and does not depend on $\theta$ and therefore plays no role in the optimization. Observe that the MAP estimate of $\theta$ coincides with the ML estimate when the prior $g$ is uniform (i.e., $g$ is a constant function).

When the loss function is of the form

$L(\theta ,a)={\begin{cases}0,&{\text{if }}|a-\theta | as $c$ goes to 0, the Bayes estimator approaches the MAP estimator, provided that the distribution of $\theta$ is quasi-concave.  But generally a MAP estimator is not a Bayes estimator unless $\theta$ is discrete.

## Computation

MAP estimates can be computed in several ways:

1. Analytically, when the mode(s) of the posterior distribution can be given in closed form. This is the case when conjugate priors are used.
2. Via numerical optimization such as the conjugate gradient method or Newton's method. This usually requires first or second derivatives, which have to be evaluated analytically or numerically.
3. Via a modification of an expectation-maximization algorithm. This does not require derivatives of the posterior density.
4. Via a Monte Carlo method using simulated annealing

## Limitations

While only mild conditions are required for MAP estimation to be a limiting case of Bayes estimation (under the 0–1 loss function),  it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior mean or median instead, together with credible intervals. This is both because these estimators are optimal under squared-error and linear-error loss respectively—which are more representative of typical loss functions—and for a continuous posterior distribution there is no loss function which suggests the MAP is the optimal point estimator. In addition, the posterior distribution may often not have a simple analytic form: in this case, the distribution can be simulated using Markov chain Monte Carlo techniques, while optimization to find its mode(s) may be difficult or impossible.[ citation needed ] An example of a density of a bimodal distribution in which the highest mode is uncharacteristic of the majority of the distribution

In many types of models, such as mixture models, the posterior may be multi-modal. In such a case, the usual recommendation is that one should choose the highest mode: this is not always feasible (global optimization is a difficult problem), nor in some cases even possible (such as when identifiability issues arise). Furthermore, the highest mode may be uncharacteristic of the majority of the posterior.

Finally, unlike ML estimators, the MAP estimate is not invariant under reparameterization. Switching from one parameterization to another involves introducing a Jacobian that impacts on the location of the maximum. 

As an example of the difference between Bayes estimators mentioned above (mean and median estimators) and using a MAP estimate, consider the case where there is a need to classify inputs $x$ as either positive or negative (for example, loans as risky or safe). Suppose there are just three possible hypotheses about the correct method of classification $h_{1}$ , $h_{2}$ and $h_{3}$ with posteriors 0.4, 0.3 and 0.3 respectively. Suppose given a new instance, $x$ , $h_{1}$ classifies it as positive, whereas the other two classify it as negative. Using the MAP estimate for the correct classifier $h_{1}$ , $x$ is classified as positive, whereas the Bayes estimators would average over all hypotheses and classify $x$ as negative.

## Example

Suppose that we are given a sequence $(x_{1},\dots ,x_{n})$ of IID $N(\mu ,\sigma _{v}^{2})$ random variables and a priori distribution of $\mu$ is given by $N(\mu _{0},\sigma _{m}^{2})$ . We wish to find the MAP estimate of $\mu$ . Note that the normal distribution is its own conjugate prior, so we will be able to find a closed-form solution analytically.

The function to be maximized is then given by

$f(\mu )f(x\mid \mu )=\pi (\mu )L(\mu )={\frac {1}{{\sqrt {2\pi }}\sigma _{m}}}\exp \left(-{\frac {1}{2}}\left({\frac {\mu -\mu _{0}}{\sigma _{m}}}\right)^{2}\right)\prod _{j=1}^{n}{\frac {1}{{\sqrt {2\pi }}\sigma _{v}}}\exp \left(-{\frac {1}{2}}\left({\frac {x_{j}-\mu }{\sigma _{v}}}\right)^{2}\right),$ which is equivalent to minimizing the following function of $\mu$ :

$\sum _{j=1}^{n}\left({\frac {x_{j}-\mu }{\sigma _{v}}}\right)^{2}+\left({\frac {\mu -\mu _{0}}{\sigma _{m}}}\right)^{2}.$ Thus, we see that the MAP estimator for μ is given by

${\hat {\mu }}_{\mathrm {MAP} }={\frac {\sigma _{m}^{2}\,n}{\sigma _{m}^{2}\,n+\sigma _{v}^{2}}}\left({\frac {1}{n}}\sum _{j=1}^{n}x_{j}\right)+{\frac {\sigma _{v}^{2}}{\sigma _{m}^{2}\,n+\sigma _{v}^{2}}}\,\mu _{0}={\frac {\sigma _{m}^{2}\left(\sum _{j=1}^{n}x_{j}\right)+\sigma _{v}^{2}\,\mu _{0}}{\sigma _{m}^{2}\,n+\sigma _{v}^{2}}}.$ which turns out to be a linear interpolation between the prior mean and the sample mean weighted by their respective covariances.

The case of $\sigma _{m}\to \infty$ is called a non-informative prior and leads to an ill-defined a priori probability distribution; in this case ${\hat {\mu }}_{\mathrm {MAP} }\to {\hat {\mu }}_{\mathrm {ML} }.$ ## Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter , which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

In estimation theory and statistics, the Cramér–Rao bound (CRB) expresses a lower bound on the variance of unbiased estimators of a deterministic parameter, stating that the variance of any such estimator is at least as high as the inverse of the Fisher information. The result is named in honor of Harald Cramér and C. R. Rao, but has independently also been derived by Maurice Fréchet, Georges Darmois, as well as Alexander Aitken and Harold Silverstone.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics. In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one. In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere. In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice.

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is a distinct concept from consistency. Consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In credibility theory, a branch of study in actuarial science, the Bühlmann model is a random effects model used in to determine the appropriate premium for a group of insurance contracts. The model is named after Hans Bühlmann who first published a description in 1967. The shifted log-logistic distribution is a probability distribution also known as the generalized log-logistic or the three-parameter log-logistic distribution. It has also been called the generalized logistic distribution, but this conflicts with other uses of the term: see generalized logistic distribution. In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics. In statistics, maximum spacing estimation, or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In statistics, an adaptive estimator is an estimator in a parametric or semiparametric model with nuisance parameters such that the presence of these nuisance parameters does not affect efficiency of estimation.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences. In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data. In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

1. Bassett, Robert; Deride, Julio (2018-01-30). "Maximum a posteriori estimators as a limit of Bayes estimators". Mathematical Programming: 1–16. arXiv:. doi:10.1007/s10107-018-1241-0. ISSN   0025-5610.
2. Murphy, Kevin P. (2012). Machine learning : a probabilistic perspective. Cambridge, Massachusetts: MIT Press. pp. 151–152. ISBN   978-0-262-01802-9.
• DeGroot, M. (1970). Optimal Statistical Decisions. McGraw-Hill. ISBN   0-07-016242-5.
• Sorenson, Harold W. (1980). Parameter Estimation: Principles and Problems. Marcel Dekker. ISBN   0-8247-6987-2.
• Hald, Anders (2007). "Gauss's Derivation of the Normal Distribution and the Method of Least Squares, 1809". A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935. New York: Springer. pp. 55–61. ISBN   978-0-387-46409-1.