Empirical Bayes method

Last updated

Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. [1] Empirical Bayes, also known as maximum marginal likelihood , [2] represents a convenient approach for setting hyperparameters, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques. It is still commonly used, however, for variational methods in Deep Learning, such as variational autoencoders, where latent variable spaces are high-dimensional.

Contents

Introduction

Empirical Bayes methods can be seen as an approximation to a fully Bayesian treatment of a hierarchical Bayes model.

In, for example, a two-stage hierarchical Bayes model, observed data are assumed to be generated from an unobserved set of parameters according to a probability distribution . In turn, the parameters can be considered samples drawn from a population characterised by hyperparameters according to a probability distribution . In the hierarchical Bayes model, though not in the empirical Bayes approximation, the hyperparameters are considered to be drawn from an unparameterized distribution .

Information about a particular quantity of interest therefore comes not only from the properties of those data that directly depend on it, but also from the properties of the population of parameters as a whole, inferred from the data as a whole, summarised by the hyperparameters .

Using Bayes' theorem,

In general, this integral will not be tractable analytically or symbolically and must be evaluated by numerical methods. Stochastic (random) or deterministic approximations may be used. Example stochastic methods are Markov Chain Monte Carlo and Monte Carlo sampling. Deterministic approximations are discussed in quadrature.

Alternatively, the expression can be written as

and the final factor in the integral can in turn be expressed as

These suggest an iterative scheme, qualitatively similar in structure to a Gibbs sampler, to evolve successively improved approximations to and . First, calculate an initial approximation to ignoring the dependence completely; then calculate an approximation to based upon the initial approximate distribution of ; then use this to update the approximation for ; then update ; and so on.

When the true distribution is sharply peaked, the integral determining may be not much changed by replacing the probability distribution over with a point estimate representing the distribution's peak (or, alternatively, its mean),

With this approximation, the above iterative scheme becomes the EM algorithm.

The term "Empirical Bayes" can cover a wide variety of methods, but most can be regarded as an early truncation of either the above scheme or something quite like it. Point estimates, rather than the whole distribution, are typically used for the parameter(s) . The estimates for are typically made from the first approximation to without subsequent refinement. These estimates for are usually made without considering an appropriate prior distribution for .

Point estimation

Robbins' method: non-parametric empirical Bayes (NPEB)

Robbins [3] considered a case of sampling from a mixed distribution, where probability for each (conditional on ) is specified by a Poisson distribution,

while the prior on θ is unspecified except that it is also i.i.d. from an unknown distribution, with cumulative distribution function . Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials.[ citation needed ] We simply seek a point prediction of given all the observed data. Because the prior is unspecified, we seek to do this without knowledge of G. [4]

Under squared error loss (SEL), the conditional expectation E(θi | Yi = yi) is a reasonable quantity to use for prediction. For the Poisson compound sampling model, this quantity is

This can be simplified by multiplying both the numerator and denominator by , yielding

where pG is the marginal probability mass function obtained by integrating out θ over G.

To take advantage of this, Robbins [3] suggested estimating the marginals with their empirical frequencies (), yielding the fully non-parametric estimate as:

where denotes "number of". (See also Good–Turing frequency estimation.)

Example – Accident rates

Suppose each customer of an insurance company has an "accident rate" Θ and is insured against accidents; the probability distribution of Θ is the underlying distribution, and is unknown. The number of accidents suffered by each customer in a specified time period has a Poisson distribution with expected value equal to the particular customer's accident rate. The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate Θ is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period as the corresponding proportion in the observed random sample. Having done so, it is then desired to predict the accident rate of each customer in the sample. As above, one may use the conditional expected value of the accident rate Θ given the observed number of accidents during the baseline period. Thus, if a customer suffers six accidents during the baseline period, that customer's estimated accident rate is 7 × [the proportion of the sample who suffered 7 accidents] / [the proportion of the sample who suffered 6 accidents]. Note that if the proportion of people suffering k accidents is a decreasing function of k, the customer's predicted accident rate will often be lower than their observed number of accidents.

This shrinkage effect is typical of empirical Bayes analyses.

Gaussian

Suppose are random variables, such that is observed, but is hidden. The problem is to find the expectation of , conditional on . Suppose further that , that is, , where is a multivariate gaussian with variance .

Then, we have the formula by direct calculation with the probability density function of multivariate gaussians. Integrating over , we obtainIn particular, this means that one can perform Bayesian estimation of without access to either the prior density of or the posterior density of . The only requirement is to have access to the score function of . This has applications in score-based generative modeling. [5]

Parametric empirical Bayes

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple conjugate priors), then the empirical Bayes problem is only to estimate the marginal and the hyperparameters using the complete set of empirical measurements. For example, one common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using the maximum likelihood estimate (MLE), or a moments expansion, which allows one to express the hyperparameters in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior . The resulting equation for the prior is greatly simplified, as shown below.

There are several common parametric empirical Bayes models, including the Poisson–gamma model (below), the Beta-binomial model, the Gaussian–Gaussian model, the Dirichlet-multinomial model, as well specific models for Bayesian linear regression (see below) and Bayesian multivariate linear regression. More advanced approaches include hierarchical Bayes models and Bayesian mixture models.

Gaussian–Gaussian model

For an example of empirical Bayes estimation using a Gaussian-Gaussian model, see Empirical Bayes estimators.

Poisson–gamma model

For example, in the example above, let the likelihood be a Poisson distribution, and let the prior now be specified by the conjugate prior, which is a gamma distribution () (where ):

It is straightforward to show the posterior is also a gamma distribution. Write

where the marginal distribution has been omitted since it does not depend explicitly on . Expanding terms which do depend on gives the posterior as:

So the posterior density is also a gamma distribution , where , and . Also notice that the marginal is simply the integral of the posterior over all , which turns out to be a negative binomial distribution.

To apply empirical Bayes, we will approximate the marginal using the maximum likelihood estimate (MLE). But since the posterior is a gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimate we need. Recalling that the mean of a gamma distribution is simply , we have

To obtain the values of and , empirical Bayes prescribes estimating mean and variance using the complete set of empirical data.

The resulting point estimate is therefore like a weighted average of the sample mean and the prior mean . This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).

See also

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and a rate parameter

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In statistics, the Neyman–Pearson lemma describes the existence and uniqueness of the likelihood ratio as a uniformly most powerful test in certain contexts. It was introduced by Jerzy Neyman and Egon Pearson in a paper in 1933. The Neyman–Pearson lemma is part of the Neyman–Pearson theory of statistical testing, which introduced concepts like errors of the second kind, power function, and inductive behavior. The previous Fisherian theory of significance testing postulated only one hypothesis. By introducing a competing hypothesis, the Neyman–Pearsonian flavor of statistical testing allows investigating the two types of errors. The trivial cases where one always rejects or accepts the null hypothesis are of little interest but it does prove that one must not relinquish control over one type of error while calibrating the other. Neyman and Pearson accordingly proceeded to restrict their attention to the class of all level tests while subsequently minimizing type II error, traditionally denoted by . Their seminal paper of 1933, including the Neyman–Pearson lemma, comes at the end of this endeavor, not only showing the existence of tests with the most power that retain a prespecified level of type I error, but also providing a way to construct such tests. The Karlin-Rubin theorem extends the Neyman–Pearson lemma to settings involving composite hypotheses with monotone likelihood ratios.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In Bayesian probability theory, if, given a likelihood function , the posterior distribution is in the same probability distribution family as the prior probability distribution , the prior and posterior are then called conjugate distributions with respect to that likelihood function and the prior is called a conjugate prior for the likelihood function .

<span class="mw-page-title-main">Inverse-gamma distribution</span> Two-parameter family of continuous probability distributions

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution.

<span class="mw-page-title-main">Generalized inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory and statistics, the generalized inverse Gaussian distribution (GIG) is a three-parameter family of continuous probability distributions with probability density function

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. The "M" initial stands for "maximum likelihood-type".

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

The table of chords, created by the Greek astronomer, geometer, and geographer Ptolemy in Egypt during the 2nd century AD, is a trigonometric table in Book I, chapter 11 of Ptolemy's Almagest, a treatise on mathematical astronomy. It is essentially equivalent to a table of values of the sine function. It was the earliest trigonometric table extensive enough for many practical purposes, including those of astronomy. Since the 8th and 9th centuries, the sine and other trigonometric functions have been used in Islamic mathematics and astronomy, reforming the production of sine tables. Khwarizmi and Habash al-Hasib later produced a set of trigonometric tables.

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable is known as the natural exponential family of .

In fluid dynamics, a stagnation point flow refers to a fluid flow in the neighbourhood of a stagnation point or a stagnation line with which the stagnation point/line refers to a point/line where the velocity is zero in the inviscid approximation. The flow specifically considers a class of stagnation points known as saddle points wherein incoming streamlines gets deflected and directed outwards in a different direction; the streamline deflections are guided by separatrices. The flow in the neighborhood of the stagnation point or line can generally be described using potential flow theory, although viscous effects cannot be neglected if the stagnation point lies on a solid surface.

Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.

Integrated nested Laplace approximations (INLA) is a method for approximate Bayesian inference based on Laplace's method. It is designed for a class of models called latent Gaussian models (LGMs), for which it can be a fast and accurate alternative for Markov chain Monte Carlo methods to compute posterior marginal distributions. Due to its relative speed even with large data sets for certain problems and models, INLA has been a popular inference method in applied statistics, in particular spatial statistics, ecology, and epidemiology. It is also possible to combine INLA with a finite element method solution of a stochastic partial differential equation to study e.g. spatial point processes and species distribution models. The INLA method is implemented in the R-INLA R package.

References

  1. Carlin, Bradley P.; Louis, Thomas A. (2002). "Empirical Bayes: Past, Present, and Future". In Raftery, Adrian E.; Tanner, Martin A.; Wells, Martin T. (eds.). Statistics in the 21st Century. Chapman & Hall. pp. 312–318. ISBN   1-58488-272-7.
  2. C.M. Bishop (2005). Neural networks for pattern recognition. Oxford University Press ISBN   0-19-853864-2
  3. 1 2 Robbins, Herbert (1956). "An Empirical Bayes Approach to Statistics". Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Springer Series in Statistics: 157–163. doi:10.1007/978-1-4612-0919-5_26. ISBN   978-0-387-94037-3. MR   0084919.
  4. Carlin, Bradley P.; Louis, Thomas A. (2000). Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.). Chapman & Hall/CRC. pp. Sec. 3.2 and Appendix B. ISBN   978-1-58488-170-4.
  5. Saremi, Saeed; Hyvärinen, Aapo (2019). "Neural Empirical Bayes". Journal of Machine Learning Research. 20 (181): 1–23. ISSN   1533-7928.

Further reading