Jeffreys prior

Last updated January 05, 2025

In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys,^[1] its density function is proportional to the square root of the determinant of the Fisher information matrix:

Reparameterization
One-parameter case
Multiple-parameter case
Attributes
Minimum description length
Examples
Gaussian distribution with mean parameter
Gaussian distribution with standard deviation parameter
Poisson distribution with rate parameter
Bernoulli trial
N-sided die with biased probabilities
Generalizations
Probability-matching prior
α-parallel prior
References
Further reading

$p\left(\theta \right)\propto \left|I(\theta )\right|^{1/2}.\,$

It has the key feature that it is invariant under a change of coordinates for the parameter vector ${\textstyle \theta }$ . That is, the relative probability assigned to a volume of a probability space using a Jeffreys prior will be the same regardless of the parameterization used to define the Jeffreys prior. This makes it of special interest for use with scale parameters.^[2] As a concrete example, a Bernoulli distribution can be parameterized by the probability of occurrence $p$ , or by the odds $r = p / (1 - p)$ . A uniform prior on one of these is not the same as a uniform prior on the other, even accounting for reparameterization in the usual way, but the Jeffreys prior on one reparameterizes to the Jeffreys prior on the other.

In maximum likelihood estimation of exponential family models, penalty terms based on the Jeffreys prior were shown to reduce asymptotic bias in point estimates.^[3]^[4]

Reparameterization

One-parameter case

If ${\textstyle \theta }$ and ${\textstyle \varphi }$ are two possible parameterizations of a statistical model, and ${\textstyle \theta }$ is a continuously differentiable function of ${\textstyle \varphi }$ , we say that the prior ${\textstyle p_{\theta }(\theta )}$ is "invariant" under a reparameterization if $p_{\varphi }(\varphi )=p_{\theta }(\theta )\left|{\frac {d\theta }{d\varphi }}\right|,$ that is, if the priors ${\textstyle p_{\theta }(\theta )}$ and ${\textstyle p_{\varphi }(\varphi )}$ are related by the usual change of variables theorem.

Since the Fisher information transforms under reparameterization as $I_{\varphi }(\varphi )=I_{\theta }(\theta )\left({\frac {d\theta }{d\varphi }}\right)^{2},$ defining the priors as ${\textstyle p_{\varphi }(\varphi )\propto {\sqrt {I_{\varphi }(\varphi )}}}$ and ${\textstyle p_{\theta }(\theta )\propto {\sqrt {I_{\theta }(\theta )}}}$ gives us the desired "invariance".^[5]

Multiple-parameter case

Analogous to the one-parameter case, let ${\textstyle {\vec {\theta }}}$ and ${\textstyle {\vec {\varphi }}}$ be two possible parameterizations of a statistical model, with ${\textstyle {\vec {\theta }}}$ a continuously differentiable function of ${\textstyle {\vec {\varphi }}}$ . We call the prior ${\textstyle p_{\theta }({\vec {\theta }})}$ "invariant" under reparameterization if $p_{\varphi }({\vec {\varphi }})=p_{\theta }({\vec {\theta }})~|\det J|\,,$ where ${\textstyle J}$ is the Jacobian matrix with entries $J_{ij}={\frac {\partial \theta _{i}}{\partial \varphi _{j}}}.$ Since the Fisher information matrix transforms under reparameterization as $I_{\varphi }({\vec {\varphi }})=J^{T}I_{\theta }({\vec {\theta }})J,$ we have that $\det I_{\varphi }(\varphi )=\det I_{\theta }(\theta )(\det J)^{2}$ and thus defining the priors as ${\textstyle p_{\varphi }({\vec {\varphi }})\propto {\sqrt {\det I_{\varphi }({\vec {\varphi }})}}}$ and ${\textstyle p_{\theta }({\vec {\theta }})\propto {\sqrt {\det I_{\theta }({\vec {\theta }})}}}$ gives us the desired "invariance".

Attributes

From a practical and mathematical standpoint, a valid reason to use this non-informative prior instead of others, like the ones obtained through a limit in conjugate families of distributions, is that the relative probability of a volume of the probability space is not dependent upon the set of parameter variables that is chosen to describe parameter space.

Sometimes the Jeffreys prior cannot be normalized, and is thus an improper prior. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a Gaussian distribution of known variance.

Use of the Jeffreys prior violates the strong version of the likelihood principle, which is accepted by many, but by no means all, statisticians. When using the Jeffreys prior, inferences about ${\textstyle {\vec {\theta }}}$ depend not just on the probability of the observed data as a function of ${\textstyle {\vec {\theta }}}$ , but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe. Accordingly, the Jeffreys prior, and hence the inferences made using it, may be different for two experiments involving the same ${\textstyle {\vec {\theta }}}$ parameter even when the likelihood functions for the two experiments are the same—a violation of the strong likelihood principle.

Minimum description length

In the minimum description length approach to statistics the goal is to describe data as compactly as possible where the length of a description is measured in bits of the code used. For a parametric family of distributions one compares a code with the best code based on one of the distributions in the parameterized family. The main result is that in exponential families, asymptotically for large sample size, the code based on the distribution that is a mixture of the elements in the exponential family with the Jeffreys prior is optimal. This result holds if one restricts the parameter set to a compact subset in the interior of the full parameter space^{[ citation needed ]}. If the full parameter is used a modified version of the result should be used.

Examples

The Jeffreys prior for a parameter (or a set of parameters) depends upon the statistical model.

Gaussian distribution with mean parameter

For the Gaussian distribution of the real value ${\textstyle x}$ $f(x\mid \mu )={\frac {e^{-(x-\mu )^{2}/2\sigma ^{2}}}{\sqrt {2\pi \sigma ^{2}}}}$ with ${\textstyle \sigma }$ fixed, the Jeffreys prior for the mean ${\textstyle \mu }$ is ${\begin{aligned}p(\mu )&\propto {\sqrt {I(\mu )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\mu }}\log f(x\mid \mu )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {x-\mu }{\sigma ^{2}}}\right)^{2}\right]}}\\&={\sqrt {\int _{-\infty }^{+\infty }f(x\mid \mu )\left({\frac {x-\mu }{\sigma ^{2}}}\right)^{2}dx}}={\sqrt {\sigma ^{2}/\sigma ^{4}}}\propto 1.\end{aligned}}$ That is, the Jeffreys prior for ${\textstyle \mu }$ does not depend upon ${\textstyle \mu }$ ; it is the unnormalized uniform distribution on the real line — the distribution that is 1 (or some other fixed constant) for all points. This is an improper prior, and is, up to the choice of constant, the unique translation-invariant distribution on the reals (the Haar measure with respect to addition of reals), corresponding to the mean being a measure of location and translation-invariance corresponding to no information about location.

Gaussian distribution with standard deviation parameter

For the Gaussian distribution of the real value ${\textstyle x}$ $f(x\mid \sigma )={\frac {e^{-(x-\mu )^{2}/2\sigma ^{2}}}{\sqrt {2\pi \sigma ^{2}}}},$ with ${\textstyle \mu }$ fixed, the Jeffreys prior for the standard deviation ${\textstyle \sigma >0}$ is ${\begin{aligned}p(\sigma )&\propto {\sqrt {I(\sigma )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\sigma }}\log f(x\mid \sigma )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {(x-\mu )^{2}-\sigma ^{2}}{\sigma ^{3}}}\right)^{2}\right]}}\\&={\sqrt {\int _{-\infty }^{+\infty }f(x\mid \sigma )\left({\frac {(x-\mu )^{2}-\sigma ^{2}}{\sigma ^{3}}}\right)^{2}dx}}={\sqrt {\frac {2}{\sigma ^{2}}}}\propto {\frac {1}{\sigma }}.\end{aligned}}$ Equivalently, the Jeffreys prior for ${\textstyle \log \sigma =\int d\sigma /\sigma }$ is the unnormalized uniform distribution on the real line, and thus this distribution is also known as the logarithmic prior. Similarly, the Jeffreys prior for ${\textstyle \log \sigma ^{2}=2\log \sigma }$ is also uniform. It is the unique (up to a multiple) prior (on the positive reals) that is scale-invariant (the Haar measure with respect to multiplication of positive reals), corresponding to the standard deviation being a measure of scale and scale-invariance corresponding to no information about scale. As with the uniform distribution on the reals, it is an improper prior.

Poisson distribution with rate parameter

For the Poisson distribution of the non-negative integer ${\textstyle n}$ , $f(n\mid \lambda )=e^{-\lambda }{\frac {\lambda ^{n}}{n!}},$ the Jeffreys prior for the rate parameter ${\textstyle \lambda \geq 0}$ is ${\begin{aligned}p(\lambda )&\propto {\sqrt {I(\lambda )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\lambda }}\log f(n\mid \lambda )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {n-\lambda }{\lambda }}\right)^{2}\right]}}\\&={\sqrt {\sum _{n=0}^{+\infty }f(n\mid \lambda )\left({\frac {n-\lambda }{\lambda }}\right)^{2}}}={\sqrt {\frac {1}{\lambda }}}.\end{aligned}}$ Equivalently, the Jeffreys prior for ${\textstyle {\sqrt {\lambda }}=\int d\lambda /{\sqrt {\lambda }}}$ is the unnormalized uniform distribution on the non-negative real line.

Bernoulli trial

For a coin that is "heads" with probability ${\textstyle \gamma \in [0,1]}$ and is "tails" with probability ${\textstyle 1-\gamma }$ , for a given ${\textstyle (H,T)\in \{(0,1),(1,0)\}}$ the probability is ${\textstyle \gamma ^{H}(1-\gamma )^{T}}$ . The Jeffreys prior for the parameter ${\textstyle \gamma }$ is

${\begin{aligned}p(\gamma )&\propto {\sqrt {I(\gamma )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\gamma }}\log f(x\mid \gamma )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {H}{\gamma }}-{\frac {T}{1-\gamma }}\right)^{2}\right]}}\\&={\sqrt {\gamma \left({\frac {1}{\gamma }}-{\frac {0}{1-\gamma }}\right)^{2}+(1-\gamma )\left({\frac {0}{\gamma }}-{\frac {1}{1-\gamma }}\right)^{2}}}={\frac {1}{\sqrt {\gamma (1-\gamma )}}}\,.\end{aligned}}$

This is the arcsine distribution and is a beta distribution with ${\textstyle \alpha =\beta =1/2}$ . Furthermore, if ${\textstyle \gamma =\sin ^{2}(\theta )}$ then $\Pr[\theta ]=\Pr[\gamma ]{\frac {d\gamma }{d\theta }}\propto {\frac {1}{\sqrt {(\sin ^{2}\theta )(1-\sin ^{2}\theta )}}}~2\sin \theta \cos \theta =2\,.$ That is, the Jeffreys prior for ${\textstyle \theta }$ is uniform in the interval ${\textstyle [0,\pi /2]}$ . Equivalently, ${\textstyle \theta }$ is uniform on the whole circle ${\textstyle [0,2\pi ]}$ .

N-sided die with biased probabilities

Similarly, for a throw of an ${\textstyle N}$ -sided die with outcome probabilities ${\textstyle {\vec {\gamma }}=(\gamma _{1},\ldots ,\gamma _{N})}$ , each non-negative and satisfying ${\textstyle \sum _{i=1}^{N}\gamma _{i}=1}$ , the Jeffreys prior for ${\textstyle {\vec {\gamma }}}$ is the Dirichlet distribution with all (alpha) parameters set to one half. This amounts to using a pseudocount of one half for each possible outcome.

Equivalently, if we write ${\textstyle \gamma _{i}=\varphi _{i}^{2}}$ for each ${\textstyle i}$ , then the Jeffreys prior for ${\textstyle {\vec {\varphi }}}$ is uniform on the ${\textstyle (N-1)}$ -dimensional unit sphere (i.e., it is uniform on the surface of an ${\textstyle N}$ -dimensional unit ball).

Generalizations

Probability-matching prior

In 1963, Welch and Peers showed that for a scalar parameter θ the Jeffreys prior is "probability-matching" in the sense that posterior predictive probabilities agree with frequentist probabilities and credible intervals of a chosen width coincide with frequentist confidence intervals.^[6] In a follow-up, Peers showed that this was not true for the multi-parameter case,^[7] instead leading to the notion of probability-matching priors which are only implicitly defined as the probability distribution solving a certain partial differential equation involving the Fisher information.^[8]

α-parallel prior

Using tools from information geometry, the Jeffreys prior can be generalized in pursuit of obtaining priors that encode geometric information of the statistical model, so as to be invariant under a change of the coordinate of parameters.^[9] A special case, the so-called Weyl prior, is defined as a volume form on a Weyl manifold.^[10]

Related Research Articles

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

In quantum field theory, the Dirac spinor is the spinor that describes all known fundamental particles that are fermions, with the possible exception of neutrinos. It appears in the plane-wave solution to the Dirac equation, and is a certain combination of two Weyl spinors, specifically, a bispinor that transforms "spinorially" under the action of the Lorentz group.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation.

In probability theory, the Borel–Kolmogorov paradox is a paradox relating to conditional probability with respect to an event of probability zero. It is named after Émile Borel and Andrey Kolmogorov.

In probability theory, the Gram–Charlier A series, and the Edgeworth series are series that approximate a probability distribution in terms of its cumulants. The series are the same; but, the arrangement of terms differ. The key idea of these expansions is to write the characteristic function of the distribution whose probability density function $f$ is to be approximated in terms of the characteristic function of a distribution with known and suitable properties, and to recover $f$ through the inverse Fourier transform.

In statistics, econometrics, and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it can be used to describe certain time-varying processes in nature, economics, behavior, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation which should not be confused with a differential equation. Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models of time series, which have a more complicated stochastic structure; it is also a special case of the vector autoregressive model (VAR), which consists of a system of more than one interlocking stochastic difference equation in more than one evolving random variable.

The scaled inverse chi-squared distribution $, where is the scale parameter, equals the univariate inverse Wishart distribution with degrees of freedom .$

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

In probability theory and statistics, the characteristic function of any real-valued random variable completely defines its probability distribution. If a random variable admits a probability density function, then the characteristic function is the Fourier transform of the probability density function. Thus it provides an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the characteristic functions of distributions defined by the weighted sums of random variables.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, individuals with better and better -values are generated over the generation sequence.$

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In general relativity, a point mass deflects a light ray with impact parameter $by an angle approximately equal to$

In quantum field theory, a non-topological soliton (NTS) is a soliton field configuration possessing, contrary to a topological one, a conserved Noether charge and stable against transformation into usual particles of this field for the following reason. For fixed charge Q, the mass sum of Q free particles exceeds the energy (mass) of the NTS so that the latter is energetically favorable to exist.

In general relativity, the Vaidya metric describes the non-empty external spacetime of a spherically symmetric and nonrotating star which is either emitting or absorbing null dusts. It is named after the Indian physicist Prahalad Chunnilal Vaidya and constitutes the simplest non-static generalization of the non-radiative Schwarzschild solution to Einstein's field equation, and therefore is also called the "radiating(shining) Schwarzschild metric".

<span class="mw-page-title-main">Asymmetric Laplace distribution</span> Continuous probability distribution

In probability theory and statistics, the asymmetric Laplace distribution (ALD) is a continuous probability distribution which is a generalization of the Laplace distribution. Just as the Laplace distribution consists of two exponential distributions of equal scale back-to-back about x = m, the asymmetric Laplace consists of two exponential distributions of unequal scale back to back about x = m, adjusted to assure continuity and normalization. The difference of two variates exponentially distributed with different means and rate parameters will be distributed according to the ALD. When the two rate parameters are equal, the difference will be distributed according to the Laplace distribution.

In theoretical physics, more specifically in quantum field theory and supersymmetry, supersymmetric Yang–Mills, also known as super Yang–Mills and abbreviated to SYM, is a supersymmetric generalization of Yang–Mills theory, which is a gauge theory that plays an important part in the mathematical formulation of forces in particle physics. It is a special case of 4D N = 1 global supersymmetry.

References

↑ Jeffreys H (1946). "An invariant form for the prior probability in estimation problems". Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences. 186 (1007): 453–461. Bibcode:1946RSPSA.186..453J. doi:10.1098/rspa.1946.0056. JSTOR 97883. PMID 20998741.
↑ Jaynes ET (September 1968). "Prior probabilities" (PDF). IEEE Transactions on Systems Science and Cybernetics. 4 (3): 227–241. doi:10.1109/TSSC.1968.300117.
↑ Firth, David (1992). "Bias reduction, the Jeffreys prior and GLIM". In Fahrmeir, Ludwig; Francis, Brian; Gilchrist, Robert; Tutz, Gerhard (eds.). Advances in GLIM and Statistical Modelling. New York: Springer. pp. 91–100. doi:10.1007/978-1-4612-2952-0_15. ISBN 0-387-97873-9.
↑ Magis, David (2015). "A Note on Weighted Likelihood and Jeffreys Modal Estimation of Proficiency Levels in Polytomous Item Response Models". Psychometrika . 80: 200–204. doi:10.1007/s11336-013-9378-5.
↑ Robert CP, Chopin N, Rousseau J (2009). "Harold Jeffreys's Theory of Probability Revisited". Statistical Science. 24 (2). arXiv: 0804.3173 . doi: 10.1214/09-STS284 .
↑ Welch, B. L.; Peers, H. W. (1963). "On Formulae for Confidence Points Based on Integrals of Weighted Likelihoods". Journal of the Royal Statistical Society. Series B (Methodological). 25 (2): 318–329. doi:10.1111/j.2517-6161.1963.tb00512.x.
↑ Peers, H. W. (1965). "On Confidence Points and Bayesian Probability Points in the Case of Several Parameters". Journal of the Royal Statistical Society. Series B (Methodological). 27 (1): 9–16. doi:10.1111/j.2517-6161.1965.tb00581.x.
↑ Scricciolo, Catia (1999). "Probability matching priors: a review". Journal of the Italian Statistical Society. 8. 83. doi:10.1007/BF03178943.
↑ Takeuchi, J.; Amari, S. (2005). "α-parallel prior and its properties". IEEE Transactions on Information Theory. 51 (3): 1011–1023. doi:10.1109/TIT.2004.842703.
↑ Jiang, Ruichao; Tavakoli, Javad; Zhao, Yiqiang (2020). "Weyl Prior and Bayesian Statistics". Entropy. 22 (4). 467. doi: 10.3390/e22040467 . PMC 7516948 .