Normal-gamma distribution

normal-gamma
Parameters	location (real); (real); (real); (real)
Support
PDF
Mean
Mode
Variance

Last updated December 09, 2023

In probability theory and statistics, the normal-gamma distribution (or Gaussian-gamma distribution) is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and precision.^[2]

Definition

For a pair of random variables, (X,T), suppose that the conditional distribution of X given T is given by

X\mid T\sim N(\mu ,1/(\lambda T))\,\!,

meaning that the conditional distribution is a normal distribution with mean $\mu$ and precision $\lambda T$ — equivalently, with variance $1/(\lambda T).$

Suppose also that the marginal distribution of T is given by

T\mid \alpha ,\beta \sim \operatorname {Gamma} (\alpha ,\beta ),

where this means that T has a gamma distribution. Here λ, α and β are parameters of the joint distribution.

Then (X,T) has a normal-gamma distribution, and this is denoted by

(X,T)\sim \operatorname {NormalGamma} (\mu ,\lambda ,\alpha ,\beta ).

Properties

Probability density function

The joint probability density function of (X,T) is^{[ citation needed ]}

f(x,\tau \mid \mu ,\lambda ,\alpha ,\beta )={\frac {\beta ^{\alpha }{\sqrt {\lambda }}}{\Gamma (\alpha ){\sqrt {2\pi }}}}\,\tau ^{\alpha -{\frac {1}{2}}}\,e^{-\beta \tau }\exp \left(-{\frac {\lambda \tau (x-\mu )^{2}}{2}}\right)

Marginal distributions

By construction, the marginal distribution of $\tau$ is a gamma distribution, and the conditional distribution of $x$ given $\tau$ is a Gaussian distribution. The marginal distribution of $x$ is a three-parameter non-standardized Student's t-distribution with parameters $(\nu ,\mu ,\sigma ^{2})=(2\alpha ,\mu ,\beta /(\lambda \alpha ))$ .^{[ citation needed ]}

Exponential family

The normal-gamma distribution is a four-parameter exponential family with natural parameters $\alpha -1/2,-\beta -\lambda \mu ^{2}/2,\lambda \mu ,-\lambda /2$ and natural statistics $\ln \tau ,\tau ,\tau x,\tau x^{2}$ .^{[ citation needed ]}

Moments of the natural statistics

The following moments can be easily computed using the moment generating function of the sufficient statistic:^[3]

\operatorname {E} (\ln T)=\psi \left(\alpha \right)-\ln \beta ,

where $\psi \left(\alpha \right)$ is the digamma function,

{\begin{aligned}\operatorname {E} (T)&={\frac {\alpha }{\beta }},\\[5pt]\operatorname {E} (TX)&=\mu {\frac {\alpha }{\beta }},\\[5pt]\operatorname {E} (TX^{2})&={\frac {1}{\lambda }}+\mu ^{2}{\frac {\alpha }{\beta }}.\end{aligned}}

Scaling

If $(X,T)\sim \mathrm {NormalGamma} (\mu ,\lambda ,\alpha ,\beta ),$ then for any $b>0,(bX,bT)$ is distributed as^{[ citation needed ]} ${\rm {NormalGamma}}(b\mu ,\lambda /b^{3},\alpha ,\beta /b).$

Posterior distribution of the parameters

Assume that x is distributed according to a normal distribution with unknown mean $\mu$ and precision $\tau$ .

x\sim {\mathcal {N}}(\mu ,\tau ^{-1})

and that the prior distribution on $\mu$ and $\tau$ , $(\mu ,\tau )$ , has a normal-gamma distribution

(\mu ,\tau )\sim {\text{NormalGamma}}(\mu _{0},\lambda _{0},\alpha _{0},\beta _{0}),

for which the density $π$ satisfies

\pi (\mu ,\tau )\propto \tau ^{\alpha _{0}-{\frac {1}{2}}}\,\exp[-\beta _{0}\tau ]\,\exp \left[-{\frac {\lambda _{0}\tau (\mu -\mu _{0})^{2}}{2}}\right].

Suppose

x_{1},\ldots ,x_{n}\mid \mu ,\tau \sim \operatorname {{i.}{i.}{d.}} \operatorname {N} \left(\mu ,\tau ^{-1}\right),

i.e. the components of $\mathbf {X} =(x_{1},\ldots ,x_{n})$ are conditionally independent given $\mu ,\tau$ and the conditional distribution of each of them given $\mu ,\tau$ is normal with expected value $\mu$ and variance $1/\tau .$ The posterior distribution of $\mu$ and $\tau$ given this dataset $\mathbb {X}$ can be analytically determined by Bayes' theorem ^[4] explicitly,

\mathbf {P} (\tau ,\mu \mid \mathbf {X} )\propto \mathbf {L} (\mathbf {X} \mid \tau ,\mu )\pi (\tau ,\mu ),

where $\mathbf {L}$ is the likelihood of the parameters given the data.

Since the data are i.i.d, the likelihood of the entire dataset is equal to the product of the likelihoods of the individual data samples:

\mathbf {L} (\mathbf {X} \mid \tau ,\mu )=\prod _{i=1}^{n}\mathbf {L} (x_{i}\mid \tau ,\mu ).

This expression can be simplified as follows:

{\begin{aligned}\mathbf {L} (\mathbf {X} \mid \tau ,\mu )&\propto \prod _{i=1}^{n}\tau ^{1/2}\exp \left[{\frac {-\tau }{2}}(x_{i}-\mu )^{2}\right]\\[5pt]&\propto \tau ^{n/2}\exp \left[{\frac {-\tau }{2}}\sum _{i=1}^{n}(x_{i}-\mu )^{2}\right]\\[5pt]&\propto \tau ^{n/2}\exp \left[{\frac {-\tau }{2}}\sum _{i=1}^{n}(x_{i}-{\bar {x}}+{\bar {x}}-\mu )^{2}\right]\\[5pt]&\propto \tau ^{n/2}\exp \left[{\frac {-\tau }{2}}\sum _{i=1}^{n}\left((x_{i}-{\bar {x}})^{2}+({\bar {x}}-\mu )^{2}\right)\right]\\[5pt]&\propto \tau ^{n/2}\exp \left[{\frac {-\tau }{2}}\left(ns+n({\bar {x}}-\mu )^{2}\right)\right],\end{aligned}}

where ${\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}$ , the mean of the data samples, and $s={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}$ , the sample variance.

The posterior distribution of the parameters is proportional to the prior times the likelihood.

{\begin{aligned}\mathbf {P} (\tau ,\mu \mid \mathbf {X} )&\propto \mathbf {L} (\mathbf {X} \mid \tau ,\mu )\pi (\tau ,\mu )\\&\propto \tau ^{n/2}\exp \left[{\frac {-\tau }{2}}\left(ns+n({\bar {x}}-\mu )^{2}\right)\right]\tau ^{\alpha _{0}-{\frac {1}{2}}}\,\exp[{-\beta _{0}\tau }]\,\exp \left[-{\frac {\lambda _{0}\tau (\mu -\mu _{0})^{2}}{2}}\right]\\&\propto \tau ^{{\frac {n}{2}}+\alpha _{0}-{\frac {1}{2}}}\exp \left[-\tau \left({\frac {1}{2}}ns+\beta _{0}\right)\right]\exp \left[-{\frac {\tau }{2}}\left(\lambda _{0}(\mu -\mu _{0})^{2}+n({\bar {x}}-\mu )^{2}\right)\right]\end{aligned}}

The final exponential term is simplified by completing the square.

{\begin{aligned}\lambda _{0}(\mu -\mu _{0})^{2}+n({\bar {x}}-\mu )^{2}&=\lambda _{0}\mu ^{2}-2\lambda _{0}\mu \mu _{0}+\lambda _{0}\mu _{0}^{2}+n\mu ^{2}-2n{\bar {x}}\mu +n{\bar {x}}^{2}\\&=(\lambda _{0}+n)\mu ^{2}-2(\lambda _{0}\mu _{0}+n{\bar {x}})\mu +\lambda _{0}\mu _{0}^{2}+n{\bar {x}}^{2}\\&=(\lambda _{0}+n)(\mu ^{2}-2{\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}}\mu )+\lambda _{0}\mu _{0}^{2}+n{\bar {x}}^{2}\\&=(\lambda _{0}+n)\left(\mu -{\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}}\right)^{2}+\lambda _{0}\mu _{0}^{2}+n{\bar {x}}^{2}-{\frac {\left(\lambda _{0}\mu _{0}+n{\bar {x}}\right)^{2}}{\lambda _{0}+n}}\\&=(\lambda _{0}+n)\left(\mu -{\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}}\right)^{2}+{\frac {\lambda _{0}n({\bar {x}}-\mu _{0})^{2}}{\lambda _{0}+n}}\end{aligned}}

On inserting this back into the expression above,

{\begin{aligned}\mathbf {P} (\tau ,\mu \mid \mathbf {X} )&\propto \tau ^{{\frac {n}{2}}+\alpha _{0}-{\frac {1}{2}}}\exp \left[-\tau \left({\frac {1}{2}}ns+\beta _{0}\right)\right]\exp \left[-{\frac {\tau }{2}}\left(\left(\lambda _{0}+n\right)\left(\mu -{\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}}\right)^{2}+{\frac {\lambda _{0}n({\bar {x}}-\mu _{0})^{2}}{\lambda _{0}+n}}\right)\right]\\&\propto \tau ^{{\frac {n}{2}}+\alpha _{0}-{\frac {1}{2}}}\exp \left[-\tau \left({\frac {1}{2}}ns+\beta _{0}+{\frac {\lambda _{0}n({\bar {x}}-\mu _{0})^{2}}{2(\lambda _{0}+n)}}\right)\right]\exp \left[-{\frac {\tau }{2}}\left(\lambda _{0}+n\right)\left(\mu -{\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}}\right)^{2}\right]\end{aligned}}

This final expression is in exactly the same form as a Normal-Gamma distribution, i.e.,

\mathbf {P} (\tau ,\mu \mid \mathbf {X} )={\text{NormalGamma}}\left({\frac {\lambda _{0}\mu _{0}+n{\bar {x}}}{\lambda _{0}+n}},\lambda _{0}+n,\alpha _{0}+{\frac {n}{2}},\beta _{0}+{\frac {1}{2}}\left(ns+{\frac {\lambda _{0}n({\bar {x}}-\mu _{0})^{2}}{\lambda _{0}+n}}\right)\right)

Interpretation of parameters

The interpretation of parameters in terms of pseudo-observations is as follows:

The new mean takes a weighted average of the old pseudo-mean and the observed mean, weighted by the number of associated (pseudo-)observations.
The precision was estimated from $2\alpha$ pseudo-observations (i.e. possibly a different number of pseudo-observations, to allow the variance of the mean and precision to be controlled separately) with sample mean $\mu$ and sample variance ${\frac {\beta }{\alpha }}$ (i.e. with sum of squared deviations $2\beta$ ).
The posterior updates the number of pseudo-observations ( $\lambda _{0}$ ) simply by adding the corresponding number of new observations ( $n$ ).
The new sum of squared deviations is computed by adding the previous respective sums of squared deviations. However, a third "interaction term" is needed because the two sets of squared deviations were computed with respect to different means, and hence the sum of the two underestimates the actual total squared deviation.

As a consequence, if one has a prior mean of $\mu _{0}$ from $n_{\mu }$ samples and a prior precision of $\tau _{0}$ from $n_{\tau }$ samples, the prior distribution over $\mu$ and $\tau$ is

\mathbf {P} (\tau ,\mu \mid \mathbf {X} )=\operatorname {NormalGamma} \left(\mu _{0},n_{\mu },{\frac {n_{\tau }}{2}},{\frac {n_{\tau }}{2\tau _{0}}}\right)

and after observing $n$ samples with mean $\mu$ and variance $s$ , the posterior probability is

\mathbf {P} (\tau ,\mu \mid \mathbf {X} )={\text{NormalGamma}}\left({\frac {n_{\mu }\mu _{0}+n\mu }{n_{\mu }+n}},n_{\mu }+n,{\frac {1}{2}}(n_{\tau }+n),{\frac {1}{2}}\left({\frac {n_{\tau }}{\tau _{0}}}+ns+{\frac {n_{\mu }n(\mu -\mu _{0})^{2}}{n_{\mu }+n}}\right)\right)

Note that in some programming languages, such as Matlab, the gamma distribution is implemented with the inverse definition of $\beta$ , so the fourth argument of the Normal-Gamma distribution is $2\tau _{0}/n_{\tau }$ .

Generating normal-gamma random variates

Generation of random variates is straightforward:

Sample $\tau$ from a gamma distribution with parameters $\alpha$ and $\beta$
Sample $x$ from a normal distribution with mean $\mu$ and variance $1/(\lambda \tau )$

Related distributions

The normal-inverse-gamma distribution is essentially the same distribution parameterized by variance rather than precision
The normal-exponential-gamma distribution

Notes

1 2 Bernardo & Smith (1993, p. 434)
↑ Bernardo & Smith (1993, pages 136, 268, 434)
↑ Wasserman, Larry (2004), "Parametric Inference", Springer Texts in Statistics, New York, NY: Springer New York, pp. 119–148, ISBN 978-1-4419-2322-6 , retrieved 2023-12-08
↑ "Bayes' Theorem: Introduction". Archived from the original on 2014-08-07. Retrieved 2014-08-05.

Related Research Articles

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In special relativity, a four-vector is an object with four components, which transform in a specific way under Lorentz transformations. Specifically, a four-vector is an element of a four-dimensional vector space considered as a representation space of the standard representation of the Lorentz group, the representation. It differs from a Euclidean vector in how its magnitude is determined. The transformations that preserve this magnitude are the Lorentz transformations, which include spatial rotations and boosts.

In the special theory of relativity, four-force is a four-vector that replaces the classical force.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In differential geometry, the four-gradient $is the four-vector analogue of the gradient from vector calculus.$

In differential geometry, a tensor density or relative tensor is a generalization of the tensor field concept. A tensor density transforms as a tensor field when passing from one coordinate system to another, except that it is additionally multiplied or weighted by a power W of the Jacobian determinant of the coordinate transition function or its absolute value. A tensor density with a single index is called a vector density. A distinction is made among (authentic) tensor densities, pseudotensor densities, even tensor densities and odd tensor densities. Sometimes tensor densities with a negative weight W are called tensor capacity. A tensor density can also be regarded as a section of the tensor product of a tensor bundle with a density bundle.

The Pearson distribution is a family of continuous probability distributions. It was first published by Karl Pearson in 1895 and subsequently extended by him in 1901 and 1916 in a series of articles on biostatistics.

In general relativity, a geodesic generalizes the notion of a "straight line" to curved spacetime. Importantly, the world line of a particle free from all external, non-gravitational forces is a particular type of geodesic. In other words, a freely moving or falling particle always moves along a geodesic.

In probability theory and statistics, the generalized inverse Gaussian distribution (GIG) is a three-parameter family of continuous probability distributions with probability density function

In physics, Maxwell's equations in curved spacetime govern the dynamics of the electromagnetic field in curved spacetime or where one uses an arbitrary coordinate system. These equations can be viewed as a generalization of the vacuum Maxwell's equations which are normally formulated in the local coordinates of flat spacetime. But because general relativity dictates that the presence of electromagnetic fields induce curvature in spacetime, Maxwell's equations in flat spacetime should be viewed as a convenient approximation.

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members often asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— $in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.$

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

Expected shortfall (ES) is a risk measure—a concept used in the field of financial risk measurement to evaluate the market risk or credit risk of a portfolio. The "expected shortfall at q% level" is the expected return on the portfolio in the worst $of cases. ES is an alternative to value at risk that is more sensitive to the shape of the tail of the loss distribution.$

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In financial mathematics, tail value at risk (TVaR), also known as tail conditional expectation (TCE) or conditional tail expectation (CTE), is a risk measure associated with the more general value at risk. It quantifies the expected value of the loss given that an event outside a given probability level has occurred.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In mathematics, the Fox–Wright function (also known as Fox–Wright Psi function, not to be confused with Wright Omega function) is a generalisation of the generalised hypergeometric function _pF_q(z) based on ideas of Charles Fox (1928) and E. Maitland Wright (1935):

In the Newman–Penrose (NP) formalism of general relativity, independent components of the Ricci tensors of a four-dimensional spacetime are encoded into seven Ricci scalars which consist of three real scalars $, three complex scalars and the NP curvature scalar . Physically, Ricci-NP scalars are related with the energy-momentum distribution of the spacetime due to Einstein's field equation.$

In theoretical physics, relativistic Lagrangian mechanics is Lagrangian mechanics applied in the context of special relativity and general relativity.

References

Bernardo, J.M.; Smith, A.F.M. (1993) Bayesian Theory, Wiley. ISBN 0-471-49464-X
Dearden et al. "Bayesian Q-learning", Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), July 26–30, 1998, Madison, Wisconsin, USA.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[BS434-1] 1 2 Bernardo & Smith (1993, p. 434)

[2] Bernardo & Smith (1993, pages 136, 268, 434)

[3] Wasserman, Larry (2004), "Parametric Inference", Springer Texts in Statistics, New York, NY: Springer New York, pp. 119–148, ISBN 978-1-4419-2322-6 , retrieved 2023-12-08

[4] "Bayes' Theorem: Introduction". Archived from the original on 2014-08-07. Retrieved 2014-08-05.

[1]

[2]

[3]

[4]

Parameters	$\mu \,$ location (real) $\lambda >0\,$ (real) $\alpha >0\,$ (real) $\beta >0\,$ (real)
Support	$x\in (-\infty ,\infty )\,\!,\;\tau \in (0,\infty )$
PDF	$f(x,\tau \mid \mu ,\lambda ,\alpha ,\beta )={\frac {\beta ^{\alpha }{\sqrt {\lambda }}}{\Gamma (\alpha ){\sqrt {2\pi }}}}\,\tau ^{\alpha -{\frac {1}{2}}}\,e^{-\beta \tau }\,e^{-{\frac {\lambda \tau (x-\mu )^{2}}{2}}}$
Mean	^[1] $\operatorname {E} (X)=\mu \,\!,\quad \operatorname {E} (\mathrm {T} )=\alpha \beta ^{-1}$
Mode	$\left(\mu ,{\frac {\alpha -{\frac {1}{2}}}{\beta }}\right)$
Variance	^[1] $\operatorname {var} (X)={\Big (}{\frac {\beta }{\lambda (\alpha -1)}}{\Big )},\quad \operatorname {var} (\mathrm {T} )=\alpha \beta ^{-2}$