Bayesian model reduction

Last updated September 30, 2020

Bayesian model reduction is a method for computing the evidence and posterior over the parameters of Bayesian models that differ in their priors.^[1]^[2] A full model is fitted to data using standard approaches. Hypotheses are then tested by defining one or more 'reduced' models with alternative (and usually more restrictive) priors, which usually – in the limit – switch off certain parameters. The evidence and parameters of the reduced models can then be computed from the evidence and estimated (posterior) parameters of the full model using Bayesian model reduction. If the priors and posteriors are normally distributed, then there is an analytic solution which can be computed rapidly. This has multiple scientific and engineering applications: these include scoring the evidence for large numbers of models very quickly and facilitating the estimation of hierarchical models (Parametric Empirical Bayes).

Theory

Consider some model with parameters $\theta$ and a prior probability density on those parameters $p(\theta )$ . The posterior belief about $\theta$ after seeing the data $p(\theta \mid y)$ is given by Bayes rule:

{\begin{aligned}p(\theta \mid y)&={\frac {p(y\mid \theta )p(\theta )}{p(y)}}\\p(y)&=\int p(y\mid \theta )p(\theta )\,d\theta \end{aligned}}

(1)

The second line of Equation 1 is the model evidence, which is the probability of observing the data given the model. In practice, the posterior cannot usually be computed analytically due to the difficulty in computing the integral over the parameters. Therefore, the posteriors are estimated using approaches such as MCMC sampling or variational Bayes. A reduced model can then be defined with an alternative set of priors ${\tilde {p}}(\theta )$ :

{\begin{aligned}{\tilde {p}}(\theta \mid y)&={\frac {p(y\mid \theta ){\tilde {p}}(\theta )}{{\tilde {p}}(y)}}\\{\tilde {p}}(y)&=\int p(y\mid \theta ){\tilde {p}}(\theta )\,d\theta \end{aligned}}

(2)

The objective of Bayesian model reduction is to compute the posterior ${\tilde {p}}(\theta \mid y)$ and evidence ${\tilde {p}}(y)$ of the reduced model from the posterior $p(\theta \mid y)$ and evidence $p(y)$ of the full model. Combining Equation 1 and Equation 2 and re-arranging, the reduced posterior ${\tilde {p}}(\theta \mid y)$ can be expressed as the product of the full posterior, the ratio of priors and the ratio of evidences:

{\begin{aligned}{\frac {{\tilde {p}}(\theta \mid y){\tilde {p}}(y)}{p(\theta \mid y)p(y)}}&={\frac {p(y\mid \theta ){\tilde {p}}(\theta )}{p(y\mid \theta )p(\theta )}}\\\Rightarrow {\tilde {p}}(\theta \mid y)&=p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\end{aligned}}

(3)

The evidence for the reduced model is obtained by integrating over the parameters of each side of the equation:

\int {\tilde {p}}(\theta \mid y)\,d\theta =\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\,d\theta =1

(4)

And by re-arrangement:

{\begin{aligned}1&=\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\,d\theta \\&={\frac {p(y)}{{\tilde {p}}(y)}}\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}\,d\theta \\\Rightarrow {\tilde {p}}(y)&=p(y)\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}\,d\theta \end{aligned}}

(5)

Gaussian priors and posteriors

Under Gaussian prior and posterior densities, as are used in the context of variational Bayes, Bayesian model reduction has a simple analytical solution.^[1] First define normal densities for the priors and posteriors:

{\displaystyle {\begin{aligned}p(\theta )&=N(\theta

(6)

where the tilde symbol (~) indicates quantities relating to the reduced model and subscript zero – such as $\mu _{0}$ – indicates parameters of the priors. For convenience we also define precision matrices, which are the inverse of each covariance matrix:

{\begin{aligned}\Pi &=\Sigma ^{-1}\\\Pi _{0}&=\Sigma _{0}^{-1}\\{\tilde {\Pi }}&={\tilde {\Sigma }}^{-1}\\{\tilde {\Pi }}_{0}&={\tilde {\Sigma }}_{0}^{-1}\\\end{aligned}}

(7)

The free energy of the full model $F$ is an approximation (lower bound) on the log model evidence: $F\approx \ln {p(y)}$ that is optimised explicitly in variational Bayes (or can be recovered from sampling approximations). The reduced model's free energy ${\tilde {F}}$ and parameters $({\tilde {\mu }},{\tilde {\Sigma }})$ are then given by the expressions:

{\begin{aligned}{\tilde {F}}&={\frac {1}{2}}\ln |{\tilde {\Pi }}_{0}\cdot \Pi \cdot {\tilde {\Sigma }}\cdot \Sigma _{0}|\\&-{\frac {1}{2}}(\mu ^{T}\Pi \mu +{\tilde {\mu }}_{0}^{T}{\tilde {\Pi }}_{0}{\tilde {\mu }}_{0}-\mu _{0}^{T}\Pi _{0}\mu _{0}-{\tilde {\mu }}^{T}{\tilde {\Pi }}{\tilde {\mu }})+F\\{\tilde {\mu }}&={\tilde {\Sigma }}(\Pi \mu +{\tilde {\Pi }}_{0}{\tilde {\mu }}_{0}-\Pi _{0}\mu _{0})\\{\tilde {\Sigma }}&=(\Pi +{\tilde {\Pi }}_{0}-\Pi _{0})^{-1}\\\end{aligned}}

(8)

Example

Consider a model with a parameter $\theta$ and Gaussian prior $p(\theta )=N(0,0.5^{2})$ , which is the Normal distribution with mean zero and standard deviation 0.5 (illustrated in the Figure, left). This prior says that without any data, the parameter is expected to have value zero, but we are willing to entertain positive or negative values (with a 99% confidence interval [−1.16,1.16]). The model with this prior is fitted to the data, to provide an estimate of the parameter $q(\theta )$ and the model evidence $p(y)$ .

To assess whether the parameter contributed to the model evidence, i.e. whether we learnt anything about this parameter, an alternative 'reduced' model is specified in which the parameter has a prior with a much smaller variance: e.g. ${\tilde {p}}_{0}=N(0,0.001^{2})$ . This is illustrated in the Figure (right). This prior effectively 'switches off' the parameter, saying that we are almost certain that it has value zero. The parameter ${\tilde {q}}(\theta )$ and evidence ${\tilde {p}}(y)$ for this reduced model are rapidly computed from the full model using Bayesian model reduction.

The hypothesis that the parameter contributed to the model is then tested by comparing the full and reduced models via the Bayes factor, which is the ratio of model evidences:

{\text{BF}}={\frac {p(y)}{{\tilde {p}}(y)}}

The larger this ratio, the greater the evidence for the full model, which included the parameter as a free parameter. Conversely, the stronger the evidence for the reduced model, the more confident we can be that the parameter did not contribute. Note this method is not specific to comparing 'switched on' or 'switched off' parameters, and any intermediate setting of the priors could also be evaluated.

Applications

Neuroimaging

Bayesian model reduction was initially developed for use in neuroimaging analysis,^[1]^[3] in the context of modelling brain connectivity, as part of the dynamic causal modelling framework (where it was originally referred to as post-hoc Bayesian model selection).^[1] Dynamic causal models (DCMs) are differential equation models of brain dynamics.^[4] The experimenter specifies multiple competing models which differ in their priors – e.g. in the choice of parameters which are fixed at their prior expectation of zero. Having fitted a single 'full' model with all parameters of interest informed by the data, Bayesian model reduction enables the evidence and parameters for competing models to be rapidly computed, in order to test hypotheses. These models can be specified manually by the experimenter, or searched over automatically, in order to 'prune' any redundant parameters which do not contribute to the evidence.

Bayesian model reduction was subsequently generalised and applied to other forms of Bayesian models, for example parametric empirical Bayes (PEB) models of group effects.^[2] Here, it is used to compute the evidence and parameters for any given level of a hierarchical model under constraints (empirical priors) imposed by the level above.

Neurobiology

Bayesian model reduction has been used to explain functions of the brain. By analogy to its use in eliminating redundant parameters from models of experimental data, it has been proposed that the brain eliminates redundant parameters of internal models of the world while offline (e.g. during sleep).^[5]^[6]

Software implementations

Bayesian model reduction is implemented in the Statistical Parametric Mapping toolbox, in the Matlab function spm_log_evidence_reduce.m .

Related Research Articles

In statistics, the likelihood function measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

In statistics, the use of Bayes factors is a Bayesian alternative to classical hypothesis testing. Bayesian model comparison is a method of model selection based on Bayes factors. The models under consideration are statistical models. The aim of the Bayes factor is to quantify the support for a model over another, regardless of whether these models are correct. The technical definition of "support" in the context of Bayesian inference is described below.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x | θ). For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

Bayesian experimental design provides a general probability-theoretical framework from which other theories on experimental design can be derived. It is based on Bayesian inference to interpret the observations/data acquired during the experiment. This allows accounting for both any prior knowledge on the parameters to be determined as well as uncertainties in observations.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

Generalized filtering is a generic Bayesian filtering scheme for nonlinear state-space models. It is based on a variational principle of least action, formulated in generalized coordinates. Note that the concept of "generalized coordinates" as used here differs from the concept of generalized coordinates of motion as used in (multibody) dynamical systems analysis. Generalized filtering furnishes posterior densities over hidden states generating observed data using a generalized gradient descent on variational free energy, under the Laplace assumption. Unlike classical filtering, generalized filtering eschews Markovian assumptions about random fluctuations. Furthermore, it operates online, assimilating data to approximate the posterior density over unknown quantities, without the need for a backward pass. Special cases include variational filtering, dynamic expectation maximization and generalized predictive coding.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

Statistical shape analysis and statistical shape theory in computational anatomy (CA) is performed relative to templates, therefore it is a local theory of statistics on shape. Template estimation in computational anatomy from populations of observations is a fundamental operation ubiquitous to the discipline. Several methods for template estimation based on Bayesian probability and statistics in the random orbit model of CA have emerged for submanifolds and dense image volumes.

Dynamic causal modeling (DCM) is a framework for specifying models, fitting them to data and comparing their evidence using Bayesian model comparison. It uses nonlinear state-space models in continuous time, specified using stochastic or ordinary differential equations. DCM was initially developed for testing hypotheses about neural dynamics. In this setting, differential equations describe the interaction of neural populations, which directly or indirectly give rise to functional neuroimaging data e.g., functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG) or electroencephalography (EEG). Parameters in these models quantify the directed influences or effective connectivity among neuronal populations, which are estimated from the data using Bayesian statistical methods.

References

1 2 3 4 Friston, Karl; Penny, Will (June 2011). "Post hoc Bayesian model selection". NeuroImage. 56 (4): 2089–2099. doi:10.1016/j.neuroimage.2011.03.062. ISSN 1053-8119. PMC 3112494 . PMID 21459150.
1 2 Friston, Karl J.; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E.; van Wijk, Bernadette C.M.; Ziegler, Gabriel; Zeidman, Peter (March 2016). "Bayesian model reduction and empirical Bayes for group (DCM) studies". NeuroImage. 128: 413–431. doi:10.1016/j.neuroimage.2015.11.015. ISSN 1053-8119. PMC 4767224 . PMID 26569570.
↑ Rosa, M.J.; Friston, K.; Penny, W. (June 2012). "Post-hoc selection of dynamic causal models". Journal of Neuroscience Methods. 208 (1): 66–78. doi:10.1016/j.jneumeth.2012.04.013. ISSN 0165-0270. PMC 3401996 . PMID 22561579.
↑ Friston, K.J.; Harrison, L.; Penny, W. (August 2003). "Dynamic causal modelling". NeuroImage. 19 (4): 1273–1302. doi:10.1016/s1053-8119(03)00202-7. ISSN 1053-8119. PMID 12948688. S2CID 2176588.
↑ Friston, Karl J.; Lin, Marco; Frith, Christopher D.; Pezzulo, Giovanni; Hobson, J. Allan; Ondobaka, Sasha (October 2017). "Active Inference, Curiosity and Insight" (PDF). Neural Computation. 29 (10): 2633–2683. doi:10.1162/neco_a_00999. ISSN 0899-7667. PMID 28777724. S2CID 13354308.
↑ Tononi, Giulio; Cirelli, Chiara (February 2006). "Sleep function and synaptic homeostasis". Sleep Medicine Reviews. 10 (1): 49–62. doi:10.1016/j.smrv.2005.05.002. ISSN 1087-0792. PMID 16376591.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Friston1-1] 1 2 3 4 Friston, Karl; Penny, Will (June 2011). "Post hoc Bayesian model selection". NeuroImage. 56 (4): 2089–2099. doi:10.1016/j.neuroimage.2011.03.062. ISSN 1053-8119. PMC 3112494 . PMID 21459150.

[Friston2-2] 1 2 Friston, Karl J.; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E.; van Wijk, Bernadette C.M.; Ziegler, Gabriel; Zeidman, Peter (March 2016). "Bayesian model reduction and empirical Bayes for group (DCM) studies". NeuroImage. 128: 413–431. doi:10.1016/j.neuroimage.2015.11.015. ISSN 1053-8119. PMC 4767224 . PMID 26569570.

[Rosa-3] Rosa, M.J.; Friston, K.; Penny, W. (June 2012). "Post-hoc selection of dynamic causal models". Journal of Neuroscience Methods. 208 (1): 66–78. doi:10.1016/j.jneumeth.2012.04.013. ISSN 0165-0270. PMC 3401996 . PMID 22561579.

[Friston3-4] Friston, K.J.; Harrison, L.; Penny, W. (August 2003). "Dynamic causal modelling". NeuroImage. 19 (4): 1273–1302. doi:10.1016/s1053-8119(03)00202-7. ISSN 1053-8119. PMID 12948688. S2CID 2176588.

[Friston4-5] Friston, Karl J.; Lin, Marco; Frith, Christopher D.; Pezzulo, Giovanni; Hobson, J. Allan; Ondobaka, Sasha (October 2017). "Active Inference, Curiosity and Insight" (PDF). Neural Computation. 29 (10): 2633–2683. doi:10.1162/neco_a_00999. ISSN 0899-7667. PMID 28777724. S2CID 13354308.

[Tononi-6] Tononi, Giulio; Cirelli, Chiara (February 2006). "Sleep function and synaptic homeostasis". Sleep Medicine Reviews. 10 (1): 49–62. doi:10.1016/j.smrv.2005.05.002. ISSN 1087-0792. PMID 16376591.