Bayesian hierarchical modeling

Last updated

Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian method. [1] The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

Contents

Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. [2] As the approaches answer different questions the formal results aren't technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors.

Hierarchical modeling is used when information is available on several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own temporal profile of daily infected cases. [3] In decline curve analysis to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own temporal profile of oil or gas production rates (usually, barrels per month). [4] Data structure for the hierarchical modeling retains nested data structure. The hierarchical form of analysis and organization helps in the understanding of multiparameter problems and also plays an important role in developing computational strategies. [5]

Philosophy

Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. [6] Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. [7] Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, “The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality.” These subjective probabilities are more directly involved in the mind rather than the physical probabilities. [7] Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event. [8]

Bayes' theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. [9]

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients.

In order to make updated probability statements about , given the occurrence of event y, we must begin with a model providing a joint probability distribution for and y. This can be written as a product of the two distributions that are often referred to as the prior distribution and the sampling distribution respectively:

Using the basic property of conditional probability, the posterior distribution will yield:

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief, , in appropriate and solvable ways. [9]

Exchangeability

The usual starting point of a statistical analysis is the assumption that the n values are exchangeable. If no information – other than data y – is available to distinguish any of the ’s from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry among the parameters in their prior distribution. [10] This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector , with distribution .

Finite exchangeability

For a fixed number n, the set is exchangeable if the joint probability is invariant under permutations of the indices. That is, for every permutation or of (1, 2, …, n), [11]

Following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n balls, there will be n  1 remaining balls left for the next draw.

Since the probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second draw, both of which are equal to 1/2 (i.e. ), then and are exchangeable.

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first draw is 0, and is not equal to the probability that the red ball is selected in the second draw which is equal to 1/2 (i.e. ). Thus, and are not independent.

If are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. [12]

Infinite exchangeability

Infinite exchangeability is the property that every finite subset of an infinite sequence , is exchangeable. That is, for any n, the sequence is exchangeable. [12]

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, [1] namely:

  1. Hyperparameters: parameters of the prior distribution
  2. Hyperpriors: distributions of Hyperparameters

Suppose a random variable Y follows a normal distribution with parameter θ as the mean and 1 as the variance, that is . The tilde relation can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter has a distribution given by a normal distribution with mean and variance 1, i.e. . Furthermore, follows another distribution given, for example, by the standard normal distribution, . The parameter is called the hyperparameter, while its distribution given by is an example of a hyperprior distribution. The notation of the distribution of Y changes as another parameter is added, i.e. . If there is another stage, say, follows another normal distribution with mean and variance , meaning , and can also be called hyperparameters while their distributions are hyperprior distributions as well. [6]

Framework

Let be an observation and a parameter governing the data generating process for . Assume further that the parameters are generated exchangeably from a common population, with distribution governed by a hyperparameter .
The Bayesian hierarchical model contains the following stages:

The likelihood, as seen in stage I is , with as its prior distribution. Note that the likelihood depends on only through .

The prior distribution from stage I can be broken down into:

[from the definition of conditional probability]

With as its hyperparameter with hyperprior distribution, .

Thus, the posterior distribution is proportional to:

[using Bayes' Theorem]
[13]

Example

To further illustrate this, consider the example: A teacher wants to estimate how well a student did on the SAT. The teacher uses information on the student’s high school grades and current grade point average (GPA) to come up with an estimate. The student's current GPA, denoted by , has a likelihood given by some probability function with parameter , i.e. . This parameter is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter , which is the high school grade of the student (freshman, sophomore, junior or senior). [14] That is, . Moreover, the hyperparameter follows its own distribution given by , a hyperprior. To solve for the SAT score given information on the GPA,

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter. [15]

2-stage hierarchical model

In general, the joint posterior distribution of interest in 2-stage hierarchical models is:

[15]

3-stage hierarchical model

For 3-stage hierarchical models, the posterior distribution is given by:

[15]

Bayesian nonlinear mixed-effects model

Bayesian research cycle using Bayesian nonlinear mixed effects model: (a) standard research cycle and (b) Bayesian-specific workflow . Bayesian research cycle.png
Bayesian research cycle using Bayesian nonlinear mixed effects model: (a) standard research cycle and (b) Bayesian-specific workflow .

The framework of Bayesian hierarchical modeling is frequently used in diverse applications. Particularly, Bayesian nonlinear mixed-effects models have recently[ when? ] received significant attention.[ by whom? ] A basic version of the Bayesian nonlinear mixed-effects models is represented as the following three-stage:

Stage 1: Individual-Level Model

Stage 2: Population Model

Stage 3: Prior

Here, denotes the continuous response of the -th subject at the time point , and is the -th covariate of the -th subject. Parameters involved in the model are written in Greek letters. is a known function parameterized by the -dimensional vector . Typically, is a `nonlinear' function and describes the temporal trajectory of individuals. In the model, and describe within-individual variability and between-individual variability, respectively. If Stage 3: Prior is not considered, then the model reduces to a frequentist nonlinear mixed-effect model.


A central task in the application of the Bayesian nonlinear mixed-effect models is to evaluate the posterior density:


The panel on the right displays Bayesian research cycle using Bayesian nonlinear mixed-effects model. [16] A research cycle using the Bayesian nonlinear mixed-effects model comprises two steps: (a) standard research cycle and (b) Bayesian-specific workflow. Standard research cycle involves literature review, defining a problem and specifying the research question and hypothesis. Bayesian-specific workflow comprises three sub-steps: (b)–(i) formalizing prior distributions based on background knowledge and prior elicitation; (b)–(ii) determining the likelihood function based on a nonlinear function ; and (b)–(iii) making a posterior inference. The resulting posterior inference can be used to start a new research cycle.

Related Research Articles

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the lower or upper end of the range, and the division of the range could notionally be made at any point.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

<span class="mw-page-title-main">Dirichlet process</span> Family of stochastic processes

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

In probability theory and directional statistics, a wrapped probability distribution is a continuous probability distribution that describes data points that lie on a unit n-sphere. In one dimension, a wrapped distribution consists of points on the unit circle. If is a random variate in the interval with probability density function (PDF) , then is a circular variable distributed according to the wrapped distribution and is an angular variable in the interval distributed according to the wrapped distribution .

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable.

Pure inductive logic (PIL) is the area of mathematical logic concerned with the philosophical and mathematical foundations of probabilistic inductive reasoning. It combines classical predicate logic and probability theory. Probability values are assigned to sentences of a first-order relational language to represent degrees of belief that should be held by a rational agent. Conditional probability values represent degrees of belief based on the assumption of some received evidence.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.

References

  1. 1 2 Allenby, Rossi, McCulloch (January 2005). "Hierarchical Bayes Model: A Practitioner’s Guide". Journal of Bayesian Applications in Marketing, pp. 1–4. Retrieved 26 April 2014, p. 3
  2. Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN   1-58488-388-X.
  3. Lee, Se Yoon; Lei, Bowen; Mallick, Bani (2020). "Estimation of COVID-19 spread curves integrating global data and borrowing information". PLOS ONE. 15 (7): e0236860. arXiv: 2005.00662 . doi: 10.1371/journal.pone.0236860 . PMC   7390340 . PMID   32726361.
  4. Lee, Se Yoon; Mallick, Bani (2021). "Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas". Sankhya B. 84: 1–43. doi: 10.1007/s13571-020-00245-8 .
  5. Gelman et al. 2004, p. 6.
  6. 1 2 Gelman et al. 2004, p. 117.
  7. 1 2 Good, I.J. (1980). "Some history of the hierarchical Bayesian methodology". Trabajos de Estadistica y de Investigacion Operativa. 31: 489–519. doi:10.1007/BF02888365. S2CID   121270218.
  8. Bernardo, Smith(1994). Bayesian Theory. Chichester, England: John Wiley & Sons, ISBN   0-471-92416-4, p. 23
  9. 1 2 Gelman et al. 2004, pp. 6–8.
  10. Bernardo, Degroot, Lindley (September 1983). “Proceedings of the Second Valencia International Meeting”. Bayesian Statistics 2. Amsterdam: Elsevier Science Publishers B.V, ISBN   0-444-87746-0, pp. 167–168
  11. Gelman et al. 2004, pp. 121–125.
  12. 1 2 Diaconis, Freedman (1980). “Finite exchangeable sequences”. Annals of Probability, pp. 745–747
  13. Bernardo, Degroot, Lindley (September 1983). “Proceedings of the Second Valencia International Meeting”. Bayesian Statistics 2. Amsterdam: Elsevier Science Publishers B.V, ISBN   0-444-87746-0, pp. 371–372
  14. Gelman et al. 2004, pp. 120–121.
  15. 1 2 3 Box G. E. P., Tiao G. C. (1965). "Multiparameter problem from a bayesian point of view". Multiparameter Problems From A Bayesian Point of View Volume 36 Number 5. New York City: John Wiley & Sons, ISBN   0-471-57428-7
  16. 1 2 Lee, Se Yoon (2022). "Bayesian Nonlinear Models for Repeated Measurement Data: An Overview, Implementation, and Applications". Mathematics. 10 (6): 898. arXiv: 2201.12430 . doi: 10.3390/math10060898 .