Hierarchical Dirichlet process

Last updated

In statistics and machine learning, the hierarchical Dirichlet process (HDP) is a nonparametric Bayesian approach to clustering grouped data. [1] [2] It uses a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups. It was developed by Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David Blei and published in the Journal of the American Statistical Association in 2006, [1] as a formalization and generalization of the infinite hidden Markov model published in 2002. [3]

Contents

Model

This model description is sourced from. [1] The HDP is a model for grouped data. What this means is that the data items come in multiple distinct groups. For example, in a topic model words are organized into documents, with each document formed by a bag (group) of words (data items). Indexing groups by , suppose each group consist of data items .

The HDP is parameterized by a base distribution that governs the a priori distribution over data items, and a number of concentration parameters that govern the a priori number of clusters and amount of sharing across groups. The th group is associated with a random probability measure which has distribution given by a Dirichlet process:

where is the concentration parameter associated with the group, and is the base distribution shared across all groups. In turn, the common base distribution is Dirichlet process distributed:

with concentration parameter and base distribution . Finally, to relate the Dirichlet processes back with the observed data, each data item is associated with a latent parameter :

The first line states that each parameter has a prior distribution given by , while the second line states that each data item has a distribution parameterized by its associated parameter. The resulting model above is called a HDP mixture model, with the HDP referring to the hierarchically linked set of Dirichlet processes, and the mixture model referring to the way the Dirichlet processes are related to the data items.

To understand how the HDP implements a clustering model, and how clusters become shared across groups, recall that draws from a Dirichlet process are atomic probability measures with probability one. This means that the common base distribution has a form which can be written as:

where there are an infinite number of atoms, , assuming that the overall base distribution has infinite support. Each atom is associated with a mass . The masses have to sum to one since is a probability measure. Since is itself the base distribution for the group specific Dirichlet processes, each will have atoms given by the atoms of , and can itself be written in the form:

Thus the set of atoms is shared across all groups, with each group having its own group-specific atom masses. Relating this representation back to the observed data, we see that each data item is described by a mixture model:

where the atoms play the role of the mixture component parameters, while the masses play the role of the mixing proportions. In conclusion, each group of data is modeled using a mixture model, with mixture components shared across all groups but mixing proportions being group-specific. In clustering terms, we can interpret each mixture component as modeling a cluster of data items, with clusters shared across all groups, and each group, having its own mixing proportions, composed of different combinations of clusters.

Applications

The HDP mixture model is a natural nonparametric generalization of Latent Dirichlet allocation, where the number of topics can be unbounded and learnt from data. [1] Here each group is a document consisting of a bag of words, each cluster is a topic, and each document is a mixture of topics. The HDP is also a core component of the infinite hidden Markov model, [3] which is a nonparametric generalization of the hidden Markov model allowing the number of states to be unbounded and learnt from data. [1] [4]

Generalizations

The HDP can be generalized in a number of directions. The Dirichlet processes can be replaced by Pitman-Yor processes and Gamma processes, resulting in the Hierarchical Pitman-Yor process and Hierarchical Gamma process. The hierarchy can be deeper, with multiple levels of groups arranged in a hierarchy. Such an arrangement has been exploited in the sequence memoizer, a Bayesian nonparametric model for sequences which has a multi-level hierarchy of Pitman-Yor processes. In addition, Bayesian Multi-Domain Learning (BMDL) model derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. [5]

See also

Related Research Articles

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Expectation–maximization algorithm Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents a convenient approach for setting hyperparameters, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a Chinese restaurant. Imagine a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. This continues, with each customer choosing to either sit at an occupied table with a probability proportional to the number of customers already there, or an unoccupied table. At time n, the n customers have been partitioned among m ≤ n tables. The results of this process are exchangeable, meaning the order in which the customers sit does not affect the probability of the final distribution. This property greatly simplifies a number of problems in population genetics, linguistic analysis, and image recognition.

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. LDA is an example of a topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Dirichlet process

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In probability and statistics, a compound probability distribution is the probability distribution that results from assuming that a random variable is distributed according to some parametrized distribution, with the parameters of that distribution themselves being random variables. If the parameter is a scale parameter, the resulting mixture is also called a scale mixture.

In Bayesian inference, the Bernstein-von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in the limit of infinite data to a multivariate normal distribution centered at the maximum likelihood estimator with covariance matrix given by , where is the true population parameter and is the Fisher information matrix at the true population parameter value.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson as a prior over probability distributions.

In statistics, the inverted Dirichlet distribution is a multivariate generalization of the beta prime distribution, and is related to the Dirichlet distribution. It was first described by Tiao and Cuttman in 1965.

In the mathematical theory of probability, the dependent Dirichlet process (DDP) provides a non-parametric prior over evolving mixture models. A construction of the DDP built on a Poisson point process. The concept is named after Peter Gustav Lejeune Dirichlet.

Integrated nested Laplace approximations (INLA) is a method for approximate Bayesian inference based on Laplace's method. It is designed for a class of models called latent Gaussian models (LGMs), for which it can be a fast and accurate alternative for Markov chain Monte Carlo methods to compute posterior marginal distributions. Due to its relative speed even with large data sets for certain problems and models, INLA has been a popular inference method in applied statistics, in particular spatial statistics, ecology, and epidemiology. It is also possible to combine INLA with a finite element method solution of a stochastic partial differential equation to study e.g. spatial point processes and species distribution models. The INLA method is implemented in the R-INLA R package.

References

  1. 1 2 3 4 5 Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. (2006). "Hierarchical Dirichlet Processes" (PDF). Journal of the American Statistical Association . 101 (476): pp. 1566–1581. CiteSeerX   10.1.1.5.9094 . doi:10.1198/016214506000000302. S2CID   7934949.
  2. Teh, Y. W.; Jordan, M. I. (2010). Hierarchical Bayesian Nonparametric Models with Applications (PDF). Bayesian Nonparametrics. Cambridge University Press. pp. 158–207. CiteSeerX   10.1.1.157.9451 . doi:10.1017/CBO9780511802478.006. ISBN   9780511802478.
  3. 1 2 Beal, M.J., Ghahramani, Z. and Rasmussen, C.E. (2002). "The infinite hidden Markov model" (PDF). Advances in Neural Information Processing Systems 14:577–585. Cambridge, MA: MIT Press.
  4. Fox, Emily B., et al. "A sticky HDP-HMM with application to speaker diarization." The Annals of Applied Statistics (2011): 1020-1056.
  5. Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. "Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data" (PDF). 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.