Chinese restaurant process

Last updated

In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a restaurant. Imagine a restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. This continues, with each customer choosing to either sit at an occupied table with a probability proportional to the number of customers already there (i.e., they are more likely to sit at a table with many customers than few), or an unoccupied table. At time n, the n customers have been partitioned among m  n tables (or blocks of the partition). The results of this process are exchangeable, meaning the order in which the customers sit does not affect the probability of the final distribution. This property greatly simplifies a number of problems in population genetics, linguistic analysis, and image recognition.

Contents

The restaurant analogy first appeared in a 1985 write-up by David Aldous, [1] where it was attributed to Jim Pitman (who additionally credits Lester Dubins). [2]

Formal definition

For any positive integer , let denote the set of all partitions of the set . The Chinese restaurant process takes values in the infinite Cartesian product .

The value of the process at time is a partition of the set , whose probability distribution is determined as follows. At time , the trivial partition is obtained (with probability one). At time the element "" is either:

  1. added to one of the blocks of the partition , where each block is chosen with probability where is the size of the block (i.e. number of elements), or
  2. added to the partition as a new singleton block, with probability .

The random partition so generated has some special properties. It is exchangeable in the sense that relabeling does not change the distribution of the partition, and it is consistent in the sense that the law of the partition of obtained by removing the element from the random partition is the same as the law of the random partition .

The probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is

where is a block in the partition and is the size of .

The definition can be generalized by introducing a parameter which modifies the probability of the new customer sitting at a new table to and correspondingly modifies the probability of them sitting at a table of size to . The vanilla process introduced above can be recovered by setting . Intuitively, can be interpreted as the effective number of customers sitting at the first empty table.

Alternative definition

An equivalent, but subtly different way to define the Chinese restaurant process, is to let new customers choose companions rather than tables. [3] Customer chooses to sit at the same table as any one of the seated customers with probability , or chooses to sit at a new, unoccupied table with probability . Notice that in this formulation, the customer chooses a table without having to count table occupancies---we don't need .

Distribution of the number of tables

Chinese restaurant table
Parameters


Support
PMF
Mean
(see digamma function)

The Chinese restaurant table distribution (CRT) is the probability distribution on the number of tables in the Chinese restaurant process. [4] It can be understood as the sum of independent Bernoulli random variables, each with a different parameter:

The probability mass function of is given by [5]

where denotes Stirling numbers of the first kind.

Generalization

This construction can be generalized to a model with two parameters, & , [2] [6] commonly called the strength (or concentration) and discount parameters respectively. At time , the next customer to arrive finds occupied tables and decides to sit at an empty table with probability

or at an occupied table of size with probability

In order for the construction to define a valid probability measure it is necessary to suppose that either and for some ; or that and .

Under this model the probability assigned to any particular partition of , in terms of the Pochhammer k-symbol, is

where, by convention, , and for

Thus, for the case when the partition probability can be expressed in terms of the Gamma function as

In the one-parameter case, where is zero, this simplifies to

Or, when is zero,

As before, the probability assigned to any particular partition depends only on the block sizes, so as before the random partition is exchangeable in the sense described above. The consistency property still holds, as before, by construction.

If , the probability distribution of the random partition of the integer thus generated is the Ewens distribution with parameter , used in population genetics and the unified neutral theory of biodiversity.

Animation of a Chinese restaurant process with scaling parameter . Tables are hidden once the customers of a table can not be displayed anymore; however, every table has infinitely many seats. (Recording of an interactive animation. [7] )

Derivation

Here is one way to derive this partition probability. Let be the random block into which the number is added, for . Then

The probability that is any particular partition of the set is the product of these probabilities as runs from to . Now consider the size of block : it increases by one each time we add one element into it. When the last element in block is to be added in, the block size is . For example, consider this sequence of choices: (generate a new block )(join )(join )(join ). In the end, block has 4 elements and the product of the numerators in the above equation gets . Following this logic, we obtain as above.

Expected number of tables

For the one parameter case, with and , the number of tables is distributed according to the chinese restaurant table distribution. The expected value of this random variable, given that there are seated customers, is [8]

where is the digamma function. In the general case () the expected number of occupied tables is [6]

however, note that the function here is not the standard gamma function. [6]

The Indian buffet process

It is possible to adapt the model such that each data point is no longer uniquely associated with a class (i.e., we are no longer constructing a partition), but may be associated with any combination of the classes. This strains the restaurant-tables analogy and so is instead likened to a process in which a series of diners samples from some subset of an infinite selection of dishes on offer at a buffet. The probability that a particular diner samples a particular dish is proportional to the popularity of the dish among diners so far, and in addition the diner may sample from the untested dishes. This has been named the Indian buffet process and can be used to infer latent features in data. [9]

Applications

The Chinese restaurant process is closely connected to Dirichlet processes and Pólya's urn scheme, and therefore useful in applications of Bayesian statistics including nonparametric Bayesian methods. The Generalized Chinese Restaurant Process is closely related to Pitman–Yor process. These processes have been used in many applications, including modeling text, clustering biological microarray data, [10] biodiversity modelling, and image reconstruction [11] [12]

See also

Related Research Articles

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter and a scale parameter .
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Beta function</span> Mathematical function

In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In mathematics, the Mellin transform is an integral transform that may be regarded as the multiplicative version of the two-sided Laplace transform. This integral transform is closely connected to the theory of Dirichlet series, and is often used in number theory, mathematical statistics, and the theory of asymptotic expansions; it is closely related to the Laplace transform and the Fourier transform, and the theory of the gamma function and allied special functions.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

The principle of detailed balance can be used in kinetic systems which are decomposed into elementary processes. It states that at equilibrium, each elementary process is in equilibrium with its reverse process.

<span class="mw-page-title-main">Beta prime distribution</span> Probability distribution

In probability theory and statistics, the beta prime distribution is an absolutely continuous probability distribution. If has a beta distribution, then the odds has a beta prime distribution.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

<span class="mw-page-title-main">Dirichlet process</span> Family of stochastic processes

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

In statistics, a Pólya urn model, named after George Pólya, is a family of urn models that can be used to interpret many commonly used statistical models.

<span class="mw-page-title-main">Stable count distribution</span> Probability distribution

In probability theory, the stable count distribution is the conjugate prior of a one-sided stable distribution. This distribution was discovered by Stephen Lihn in his 2017 study of daily distributions of the S&P 500 and the VIX. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

In probability theory and statistics, the Dirichlet negative multinomial distribution is a multivariate distribution on the non-negative integers. It is a multivariate extension of the beta negative binomial distribution. It is also a generalization of the negative multinomial distribution (NM(k, p)) allowing for heterogeneity or overdispersion to the probability vector. It is used in quantitative marketing research to flexibly model the number of household transactions across multiple brands.

References

  1. Aldous, D. J. (1985). "Exchangeability and related topics". École d'Été de Probabilités de Saint-Flour XIII — 1983. Lecture Notes in Mathematics. Vol. 1117. pp. 1–198. doi:10.1007/BFb0099421. ISBN   978-3-540-15203-3. The restaurant process is described on page 92.
  2. 1 2 Pitman, Jim (1995). "Exchangeable and Partially Exchangeable Random Partitions". Probability Theory and Related Fields. 102 (2): 145–158. doi: 10.1007/BF01213386 . MR   1337249. S2CID   16849229.
  3. Blei, David M.; Frazier, Peter I. (2011). "Distance Dependent Chinese Restaurant Processes" (PDF). Journal of Machine Learning Research. 12: 2461–2488.
  4. Zhou, Mingyuan; Carin, Lawrence (2012). "Negative Binomial Process Count and Mixture Modeling". IEEE Transactions on Pattern Analysis and Machine Intelligence. 37 (2): 307–20. arXiv: 1209.3442 . Bibcode:2012arXiv1209.3442Z. doi:10.1109/TPAMI.2013.211. PMID   26353243. S2CID   1937045.
  5. Antoniak, Charles E (1974). "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems". The Annals of Statistics. 2 (6): 1152–1174. doi: 10.1214/aos/1176342871 .
  6. 1 2 3 Pitman, Jim (2006). Combinatorial Stochastic Processes. Vol. 1875. Berlin: Springer-Verlag. ISBN   9783540309901. Archived from the original on 2012-09-25. Retrieved 2011-05-11.
  7. "Dirichlet Process and Dirichlet Distribution -- Polya Restaurant Scheme and Chinese Restaurant Process".
  8. Xinhua Zhang, "A Very Gentle Note on the Construction of Dirichlet Process", September 2008, The Australian National University, Canberra. Online: http://users.cecs.anu.edu.au/~xzhang/pubDoc/notes/dirichlet_process.pdf Archived April 11, 2011, at the Wayback Machine
  9. Griffiths, T.L. and Ghahramani, Z. (2005) Infinite Latent Feature Models and the Indian Buffet Process Archived 2008-10-31 at the Wayback Machine . Gatsby Unit Technical Report GCNU-TR-2005-001.
  10. Qin, Zhaohui S (2006). "Clustering microarray gene expression data using weighted Chinese restaurant process". Bioinformatics. 22 (16): 1988–1997. doi:10.1093/bioinformatics/btl284. PMID   16766561.
  11. White, J. T.; Ghosal, S. (2011). "Bayesian smoothing of photon-limited images with applications in astronomy" (PDF). Journal of the Royal Statistical Society, Series B (Statistical Methodology). 73 (4): 579–599. CiteSeerX   10.1.1.308.7922 . doi:10.1111/j.1467-9868.2011.00776.x. S2CID   2342134.
  12. Li, M.; Ghosal, S. (2014). "Bayesian multiscale smoothing of Gaussian noised images". Bayesian Analysis. 9 (3): 733–758. doi: 10.1214/14-ba871 .