Categorical distribution

Last updated
Categorical
Parameters number of categories (integer)
event probabilities
Support
PMF

(1)
(2)
(3)

Contents

where is the Iverson bracket
Mode

In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution [1] ) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to K). The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a die. On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.

Terminology

Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions.

In some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise. [2] This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below).

However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichlet-multinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The joint distribution of the same variables with the same Dirichlet-multinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomial-style counts of nodes in each particular category (similar to the distinction between a set of Bernoulli-distributed nodes and a single binomial-distributed node). Both forms have very similar-looking probability mass functions (PMFs), which both make reference to multinomial-style counts of nodes in a category. However, the multinomial-style PMF has an extra factor, a multinomial coefficient, that is a constant equal to 1 in the categorical-style PMF. Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest. The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.

Formulating distributions

A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable.

In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be {0, 1, ..., k  1} or {1, 2, ..., k} or any other arbitrary set of values. In the following descriptions, we use {1, 2, ..., k} for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses {0, 1}. In this case, the probability mass function f is:

where , represents the probability of seeing element i and .

Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket: [3]

where evaluates to 1 if , 0 otherwise. There are various advantages of this formulation, e.g.:

Yet another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter n of the multinomial distribution (the number of sampled items) is fixed at 1. In this formulation, the sample space can be considered to be the set of 1-of-K encoded [4] random vectors x of dimension k having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which category has been chosen. The probability mass function f in this formulation is:

where represents the probability of seeing element i and . This is the formulation adopted by Bishop. [4] [note 1]

Properties

The possible probabilities for the categorical distribution with
k
=
3
{\displaystyle k=3}
are the 2-simplex
p
1
+
p
2
+
p
3
=
1
{\displaystyle p_{1}+p_{2}+p_{3}=1}
, embedded in 3-space. 2D-simplex.svg
The possible probabilities for the categorical distribution with are the 2-simplex , embedded in 3-space.
where I is the indicator function. Then Y has a distribution which is a special case of the multinomial distribution with parameter . The sum of independent and identically distributed such random variables Y constructed from a categorical distribution with parameter is multinomially distributed with parameters and

Bayesian inference using conjugate prior

In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution (and also the multinomial distribution). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what is known about the parameter prior to observing the data point, knowledge can then be updated based on the data point, yielding a new distribution of the same form as the old one. As such, knowledge of a parameter can be successively updated by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model

then the following holds: [2]

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

Further intuition comes from the expected value of the posterior distribution (see the article on the Dirichlet distribution):

This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: if, for example, there are three possible categories, and category 1 is seen in the observed data 40% of the time, one would expect on average to see category 1 40% of the time in the posterior distribution as well.

(This intuition is ignoring the effect of the prior distribution. Furthermore, the posterior is a distribution over distributions. The posterior distribution in general describes the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated the data. For example, if 3 categories in the ratio 40:5:55 are in the observed data, then ignoring the effect of the prior distribution, the true parameter – i.e. the true, underlying distribution that generated our observed data – would be expected to have the average value of (0.40,0.05,0.55), which is indeed what the posterior reveals. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the variance of the posterior, which is controlled by the total number of observations – the more data observed, the less uncertainty about the true parameter.)

(Technically, the prior parameter should actually be seen as representing prior observations of category . Then, the updated posterior parameter represents posterior observations. This reflects the fact that a Dirichlet distribution with has a completely flat shape — essentially, a uniform distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the term and simply think of the α vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting values less than 1.)

MAP estimation

The maximum-a-posteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e., [2]

In many practical applications, the only way to guarantee the condition that is to set for all i.

Marginal likelihood

In the above model, the marginal likelihood of the observations (i.e. the joint distribution of the observations, with the prior parameter marginalized out) is a Dirichlet-multinomial distribution: [2]

This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.

Posterior predictive distribution

The posterior predictive distribution of a new observation in the above model is the distribution that a new observation would take given the set of N categorical observations. As shown in the Dirichlet-multinomial distribution article, it has a very simple form: [2]

There are various relationships among this formula and the previous ones:

The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident with re-examination of the above formula. As explained in the posterior predictive distribution article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution:

The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value i is directly specified by the associated parameter pi. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters.

Observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. In this scenario, if a category has a high frequency of occurrence, then new data points are more likely to join that category — further enriching the same category. This type of scenario is often termed a preferential attachment (or "rich get richer") model. This models many real-world processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.

Posterior conditional distribution

In Gibbs sampling, one typically needs to draw from conditional distributions in multi-variable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors (e.g. mixture models and models including mixture components), the Dirichlet distributions are often "collapsed out" (marginalized out) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their joint distribution is a Dirichlet-multinomial distribution). One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the posterior predictive distribution of the remaining nodes.

That is, for a set of nodes , if the node in question is denoted as and the remainder as , then

where is the number of nodes having category i among the nodes other than node n.

Sampling

There are a number of methods, but the most common way to sample from a categorical distribution uses a type of inverse transform sampling:

Assume a distribution is expressed as "proportional to" some expression, with unknown normalizing constant. Before taking any samples, one prepares some values as follows:

  1. Compute the unnormalized value of the distribution for each category.
  2. Sum them up and divide each value by this sum, in order to normalize them.
  3. Impose some sort of order on the categories (e.g. by an index that runs from 1 to k, where k is the number of categories).
  4. Convert the values to a cumulative distribution function (CDF) by replacing each value with the sum of all of the previous values. This can be done in time O(k). The resulting value for the first category will be 0.

Then, each time it is necessary to sample a value:

  1. Pick a uniformly distributed number between 0 and 1.
  2. Locate the greatest number in the CDF whose value is less than or equal to the number just chosen. This can be done in time O(log(k)), by binary search.
  3. Return the category corresponding to this CDF value.

If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O(n) time (assuming an O(1) approximation is used to draw values from the binomial distribution [6] ).

function draw_categorical(n) // where n is the number of samples to draw from the categorical distribution   r = 1   s = 0   for i from 1 to k // where k is the number of categories     v = draw from a binomial(n, p[i] / r) distribution // where p[i] is the probability of category i     for j from 1 to v       z[s++] = i // where z is an array in which the results are stored     n = n - v     r = r - p[i]   shuffle (randomly re-order) the elements in z   return z 

Sampling via the Gumbel distribution

In machine learning it is typical to parametrize the categorical distribution, via an unconstrained representation in , whose components are given by:

where is any real constant. Given this representation, can be recovered using the softmax function, which can then be sampled using the techniques described above. There is however a more direct sampling method that uses samples from the Gumbel distribution. [7] Let be k independent draws from the standard Gumbel distribution, then

will be a sample from the desired categorical distribution. (If is a sample from the standard uniform distribution, then is a sample from the standard Gumbel distribution.)

See also

Notes

  1. However, Bishop does not explicitly use the term categorical distribution.

Related Research Articles

The likelihood function is the joint probability of the observed data viewed as a function of the parameters of the chosen statistical model.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an example of a topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

<span class="mw-page-title-main">Dirichlet process</span> Family of stochastic processes

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data.

In probability theory and statistics, the negative multinomial distribution is a generalization of the negative binomial distribution to more than two outcomes.

<span class="mw-page-title-main">Logit-normal distribution</span>

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and P is the standard logistic function, then X = P(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In probability theory and statistics, the Dirichlet negative multinomial distribution is a multivariate distribution on the non-negative integers. It is a multivariate extension of the beta negative binomial distribution. It is also a generalization of the negative multinomial distribution allowing for heterogeneity or overdispersion to the probability vector. It is used in quantitative marketing research to flexibly model the number of household transactions across multiple brands.

References

  1. Murphy, K. P. (2012). Machine learning: a probabilistic perspective, p. 35. MIT press. ISBN   0262018020.
  2. 1 2 3 4 5 6 Minka, T. (2003) Bayesian inference, entropy and the multinomial distribution. Technical report Microsoft Research.
  3. Minka, T. (2003), op. cit. Minka uses the Kronecker delta function, similar to but less general than the Iverson bracket.
  4. 1 2 Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer. ISBN   0-387-31073-8.
  5. Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions, Wiley. ISBN   0-471-12844-9 (p. 105)
  6. Agresti, A., An Introduction to Categorical Data Analysis, Wiley-Interscience, 2007, ISBN   978-0-471-22618-5, pp. 25
  7. Adams, Ryan. "The Gumbel–Max Trick for Discrete Distributions".