Imprecise Dirichlet process

Last updated

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson [1] as a prior over probability distributions.

Contents

A Dirichlet process is completely defined by its parameters: (the base distribution or base measure) is an arbitrary distribution and (the concentration parameter ) is a positive real number (it is often denoted as ). According to the Bayesian paradigm these parameters should be chosen based on the available prior information on the domain.

The question is: how should we choose the prior parameters of the DP, in particular the infinite dimensional one , in case of lack of prior information?

To address this issue, the only prior that has been proposed so far is the limiting DP obtained for , which has been introduced under the name of Bayesian bootstrap by Rubin; [2] in fact it can be proven that the Bayesian bootstrap is asymptotically equivalent to the frequentist bootstrap introduced by Bradley Efron. [3] The limiting Dirichlet process has been criticized on diverse grounds. From an a-priori point of view, the main criticism is that taking is far from leading to a noninformative prior. [4] Moreover, a-posteriori, it assigns zero probability to any set that does not include the observations. [2]

The imprecise Dirichlet [5] process has been proposed to overcome these issues. The basic idea is to fix but do not choose any precise base measure .

More precisely, the imprecise Dirichlet process (IDP) is defined as follows:

where is the set of all probability measures. In other words, the IDP is the set of all Dirichlet processes (with a fixed ) obtained by letting the base measure to span the set of all probability measures.

Inferences with the Imprecise Dirichlet Process

Let a probability distribution on (here is a standard Borel space with Borel -field ) and assume that . Then consider a real-valued bounded function defined on . It is well known that the expectation of with respect to the Dirichlet process is

One of the most remarkable properties of the DP priors is that the posterior distribution of is again a DP. Let be an independent and identically distributed sample from and , then the posterior distribution of given the observations is

where is an atomic probability measure (Dirac's delta) centered at . Hence, it follows that Therefore, for any fixed , we can exploit the previous equations to derive prior and posterior expectations.

In the IDP can span the set of all distributions . This implies that we will get a different prior and posterior expectation of for any choice of . A way to characterize inferences for the IDP is by computing lower and upper bounds for the expectation of w.r.t. . A-priori these bounds are:

the lower (upper) bound is obtained by a probability measure that puts all the mass on the infimum (supremum) of , i.e., with (or respectively with ). From the above expressions of the lower and upper bounds, it can be observed that the range of under the IDP is the same as the original range of . In other words, by specifying the IDP, we are not giving any prior information on the value of the expectation of . A-priori, IDP is therefore a model of prior (near)-ignorance for .

A-posteriori, IDP can learn from data. The posterior lower and upper bounds for the expectation of are in fact given by:

It can be observed that the posterior inferences do not depend on . To define the IDP, the modeler has only to choose (the concentration parameter). This explains the meaning of the adjective near in prior near-ignorance, because the IDP requires by the modeller the elicitation of a parameter. However, this is a simple elicitation problem for a nonparametric prior, since we only have to choose the value of a positive scalar (there are not infinitely many parameters left in the IDP model).

Finally, observe that for , IDP satisfies

where . In other words, the IDP is consistent.

Lower (red) and Upper (blue) cumulative distribution for the observations {-1.17, 0.44, 1.17, 3.28, 1.44, 1.98} Lower (red) and Upper (blue) cumulative distribution function.jpeg
Lower (red) and Upper (blue) cumulative distribution for the observations {−1.17, 0.44, 1.17, 3.28, 1.44, 1.98}

Choice of the prior strength

The IDP is completely specified by , which is the only parameter left in the prior model. Since the value of determines how quickly lower and upper posterior expectations converge at the increase of the number of observations, can be chosen so to match a certain convergence rate. [5] The parameter can also be chosen to have some desirable frequentist properties (e.g., credible intervals to be calibrated frequentist intervals, hypothesis tests to be calibrated for the Type I error, etc.), see Example: median test

Example: estimate of the cumulative distribution

Let be i.i.d. real random variables with cumulative distribution function .

Since , where is the indicator function, we can use IDP to derive inferences about The lower and upper posterior mean of are

where is the empirical distribution function. Here, to obtain the lower we have exploited the fact that and for the upper that .

Beta distributions for the lower (red) and upper (blue) probability corresponding to the observations {-1.17, 0.44, 1.17, 3.28, 1.44, 1.98}. The area in [0,0.5] gives the lower (0.891) and the upper (0.9375) probability of the hypothesis "the median is greater than zero". Beta distribution for the lower (red) and upper (blue) probability of the hypothesis..jpeg
Beta distributions for the lower (red) and upper (blue) probability corresponding to the observations {-1.17, 0.44, 1.17, 3.28, 1.44, 1.98}. The area in [0,0.5] gives the lower (0.891) and the upper (0.9375) probability of the hypothesis "the median is greater than zero".

Note that, for any precise choice of (e.g., normal distribution ), the posterior expectation of will be included between the lower and upper bound.

Example: median test

IDP can also be used for hypothesis testing, for instance to test the hypothesis , i.e., the median of is greater than zero. By considering the partition and the property of the Dirichlet process, it can be shown that the posterior distribution of is

where is the number of observations that are less than zero,

and

By exploiting this property, it follows that

where is the regularized incomplete beta function. We can thus perform the hypothesis test

(with for instance) and then

  1. if both the inequalities are satisfied we can declare that with probability larger than ;
  2. if only one of the inequality is satisfied (which has necessarily to be the one for the upper), we are in an indeterminate situation, i.e., we cannot decide;
  3. if both are not satisfied, we can declare that the probability that is lower than the desired probability of .

IDP returns an indeterminate decision when the decision is prior dependent (that is when it would depend on the choice of ).

By exploiting the relationship between the cumulative distribution function of the Beta distribution, and the cumulative distribution function of a random variable Z from a binomial distribution, where the "probability of success" is p and the sample size is n:

we can show that the median test derived with th IDP for any choice of encompasses the one-sided frequentist sign test as a test for the median. It can in fact be verified that for the -value of the sign test is equal to . Thus, if then the -value is less than and, thus, they two tests have the same power.

Applications of the Imprecise Dirichlet Process

Dirichlet processes are frequently used in Bayesian nonparametric statistics. The Imprecise Dirichlet Process can be employed instead of the Dirichlet processes in any application in which prior information is lacking (it is therefore important to model this state of prior ignorance).

In this respect, the Imprecise Dirichlet Process has been used for nonparametric hypothesis testing, see the Imprecise Dirichlet Process statistical package. Based on the Imprecise Dirichlet Process, Bayesian nonparametric near-ignorance versions of the following classical nonparametric estimators have been derived: the Wilcoxon rank sum test [5] and the Wilcoxon signed-rank test. [6]

A Bayesian nonparametric near-ignorance model presents several advantages with respect to a traditional approach to hypothesis testing.

  1. The Bayesian approach allows us to formulate the hypothesis test as a decision problem. This means that we can verify the evidence in favor of the null hypothesis and not only rejecting it and take decisions which minimize the expected loss.
  2. Because of the nonparametric prior near-ignorance, IDP based tests allows us to start the hypothesis test with very weak prior assumptions, much in the direction of letting data speak for themselves.
  3. Although the IDP test shares several similarities with a standard Bayesian approach, at the same time it embodies a significant change of paradigm when it comes to take decisions. In fact the IDP based tests have the advantage of producing an indeterminate outcome when the decision is prior-dependent. In other words, the IDP test suspends the judgment when the option which minimizes the expected loss changes depending on the Dirichlet Process base measure we focus on.
  4. It has been empirically verified that when the IDP test is indeterminate, the frequentist tests are virtually behaving as random guessers. This surprising result has practical consequences in hypothesis testing. Assume that we are trying to compare the effects of two medical treatments (Y is better than X) and that, given the available data, the IDP test is indeterminate. In such a situation the frequentist test always issues a determinate response (for instance I can tell that Y is better than X), but it turns out that its response is completely random, like if we were tossing of a coin. On the other side, the IDP test acknowledges the impossibility of making a decision in these cases. Thus, by saying "I do not know", the IDP test provides a richer information to the analyst. The analyst could for instance use this information to collect more data.

Categorical variables

For categorical variables, i.e., when has a finite number of elements, it is known that the Dirichlet process reduces to a Dirichlet distribution. In this case, the Imprecise Dirichlet Process reduces to the Imprecise Dirichlet model proposed by Walley [7] as a model for prior (near)-ignorance for chances.

See also

Imprecise probability

Robust Bayesian analysis

Related Research Articles

In probability theory, the expected value of a random variable , denoted or , is a generalization of the weighted average, and is intuitively the arithmetic mean of a large number of independent realizations of . The expected value is also known as the expectation, mathematical expectation, mean, average, or first moment. Expected value is a key concept in economics, finance, and many other subjects.

In mathematical analysis and in probability theory, a σ-algebra on a set X is a collection Σ of subsets of X that includes X itself, is closed under complement, and is closed under countable unions.

Distributions, also known as Schwartz distributions or generalized functions, are objects that generalize the classical notion of functions in mathematical analysis. Distributions make it possible to differentiate functions whose derivatives do not exist in the classical sense. In particular, any locally integrable function has a distributional derivative. Distributions are widely used in the theory of partial differential equations, where it may be easier to establish the existence of distributional solutions than classical solutions, or appropriate classical solutions may not exist. Distributions are also important in physics and engineering where many problems naturally lead to differential equations whose solutions or initial conditions are distributions, such as the Dirac delta function.

In complex analysis, a branch of mathematics, analytic continuation is a technique to extend the domain of a given analytic function. Analytic continuation often succeeds in defining further values of a function, for example in a new region where an infinite series representation in terms of which it is initially defined becomes divergent.

In the mathematical field of real analysis, the monotone convergence theorem is any of a number of related theorems proving the convergence of monotonic sequences that are also bounded. Informally, the theorems state that if a sequence is increasing and bounded above by a supremum, then the sequence will converge to the supremum; in the same way, if a sequence is decreasing and is bounded below by an infimum, it will converge to the infimum.

In mathematics, Fatou's lemma establishes an inequality relating the Lebesgue integral of the limit inferior of a sequence of functions to the limit inferior of integrals of these functions. The lemma is named after Pierre Fatou.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In mathematics, the limit of a sequence of sets A1, A2, ... is a set whose elements are determined by the sequence in either of two equivalent ways: (1) by upper and lower bounds on the sequence that converge monotonically to the same set and (2) by convergence of a sequence of indicator functions which are themselves real-valued. As is the case with sequences of other objects, convergence is not necessary or even usual.

The Laplace–Stieltjes transform, named for Pierre-Simon Laplace and Thomas Joannes Stieltjes, is an integral transform similar to the Laplace transform. For real-valued functions, it is the Laplace transform of a Stieltjes measure, however it is often defined for functions with values in a Banach space. It is useful in a number of areas of mathematics, including functional analysis, and certain areas of theoretical and applied probability.

In mathematics, a Dirichlet series is any series of the form

In the theory of probability, the Glivenko–Cantelli theorem, named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, determines the asymptotic behaviour of the empirical distribution function as the number of independent and identically distributed observations grows.

A Doob martingale is a mathematical construction of a stochastic process which approximates a given random variable and has the martingale property with respect to the given filtration. It may be thought of as the evolving sequence of best approximations to the random variable based on information accumulated up to a certain time.

In mathematics, a random compact set is essentially a compact set-valued random variable. Random compact sets are useful in the study of attractors for random dynamical systems.

In probability theory, an empirical measure is a random measure arising from a particular realization of a sequence of random variables. The precise definition is found below. Empirical measures are relevant to mathematical statistics.

In mathematics, the theory of optimal stopping or early stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost. Optimal stopping problems can be found in areas of statistics, economics, and mathematical finance. A key example of an optimal stopping problem is the secretary problem. Optimal stopping problems can often be written in the form of a Bellman equation, and are therefore often solved using dynamic programming.

Dirichlet process

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In the field of mathematical analysis, a general Dirichlet series is an infinite series that takes the form of

In mathematics, the conformal radius is a way to measure the size of a simply connected planar domain D viewed from a point z in it. As opposed to notions using Euclidean distance, this notion is well-suited to use in complex analysis, in particular in conformal maps and conformal geometry.

References

  1. Ferguson, Thomas (1973). "Bayesian analysis of some nonparametric problems". Annals of Statistics . 1 (2): 209–230. doi: 10.1214/aos/1176342360 . MR   0350949.
  2. 1 2 Rubin D (1981). The Bayesian bootstrap. Ann. Stat. 9 130–134
  3. Efron B (1979). Bootstrap methods: Another look at the jackknife. Ann. Stat. 7 1–26
  4. Sethuraman, J.; Tiwari, R. C. (1981). "Convergence of Dirichlet measures and the interpretation of their parameter". Defense Technical Information Center.
  5. 1 2 3 Benavoli, Alessio; Mangili, Francesca; Ruggeri, Fabrizio; Zaffalon, Marco (2014). "Imprecise Dirichlet Process with application to the hypothesis test on the probability that X< Y". arXiv: 1402.2755 [math.ST].
  6. Benavoli, Alessio; Mangili, Francesca; Corani, Giorgio; Ruggeri, Fabrizio; Zaffalon, Marco (2014). "A Bayesian Wilcoxon signed-rank test based on the Dirichlet process". Proceedings of the 30th International Conference on Machine Learning (ICML 2014).Cite journal requires |journal= (help)
  7. Walley, Peter (1991). Statistical Reasoning with Imprecise Probabilities. London: Chapman and Hall. ISBN   0-412-28660-2.