Fisher's noncentral hypergeometric distribution

Last updated
Probability mass function for Fisher's noncentral hypergeometric distribution for different values of the odds ratio o.
m1 = 80, m2 = 60, n = 100, o = 0.01, ..., 1000 FishersNoncentralHypergeometric1.png
Probability mass function for Fisher's noncentral hypergeometric distribution for different values of the odds ratio ω.
m1 = 80, m2 = 60, n = 100, ω = 0.01, ..., 1000
Biologist and statistician Ronald Fisher Youngronaldfisher2.JPG
Biologist and statistician Ronald Fisher

In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional distribution of two or more binomially distributed variables dependent upon their fixed sum.

Contents

The distribution may be illustrated by the following urn model. Assume, for example, that an urn contains m1 red balls and m2 white balls, totalling N = m1 + m2 balls. Each red ball has the weight ω1 and each white ball has the weight ω2. We will say that the odds ratio is ω = ω1 / ω2. Now we are taking balls randomly in such a way that the probability of taking a particular ball is proportional to its weight, but independent of what happens to the other balls. The number of balls taken of a particular color follows the binomial distribution. If the total number n of balls taken is known then the conditional distribution of the number of taken red balls for given n is Fisher's noncentral hypergeometric distribution. To generate this distribution experimentally, we have to repeat the experiment until it happens to give n balls.

If we want to fix the value of n prior to the experiment then we have to take the balls one by one until we have n balls. The balls are therefore no longer independent. This gives a slightly different distribution known as Wallenius' noncentral hypergeometric distribution. It is far from obvious why these two distributions are different. See the entry for noncentral hypergeometric distributions for an explanation of the difference between these two distributions and a discussion of which distribution to use in various situations.

The two distributions are both equal to the (central) hypergeometric distribution when the odds ratio is 1.

Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name.

Fisher's noncentral hypergeometric distribution was first given the name extended hypergeometric distribution (Harkness, 1965), and some authors still use this name today.

Univariate distribution

Univariate Fisher's noncentral hypergeometric distribution
Parameters


Support

PMF
where
Mean , where
Mode , where , , .
Variance , where Pk is given above.

The probability function, mean and variance are given in the adjacent table.

An alternative expression of the distribution has both the number of balls taken of each color and the number of balls not taken as random variables, whereby the expression for the probability becomes symmetric.

The calculation time for the probability function can be high when the sum in P0 has many terms. The calculation time can be reduced by calculating the terms in the sum recursively relative to the term for y = x and ignoring negligible terms in the tails (Liao and Rosen, 2001).

The mean can be approximated by:

,

where , , .

The variance can be approximated by:

.

Better approximations to the mean and variance are given by Levin (1984, 1990), McCullagh and Nelder (1989), Liao (1992), and Eisinga and Pelzer (2011). The saddlepoint methods to approximate the mean and the variance suggested Eisinga and Pelzer (2011) offer extremely accurate results.

Properties

The following symmetry relations apply:

Recurrence relation:

The distribution is affectionately called "finchy-pig," based on the abbreviation convention above.

Derivation

The univariate noncentral hypergeometric distribution may be derived alternatively as a conditional distribution in the context of two binomially distributed random variables, for example when considering the response to a particular treatment in two different groups of patients participating in a clinical trial. An important application of the noncentral hypergeometric distribution in this context is the computation of exact confidence intervals for the odds ratio comparing treatment response between the two groups.

Suppose X and Y are binomially distributed random variables counting the number of responders in two corresponding groups of size mX and mY respectively,

.

Their odds ratio is given as

.

The responder prevalence is fully defined in terms of the odds , , which correspond to the sampling bias in the urn scheme above, i.e.

.

The trial can be summarized and analyzed in terms of the following contingency table.

      Treatment
  Group
respondernon-responderTotal
Xx.mX
Yy.mY
Totaln.N

In the table, corresponds to the total number of responders across groups, and N to the total number of patients recruited into the trial. The dots denote corresponding frequency counts of no further relevance.

The sampling distribution of responders in group X conditional upon the trial outcome and prevalences, , is noncentral hypergeometric:

Note that the denominator is essentially just the numerator, summed over all events of the joint sample space for which it holds that . Terms independent of X can be factored out of the sum and cancel out with the numerator.

Multivariate distribution

Multivariate Fisher's Noncentral Hypergeometric Distribution
Parameters



Support
PMF
where
Mean The mean μi of xi can be approximated by
where r is the unique positive solution to .

The distribution can be expanded to any number of colors c of balls in the urn. The multivariate distribution is used when there are more than two colors.

The probability function and a simple approximation to the mean are given to the right. Better approximations to the mean and variance are given by McCullagh and Nelder (1989).

Properties

The order of the colors is arbitrary so that any colors can be swapped.

The weights can be arbitrarily scaled:

for all

Colors with zero number (mi = 0) or zero weight (ωi = 0) can be omitted from the equations.

Colors with the same weight can be joined:

where is the (univariate, central) hypergeometric distribution probability.

Applications

Fisher's noncentral hypergeometric distribution is useful for models of biased sampling or biased selection where the individual items are sampled independently of each other with no competition. The bias or odds can be estimated from an experimental value of the mean. Use Wallenius' noncentral hypergeometric distribution instead if items are sampled one by one with competition.

Fisher's noncentral hypergeometric distribution is used mostly for tests in contingency tables where a conditional distribution for fixed margins is desired. This can be useful, for example, for testing or measuring the effect of a medicine. See McCullagh and Nelder (1989).

Software available

See also

Related Research Articles

<span class="mw-page-title-main">Cauchy distribution</span> Probability distribution

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.

<span class="mw-page-title-main">Rayleigh distribution</span> Probability distribution

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. Up to rescaling, it coincides with the chi distribution with two degrees of freedom. The distribution is named after Lord Rayleigh.

In mathematics and statistics, a stationary process is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Consequently, parameters such as mean and variance also do not change over time. If you draw a line through the middle of a stationary process then it should be flat; it may have 'seasonal' cycles around the trend line, but overall it does not trend up nor down.

<span class="mw-page-title-main">Dirac comb</span> Periodic distribution ("function") of "point-mass" Dirac delta sampling

In mathematics, a Dirac comb is a periodic function with the formula

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn without replacement.

<span class="mw-page-title-main">Wallenius' noncentral hypergeometric distribution</span>

In probability theory and statistics, Wallenius' noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where items are sampled with bias.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

<span class="mw-page-title-main">Negative hypergeometric distribution</span>

In probability theory and statistics, the negative hypergeometric distribution describes probabilities for when sampling from a finite population without replacement in which each sample can be classified into two mutually exclusive categories like Pass/Fail or Employed/Unemployed. As random selections are made from the population, each subsequent draw decreases the population causing the probability of success to change with each draw. Unlike the standard hypergeometric distribution, which describes the number of successes in a fixed sample size, in the negative hypergeometric distribution, samples are drawn until failures have been found, and the distribution describes the probability of finding successes in such a sample. In other words, the negative hypergeometric distribution describes the likelihood of successes in a sample with exactly failures.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

References

Breslow, N. E.; Day, N. E. (1980), Statistical Methods in Cancer Research, Lyon: International Agency for Research on Cancer.

Eisinga, R.; Pelzer, B. (2011), "Saddlepoint approximations to the mean and variance of the extended hypergeometric distribution" (PDF), Statistica Neerlandica, vol. 65, no. 1, pp. 22–31, doi:10.1111/j.1467-9574.2010.00468.x .

Fog, A. (2007), Random number theory .

Fog, A. (2008), "Sampling Methods for Wallenius' and Fisher's Noncentral Hypergeometric Distributions", Communications in Statictics, Simulation and Computation, vol. 37, no. 2, pp. 241–257, doi:10.1080/03610910701790236, S2CID   14904723 .

Johnson, N. L.; Kemp, A. W.; Kotz, S. (2005), Univariate Discrete Distributions, Hoboken, New Jersey: Wiley and Sons.

Levin, B. (1984), "Simple Improvements on Cornfield's approximation to the mean of a noncentral Hypergeometric random variable", Biometrika, vol. 71, no. 3, pp. 630–632, doi:10.1093/biomet/71.3.630 .

Levin, B. (1990), "The saddlepoint correction in conditional logistic likelihood analysis", Biometrika, [Oxford University Press, Biometrika Trust], vol. 77, no. 2, pp. 275–285, doi: 10.1093/biomet/77.2.275 , JSTOR   2336805 .

Liao, J. (1992), "An Algorithm for the Mean and Variance of the Noncentral Hypergeometric Distribution", Biometrics, [Wiley, International Biometric Society], vol. 48, no. 3, pp. 889–892, doi:10.2307/2532354, JSTOR   2532354 .

Liao, J. G.; Rosen, O. (2001), "Fast and Stable Algorithms for Computing and Sampling from the Noncentral Hypergeometric Distribution", The American Statistician, vol. 55, no. 4, pp. 366–369, doi:10.1198/000313001753272547, S2CID   121279235 .

McCullagh, P.; Nelder, J. A. (1989), Generalized Linear Models, 2. ed., London: Chapman and Hall.