Index of dispersion

Last updated

In probability theory and statistics, the index of dispersion, [1] dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

Contents

It is defined as the ratio of the variance ${\displaystyle \sigma ^{2}}$ to the mean ${\displaystyle \mu }$,

${\displaystyle D={\sigma ^{2} \over \mu }.}$

It is also known as the Fano factor, though this term is sometimes reserved for windowed data (the mean and variance are computed over a subpopulation), where the index of dispersion is used in the special case where the window is infinite. Windowing data is frequently done: the VMR is frequently computed over various intervals in time or small regions in space, which may be called "windows", and the resulting statistic called the Fano factor.

It is only defined when the mean ${\displaystyle \mu }$ is non-zero, and is generally only used for positive statistics, such as count data or time between events, or where the underlying distribution is assumed to be the exponential distribution or Poisson distribution.

Terminology

In this context, the observed dataset may consist of the times of occurrence of predefined events, such as earthquakes in a given region over a given magnitude, or of the locations in geographical space of plants of a given species. Details of such occurrences are first converted into counts of the numbers of events or occurrences in each of a set of equal-sized time- or space-regions.

The above defines a dispersion index for counts. [2] A different definition applies for a dispersion index for intervals, [3] where the quantities treated are the lengths of the time-intervals between the events. Common usage is that "index of dispersion" means the dispersion index for counts.

Interpretation

Some distributions, most notably the Poisson distribution, have equal variance and mean, giving them a VMR = 1. The geometric distribution and the negative binomial distribution have VMR > 1, while the binomial distribution has VMR < 1, and the constant random variable has VMR = 0. This yields the following table:

DistributionVMR
constant random variable VMR = 0not dispersed
binomial distribution 0 < VMR < 1under-dispersed
Poisson distribution VMR = 1
negative binomial distribution VMR > 1over-dispersed

This can be considered analogous to the classification of conic sections by eccentricity; see Cumulants of particular probability distributions for details.

The relevance of the index of dispersion is that it has a value of one when the probability distribution of the number of occurrences in an interval is a Poisson distribution. Thus the measure can be used to assess whether observed data can be modeled using a Poisson process. When the coefficient of dispersion is less than 1, a dataset is said to be "under-dispersed": this condition can relate to patterns of occurrence that are more regular than the randomness associated with a Poisson process. For instance, points spread uniformly in space or regular, periodic events will be under-dispersed. If the index of dispersion is larger than 1, a dataset is said to be over-dispersed: this can correspond to the existence of clusters of occurrences. Clumped, concentrated data is over-dispersed.

A sample-based estimate of the dispersion index can be used to construct a formal statistical hypothesis test for the adequacy of the model that a series of counts follow a Poisson distribution. [4] [5] In terms of the interval-counts, over-dispersion corresponds to there being more intervals with low counts and more intervals with high counts, compared to a Poisson distribution: in contrast, under-dispersion is characterised by there being more intervals having counts close to the mean count, compared to a Poisson distribution.

The VMR is also a good measure of the degree of randomness of a given phenomenon. For example, this technique is commonly used in currency management.

Example

For randomly diffusing particles (Brownian motion), the distribution of the number of particle inside a given volume is poissonian, i.e. VMR=1. Therefore, to assess if a given spatial pattern (assuming you have a way to measure it) is due purely to diffusion or if some particle-particle interaction is involved : divide the space into patches, Quadrats or Sample Units (SU), count the number of individuals in each patch or SU, and compute the VMR. VMRs significantly higher than 1 denote a clustered distribution, where random walk is not enough to smother the attractive inter-particle potential.

History

The first to discuss the use of a test to detect deviations from a Poisson or binomial distribution appears to have been Lexis in 1877. One of the tests he developed was the Lexis ratio.

This index was first used in botany by Clapham in 1936.

If the variates are Poisson distributed then the index of dispersion is distributed as a χ2 statistic with n - 1 degrees of freedom when n is large and is μ > 3. [6] For many cases of interest this approximation is accurate and Fisher in 1950 derived an exact test for it.

Hoel studied the first four moments of its distribution. [7] He found that the approximation to the χ2 statistic is reasonable if μ > 5.

Skewed distributions

For highly skewed distributions, it may be more appropriate to use a linear loss function, as opposed to a quadratic one. The analogous coefficient of dispersion in this case is the ratio of the average absolute deviation from the median to the median of the data, [8] or, in symbols:

${\displaystyle CD={\frac {1}{n}}{\frac {\sum _{j}{|m-x_{j}|}}{m}}}$

where n is the sample size, m is the sample median and the sum taken over the whole sample. Iowa, New York and South Dakota use this linear coefficient of dispersion to estimate dues taxes. [9] [10] [11]

For a two-sample test in which the sample sizes are large, both samples have the same median, and differ in the dispersion around it, a confidence interval for the linear coefficient of dispersion is bounded inferiorly by

${\displaystyle {\frac {t_{a}}{t_{b}}}\exp {\left(-{\sqrt {z_{\alpha }\left(\operatorname {var} \left[\log \left({\frac {t_{a}}{t_{b}}}\right)\right]\right)}}\right)}}$

where tj is the mean absolute deviation of the jth sample and zα is the confidence interval length for a normal distribution of confidence α (e.g., for α = 0.05, zα = 1.96). [8]

Notes

1. Cox &Lewis (1966)
2. Cox & Lewis (1966), p72
3. Cox & Lewis (1966), p71
4. Cox & Lewis (1966), p158
5. Upton & Cook(2006), under index of dispersion
6. Frome, E. L. (1982). "Algorithm AS 171: Fisher's Exact Variance Test for the Poisson Distribution". Journal of the Royal Statistical Society, Series C . 31 (1): 67–71. JSTOR   2347079.
7. Hoel, P. G. (1943). "On Indices of Dispersion". Annals of Mathematical Statistics . 14 (2): 155–162. doi:. JSTOR   2235818.
8. Bonett, DG; Seier, E (2006). "Confidence interval for a coefficient of dispersion in non-normal distributions". Biometrical Journal. 48 (1): 144–148. doi:10.1002/bimj.200410148. PMID   16544819.
9. "Statistical Calculation Definitions for Mass Appraisal" (PDF). Iowa.gov. Archived from the original (PDF) on 11 November 2010. Median Ratio: The ratio located midway between the highest ratio and the lowest ratio when individual ratios for a class of realty are ranked in ascending or descending order. The median ratio is most frequently used to determine the level of assessment for a given class of real estate.
10. "Assessment equity in New York: Results from the 2010 market value survey". Archived from the original on 6 November 2012.
11. "Summary of the Assessment Process" (PDF). state.sd.us. South Dakota Department of Revenue - Property/Special Taxes Division. Archived from the original (PDF) on 10 May 2009.

Related Research Articles

In probability theory, a normaldistribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

In probability and statistics, Student's t-distribution is any member of a family of continuous probability distributions that arise when estimating the mean of a normally-distributed population in situations where the sample size is small and the population's standard deviation is unknown. It was developed by English statistician William Sealy Gosset under the pseudonym "Student".

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM).

A tolerance interval is a statistical interval within which, with some confidence level, a specified proportion of a sampled population falls. "More specifically, a 100×p%/100×(1−α) tolerance interval provides limits within which at least a certain proportion (p) of the population falls with a given level of confidence (1−α)." "A tolerance interval (TI) based on a sample is constructed so that it would include at least a proportion p of the sampled population with confidence 1−α; such a TI is usually referred to as p-content − (1−α) coverage TI." "A upper tolerance limit (TL) is simply a 1−α upper confidence limit for the 100 p percentile of the population."

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R. In addition, CV is utilized by economists and investors in economic models.

In statistics, a Multimodaldistribution is a probability distribution with two different modes, which may also be referred to as a bimodal distribution. These appear as distinct peaks in the probability density function, as shown in Figures 1 and 2. Categorical, continuous, and discrete data can all form bimodal distributions.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In statistics, the Fano factor, like the coefficient of variation, is a measure of the dispersion of a probability distribution of a Fano noise. It is named after Ugo Fano, an Italian American physicist.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot quantity need not be a statistic—the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is a distinct concept from consistency. Consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and Inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

In probability and statistics, the class of exponential dispersion models (EDM) is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

Exact statistics, such as that described in exact test, is a branch of statistics that was developed to provide more accurate results pertaining to statistical testing and interval estimation by eliminating procedures based on asymptotic and approximate statistical methods. The main characteristic of exact methods is that statistical tests and confidence intervals are based on exact probability statements that are valid for any sample size. Exact statistical methods help avoid some of the unreasonable assumptions of traditional statistical methods, such as the assumption of equal variances in classical ANOVA. They also allow exact inference on variance components of mixed models.

In statistics, a generalized p-value is an extended version of the classical p-value, which except in a limited number of applications, provides only approximate solutions.

In statistics and probability theory, the nonparametric skew is a statistic occasionally used with random variables that take real values. It is a measure of the skewness of a random variable's distribution—that is, the distribution's tendency to "lean" to one side or the other of the mean. Its calculation does not require any knowledge of the form of the underlying distribution—hence the name nonparametric. It has some desirable properties: it is zero for any symmetric distribution; it is unaffected by a scale shift; and it reveals either left- or right-skewness equally well. In some statistical samples it has been shown to be less powerful than the usual measures of skewness in detecting departures of the population from normality.

Taylor's power law is an empirical law in ecology that relates the variance of the number of individuals of a species per unit area of habitat to the corresponding mean by a power law relationship. It is named after the ecologist who first proposed it in 1961, Lionel Roy Taylor (1924–2007). Taylor's original name for this relationship was the law of the mean.

References

• Cox, D. R.; Lewis, P. A. W. (1966). The Statistical Analysis of Series of Events. London: Methuen.
• Upton, G.; Cook, I. (2006). Oxford Dictionary of Statistics (2nd ed.). Oxford University Press. ISBN   978-0-19-954145-4.