Overdispersion

Last updated

In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model.

Contents

A common task in applied statistics is choosing a parametric model to fit a given set of empirical observations. This necessitates an assessment of the fit of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical population mean of the model is approximately equal to the sample mean. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moments. When the observed variance is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently heterogeneous (non-uniform) contrary to the assumptions implicit within widely used simple parametric models.

Examples

Poisson

Overdispersion is often encountered when fitting very simple parametric models, such as those based on the Poisson distribution. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, Poisson regression analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson mixture model like the negative binomial distribution can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters).

Binomial

As a more concrete example, it has been observed that the number of boys born to families does not conform faithfully to a binomial distribution as might be expected. [1] Instead, the sex ratios of families seem to skew toward either boys or girls (see, for example the Trivers–Willard hypothesis for one possible explanation) i.e. there are more all-boy families, more all-girl families and not enough families close to the population 51:49 boy-to-girl mean ratio than expected from a binomial distribution, and the resulting empirical variance is larger than specified by a binomial model.

In this case, the beta-binomial model distribution is a popular and analytically tractable alternative model to the binomial distribution since it provides a better fit to the observed data. [2] To capture the heterogeneity of the families, one can think of the probability parameter of the binomial model (say, probability of being a boy) is itself a random variable (i.e. random effects model) drawn for each family from a beta distribution as the mixing distribution. The resulting compound distribution (beta-binomial) has an additional free parameter.

Another common model for overdispersion—when some of the observations are not Bernoulli—arises from introducing a normal random variable into a logistic model. Software is widely available for fitting this type of multilevel model. In this case, if the variance of the normal variable is zero, the model reduces to the standard (undispersed) logistic regression. This model has an additional free parameter, namely the variance of the normal variable.

With respect to binomial random variables, the concept of overdispersion makes sense only if n>1 (i.e. overdispersion is nonsensical for Bernoulli random variables).

Normal distribution

As the normal distribution (Gaussian) has variance as a parameter, any data with finite variance (including any finite data) can be modeled with a normal distribution with the exact variance – the normal distribution is a two-parameter model, with mean and variance. Thus, in the absence of an underlying model, there is no notion of data being overdispersed relative to the normal model, though the fit may be poor in other respects (such as the higher moments of skew, kurtosis, etc.). However, in the case that the data is modeled by a normal distribution with an expected variation, it can be over- or under-dispersed relative to that prediction.

For example, in a statistical survey, the margin of error (determined by sample size) predicts the sampling error and hence dispersion of results on repeated surveys. If one performs a meta-analysis of repeated surveys of a fixed population (say with a given sample size, so margin of error is the same), one expects the results to fall on normal distribution with standard deviation equal to the margin of error. However, in the presence of study heterogeneity where studies have different sampling bias, the distribution is instead a compound distribution and will be overdistributed relative to the predicted distribution. For example, given repeated opinion polls all with a margin of error of 3%, if they are conducted by different polling organizations, one expects the results to have standard deviation greater than 3%, due to pollster bias from different methodologies.

Differences in terminology among disciplines

Over- and underdispersion are terms which have been adopted in branches of the biological sciences. In parasitology, the term 'overdispersion' is generally used as defined here – meaning a distribution with a higher than expected variance.

In some areas of ecology, however, meanings have been transposed, so that overdispersion is actually taken to mean more even (lower variance) than expected. This confusion has caused some ecologists to suggest that the terms 'aggregated', or 'contagious', would be better used in ecology for 'overdispersed'. [3] Such preferences are creeping into parasitology too. [4] Generally this suggestion has not been heeded, and confusion persists in the literature.

Furthermore in demography, overdispersion is often evident in the analysis of death count data, but demographers prefer the term 'unobserved heterogeneity'.

See also

Related Research Articles

<span class="mw-page-title-main">Estimator</span> Rule for calculating an estimate of a given quantity based on observed data

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

<span class="mw-page-title-main">Negative binomial distribution</span> Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on a die as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.

Nonparametric statistics is a type of statistical analysis that does not rely on the assumption of a specific underlying distribution, or any other specific assumptions about the population parameters. This is in contrast to parametric statistics, which make such assumptions about the population. In nonparametric statistics, a distribution may not be specified at all, or a distribution may be specified but its parameters, such as the mean and variance, are not assumed to have a known value or distribution in advance. In some cases, parameters may be generated from the data, such as the median. Nonparametric statistics can be used for descriptive statistics or statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

  1. Permutation tests
  2. Bootstrapping
  3. Cross validation

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

<span class="mw-page-title-main">Data transformation (statistics)</span>

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

In statistics, quasi-likelihood methods are used to estimate parameters in a statistical model when exact likelihood methods, for example maximum likelihood estimation, are computationally infeasible. Due to the wrong likelihood being used, quasi-likelihood estimators lose asymptotic efficiency compared to, e.g., maximum likelihood estimators. Under broadly applicable conditions, quasi-likelihood estimators are consistent and asymptotically normal. The asymptotic covariance matrix can be obtained using the so-called sandwich estimator. Examples of quasi-likelihood methods include the generalized estimating equations and pairwise likelihood approaches.

In statistics, count data is a statistical data type describing countable quantities, data which can take only the counting numbers, non-negative integer values {0, 1, 2, 3, ...}, and where these integers arise from counting rather than ranking. The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, usually represented by 0 and 1, and from ordinal data, which may also consist of integers but where the individual values fall on an arbitrary scale and only the relative ranking is important.

In probability theory and statistics, the index of dispersion, dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

In probability and statistics, a compound probability distribution is the probability distribution that results from assuming that a random variable is distributed according to some parametrized distribution, with the parameters of that distribution themselves being random variables. If the parameter is a scale parameter, the resulting mixture is also called a scale mixture.

Taylor's power law is an empirical law in ecology that relates the variance of the number of individuals of a species per unit area of habitat to the corresponding mean by a power law relationship. It is named after the ecologist who first proposed it in 1961, Lionel Roy Taylor (1924–2007). Taylor's original name for this relationship was the law of the mean. The name Taylor's law was coined by Southwood in 1966.

References

  1. Stansfield, William D.; Carlton, Matthew A. (February 2009). "The most widely publicized gender problem in human genetics". Human Biology. 81 (1): 3–11. doi:10.3378/027.081.0101. ISSN   1534-6617. PMID   19589015.
  2. Lindsey, J. K.; Altham, P. M. E. (1998). "Analysis of the Human Sex Ratio by using Overdispersion Models". Journal of the Royal Statistical Society, Series C. 47 (1): 149–157. doi: 10.1111/1467-9876.00103 . PMID   12293397. S2CID   22354905.
  3. Greig-Smith, P. (1983). Quantitative Plant Ecology (Third ed.). University of California Press. ISBN   0-632-00142-9.
  4. Poulin, R. (2006). Evolutionary Ecology of Parasites. Princeton University Press. ISBN   9780691120850.