Watterson estimator

Last updated

In population genetics, the Watterson estimator is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s. [1] [2] It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" (the product of the effective population size and the neutral mutation rate) from the observed nucleotide diversity of a population. , [3] where is the effective population size and is the per-generation mutation rate of the population of interest (Watterson (1975) ). The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying (so that mutations never overlay or reverse one another), and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.

The estimate of , often denoted as , is

where is the number of segregating sites (an example of a segregating site would be a single-nucleotide polymorphism) in the sample and

is the th harmonic number.

This estimate is based on coalescent theory. Watterson's estimator is commonly used for its simplicity. When its assumptions are met, the estimator is unbiased and the variance of the estimator decreases with increasing sample size or recombination rate. However, the estimator can be biased by population structure. For example, is downwardly biased in an exponentially growing population. It can also be biased by violation of the infinite-sites mutational model; if multiple mutations can overwrite one another, Watterson's estimator will be biased downward.

Comparing the value of the Watterson's estimator, to nucleotide diversity is the basis of Tajima's D which allows inference of the evolutionary regime of a given locus.

See also

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

Nucleotide diversity is a concept in molecular genetics which is used to measure the degree of polymorphism within a population.

In population genetics, Ewens's sampling formula describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

The effective population size (Ne) is the size of an idealised population that would experience the same rate of genetic drift or increase in inbreeding as the real population. Idealised populations are based on unrealistic but convenient assumptions including random mating, simultaneous birth of each new generation, and constant population size. For most quantities of interest and most real populations, Ne is smaller than the census population size N of a real population. The same population may have multiple effective population sizes for different properties of interest, including genetic drift and inbreeding.

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased.

<span class="mw-page-title-main">Jackknife resampling</span> Statistical method for resampling

In statistics, the jackknife is a cross-validation technique and, therefore, a form of resampling. It is especially useful for bias and variance estimation. The jackknife pre-dates other common resampling methods such as the bootstrap. Given a sample of size , a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size obtained by omitting one observation.

Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

In statistics, Fisher consistency, named after Ronald Fisher, is a desirable property of an estimator asserting that if the estimator were calculated using the entire population rather than a sample, the true value of the estimated parameter would be obtained.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

The HKA Test, named after Richard R. Hudson, Martin Kreitman, and Montserrat Aguadé, is a statistical test used in genetics to evaluate the predictions of the Neutral Theory of molecular evolution. By comparing the polymorphism within each species and the divergence observed between two species at two or more loci, the test can determine whether the observed difference is likely due to neutral evolution or rather due to adaptive evolution. Developed in 1987, the HKA test is a precursor to the McDonald-Kreitman test, which was derived in 1991. The HKA test is best used to look for balancing selection, recent selective sweeps or other variation-reducing forces.

In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic, although extensions for multiallelic frequency spectra exist.

The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

References

  1. Yong, Ed (2019-02-11). "The Women Who Contributed to Science but Were Buried in Footnotes". The Atlantic. Retrieved 2019-02-13.
  2. Rohlfs, Rori V.; Huerta-Sánchez, Emilia; Catalan, Francisca; Castellanos, Edgar; Thu, Ricky; Reyes, Rochelle-Jan; Barragan, Ezequiel Lopez; López, Andrea; Dung, Samantha Kristin (2019-02-01). "Illuminating Women's Hidden Contribution to Historical Theoretical Population Genetics". Genetics. 211 (2): 363–366. doi:10.1534/genetics.118.301277. ISSN   0016-6731. PMC   6366915 . PMID   30733376.
  3. Luca Ferretti, Luca (2015). "A generalized Watterson estimator for next-generation sequencing: From trios to autopolyploids" (PDF). Theoretical Population Biology. 100: 79–87. doi:10.1016/j.tpb.2015.01.001. PMID   25595553.