WikiMili The Free Encyclopedia

This article provides insufficient context for those unfamiliar with the subject.(October 2009) (Learn how and when to remove this template message) |

In population genetics, the **Watterson estimator** is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s.^{ [1] } It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" (the product of the effective population size and the neutral mutation rate) from the observed nucleotide diversity of a population. , where is the effective population size and is the per-generation mutation rate of the population of interest (Watterson (1975) ). The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying (so that mutations never overlay or reverse one another), and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.

**Population genetics** is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.

**Genetic diversity** is the total number of genetic characteristics in the genetic makeup of a species. It is distinguished from genetic variability, which describes the tendency of genetic characteristics to vary.

**Margaret Wu** is an Australian statistician and psychometrician who specialises in educational management. Wu helped to design the Watterson estimator which can be used to determine the genetic diversity of a population. She is an emeritus professor at the University of Melbourne.

The estimate of , often denoted as , is

where is the number of segregating sites (an example of a segregating site would be a single-nucleotide polymorphism) in the sample and

A **single-nucleotide polymorphism**, often abbreviated to **SNP**, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.

is the th harmonic number.

In mathematics, the n-th **harmonic number** is the sum of the reciprocals of the first n natural numbers:

This estimate is based on coalescent theory. Watterson's estimator is commonly used for its simplicity. When its assumptions are met, the estimator is unbiased and the variance of the estimator decreases with increasing sample size or recombination rate. However, the estimator can be biased by population structure. For example, is downwardly biased in an exponentially growing population. It can also be biased by violation of the infinite-sites mutational model; if multiple mutations can overwrite one another, Watterson's estimator will be biased downward.

**Coalescent theory** is a model of how gene variants sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

In statistics, the **bias** of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called **unbiased**. Otherwise the estimator is said to be **biased**. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".

In probability theory and statistics, **variance** is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , or .

Comparing the value of the Watterson's estimator, to nucleotide diversity is the basis of Tajima's D which allows inference of the evolutionary regime of a given locus.

**Tajima's D** is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

In probability theory, the **coupon collector's problem** describes "collect all coupons and win" contests. It asks the following question: If each box of a brand of cereals contains a coupon, and there are *n* different types of coupons, what is the probability that more than *t* boxes need to be bought to collect all *n* coupons? An alternative statement is: Given *n* coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once? The mathematical analysis of the problem reveals that the expected number of trials needed grows as . For example, when *n* = 50 it takes about 225 trials on average to collect all 50 coupons.

In statistics, an **estimator** is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.

In statistics, **maximum likelihood estimation** (**MLE**) is a method of estimating the parameters of a statistical model so the observed data is most probable. Specifically, this is done by finding the value of the parameter that maximizes the likelihood function , which is the joint probability of the observed data , over a parameter space . The point that maximizes the likelihood function is called the **maximum likelihood estimate**. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of inference within much of the quantitative research of the social and medical sciences.

In statistics, the **mean squared error** (**MSE**) or **mean squared deviation** (**MSD**) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

In estimation theory and statistics, the **Cramér–Rao bound (CRB)**, **Cramér–Rao lower bound (CRLB)**, **Cramér–Rao inequality**, **Frechét–Darmois–Cramér–Rao inequality**, or **information inequality** expresses a lower bound on the variance of unbiased estimators of a deterministic parameter. This term is named in honor of Harald Cramér, Calyampudi Radhakrishna Rao, Maurice Fréchet and Georges Darmois all of whom independently derived this limit to statistical precision in the 1940s.

In statistics, a **consistent estimator** or **asymptotically consistent estimator** is an estimator—a rule for computing estimates of a parameter *θ*_{0}—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to *θ*_{0}. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to *θ*_{0} converges to one.

In Bayesian statistics, a **maximum a posteriori probability** (**MAP**) **estimate** is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

In statistics, the **method of moments** is a method of estimation of population parameters.

In statistics, **M-estimators** are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called **M-estimation**.

In statistics, **bootstrapping** is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. Generally, it falls in the broader class of resampling methods.

In estimation theory and decision theory, a **Bayes estimator** or a **Bayes action** is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the **jackknife** is a resampling technique especially useful for variance and bias estimation. The jackknife pre-dates other common resampling methods such as the bootstrap. The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations. Given a sample of size , the jackknife estimate is found by aggregating the estimates of each -sized sub-sample.

In probability and statistics, the **Tweedie distributions** are a family of probability distributions which include the purely continuous normal and gamma distributions, the purely discrete scaled Poisson distribution, and the class of mixed compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. For any random variable *Y* that obeys a Tweedie distribution, the variance var(*Y*) relates to the mean E(*Y*) by the power law,

In statistics, **Fisher consistency**, named after Ronald Fisher, is a desirable property of an estimator asserting that if the estimator were calculated using the entire population rather than a sample, the true value of the estimated parameter would be obtained.

The **HKA Test**, named after Richard R. Hudson, Martin Kreitman, and Montserrat Aguadé, is a statistical test used in genetics to evaluate the predictions of the Neutral Theory of molecular evolution. By comparing the polymorphism within each species and the divergence observed between two species at two or more loci, the test can determine whether the observed difference is likely due to neutral evolution or rather due to adaptive evolution. Developed in 1987, the HKA test is a precursor to the McDonald-Kreitman test, which was derived in 1991. The HKA test is best used to look for balancing selection, recent selective sweeps or other variation-reducing forces.

In population genetics, the **allele frequency spectrum**, sometimes called the **site frequency spectrum**, is the distribution of the allele frequencies of a given set of loci in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic, although extensions for multiallelic frequency spectra exist.

The **Infinite sites model** (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

**Two-step M-estimators** deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

- ↑ Yong, Ed (2019-02-11). "The Women Who Contributed to Science but Were Buried in Footnotes".
*The Atlantic*. Retrieved 2019-02-13.

- Watterson, G.A. (1975), "On the number of segregating sites in genetical models without recombination.",
*Theoretical Population Biology*,**7**(2): 256–276, doi:10.1016/0040-5809(75)90020-9, PMID 1145509 - McVean, Gil; Awadalla, Philip; Fearnhead, Paul (2002) "A Coalescent-Based Method for Detecting and Estimating Recombination From Gene Sequences",
*Genetics*, 160, 1231–1241.

In computing, a **Digital Object Identifier** or **DOI** is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). An implementation of the Handle System, DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications though they also have been used to identify other types of information resources, such as commercial videos.

This page is based on this Wikipedia article

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.