Tajima's D

Last updated

Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. [1] Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

Contents

The purpose of Tajima's D test is to distinguish between a DNA sequence evolving randomly ("neutrally") and one evolving under a non-random process, including directional selection or balancing selection, demographic expansion or contraction, genetic hitchhiking, or introgression. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called "neutral", while mutations under selection are "non-neutral". For example, a mutation that causes prenatal death or severe disease would be expected to be under selection. In the population as a whole, the frequency of a neutral mutation fluctuates randomly (i.e. the percentage of individuals in the population with the mutation changes from one generation to the next, and this percentage is equally likely to go up or down) through genetic drift.

The strength of genetic drift depends on population size. If a population is at a constant size with constant mutation rate, the population will reach an equilibrium of gene frequencies. This equilibrium has important properties, including the number of segregating sites , and the number of nucleotide differences between pairs sampled (these are called pairwise differences). To standardize the pairwise differences, the mean or 'average' number of pairwise differences is used. This is simply the sum of the pairwise differences divided by the number of pairs, and is often symbolized by .

The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between mutation and genetic drift. In order to perform the test on a DNA sequence or gene, you need to sequence homologous DNA for at least 3 individuals. Tajima's statistic computes a standardized measure of the total number of segregating sites (these are DNA sites that are polymorphic) in the sampled DNA and the average number of mutations between pairs in the sample. The two quantities whose values are compared are both method of moments estimates of the population genetic parameter theta, and so are expected to equal the same value. If these two numbers only differ by as much as one could reasonably expect by chance, then the null hypothesis of neutrality cannot be rejected. Otherwise, the null hypothesis of neutrality is rejected.

Scientific explanation

Under the neutral theory model, for a population at constant size at equilibrium:

for diploid DNA, and

for haploid.

In the above formulas, S is the number of segregating sites, n is the number of samples, N is the effective population size, is the mutation rate at the examined genomic locus, and i is the index of summation. But selection, demographic fluctuations and other violations of the neutral model (including rate heterogeneity and introgression) will change the expected values of and , so that they are no longer expected to be equal. The difference in the expectations for these two variables (which can be positive or negative) is the crux of Tajima's D test statistic.

is calculated by taking the difference between the two estimates of the population genetics parameter . This difference is called , and D is calculated by dividing by the square root of its variance (its standard deviation, by definition).

Fumio Tajima demonstrated by computer simulation that the statistic described above could be modeled using a beta distribution. If the value for a sample of sequences is outside the confidence interval then one can reject the null hypothesis of neutral mutation for the sequence in question. However, in real world uses, one must be careful as past population changes (for instance, a population bottleneck) can bias the value of the statistic. [2]

Mathematical details

where

and are two estimates of the expected number of single nucleotide polymorphisms (SNPs) between two DNA sequences under the neutral mutation model in a sample size from an effective population size .

The first estimate is the average number of SNPs found in (n choose 2) pairwise comparisons of sequences in the sample,

The second estimate is derived from the expected value of , the total number of polymorphisms in the sample

Tajima defines , whereas Hartl & Clark use a different symbol to define the same parameter .

Example

Suppose you are a geneticist studying an unknown gene. As part of your research you get DNA samples from four random people (plus yourself). For simplicity, you label your sequence as a string of zeroes, and for the other four people you put a zero when their DNA is the same as yours and a one when it is different. (For this example, the specific type of difference is not important.)

                    1           2 Position  12345 67890 12345 67890 Person Y  00000 00000 00000 00000 Person A  00100 00000 00100 00010 Person B  00000 00000 00100 00010 Person C  00000 01000 00000 00010 Person D  00000 01000 00100 00010 

Notice the four polymorphic sites (positions where someone differs from you, at 3, 7, 13 and 19 above). Now compare each pair of sequences and get the average number of polymorphisms between two sequences. There are "five choose two" (ten) comparisons that need to be done.

Person Y is you!

You vs A: 3 polymorphisms

Person Y     00000 00000 00000 00000 Person A     00100 00000 00100 00010

You vs B: 2 polymorphisms

Person Y     00000 00000 00000 00000 Person B     00000 00000 00100 00010

You vs C: 2 polymorphisms

Person Y     00000 00000 00000 00000 Person C     00000 01000 00000 00010

You vs D: 3 polymorphisms

Person Y     00000 00000 00000 00000 Person D     00000 01000 00100 00010

A vs B: 1 polymorphism

Person A     00100 00000 00100 00010 Person B     00000 00000 00100 00010

A vs C: 3 polymorphisms

Person A     00100 00000 00100 00010 Person C     00000 01000 00000 00010

A vs D: 2 polymorphisms

Person A     00100 00000 00100 00010 Person D     00000 01000 00100 00010

B vs C: 2 polymorphisms

Person B     00000 00000 00100 00010 Person C     00000 01000 00000 00010

B vs D: 1 polymorphism

Person B     00000 00000 00100 00010 Person D     00000 01000 00100 00010

C vs D: 1 polymorphism

Person C     00000 01000 00000 00010 Person D     00000 01000 00100 00010


The average number of polymorphisms is .

The second estimate of the equilibrium is M=S/a1

Since there were n=5 individuals and S=4 segregating sites

a1=1/1+1/2+1/3+1/4=2.08

M=4/2.08=1.92

The lower-case d described above is the difference between these two numbers—the average number of polymorphisms found in pairwise comparison (2) and M. Thus .

Since this is a statistical test, you need to assess the significance of this value. A discussion of how to do this is provided below.

Interpreting Tajima's D

A negative Tajima's D signifies an excess of low frequency polymorphisms relative to expectation, indicating population size expansion (e.g., after a bottleneck or a selective sweep). A positive Tajima's D signifies low levels of both low and high frequency polymorphisms, indicating a decrease in population size and/or balancing selection. However, calculating a conventional "p-value" associated with any Tajima's D value that is obtained from a sample is impossible. Briefly, this is because there is no way to describe the distribution of the statistic that is independent of the true, and unknown, theta parameter (no pivot quantity exists). To circumvent this issue, several options have been proposed.

Value of Tajima's DMathematical reasonBiological interpretation 1Biological interpretation 2
Tajima's D=0Theta-Pi equivalent to Theta-k (Observed=Expected). Average Heterozygosity= # of Segregating sites.Observed variation similar to expected variationPopulation evolving as per mutation-drift equilibrium. No evidence of selection
Tajima's D<0Theta-Pi less than Theta-k (Observed<Expected). Fewer haplotypes (lower average heterozygosity) than # of segregating sites.Rare alleles abundant (excess of rare alleles)Recent selective sweep, population expansion after a recent bottleneck, linkage to a swept gene
Tajima's D>0Theta-Pi greater than Theta-k (Observed>Expected). More haplotypes (more average heterozygosity)than # of segregating sites.Rare alleles scarce (lack of rare alleles)Balancing selection, sudden population contraction

However, this interpretation should be made only if the D-value is deemed statistically significant.

Determining significance

When performing a statistical test such as Tajima's D, the critical question is whether the value calculated for the statistic is unexpected under a null process. For Tajima's D, the magnitude of the statistic is expected to increase the more the data deviates from a pattern expected under a population evolving according to the standard coalescent model.

Tajima (1989) found an empirical similarity between the distribution of the test statistic and a beta distribution with mean zero and variance one. He estimated theta by taking Watterson's estimator and dividing it by the number of samples. Simulations have shown this distribution to be conservative, [3] and now that the computing power is more readily available this approximation is not frequently used.

A more nuanced approach was presented in a paper by Simonsen et al. [4] These authors advocated constructing a confidence interval for the true theta value, and then performing a grid search over this interval to obtain the critical values at which the statistic is significant below a particular alpha value. An alternative approach is for the investigator to perform the grid search over the values of theta which they believe to be plausible based on their knowledge of the organism under study. Bayesian approaches are a natural extension of this method.

A very rough rule of thumb to significance is that values greater than +2 or less than -2 are likely to be significant. This rule is based on an appeal to asymptotic properties of some statistics, and thus +/- 2 does not actually represent a critical value for a significance test.

Finally, genome wide scans of Tajima's D in sliding windows along a chromosomal segment are often performed. With this approach, those regions that have a value of D that greatly deviates from the bulk of the empirical distribution of all such windows are reported as significant. This method does not assess significance in the traditional statistical sense, but is quite powerful given a large genomic region, and is unlikely to falsely identify interesting regions of a chromosome if only the greatest outliers are reported.

See also

Related Research Articles

The likelihood function is the joint probability of the observed data viewed as a function of the parameters of the chosen statistical model.

<span class="mw-page-title-main">Neutral theory of molecular evolution</span>

The neutral theory of molecular evolution holds that most evolutionary changes occur at the molecular level, and most of the variation within and between species are due to random genetic drift of mutant alleles that are selectively neutral. The theory applies only for evolution at the molecular level, and is compatible with phenotypic evolution being shaped by natural selection as postulated by Charles Darwin. The neutral theory allows for the possibility that most mutations are deleterious, but holds that because these are rapidly removed by natural selection, they do not make significant contributions to variation within and between species at the molecular level. A neutral mutation is one that does not affect an organism's ability to survive and reproduce. The neutral theory assumes that most mutations that are not deleterious are neutral rather than beneficial. Because only a fraction of gametes are sampled in each generation of a species, the neutral theory suggests that a mutant allele can arise within a population and reach fixation by chance, rather than by selective advantage.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

<span class="mw-page-title-main">Haplotype</span> Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

<span class="mw-page-title-main">Unified neutral theory of biodiversity</span> Theory of evolutionary biology

The unified neutral theory of biodiversity and biogeography is a theory and the title of a monograph by ecologist Stephen P. Hubbell. It aims to explain the diversity and relative abundance of species in ecological communities. Like other neutral theories of ecology, Hubbell assumes that the differences between members of an ecological community of trophically similar species are "neutral", or irrelevant to their success. This implies that niche differences do not influence abundance and the abundance of each species follows a random walk. The theory has sparked controversy, and some authors consider it a more complex version of other null models that fit the data better.

Nucleotide diversity is a concept in molecular genetics which is used to measure the degree of polymorphism within a population.

In population genetics, Ewens's sampling formula, describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

<span class="mw-page-title-main">Genetic distance</span> Measure between populations of divergence

Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.

Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

The fixation index (FST) is a measure of population differentiation due to genetic structure. It is frequently estimated from genetic polymorphism data, such as single-nucleotide polymorphisms (SNP) or microsatellites. Developed as a special case of Wright's F-statistics, it is one of the most commonly used statistics in population genetics.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In population genetics, the Watterson estimator is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s. It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" from the observed nucleotide diversity of a population. , where is the effective population size and is the per-generation mutation rate of the population of interest. The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying, and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

The McDonald–Kreitman test is a statistical test often used by evolutionary and population biologists to detect and measure the amount of adaptive evolution within a species by determining whether adaptive evolution has occurred, and the proportion of substitutions that resulted from positive selection. To do this, the McDonald–Kreitman test compares the amount of variation within a species (polymorphism) to the divergence between species (substitutions) at two types of sites, neutral and nonneutral. A substitution refers to a nucleotide that is fixed within one species, but a different nucleotide is fixed within a second species at the same base pair of homologous DNA sequences. A site is nonneutral if it is either advantageous or deleterious. The two types of sites can be either synonymous or nonsynonymous within a protein-coding region. In a protein-coding sequence of DNA, a site is synonymous if a point mutation at that site would not change the amino acid, also known as a silent mutation. Because the mutation did not result in a change in the amino acid that was originally coded for by the protein-coding sequence, the phenotype, or the observable trait, of the organism is generally unchanged by the silent mutation. A site in a protein-coding sequence of DNA is nonsynonymous if a point mutation at that site results in a change in the amino acid, resulting in a change in the organism's phenotype. Typically, silent mutations in protein-coding regions are used as the "control" in the McDonald–Kreitman test.

Fay and Wu's H is a statistical test created by and named after two researchers Justin Fay and Chung-I Wu. The purpose of the test is to distinguish between a DNA sequence evolving randomly ("neutrally") and one evolving under positive selection. This test is an advancement over Tajima's D, which is used to differentiate neutrally evolving sequences from those evolving non-randomly. Fay and Wu's H is frequently used to identify sequences which have experienced selective sweeps in their evolutionary history.

The HKA Test, named after Richard R. Hudson, Martin Kreitman, and Montserrat Aguadé, is a statistical test used in genetics to evaluate the predictions of the Neutral Theory of molecular evolution. By comparing the polymorphism within each species and the divergence observed between two species at two or more loci, the test can determine whether the observed difference is likely due to neutral evolution or rather due to adaptive evolution. Developed in 1987, the HKA test is a precursor to the McDonald-Kreitman test, which was derived in 1991. The HKA test is best used to look for balancing selection, recent selective sweeps or other variation-reducing forces.

In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic, although extensions for multiallelic frequency spectra exist.

The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

References

  1. Tajima, F. (Nov 1989). "Statistical method for testing the neutral mutation hypothesis by DNA polymorphism". Genetics. 123 (3): 585–95. PMC   1203831 . PMID   2513255.
  2. Elgvin, Tore O.; Trier, Cassandra N.; Tørresen, Ole K.; Hagen, Ingerid J.; Lien, Sigbjørn; Nederbragt, Alexander J.; Ravinet, Mark; Jensen, Henrik; Sætre, Glenn-Peter (2 June 2017). "The genomic mosaicism of hybrid speciation". Science Advances. 3 (6). doi:10.1126/sciadv.1602996. eISSN   2375-2548. PMC   5470830 . PMID   28630911.
  3. Fu, YX.; Li, WH. (Mar 1993). "Statistical tests of neutrality of mutations". Genetics. 133 (3): 693–709. PMC   1205353 . PMID   8454210.
  4. Simonsen, KL.; Churchill, GA.; Aquadro, CF. (Sep 1995). "Properties of statistical tests of neutrality for DNA polymorphism data". Genetics. 141 (1): 413–29. PMC   1206737 . PMID   8536987.

Notes

Computational tools: