Allele frequency spectrum

Last updated

In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci (often SNPs) in a population or sample. [1] [2] [3] [4] Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic (that is, with exactly two alleles present), although extensions for multiallelic frequency spectra exist. [5]

Contents

Many summary statistics of observed genetic variation are themselves summaries of the allele frequency spectrum, including estimates of such as Watterson's and Tajima's , Tajima's D, Fay and Wu's H and the fixation index . [6]

Example

The allele frequency spectrum from a sample of chromosomes is calculated by counting the number of sites with derived allele frequencies . For example, consider a sample of individuals with eight observed variable sites. In this table, a 1 indicates that the derived allele is observed at that site, while a 0 indicates the ancestral allele was observed.

SNP 1SNP 2SNP 3SNP 4SNP 5SNP 6SNP 7SNP 8
Sample 101000010
Sample 210100010
Sample 301100100
Sample 400001011
Sample 500100010
Sample 600010110
Total12311251

The allele frequency spectrum can be written as the vector , where is the number of observed sites with derived allele frequency . In this example, the observed allele frequency spectrum is , due to four instances of a single observed derived allele at a particular SNP loci, two instances of two derived alleles, and so on.

Calculation

The expected allele frequency spectrum may be calculated using either a coalescent or diffusion approach. [7] [8] The demographic history of a population and natural selection affect allele frequency dynamics, and these effects are reflected in the shape of the allele frequency spectrum. For the simple case of selective neutral alleles segregating in a population that has reached demographic equilibrium (that is, without recent population size changes or gene flow), the expected allele frequency spectrum for a sample of size is given by

where is the population scaled mutation rate. Deviations from demographic equilibrium or neutrality will change the shape of the expected frequency spectrum.

Calculating the frequency spectrum from observed sequence data requires one to be able to distinguish the ancestral and derived (mutant) alleles, often by comparing to an outgroup sequence. For example in human population genetic studies, the homologous chimpanzee reference sequence is typically used to estimate the ancestral allele. However, sometimes the ancestral allele cannot be determined, in which case the folded allele frequency spectrum may be calculated instead. The folded frequency spectrum stores the observed counts of the minor (most rare) allele frequencies. The folded spectrum can be calculated by binning together the th and th entries from the unfolded spectrum, where is the number of sampled individuals.

Multi-population allele frequency spectrum

The joint allele frequency spectrum (JAFS) is the joint distribution of allele frequencies across two or more related populations. The JAFS for populations, with sampled chromosomes in the th population, is a -dimensional histogram, in which each entry stores the total number of segregating sites in which the derived allele is observed with the corresponding frequency in each population. Each axis of the histogram corresponds to a population, and indices run from for the th population. [9] [10]

Example

Suppose we sequence diploid individuals from two populations, 4 individuals from population 1 and 2 individuals from population 2. The JAFS would be a matrix, indexed from zero. The entry would record the number of observed polymorphic loci with derived allele frequency 3 in population 1 and frequency 2 in population 2. The entry would record those loci with observed frequency 1 in population 1, and frequency 0 in population 2. The entry would record those loci with the derived allele fixed in population 1 (seen in all chromosomes), and with frequency 3 in population 2.

Applications

The shape of the allele frequency spectrum is sensitive to demography, such as population size changes, migration, and substructure, as well as natural selection. By comparing observed data summarized in a frequency spectrum to the expected frequency spectrum calculated under a given demographic and selection model, one can assess the goodness of fit of that the model to the data, and use likelihood theory to estimate the best fit parameters of the model.

For example, suppose a population experienced a recent period of exponential growth and sample sequences were obtained from the population at the end of the growth and the observed (data) allele frequency spectrum was calculated using putatively neutral variation. The demographic model would have parameters for the exponential growth rate , the time for which the growth occurred, and a reference population size , assuming that the population was at equilibrium when the growth began. The expected frequency spectrum for a given parameter set can be obtained using either diffusion or coalescent theory, and compared to the data frequency spectrum. The best fit parameters can be found using maximum likelihood.

This approach has been used to infer demographic and selection models for many species, including humans. For example, Marth et al. (2004) used the single population allele frequency spectra for a group of Africans, Europeans, and Asians to show that population bottlenecks have occurred in the Asian and European demographic histories, but not in the Africans. [11] More recently, Gutenkunst et al. (2009) used the joint allele frequency spectrum for these same three populations to infer the time at which the populations diverged and the amount of subsequent ongoing migration between them (see out of Africa hypothesis). [10] Additionally, these methods may be used to estimate patterns of selection from allele frequency data. For example, Boyko et al. (2008) inferred the distribution of fitness effects for newly arising mutations using human polymorphism data that controlled for the effects of non-equilibrium demography. [12]

Related Research Articles

<span class="mw-page-title-main">Population genetics</span> Subfield of genetics

Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.

<span class="mw-page-title-main">Hardy–Weinberg principle</span> Principle within genetics

In population genetics, the Hardy–Weinberg principle, also known as the Hardy–Weinberg equilibrium, model, theorem, or law, states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. These influences include genetic drift, mate choice, assortative mating, natural selection, sexual selection, mutation, gene flow, meiotic drive, genetic hitchhiking, population bottleneck, founder effect,inbreeding and outbreeding depression.

Allele frequency, or gene frequency, is the relative frequency of an allele at a particular locus in a population, expressed as a fraction or percentage. Specifically, it is the fraction of all chromosomes in the population that carry that allele over the total population or sample size. Microevolution is the change in allele frequencies that occurs over time within a population.

Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separated onto different chromatids during chromosomal crossover, and are therefore said to be more linked than markers that are far apart. In other words, the nearer two genes are on a chromosome, the lower the chance of recombination between them, and the more likely they are to be inherited together. Markers on different chromosomes are perfectly unlinked, although the penetrance of potentially deleterious alleles may be influenced by the presence of other alleles, and these other alleles may be located on other chromosomes than that on which a particular potentially deleterious allele is located.

<span class="mw-page-title-main">Unified neutral theory of biodiversity</span> Theory of evolutionary biology

The unified neutral theory of biodiversity and biogeography is a theory and the title of a monograph by ecologist Stephen P. Hubbell. It aims to explain the diversity and relative abundance of species in ecological communities. Like other neutral theories of ecology, Hubbell assumes that the differences between members of an ecological community of trophically similar species are "neutral", or irrelevant to their success. This implies that niche differences do not influence abundance and the abundance of each species follows a random walk. The theory has sparked controversy, and some authors consider it a more complex version of other null models that fit the data better.

In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly.

In population genetics, Ewens's sampling formula, describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.

The effective population size (Ne) is a number that, in some simplified scenarios, corresponds to the number of breeding individuals in the population. More generally, Ne is the number of individuals that an idealised population would need to have in order for some specified quantity of interest to be the same as in the real population. Idealised populations are based on unrealistic but convenient simplifications such as random mating, simultaneous birth of each new generation, constant population size, and equal numbers of children per parent. For most quantities of interest and most real populations, the effective population size Ne is usually smaller than the census population size N of a real population. The same population may have multiple effective population sizes, for different properties of interest, including for different genetic loci.

<span class="mw-page-title-main">Genetic distance</span> Measure between populations of divergence

Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.

Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

In population genetics, the Watterson estimator is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s. It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" from the observed nucleotide diversity of a population. , where is the effective population size and is the per-generation mutation rate of the population of interest. The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying, and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.

In population genetics, fixation is the change in a gene pool from a situation where there exists at least two variants of a particular gene (allele) in a given population to a situation where only one of the alleles remains. In the absence of mutation or heterozygote advantage, any allele must eventually be lost completely from the population or fixed. Whether a gene will ultimately be lost or fixed is dependent on selection coefficients and chance fluctuations in allelic proportions. Fixation can refer to a gene in general or particular nucleotide position in the DNA chain (locus).

Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

The stepwise mutation model (SMM) is a mathematical theory, developed by Motoo Kimura and Tomoko Ohta, that allows for investigation of the equilibrium distribution of allelic frequencies in a finite population where neutral alleles are produced in step-wise fashion.

The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

Additive disequilibrium (D) is a statistic that estimates the difference between observed genotypic frequencies and the genotypic frequencies that would be expected under Hardy–Weinberg equilibrium. At a biallelic locus with alleles 1 and 2, the additive disequilibrium exists according to the equations

Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.

Allele age is the amount of time elapsed since an allele first appeared due to mutation. Estimating the time at which a certain allele appeared allows researchers to infer patterns of human migration, disease, and natural selection. Allele age can be estimated based on (1) the frequency of the allele in a population and (2) the genetic variation that occurs within different copies of the allele, also known as intra-allelic variation. While either of these methods can be used to estimate allele age, the use of both increases the accuracy of the estimation and can sometimes offer additional information regarding the presence of selection.

References

  1. Fisher, Ronald A. (1930). "The distribution of gene ratios for rare mutations". Proceedings of the Royal Society of Edinburgh. 50: 205–220.
  2. Wright, Sewall (1938). "The distribution of gene frequencies under irreversible mutation". Proc. Natl. Acad. Sci. USA. 24 (7): 253–259. Bibcode:1938PNAS...24..253W. doi: 10.1073/pnas.24.7.253 . PMC   1077089 . PMID   16577841.
  3. Kimura, Motoo (1964). "Diffusion models in population genetics". J. Appl. Probab. 1 (2): 177–232. doi:10.2307/3211856. JSTOR   3211856. S2CID   86705023.
  4. Evans, Steven N.; Shvets, Yelena; Slatkin, Montgomery (2007). "Non-equilibrium theory of the allele frequency spectrum". Theoretical Population Biology. 71 (1): 109–119. arXiv: q-bio/0604010 . doi:10.1016/j.tpb.2006.06.005. PMID   16887160. S2CID   924340.
  5. Jenkins, Paul A.; Mueller, Jonas W.; Song, Yun S. (2014). "General triallelic frequency spectrum under demographic models with variable population size". Genetics. 196 (1): 295–311. arXiv: 1310.3444 . doi:10.1534/genetics.113.158584. PMC   3872192 . PMID   24214345.
  6. Durrett, Rick (2008). Probability Models for DNA Sequence Evolution (PDF) (2 ed.).
  7. Wakeley, John (22 April 2016). Coalescent Theory: An Introduction. Roberts & Company Publishers. ISBN   978-0974707754.
  8. Crow, James F.; Kimura, Motoo (1970). An introduction to population genetics theory ([Reprint] ed.). New Jersey: Blackburn Press. ISBN   9781932846126.
  9. Chen, H.; Green, R. E.; Paabo, S.; Slatkin, M. (29 July 2007). "The Joint Allele-Frequency Spectrum in Closely Related Species". Genetics. 177 (1): 387–398. doi:10.1534/genetics.107.070730. PMC   2013700 . PMID   17603120.
  10. 1 2 Gutenkunst, Ryan N.; Hernandez, Ryan D.; Williamson, Scott H.; Bustamante, Carlos D. (23 October 2009). "Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data". PLOS Genetics. 5 (10): e1000695. arXiv: 0909.0925 . doi:10.1371/journal.pgen.1000695. PMC   2760211 . PMID   19851460.
  11. Marth, Gabor T.; Czabarka, Eva; Murvai, Janos; Sherry, Stephen T. (1 January 2004). "The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations". Genetics. 166 (1): 351–372. doi:10.1534/genetics.166.1.351. PMC   1470693 . PMID   15020430.
  12. Boyko, Adam R.; Williamson, Scott H.; Indap, Amit R.; Degenhardt, Jeremiah D.; Hernandez, Ryan D.; Lohmueller, Kirk E.; Adams, Mark D.; Schmidt, Steffen; Sninsky, John J.; Sunyaev, Shamil R.; White, Thomas J.; Nielsen, Rasmus; Clark, Andrew G.; Bustamante, Carlos D. (30 May 2008). "Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome". PLOS Genetics. 4 (5): e1000083. doi: 10.1371/journal.pgen.1000083 . PMC   2377339 . PMID   18516229.