In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci (often SNPs) in a population or sample. [1] [2] [3] [4] Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic (that is, with exactly two alleles present), although extensions for multiallelic frequency spectra exist. [5]
Many summary statistics of observed genetic variation are themselves summaries of the allele frequency spectrum, including estimates of such as Watterson's and Tajima's , Tajima's D, Fay and Wu's H and the fixation index . [6]
The allele frequency spectrum from a sample of chromosomes is calculated by counting the number of sites with derived allele frequencies . For example, consider a sample of individuals with eight observed variable sites. In this table, a 1 indicates that the derived allele is observed at that site, while a 0 indicates the ancestral allele was observed.
SNP 1 | SNP 2 | SNP 3 | SNP 4 | SNP 5 | SNP 6 | SNP 7 | SNP 8 | |
---|---|---|---|---|---|---|---|---|
Sample 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
Sample 2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Sample 3 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
Sample 4 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
Sample 5 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Sample 6 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
Total | 1 | 2 | 3 | 1 | 1 | 2 | 5 | 1 |
The allele frequency spectrum can be written as the vector , where is the number of observed sites with derived allele frequency . In this example, the observed allele frequency spectrum is , due to four instances of a single observed derived allele at a particular SNP loci, two instances of two derived alleles, and so on.
The expected allele frequency spectrum may be calculated using either a coalescent or diffusion approach. [7] [8] The demographic history of a population and natural selection affect allele frequency dynamics, and these effects are reflected in the shape of the allele frequency spectrum. For the simple case of selective neutral alleles segregating in a population that has reached demographic equilibrium (that is, without recent population size changes or gene flow), the expected allele frequency spectrum for a sample of size is given by
where is the population scaled mutation rate. Deviations from demographic equilibrium or neutrality will change the shape of the expected frequency spectrum.
Calculating the frequency spectrum from observed sequence data requires one to be able to distinguish the ancestral and derived (mutant) alleles, often by comparing to an outgroup sequence. For example in human population genetic studies, the homologous chimpanzee reference sequence is typically used to estimate the ancestral allele. However, sometimes the ancestral allele cannot be determined, in which case the folded allele frequency spectrum may be calculated instead. The folded frequency spectrum stores the observed counts of the minor (most rare) allele frequencies. The folded spectrum can be calculated by binning together the th and th entries from the unfolded spectrum, where is the number of sampled individuals.
The joint allele frequency spectrum (JAFS) is the joint distribution of allele frequencies across two or more related populations. The JAFS for populations, with sampled chromosomes in the th population, is a -dimensional histogram, in which each entry stores the total number of segregating sites in which the derived allele is observed with the corresponding frequency in each population. Each axis of the histogram corresponds to a population, and indices run from for the th population. [9] [10]
Suppose we sequence diploid individuals from two populations, 4 individuals from population 1 and 2 individuals from population 2. The JAFS would be a matrix, indexed from zero. The entry would record the number of observed polymorphic loci with derived allele frequency 3 in population 1 and frequency 2 in population 2. The entry would record those loci with observed frequency 1 in population 1, and frequency 0 in population 2. The entry would record those loci with the derived allele fixed in population 1 (seen in all chromosomes), and with frequency 3 in population 2.
The shape of the allele frequency spectrum is sensitive to demography, such as population size changes, migration, and substructure, as well as natural selection. By comparing observed data summarized in a frequency spectrum to the expected frequency spectrum calculated under a given demographic and selection model, one can assess the goodness of fit of that the model to the data, and use likelihood theory to estimate the best fit parameters of the model.
For example, suppose a population experienced a recent period of exponential growth and sample sequences were obtained from the population at the end of the growth and the observed (data) allele frequency spectrum was calculated using putatively neutral variation. The demographic model would have parameters for the exponential growth rate , the time for which the growth occurred, and a reference population size , assuming that the population was at equilibrium when the growth began. The expected frequency spectrum for a given parameter set can be obtained using either diffusion or coalescent theory, and compared to the data frequency spectrum. The best fit parameters can be found using maximum likelihood.
This approach has been used to infer demographic and selection models for many species, including humans. For example, Marth et al. (2004) used the single population allele frequency spectra for a group of Africans, Europeans, and Asians to show that population bottlenecks have occurred in the Asian and European demographic histories, but not in the Africans. [11] More recently, Gutenkunst et al. (2009) used the joint allele frequency spectrum for these same three populations to infer the time at which the populations diverged and the amount of subsequent ongoing migration between them (see out of Africa hypothesis). [10] Additionally, these methods may be used to estimate patterns of selection from allele frequency data. For example, Boyko et al. (2008) inferred the distribution of fitness effects for newly arising mutations using human polymorphism data that controlled for the effects of non-equilibrium demography. [12]
Population genetics is a subfield of genetics that deals with genetic differences within and among populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.
In population genetics, the Hardy–Weinberg principle, also known as the Hardy–Weinberg equilibrium, model, theorem, or law, states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. These influences include genetic drift, mate choice, assortative mating, natural selection, sexual selection, mutation, gene flow, meiotic drive, genetic hitchhiking, population bottleneck, founder effect,inbreeding and outbreeding depression.
Allele frequency, or gene frequency, is the relative frequency of an allele at a particular locus in a population, expressed as a fraction or percentage. Specifically, it is the fraction of all chromosomes in the population that carry that allele over the total population or sample size. Microevolution is the change in allele frequencies that occurs over time within a population.
Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separated onto different chromatids during chromosomal crossover, and are therefore said to be more linked than markers that are far apart. In other words, the nearer two genes are on a chromosome, the lower the chance of recombination between them, and the more likely they are to be inherited together. Markers on different chromosomes are perfectly unlinked, although the penetrance of potentially deleterious alleles may be influenced by the presence of other alleles, and these other alleles may be located on other chromosomes than that on which a particular potentially deleterious allele is located.
The unified neutral theory of biodiversity and biogeography is a theory and the title of a monograph by ecologist Stephen P. Hubbell. It aims to explain the diversity and relative abundance of species in ecological communities. Like other neutral theories of ecology, Hubbell assumes that the differences between members of an ecological community of trophically similar species are "neutral", or irrelevant to their success. This implies that niche differences do not influence abundance and the abundance of each species follows a random walk. The theory has sparked controversy, and some authors consider it a more complex version of other null models that fit the data better.
In population genetics, linkage disequilibrium (LD) is a measure of non-random association between segments of DNA (alleles) at different positions on the chromosome (loci) in a given population based on a comparison between the frequency at which two alleles are detected together at the same loci versus the frequencies at which each allele is simply detected at that same loci. Loci are said to be in linkage disequilibrium when the frequency of being detected together is higher or lower than expected if the loci were independent and associated randomly.
In population genetics, Ewens's sampling formula describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.
Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.
Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.
In population genetics, the Watterson estimator is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s. It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" from the observed nucleotide diversity of a population. , where is the effective population size and is the per-generation mutation rate of the population of interest. The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying, and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.
In population genetics, fixation is the change in a gene pool from a situation where there exists at least two variants of a particular gene (allele) in a given population to a situation where only one of the alleles remains. That is, the allele becomes fixed. In the absence of mutation or heterozygote advantage, any allele must eventually either be lost completely from the population, or fixed, i.e. permanently established at 100% frequency in the population. Whether a gene will ultimately be lost or fixed is dependent on selection coefficients and chance fluctuations in allelic proportions. Fixation can refer to a gene in general or particular nucleotide position in the DNA chain (locus).
Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.
In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usually abbreviated as i.i.d., iid, or IID. IID was first defined in statistics and finds application in different fields such as data mining and signal processing.
Fay and Wu's H is a statistical test created by and named after two researchers Justin Fay and Chung-I Wu. The purpose of the test is to distinguish between a DNA sequence evolving randomly ("neutrally") and one evolving under positive selection. This test is an advancement over Tajima's D, which is used to differentiate neutrally evolving sequences from those evolving non-randomly. Fay and Wu's H is frequently used to identify sequences which have experienced selective sweeps in their evolutionary history.
SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.
The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.
Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.
In probability theory, Poisson-Dirichlet distributions are probability distributions on the set of nonnegative, non-increasing sequences with sum 1, depending on two parameters and . It can be defined as follows. One considers independent random variables such that follows the beta distribution of parameters and . Then, the Poisson-Dirichlet distribution of parameters and is the law of the random decreasing sequence containing and the products . This definition is due to Jim Pitman and Marc Yor. It generalizes Kingman's law, which corresponds to the particular case .