Coalescent theory

Last updated

Coalescent theory is a model of how alleles sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time (with wide variance). Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

Contents

The mathematical theory of the coalescent was developed independently by several groups in the early 1980s as a natural extension of classical population genetics theory and models, but can be primarily attributed to John Kingman. Advances in coalescent theory include recombination, selection, overlapping generations and virtually any arbitrarily complex evolutionary or demographic model in population genetic analysis.

The model can be used to produce many theoretical genealogies, and then compare observed data to these simulations to test assumptions about the demographic history of a population. Coalescent theory can be used to make inferences about population genetic parameters, such as migration, population size and recombination.

Theory

Time to coalescence

Consider a single gene locus sampled from two haploid individuals in a population. The ancestry of this sample is traced backwards in time to the point where these two lineages coalesce in their most recent common ancestor (MRCA). Coalescent theory seeks to estimate the expectation of this time period and its variance.

The probability that two lineages coalesce in the immediately preceding generation is the probability that they share a parental DNA sequence. In a population with a constant effective population size with 2Ne copies of each locus, there are 2Ne "potential parents" in the previous generation. Under a random mating model, the probability that two alleles originate from the same parental copy is thus 1/(2Ne) and, correspondingly, the probability that they do not coalesce is 1  1/(2Ne).

At each successive preceding generation, the probability of coalescence is geometrically distributed that is, it is the probability of noncoalescence at the t  1 preceding generations multiplied by the probability of coalescence at the generation of interest:

For sufficiently large values of Ne, this distribution is well approximated by the continuously defined exponential distribution

This is mathematically convenient, as the standard exponential distribution has both the expected value and the standard deviation equal to 2Ne. Therefore, although the expected time to coalescence is 2Ne, actual coalescence times have a wide range of variation. Note that coalescent time is the number of preceding generations where the coalescence took place and not calendar time, though an estimation of the latter can be made multiplying 2Ne with the average time between generations. The above calculations apply equally to a diploid population of effective size Ne (in other words, for a non-recombining segment of DNA, each chromosome can be treated as equivalent to an independent haploid individual; in the absence of inbreeding, sister chromosomes in a single individual are no more closely related than two chromosomes randomly sampled from the population). Some effectively haploid DNA elements, such as mitochondrial DNA, however, are only passed on by one sex, and therefore have one quarter the effective size of the equivalent diploid population (Ne/2)

Neutral variation

Coalescent theory can also be used to model the amount of variation in DNA sequences expected from genetic drift and mutation. This value is termed the mean heterozygosity, represented as . Mean heterozygosity is calculated as the probability of a mutation occurring at a given generation divided by the probability of any "event" at that generation (either a mutation or a coalescence). The probability that the event is a mutation is the probability of a mutation in either of the two lineages: . Thus the mean heterozygosity is equal to

For , the vast majority of allele pairs have at least one difference in nucleotide sequence.

Extensions

There are numerous extensions to the coalescent model, such as the Λ-coalescent which allows for the possibility of multifurcations .

Graphical representation

Coalescents can be visualised using dendrograms which show the relationship of branches of the population to each other. The point where two branches meet indicates a coalescent event.

Applications

Disease gene mapping

The utility of coalescent theory in the mapping of disease is slowly gaining more appreciation; although the application of the theory is still in its infancy, there are a number of researchers who are actively developing algorithms for the analysis of human genetic data that utilise coalescent theory.

A considerable number of human diseases can be attributed to genetics, from simple Mendelian diseases like sickle-cell anemia and cystic fibrosis, to more complicated maladies like cancers and mental illnesses. The latter are polygenic diseases, controlled by multiple genes that may occur on different chromosomes, but diseases that are precipitated by a single abnormality are relatively simple to pinpoint and trace – although not so simple that this has been achieved for all diseases. It is immensely useful in understanding these diseases and their processes to know where they are located on chromosomes, and how they have been inherited through generations of a family, as can be accomplished through coalescent analysis. [1]

Genetic diseases are passed from one generation to another just like other genes. While any gene may be shuffled from one chromosome to another during homologous recombination, it is unlikely that one gene alone will be shifted. Thus, other genes that are close enough to the disease gene to be linked to it can be used to trace it. [1]

Polygenic diseases have a genetic basis even though they don't follow Mendelian inheritance models, and these may have relatively high occurrence in populations, and have severe health effects. Such diseases may have incomplete penetrance, and tend to be polygenic , complicating their study. These traits may arise due to many small mutations, which together have a severe and deleterious effect on the health of the individual. [2]

Linkage mapping methods, including Coalescent theory can be put to work on these diseases, since they use family pedigrees to figure out which markers accompany a disease, and how it is inherited. At the very least, this method helps narrow down the portion, or portions, of the genome on which the deleterious mutations may occur. Complications in these approaches include epistatic effects, the polygenic nature of the mutations, and environmental factors. That said, genes whose effects are additive carry a fixed risk of developing the disease, and when they exist in a disease genotype, they can be used to predict risk and map the gene. [2] Both regular the coalescent and the shattered coalescent (which allows that multiple mutations may have occurred in the founding event, and that the disease may occasionally be triggered by environmental factors) have been put to work in understanding disease genes. [1]

Studies have been carried out correlating disease occurrence in fraternal and identical twins, and the results of these studies can be used to inform coalescent modeling. Since identical twins share all of their genome, but fraternal twins only share half their genome, the difference in correlation between the identical and fraternal twins can be used to work out if a disease is heritable, and if so how strongly. [2]

The genomic distribution of heterozygosity

The human single-nucleotide polymorphism (SNP) map has revealed large regional variations in heterozygosity, more so than can be explained on the basis of (Poisson-distributed) random chance. In part, these variations could be explained on the basis of assessment methods, the availability of genomic sequences, and possibly the standard coalescent population genetic model. Population genetic influences could have a major influence on this variation: some loci presumably would have comparatively recent common ancestors, others might have much older genealogies, and so the regional accumulation of SNPs over time could be quite different. The local density of SNPs along chromosomes appears to cluster in accordance with a variance to mean power law and to obey the Tweedie compound Poisson distribution. In this model the regional variations in the SNP map would be explained by the accumulation of multiple small genomic segments through recombination, where the mean number of SNPs per segment would be gamma distributed in proportion to a gamma distributed time to the most recent common ancestor for each segment.

History

Coalescent theory is a natural extension of the more classical population genetics concept of neutral evolution and is an approximation to the Fisher–Wright (or Wright–Fisher) model for large populations. It was discovered independently by several researchers in the 1980s.

Software

A large body of software exists for both simulating data sets under the coalescent process as well as inferring parameters such as population size and migration rates from genetic data.

Related Research Articles

Genetic drift, also known as random genetic drift, allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance.

Population genetics is a subfield of genetics that deals with genetic differences within and among populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.

Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separated onto different chromatids during chromosomal crossover, and are therefore said to be more linked than markers that are far apart. In other words, the nearer two genes are on a chromosome, the lower the chance of recombination between them, and the more likely they are to be inherited together. Markers on different chromosomes are perfectly unlinked, although the penetrance of potentially deleterious alleles may be influenced by the presence of other alleles, and these other alleles may be located on other chromosomes than that on which a particular potentially deleterious allele is located.

A genetic screen or mutagenesis screen is an experimental technique used to identify and select individuals who possess a phenotype of interest in a mutagenized population. Hence a genetic screen is a type of phenotypic screen. Genetic screens can provide important information on gene function as well as the molecular events that underlie a biological process or pathway. While genome projects have identified an extensive inventory of genes in many different organisms, genetic screens can provide valuable insight as to how those genes function.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

<span class="mw-page-title-main">Haplotype</span> Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA), of a set of organisms is the most recent individual from which all the organisms of the set are descended. The term is also used in reference to the ancestry of groups of genes (haplotypes) rather than organisms.

In population genetics, Ewens's sampling formula, describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.

The effective population size (Ne) is size of an idealised population would experience the same rate of genetic drift or increase in inbreeding as in the real population. Idealised populations are based on unrealistic but convenient assumptions including random mating, simultaneous birth of each new generation, constant population size. For most quantities of interest and most real populations, Ne is smaller than the census population size N of a real population. The same population may have multiple effective population sizes for different properties of interest, including genetic drift and inbreeding.

Genetics, a discipline of biology, is the science of heredity and variation in living organisms.

Genetic hitchhiking, also called genetic draft or the hitchhiking effect, is when an allele changes frequency not because it itself is under natural selection, but because it is near another gene that is undergoing a selective sweep and that is on the same DNA chain. When one gene goes through a selective sweep, any other nearby polymorphisms that are in linkage disequilibrium will tend to change their allele frequencies too. Selective sweeps happen when newly appeared mutations are advantageous and increase in frequency. Neutral or even slightly deleterious alleles that happen to be close by on the chromosome 'hitchhike' along with the sweep. In contrast, effects on a neutral locus due to linkage disequilibrium with newly appeared deleterious mutations are called background selection. Both genetic hitchhiking and background selection are stochastic (random) evolutionary forces, like genetic drift.

<span class="mw-page-title-main">Fixation index</span> Measure of population differentiation

The fixation index (FST) is a measure of population differentiation due to genetic structure. It is frequently estimated from genetic polymorphism data, such as single-nucleotide polymorphisms (SNP) or microsatellites. Developed as a special case of Wright's F-statistics, it is one of the most commonly used statistics in population genetics. Its values range from 0 to 1, with 0.15 being substantially differentiated and 1 being complete differentiation.

In population genetics, fixation is the change in a gene pool from a situation where there exists at least two variants of a particular gene (allele) in a given population to a situation where only one of the alleles remains. That is, the allele becomes fixed. In the absence of mutation or heterozygote advantage, any allele must eventually either be lost completely from the population, or fixed, i.e. permanently established at 100% frequency in the population. Whether a gene will ultimately be lost or fixed is dependent on selection coefficients and chance fluctuations in allelic proportions. Fixation can refer to a gene in general or particular nucleotide position in the DNA chain (locus).

Viral phylodynamics is defined as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies. Since the coining of the term in 2004, research on viral phylodynamics has focused on transmission dynamics in an effort to shed light on how these dynamics impact viral genetic variation. Transmission dynamics can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic, although extensions for multiallelic frequency spectra exist.

The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.

Allele age is the amount of time elapsed since an allele first appeared due to mutation. Estimating the time at which a certain allele appeared allows researchers to infer patterns of human migration, disease, and natural selection. Allele age can be estimated based on (1) the frequency of the allele in a population and (2) the genetic variation that occurs within different copies of the allele, also known as intra-allelic variation. While either of these methods can be used to estimate allele age, the use of both increases the accuracy of the estimation and can sometimes offer additional information regarding the presence of selection.

This glossary of genetics and evolutionary biology is a list of definitions of terms and concepts used in the study of genetics and evolutionary biology, as well as sub-disciplines and related fields, with an emphasis on classical genetics, quantitative genetics, population biology, phylogenetics, speciation, and systematics. Overlapping and related terms can be found in Glossary of cellular and molecular biology, Glossary of ecology, and Glossary of biology.

References

  1. 1 2 3 Morris, A., Whittaker, J., & Balding, D. (2002). Fine-Scale Mapping of Disease Loci via Shattered Coalescent Modeling of Genealogies. The American Journal of Human Genetics,70(3), 686–707. doi : 10.1086/339271
  2. 1 2 3 Rannala, B. (2001). Finding genes influencing susceptibility to complex diseases in the post-genome era. American journal of pharmacogenomics, 1(3), 203–221.

Sources

Articles

  • ^ Arenas, M. and Posada, D. (2014) Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories. Molecular Biology and Evolution31(5): 12951301
  • ^ Arenas, M. and Posada, D. (2007) Recodon: Coalescent simulation of coding DNA sequences with recombination, migration and demography. BMC Bioinformatics8: 458
  • ^ Arenas, M. and Posada, D. (2010) Coalescent simulation of intracodon recombination. Genetics184(2): 429437
  • ^ Browning, S.R. (2006) Multilocus association mapping using variable-length markov chains. American Journal of Human Genetics78:903913
  • ^ Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.-M., Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA sequence and microsatellite data. Bioinformatics '30': 1187–1189
  • ^ Degnan, JH and LA Salter. 2005. Gene tree distributions under the coalescent process. Evolution 59(1): 24–37. pdf from coaltree.net/
  • ^ Donnelly, P., Tavaré, S. (1995) Coalescents and genealogical structure under neutrality. Annual Review of Genetics29:401421
  • ^ Drummond A, Suchard MA, Xie D, Rambaut A (2012). "Bayesian phylogenetics with BEAUti and the BEAST 1.7". Molecular Biology and Evolution. 29 (8): 1969–1973. doi:10.1093/molbev/mss075. PMC   3408070 . PMID   22367748.
  • ^ Ewing, G. and Hermisson J. (2010), MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus, Bioinformatics26:15
  • ^ Hellenthal, G., Stephens M. (2006) msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots BioinformaticsAOP
  • ^ Hudson, Richard R. (1983a). "Testing the Constant-Rate Neutral Allele Model with Protein Sequence Data". Evolution . 37 (1): 203–17. doi:10.2307/2408186. ISSN   1558-5646. JSTOR   2408186. PMID   28568026.
  • ^ Hudson RR (1983b) Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology23:183201.
  • ^ Hudson RR (1991) Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology7: 144
  • ^ Hudson RR (2002) Generating samples under a WrightFisher neutral model. Bioinformatics18:337338
  • ^ Kendal WS (2003) An exponential dispersion model for the distribution of human single nucleotide polymorphisms. Mol Biol Evol20: 579590
  • Hein, J., Schierup, M., Wiuf C. (2004) Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory Oxford University Press ISBN   978-0-19-852996-5
  • ^ Kaplan, N.L., Darden, T., Hudson, R.R. (1988) The coalescent process in models with selection. Genetics120:819829
  • ^ Kingman, J. F. C. (1982). "On the Genealogy of Large Populations". Journal of Applied Probability . 19: 27–43. CiteSeerX   10.1.1.552.1429 . doi:10.2307/3213548. ISSN   0021-9002. JSTOR   3213548. S2CID   125055288.
  • ^ Kingman, J.F.C. (2000) Origins of the coalescent 19741982. Genetics156:14611463
  • ^ Leblois R., Estoup A. and Rousset F. (2009) IBDSim: a computer program to simulate genotypic data under isolation by distance Molecular Ecology Resources9:107–109
  • ^ Liang L., Zöllner S., Abecasis G.R. (2007) GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics 23: 15651567
  • ^ Mailund, T., Schierup, M.H., Pedersen, C.N.S., Mechlenborg, P. J. M., Madsen, J.N., Schauser, L. (2005) CoaSim: A Flexible Environment for Simulating Genetic Data under Coalescent Models BMC Bioinformatics6:252
  • ^ Möhle, M., Sagitov, S. (2001) A classification of coalescent processes for haploid exchangeable population models The Annals of Probability29:15471562
  • ^ Morris, A. P., Whittaker, J. C., Balding, D. J. (2002) Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies American Journal of Human Genetics70:686707
  • ^ Neuhauser, C., Krone, S.M. (1997) The genealogy of samples in models with selection Genetics145 519534
  • ^ Pitman, J. (1999) Coalescents with multiple collisions The Annals of Probability27:18701902
  • ^ Harding, Rosalind, M. 1998. New phylogenies: an introductory look at the coalescent. pp. 15–22, in Harvey, P. H., Brown, A. J. L., Smith, J. M., Nee, S. New uses for new phylogenies. Oxford University Press ( ISBN   0198549849)
  • ^ Rosenberg, N.A., Nordborg, M. (2002) Genealogical Trees, Coalescent Theory and the Analysis of Genetic Polymorphisms. Nature Reviews Genetics3:380390
  • ^ Sagitov, S. (1999) The general coalescent with asynchronous mergers of ancestral lines Journal of Applied Probability36:11161125
  • ^ Schweinsberg, J. (2000) Coalescents with simultaneous multiple collisions Electronic Journal of Probability5:150
  • ^ Slatkin, M. (2001) Simulating genealogies of selected alleles in populations of variable size Genetic Research145:519534
  • ^ Tajima, F. (1983) Evolutionary Relationship of DNA Sequences in finite populations. Genetics105:437460
  • ^ Tavare S, Balding DJ, Griffiths RC & Donnelly P. 1997. Inferring coalescent times from DNA sequence data. Genetics145: 505518.
  • ^ The international SNP map working group. 2001. A map of human genome variation containing 1.42 million single nucleotide polymorphisms. Nature409: 928933.
  • ^ Zöllner S. and Pritchard J.K. (2005) Coalescent-Based Association Mapping and Fine Mapping of Complex Trait Loci Genetics169:1071–1092
  • ^ Rousset F. and Leblois R. (2007) Likelihood and Approximate Likelihood Analyses of Genetic Structure in a Linear Habitat: Performance and Robustness to Model Mis-Specification Molecular Biology and Evolution24:2730–2745

Books