Phylogenetic invariants

Last updated

Phylogenetic invariants [1] are polynomial relationships between the frequencies of various site patterns in an idealized DNA multiple sequence alignment. They have received substantial study in the field of biomathematics, and they can be used to choose among phylogenetic tree topologies in an empirical setting. The primary advantage of phylogenetic invariants relative to other methods of phylogenetic estimation like maximum likelihood or Bayesian MCMC analyses is that invariants can yield information about the tree without requiring the estimation of branch lengths of model parameters. The idea of using phylogenetic invariants was introduced independently by James Cavender and Joseph Felsenstein [2] and by James A. Lake [3] in 1987.

Contents

At this point the number of programs that allow empirical datasets to be analyzed using invariants is limited. However, phylogenetic invariants may provide solutions to other problems in phylogenetics and they represent an area of active research for that reason. Felsenstein [4] stated it best when he said, "invariants are worth attention, not for what they do for us now, but what they might lead to in the future." (p. 390)

If we consider a multiple sequence alignment with t taxa and no gaps or missing data (i.e., an idealized multiple sequence alignment), there are 4t possible site patterns. For example, there are 256 possible site patterns for four taxa (fAAAA, fAAAC, fAAAG, … fTTTT), which can be written as a vector. This site pattern frequency vector has 255 degrees of freedom because the frequencies must sum to one. However, any set of site pattern frequencies that resulted from some specific process of sequence evolution on a specific tree must obey many constraints. and therefore have many fewer degrees of freedom. Thus, there should be polynomials involving those frequencies that take on a value of zero if the DNA sequences were generated on a specific tree given a particular substitution model.

Invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model and tree topology are correct. By testing whether such polynomials for various trees are 'nearly zero' when evaluated on the observed frequencies of patterns in real data sequences one should be able infer which tree best explains the data.

Some invariants are straightforward consequences of symmetries in the model of nucleotide substitution and they will take on a value of zero regardless of the underlying tree topology. For example, if we assume the Jukes-Cantor model of sequence evolution and a four-taxon tree we expect:

This is a simple outgrowth of the fact that base frequencies are constrained to be equal under the Jukes-Cantor model. Thus, they are called symmetry invariants. The equation shown above is only one of a large number of symmetry invariants for the Jukes-Cantor model; in fact, there are a total of 241 symmetry invariants for that model.

Symmetry invariants for the Jukes-Cantor model of DNA evolution (adapted from Felsenstein 2004 [4] )
Site pattern categoryExample of site patternNumber of pattern typesNumber of patternsTotal invariants that result
4xxxxx (e.g., AAAA, CCCC, ...)143
3x, 1yxxxy (e.g., AAAC, AACA, ...)41244
2x, 2yxxyy (e.g., AACC, ACCA, ...)31233
2x, 1y, 1zxxyz (e.g., AACG, ACGA, ...)624138
1x, 1y, 1z, 1wxyzw (e.g., ACGT, CGTA, ...)12423
Totals =15241

Symmetry invariants are non-phylogenetic in nature; they take on the expected value of zero regardless of the tree topology. However, it is possible to determine whether a particular multiple sequence alignment fits the Jukes-Cantor model of evolution (i.e., by testing whether the site patterns of the appropriate types are present in equal numbers). More general tests for the best-fitting model using invariants are also possible. For example Kedzierska et al. 2012 [5] used invariants to establish the best-fitting model out from a specific model set.

Models of DNA evolution that can be tested using the Kedzierska et al. (2012) [5] invariants method
Model abbreviationFull model name
JC69*Jukes-Cantor
K80*Kimura two-parameter
K81*Kimura three-parameter
SSM (CS05)Strand-specific model
GMMGeneral Markov model

The asterisk after the JC69, K80, and K81 models is used to emphasize the non-homogeneous nature of the models that can be examined using invariants. These non-homogeneous models include the commonly used continuous-time JC69, K80, and K81 models as submodels. The SSM (strand-specific model [6] ), also called the CS05 [7] model, is a generalized non-homogeneous version of the HKY (Hasegawa-Kishino-Yano) model [8] constrained to have equal distribution of the pairs of bases A,T and C,G at each node of the tree and no assumption regarding a stable base distribution. All models listed above are submodels of the general Markov model [9] (GMM). The ability to perform tests using non-homogeneous models represents a major benefit of the invariants methods relative to the more commonly used maximum likelihood methods for phylogenetic model testing.

Phylogenetic invariants, which are defined as the subset of invariants that take on a value of zero only when the sequences were (or were not) generated on a specific topology, are likely to be the most useful invariants for phylogenetic studies. .

Lake's linear invariants

Lake's invariants (which he called "evolutionary parsimony") provide an excellent example of phylogenetic invariants. Lake's invariants involve quartets, two of which (the incorrect topologies) yield values of zero and one of which yields a value greater than zero. This can be used to construct a test based on following invariant relationship, which holds for the two incorrect trees when sites evolve under the Kimura two-parameter model of sequence evolution:

The indices of these site pattern frequencies indicate the bases scored relative to the base in the first taxon (which we call taxon A). If base 1 is a purine, then base 2 is the other purine and bases 3 and 4 are the pyrimidines. If base 1 is a pyrimidine, then base 2 is the other pyrimidine and. bases 3 and 4 are the purines.  

We will call three possible quartet trees TX [TX is ((A,B),(C,D)); in newick format], TY [TY is ((A,C),(B,D)); in newick format],  and TZ [TZ is ((A,D),(B,C)); in newick format]. We can calculate three values from the data to identify the best topology given the data:

Lake broke these values up into a "parsimony-like term" ( for TX) the "background term" ( for TX) and suggests testing for deviation from zero by calculating and performing a χ2 test with one degree of freedom. Similar χ2 tests can be performed for Y and Z. If one of the three values is significantly different from zero the corresponding topology is the best estimate of phylogeny. The advantage of using Lake's invariants relative to maximum likelihood or neighbor joining of Kimura two-parameter distances is that the invariants should hold regardless of the model parameters, branch lengths, or patterns of among-sites rate heterogeneity.

A classic study by John Huelsenbeck and David Hillis [10] found that Lake's invariants converges on the true tree over all of the branch length space they examined when the underlying model of evolution is the Kimura two-parameter model. However, they also found that Lake's invariants are very inefficient (large amounts of data are necessary to converge on the correct tree). This inefficiency has caused most empiricists to abandon the use of Lake's invariants. Also, because Lake's invariants are based on the Kimura two-parameter model phylogenetic estimation using Lake's invariants may not yield the true tree when the model that generated the data strongly violates that model.

Modern approaches using phylogenetic invariants

The low efficiency of Lake's invariants reflects the fact that it used a limited set of generators for the phylogenetic invariants. Casanellas et al. [11] introduced methods to derive a much larger set of set of generators for DNA data and this has led to the development of invariants methods that are as efficient as maximum likelihood methods. [12] Several of these methods have implementations that are practical for analyses of empirical datasets.

Eriksson [13] proposed an invariants method for the general Markov model based on singular value decomposition (SVD) of matrices generated by "flattening" the nucleotides associated with each of the leaves (i.e., the site pattern frequency spectrum). Different flattening matrices are produced for each topology. However, comparisons of the original Eriksson SVD method (ErikSVD) to neighbor joining and the maximum likelihood approach implemented in the PHYLIP program dnaml were mixed; ErikSVD underperformed the other two methods when used with simulated data but it appeared to perform better than dnaml when applied to an empirical mammalian dataset based on an early release of data from the ENCODE project. The original ErikSVD method was improved by Fernández-Sánchez and Casanellas, [14] who proposed a normalization they called Erik+2. The original ErikSVD method is statistically consistent (it converges on. the true tree. as the empirical distribution approaches the theoretical distribution); the Erik+2 normalization improves the performance of the method given finite datasets. It has been implemented in the software package PAUP* as an option for the SVDquartets method.

"Squangles" (stochastic quartet tangles [15] ) represents another example of an invariants method [16] hat has been implemented in software package that is practical to be used with empirical datasets. Squangles permit the choice among the three possible quartets assuming that DNA sequences have evolved under the general Markov model; the quartets can then be assembled using a supertree method. There are three squangles that are useful for differentiating among quartets, which can be denoted as q1(f), q2(f), and q3(f) (f is a 256 element vector containing the site frequency spectrum). Each q has 66,744 terms and together they satisfy the linear relation q1 + q2 + q3 = 0 (i.e., up to linear dependence there are only two q values). Each possible quartet has different expected values for q1, q2, and q3:

Expected values for q1, q2, and q3 (adapted from Holland et al. 2013 [16] )
Tree topology

(newick format)

QuartetE(q1)E(q2)E(q3)
((A,B),(C,D));AB|CD (or 12|34)0-uu
((A,C),(B,D));AC|BD (or 13|24)v0-v
((A,D),(B,C));AD|BC (or 14|23)-ww0

The expected values q1, q2, and q3 are all zero on the star topology (a quartet with an internal branch length of zero). For practicality, Holland et al. [16] used least squares to solve for the q values. Empirical tests of the squangles method have been limited [16] [17] but they appear to be promising.

Related Research Articles

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

<span class="mw-page-title-main">Cladogram</span> Diagram used to show relations among groups of organisms with common origins

A cladogram is a diagram used in cladistics to show relations among organisms. A cladogram is not, however, an evolutionary tree because it does not show how ancestors are related to descendants, nor does it show how much they have changed, so many differing evolutionary trees can be consistent with the same cladogram. A cladogram uses lines that branch off in different directions ending at a clade, a group of organisms with a last common ancestor. There are many shapes of cladograms but they all have lines that branch off from other lines. The lines can be traced back to where they branch off. These branching off points represent a hypothetical ancestor which can be inferred to exhibit the traits shared among the terminal taxa above it. This hypothetical ancestor might then provide clues about the order of evolution of various features, adaptation, and other evolutionary narratives about ancestors. Although traditionally such cladograms were generated largely on the basis of morphological characters, DNA and RNA sequencing data and computational phylogenetics are now very commonly used in the generation of cladograms, either on their own or in combination with morphology.

<span class="mw-page-title-main">Singular value decomposition</span> Matrix decomposition

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix into a rotation, followed by a rescaling followed by another rotation. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any matrix. It is related to the polar decomposition.

<span class="mw-page-title-main">Phylogenetic tree</span> Branching diagram of evolutionary relationships between organisms

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

<span class="mw-page-title-main">Hadamard transform</span> Involutive change of basis in linear algebra

The Hadamard transform is an example of a generalized class of Fourier transforms. It performs an orthogonal, symmetric, involutive, linear operation on 2m real numbers.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

The zero-order hold (ZOH) is a mathematical model of the practical signal reconstruction done by a conventional digital-to-analog converter (DAC). That is, it describes the effect of converting a discrete-time signal to a continuous-time signal by holding each sample value for one sample interval. It has several applications in electrical communication.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley, Bob Mau in Madison, and Shuying Li in University of Iowa, the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001, and is now one of the most popular methods in molecular phylogenetics.

A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.

Phylogenetic comparative methods (PCMs) use information on the historical relationships of lineages (phylogenies) to test evolutionary hypotheses. The comparative method has a long history in evolutionary biology; indeed, Charles Darwin used differences and similarities between species as a major source of evidence in The Origin of Species. However, the fact that closely related lineages share many traits and trait combinations as a result of the process of descent with modification means that lineages are not independent. This realization inspired the development of explicitly phylogenetic comparative methods. Initially, these methods were primarily developed to control for phylogenetic history when testing for adaptation; however, in recent years the use of the term has broadened to include any use of phylogenies in statistical tests. Although most studies that employ PCMs focus on extant organisms, many methods can also be applied to extinct taxa and can incorporate information from the fossil record.

Least squares inference in phylogeny generates a phylogenetic tree based on an observed matrix of pairwise genetic distances and optionally a weight matrix. The goal is to find a tree which satisfies the distance constraints as best as possible.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Horizontal or lateral gene transfer is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate investigations of the evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages.

Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.

<span class="mw-page-title-main">Phylogenetic reconciliation</span> Technique in evolutionary study

In phylogenetics, reconciliation is an approach to connect the history of two or more coevolving biological entities. The general idea of reconciliation is that a phylogenetic tree representing the evolution of an entity can be drawn within another phylogenetic tree representing an encompassing entity to reveal their interdependence and the evolutionary events that have marked their shared history. The development of reconciliation approaches started in the 1980s, mainly to depict the coevolution of a gene and a genome, and of a host and a symbiont, which can be mutualist, commensalist or parasitic. It has also been used for example to detect horizontal gene transfer, or understand the dynamics of genome evolution.

References

  1. Allman, E. S. and. Rhodes, J. A., "Phylogenetic invariants,'' in Reconstructing Evolution: New Mathematical and Computational Advances, ed. by O. Gascuel and M. Steel. Oxford University Press, 2007, 108--147
  2. Cavender, James A.; Felsenstein, Joseph (March 1987). "Invariants of phylogenies in a simple case with discrete states". Journal of Classification. 4 (1): 57–71. doi:10.1007/BF01890075. ISSN   0176-4268. S2CID   121832940.
  3. Lake, J. A. (March 1987). "A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony". Molecular Biology and Evolution. 4 (2): 167–191. doi: 10.1093/oxfordjournals.molbev.a040433 . ISSN   1537-1719. PMID   3447007.
  4. 1 2 Felsenstein, Joseph. (2004). Inferring phylogenies. Sunderland, Mass.: Sinauer Associates. ISBN   0-87893-177-5. OCLC   52127769.
  5. 1 2 Kedzierska, A. M.; Drton, M.; Guigo, R.; Casanellas, M. (2012-03-01). "SPIn: Model Selection for Phylogenetic Mixtures via Linear Invariants". Molecular Biology and Evolution. 29 (3): 929–937. doi:10.1093/molbev/msr259. hdl: 2117/14907 . ISSN   0737-4038. PMID   22009060.
  6. Casanellas M,  Sullivant S. (2005) "The strand symmetric model," in Algebraic statistics for computational biology, ed. Pachter L,  Sturmfels B., Cambridge University Press (Chapter 16, pp. 305-321)
  7. Pachter L,  Sturmfels B. (2005) "Biology," in Algebraic statistics for computational biology, ed. Pachter L,  Sturmfels B., Cambridge University Press (Chapter 4, pp. 125-159)
  8. Hasegawa, Masami; Kishino, Hirohisa; Yano, Taka-aki (October 1985). "Dating of the human-ape splitting by a molecular clock of mitochondrial DNA". Journal of Molecular Evolution. 22 (2): 160–174. doi:10.1007/BF02101694. ISSN   0022-2844. PMID   3934395. S2CID   25554168.
  9. Barry, D., & Hartigan, J. A. (1987). Statistical analysis of hominoid molecular evolution. Statistical Science, 2(2), 191-207.
  10. Huelsenbeck, J. P.; Hillis, D. M. (1993-09-01). "Success of Phylogenetic Methods in the Four-Taxon Case". Systematic Biology. 42 (3): 247–264. doi:10.1093/sysbio/42.3.247. ISSN   1063-5157.
  11. Casanellas M,  Sullivant S. Pachter L,  Sturmfels B. (2005) Catalog of small trees, Algebraic statistics for computational biology. Chapter 15, Cambridge (UK) Cambridge University Press
  12. Casanellas, M; Fernández-Sánchez, J (January 2007). "Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees". Molecular Biology and Evolution. 24 (1): 288–293. arXiv: q-bio/0610030 . doi: 10.1093/molbev/msl153 . ISSN   1537-1719. PMID   17053050.
  13. Eriksson N. (2005) "Tree construction using singular value decomposition," in Algebraic statistics for computational biology, ed. Pachter L,  Sturmfels B., Cambridge University Press (Chapter 19, pp. 347-358)
  14. Fernández-Sánchez, Jesús; Casanellas, Marta (March 2016). "Invariant Versus Classical Quartet Inference When Evolution is Heterogeneous Across Sites and Lineages". Systematic Biology. 65 (2): 280–291. arXiv: 1405.6546 . doi: 10.1093/sysbio/syv086 . ISSN   1063-5157. PMID   26559009.
  15. Sumner J.G.. Entanglement, invariants, and phylogenetics, 2006 [Ph.D. thesis] University of Tasmania. Available from: URL http://eprints.utas.edu.au/709/
  16. 1 2 3 4 Holland, Barbara R.; Jarvis, Peter D.; Sumner, Jeremy G. (2013-01-01). "Low-Parameter Phylogenetic Inference Under the General Markov Model". Systematic Biology. 62 (1): 78–92. doi: 10.1093/sysbio/sys072 . ISSN   1076-836X. PMID   22914976.
  17. Reddy, Sushma; Kimball, Rebecca T.; Pandey, Akanksha; Hosner, Peter A.; Braun, Michael J.; Hackett, Shannon J.; Han, Kin-Lan; Harshman, John; Huddleston, Christopher J.; Kingston, Sarah; Marks, Ben D. (September 2017). "Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling". Systematic Biology. 66 (5): 857–879. doi: 10.1093/sysbio/syx041 . ISSN   1063-5157. PMID   28369655.