Minimum evolution

Last updated

Minimum evolution is a distance method employed in phylogenetics modeling. It shares with maximum parsimony the aspect of searching for the phylogeny that has the shortest total sum of branch lengths. [1] [2]

Contents

The theoretical foundations of the minimum evolution (ME) criterion lay in the seminal works of both Kidd and Sgaramella-Zonta (1971) [3] and Rzhetsky and Nei (1993). [4] In these frameworks, the molecular sequences from taxa are replaced by a set of measures of their dissimilarity (i.e., the so called "evolutionary distances") and a fundamental result states that if such distances were unbiased estimates of the true evolutionary distances from taxa (i.e., the distances that one would obtain if all the molecular data from taxa were available), then the true phylogeny of taxa would have an expected length shorter than any other possible phylogeny T compatible with those distances.

Relationships with and comparison with other methods

Maximum parsimony

It is worth noting here a subtle difference between the maximum-parsimony criterion and the ME criterion: while maximum-parsimony is based on an abductive heuristic, i.e., the plausibility of the simplest evolutionary hypothesis of taxa with respect to the more complex ones, the ME criterion is based on Kidd and Sgaramella-Zonta's conjectures that were proven true 22 years later by Rzhetsky and Nei. [4] These mathematical results set the ME criterion free from the Occam's razor principle and confer it a solid theoretical and quantitative basis.

Maximum-parsimony criterion, which uses Hamming distance branch lengths, was shown to be statistically inconsistent in 1978. This led to an interest in statistically consistent alternatives such as ME. [5]

Neighbor joining

Neighbor joining may be viewed as a greedy heuristic for the balanced minimum evolution (BME) criterion. Saito and Nei's 1987 NJ algorithm far predates the BME criterion of 2000. For two decades, researchers used NJ without a firm theoretical basis for why it works. [6]

Statistical consistency

The ME criterion is known to be statistically consistent whenever the branch lengths are estimated via the Ordinary Least-Squares (OLS) or via linear programming. [4] [7] [8] However, as observed in Rzhetsky & Nei's article, the phylogeny having the minimum length under the OLS branch length estimation model may be characterized, in some circumstance, by negative branch lengths, which unfortunately are empty of biological meaning. [4]

To solve this drawback, Pauplin [9] proposed to replace OLS with a new particular branch length estimation model, known as balanced basic evolution (BME). Richard Desper and Olivier Gascuel [10] showed that the BME branch length estimation model ensures the general statistical consistency of the minimum length phylogeny as well as the non-negativity of its branch lengths, whenever the estimated evolutionary distances from taxa satisfy the triangle inequality.

Le Sy Vinh and Arndt von Haeseler [11] have shown, by means of massive and systematic simulation experiments, that the accuracy of the ME criterion under the BME branch length estimation model is by far the highest in distance methods and not inferior to those of alternative criteria based e.g., on Maximum Likelihood or Bayesian Inference. Moreover, as shown by Daniele Catanzaro, Martin Frohn and Raffaele Pesenti, [12] the minimum length phylogeny under the BME branch length estimation model can be interpreted as the (Pareto optimal) consensus tree between concurrent minimum entropy processes encoded by a forest of n phylogenies rooted on the n analyzed taxa. This particular information theory-based interpretation is conjectured to be shared by all distance methods in phylogenetics.

Algorithmic aspects

The "minimum evolution problem" (MEP), in which a minimum-summed-length phylogeny is derived from a set of sequences under the ME criterion, is said to be NP-hard. [13] [14] The "balanced minimum evolution problem" (BMEP), which uses the newer BME criterion, is APX-hard. [5]

A number of exact algorithms solving BMEP have been described. [15] [16] [17] [18] The best known exact algorithm [19] remains impractical for more than a dozen taxa, even with multiprocessing. [5] There is only one approximation algorithm with proven error bounds, published in 2012. [5]

In practical use, BMEP is overwhemingly implemented by heuristic search. The basic, aforementioned neighbor-joining algorithm implements a greedy version of BMEP. [6] FastME, the "state-of-the-art", [5] starts with a rough tree then improves it using a set of topological moves such as Nearest Neighbor Interchanges (NNI). Compared to NJ, it is about as fast and more accurate. [20] Metaheuristics have also been used. [21]

See also

Related Research Articles

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference, methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

<span class="mw-page-title-main">Cladogram</span> Diagram used to show relations among groups of organisms with common origins

A cladogram is a diagram used in cladistics to show relations among organisms. A cladogram is not, however, an evolutionary tree because it does not show how ancestors are related to descendants, nor does it show how much they have changed, so many differing evolutionary trees can be consistent with the same cladogram. A cladogram uses lines that branch off in different directions ending at a clade, a group of organisms with a last common ancestor. There are many shapes of cladograms but they all have lines that branch off from other lines. The lines can be traced back to where they branch off. These branching off points represent a hypothetical ancestor which can be inferred to exhibit the traits shared among the terminal taxa above it. This hypothetical ancestor might then provide clues about the order of evolution of various features, adaptation, and other evolutionary narratives about ancestors. Although traditionally such cladograms were generated largely on the basis of morphological characters, DNA and RNA sequencing data and computational phylogenetics are now very commonly used in the generation of cladograms, either on their own or in combination with morphology.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

In phylogenetics and computational phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes. Under the maximum-parsimony criterion, the optimal tree will minimize the amount of homoplasy. In other words, under this criterion, the shortest possible tree that explains the data is considered best. Some of the basic ideas behind maximum parsimony were presented by James S. Farris in 1970 and Walter M. Fitch in 1971.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules, such as DNA sequences or protein sequences, that can be represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

In phylogenetics, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular or morphological change accumulated within a lineage is sufficient to cause that lineage to appear similar to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Such bias is more common when the overall divergence of some taxa results in long branches within a phylogeny. Long branches are often attracted to the base of a phylogenetic tree, because the lineage included to represent an outgroup is often also long-branched. The frequency of true LBA is unclear and often debated, and some authors view it as untestable and therefore irrelevant to empirical phylogenetic inference. Although often viewed as a failing of parsimony-based methodology, LBA could in principle result from a variety of scenarios and be inferred under multiple analytical paradigms.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Walter Monroe Fitch was a pioneering American researcher in molecular evolution.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals, populations, or specie to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley, Bob Mau in Madison, and Shuying Li in University of Iowa, the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001, and is now one of the most popular methods in molecular phylogenetics.

Tree rearrangements are deterministic algorithms devoted to search for optimal phylogenetic tree structure. They can be applied to any set of data that are naturally arranged into a tree, but have most applications in computational phylogenetics, especially in maximum parsimony and maximum likelihood searches of phylogenetic trees, which seek to identify one among many possible trees that best explains the evolutionary history of a particular gene or species.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

Quantitative comparative linguistics is the use of quantitative analysis as applied to comparative linguistics. Examples include the statistical fields of lexicostatistics and glottochronology, and the borrowing of phylogenetics from biology.

<span class="mw-page-title-main">Unrooted binary tree</span>

In mathematics and computer science, an unrooted binary tree is an unrooted tree in which each vertex has either one or three neighbors.

<span class="mw-page-title-main">Three-taxon analysis</span> Cladistic based method of phylogenetic reconstruction

Three-taxon analysis is a cladistic based method of phylogenetic reconstruction. Introduced by Nelson and Platnick in 1991 to reconstruct organisms' phylogeny, this method can also be applied to biogeographic areas. It attempts to reconstruct complex phylogenetic trees by breaking the problem down into simpler chunks. Rather than try to resolve the relationships of all X taxa at once, it considers taxa 3 at a time. It is relatively easy to generate three-taxon statements (3is); that is, statements of the form "A and B are more closely related to one another than to C". Once each group of three taxa has been considered, the method constructs a tree that is consistent with as many three-item statements as possible.

A supertree is a single phylogenetic tree assembled from a combination of smaller phylogenetic trees, which may have been assembled using different datasets or a different selection of taxa. Supertree algorithms can highlight areas where additional data would most usefully resolve any ambiguities. The input trees of a supertree should behave as samples from the larger tree.

Ziheng Yang FRS is a Chinese biologist. He holds the R.A. Fisher Chair of Statistical Genetics at University College London, and is the Director of R.A. Fisher Centre for Computational Biology at UCL. He was elected a Fellow of the Royal Society in 2006.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

Genetic saturation is the result of multiple substitutions at the same site in a sequence, or identical substitutions in different sequences, such that the apparent sequence divergence rate is lower than the actual divergence that has occurred. When comparing two or more genetic sequences consisting of single nucleotides, differences in sequence observed are only differences in the final state of the nucleotide sequence. Single nucleotides that undergoing genetic saturation change multiple times, sometimes back to their original nucleotide or to a nucleotide common to the compared genetic sequence. Without genetic information from intermediate taxa, it is difficult to know how much, or if any saturation has occurred on an observed sequence. Genetic saturation occurs most rapidly on fast-evolving sequences, such as the hypervariable region of mitochondrial DNA, or in short tandem repeats such as on the Y-chromosome.

References

  1. Catanzaro, Daniele (2010). Estimating phylogenies from molecular data, in Mathematical approaches to polymer sequence analysis and related problems. Springer, New York.
  2. Catanzaro D (2009). "The minimum evolution problem: Overview and classification". Networks. 53 (2): 112–125. doi:10.1002/net.20280. S2CID   6018514.
  3. Kidd KK, Sgaramella-Zonta LA (1971). "Phylogenetic analysis: Concepts and methods". American Journal of Human Genetics. 23 (3): 235–252. PMC   1706731 . PMID   5089842.
  4. 1 2 3 4 Rzhetsky A, Nei M (1993). "Theoretical foundations of the minimum evolution method of phylogenetic inference". Molecular Biology and Evolution. 10: 21073–1095.
  5. 1 2 3 4 5 Catanzaro, Daniele; Frohn, Martin; Gascuel, Olivier; Pesenti, Raffaele (July 2022). "A tutorial on the balanced minimum evolution problem". European Journal of Operational Research. 300 (1): 1–19. doi: 10.1016/j.ejor.2021.08.004 .
  6. 1 2 Gascuel O, Steel M (2006). "Neighbor-joining revealed". Mol Biol Evol. 23 (11): 1997–2000. doi: 10.1093/molbev/msl072 . PMID   16877499.
  7. Desper R, Gascuel O (2005). The minimum evolution distance-based approach to phylogenetic inference in Mathematics of Evolution and Phylogeny. Oxford University Press, New York.
  8. Catanzaro D, Aringhieri R, Di Summa M, Pesenti R (2015). "A branch-price-and-cut algorithm for the minimum evolution problem". European Journal of Operational Research. 244 (3): 753–765. doi:10.1016/j.ejor.2015.02.019. S2CID   1549028.
  9. Pauplin Y (2000). "Direct calculation of a tree length using a distance matrix". Journal of Molecular Evolution. 51 (1): 41–47. Bibcode:2000JMolE..51...41P. doi:10.1007/s002390010065. PMID   10903371. S2CID   8619412.
  10. Desper R, Gascuel O (March 2004). "Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting". Molecular Biology and Evolution. 21 (3): 587–98. doi: 10.1093/molbev/msh049 . PMID   14694080.
  11. Vihn LS, von Haeseler A (2005). "Shortest triplet clustering: Reconstructing large phylogenies using representative sets". BMC Bioinformatics. 6: 1–14. doi: 10.1186/1471-2105-6-1 . PMC   545949 . PMID   15631638.
  12. Catanzaro D, Frohn M, Pesenti R (2020). "An information theory perspective on the Balanced Minimum Evolution Problem". Operations Research Letters. 48 (3): 362–367. doi:10.1016/j.orl.2020.04.010. S2CID   218998400.
  13. Catanzaro D, Labbé M, Pesenti R, Salazar-González JJ (2009). "Mathematical models to reconstruct phylogenetic trees under the minimum evolution criterion". Networks. 53 (2): 126–140. doi:10.1002/net.20281. S2CID   17792339.
  14. Catanzaro D, Aringhieri R, Di Summa M, Pesenti R (2015). "A branch-price-and-cut algorithm for the minimum evolution problem". European Journal of Operational Research. 244 (3): 753–765. doi:10.1016/j.ejor.2015.02.019. S2CID   1549028.
  15. Aringhieri R, Catanzaro D, Di Summa M (2011). "Optimal solutions for the balanced minimum evolution problem". Computers and Operations Research. 38 (12): 1845–1854. doi:10.1016/j.cor.2011.02.020. hdl: 2318/86826 . S2CID   9514013.
  16. Catanzaro D, Labbé M, Pesenti R, Salazar-González JJ (2012). "The balanced minimum evolution problem". INFORMS Journal on Computing. 24 (2): 276–294. doi:10.1287/ijoc.1110.0455.
  17. Catanzaro D, Labbé M, Pesenti R (2013). "The balanced minimum evolution problem under uncertain data". Discrete Applied Mathematics. 161 (13–14): 1789–1804. doi: 10.1016/j.dam.2013.03.012 .
  18. Catanzaro D, Pesenti R (2019). "Enumerating Vertices of the Balanced Minimum Evolution Polytope". Computers and Operations Research. 109: 209–217. doi:10.1016/j.cor.2019.05.001. S2CID   164835227.
  19. Catanzaro D, Pesenti R, Wolsey L (2020). "On the Balanced Minimum Evolution Polytope". Discrete Optimization. 36: 100570. doi:10.1016/j.disopt.2020.100570. S2CID   213389485.
  20. "ATGC: FastME". www.atgc-montpellier.fr.
  21. Catanzaro D, Pesenti R, Milinkovitch MC (2007). "An ant colony optimization algorithm for phylogenetic estimation under the minimum evolution principle". BMC Evolutionary Biology. 7: 228. doi: 10.1186/1471-2148-7-228 . PMC   2211314 . PMID   18005416.

Further reading