Supertree

Last updated

A supertree is a single phylogenetic tree assembled from a combination of smaller phylogenetic trees, which may have been assembled using different datasets (e.g. morphological and molecular) or a different selection of taxa. [1] Supertree algorithms can highlight areas where additional data would most usefully resolve any ambiguities. [2] The input trees of a supertree should behave as samples from the larger tree. [3]

Contents

Construction methods

The construction of a supertree scales exponentially with the number of taxa included; therefore for a tree of any reasonable size, it is not possible to examine every possible supertree and weigh its success at combining the input information. Heuristic methods are thus essential, although these methods may be unreliable; the result extracted is often biased or affected by irrelevant characteristics of the input data. [1]

The most well known method for supertree construction is Matrix Representation with Parsimony (MRP), in which the input source trees are represented by matrices with 0s, 1s, and ?s (i.e., each edge in each source tree defines a bipartition of the leafset into two disjoint parts, and the leaves on one side get 0, the leaves on the other side get 1, and the missing leaves get ?), and the matrices are concatenated and then analyzed using heuristics for maximum parsimony. [4] Another approach for supertree construction include a maximum likelihood version of MRP called "MRL" (matrix representation with likelihood), which analyzes the same MRP matrix but uses heuristics for maximum likelihood instead of for maximum parsimony to construct the supertree.

The Robinson-Foulds distance is the most popular of many ways of measuring how similar a supertree is to the input trees. It is a metric for the number of clades from the input trees that are retained in the supertree. Robinson-Foulds optimization methods search for a supertree that minimizes the total (summed) Robinson-Foulds differences between the (binary) supertree and each input tree. [1] In this case the supertree can hence be view as the median of the input tree according to the Robinson-Foulds distance. Alternative approaches have been developed to infer median supertree based on different metrics, e.g. relying on triplet or quartet decomposition of the trees. [5]

A recent innovation has been the construction of Maximum Likelihood supertrees and the use of "input-tree-wise" likelihood scores to perform tests of two supertrees. [6]

Additional methods include the Min Cut Supertree approach, [7] Most Similar Supertree Analysis (MSSA), Distance Fit (DFIT) and Quartet Fit (QFIT), implemented in the software CLANN. [8] [9]

Application

Supertrees have been applied to produce phylogenies of many groups, notably the angiosperms, [10] eukaryotes [11] and mammals. [12] They have also been applied to larger-scale problems such as the origins of diversity, vulnerability to extinction, [13] and evolutionary models of ecological structure. [14]

Further reading

Related Research Articles

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference, methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

<span class="mw-page-title-main">Cladogram</span> Diagram used to show relations among groups of organisms with common origins

A cladogram is a diagram used in cladistics to show relations among organisms. A cladogram is not, however, an evolutionary tree because it does not show how ancestors are related to descendants, nor does it show how much they have changed, so many differing evolutionary trees can be consistent with the same cladogram. A cladogram uses lines that branch off in different directions ending at a clade, a group of organisms with a last common ancestor. There are many shapes of cladograms but they all have lines that branch off from other lines. The lines can be traced back to where they branch off. These branching off points represent a hypothetical ancestor which can be inferred to exhibit the traits shared among the terminal taxa above it. This hypothetical ancestor might then provide clues about the order of evolution of various features, adaptation, and other evolutionary narratives about ancestors. Although traditionally such cladograms were generated largely on the basis of morphological characters, DNA and RNA sequencing data and computational phylogenetics are now very commonly used in the generation of cladograms, either on their own or in combination with morphology.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

In phylogenetics and computational phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes. Under the maximum-parsimony criterion, the optimal tree will minimize the amount of homoplasy. In other words, under this criterion, the shortest possible tree that explains the data is considered best. Some of the basic ideas behind maximum parsimony were presented by James S. Farris in 1970 and Walter M. Fitch in 1971.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

A disk-covering method is a divide-and-conquer meta-technique for large-scale phylogenetic analysis which has been shown to improve the performance of both heuristics for NP-hard optimization problems and polynomial-time distance-based methods. Disk-covering methods are a meta-technique in that they have flexibility in several areas, depending on the performance metrics that are being optimized for the base method. Such metrics can be efficiency, accuracy, or sequence length requirements for statistical performance. There have been several disk-covering methods developed, which have been applied to different "base methods". Disk-covering methods have been used with distance-based methods to produce "fast-converging methods", which are methods that will reconstruct the true tree from sequences that have at most a polynomial number of sites.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies). It consists of 65 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows, Mac OS 8, Mac OS 9, OS X, Linux ; and FreeBSD from FreeBSD.org. Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals, populations, or specie to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley, Bob Mau in Madison, and Shuying Li in University of Iowa, the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001, and is now one of the most popular methods in molecular phylogenetics.

Phylogenetic comparative methods (PCMs) use information on the historical relationships of lineages (phylogenies) to test evolutionary hypotheses. The comparative method has a long history in evolutionary biology; indeed, Charles Darwin used differences and similarities between species as a major source of evidence in The Origin of Species. However, the fact that closely related lineages share many traits and trait combinations as a result of the process of descent with modification means that lineages are not independent. This realization inspired the development of explicitly phylogenetic comparative methods. Initially, these methods were primarily developed to control for phylogenetic history when testing for adaptation; however, in recent years the use of the term has broadened to include any use of phylogenies in statistical tests. Although most studies that employ PCMs focus on extant organisms, many methods can also be applied to extinct taxa and can incorporate information from the fossil record.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

PAUP* is a computational phylogenetics program for inferring evolutionary trees (phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implemented parsimony, but from version 4.0 it also supports distance matrix and likelihood methods. Version 3.0 ran on Macintosh computers and supported a rich, user-friendly graphical interface. Together with the program MacClade, with which it shares the NEXUS data format, PAUP* was the phylogenetic software of choice for many phylogenetists.

A patrocladogram is a cladistic branching pattern that has been precisely modified by use of patristic distances ; a type of phylogram. The patristic distance is defined as, "the number of apomorphic step changes separating two taxa on a cladogram," and is used exclusively to determine the amount of divergence of a characteristic from a common ancestor. This means that cladistic and patristic distances are combined to construct a new tree using various phenetic algorithms. The purpose of the patrocladogram in biological classification is to form a hypothesis about which evolutionary processes are actually involved before making a taxonomic decision. Patrocladograms are based on biostatistics that include but are not limited to: parsimony, distance matrix, likelihood methods, and Bayesian probability. Some examples of genomically related data that can be used as inputs for these methods are: molecular sequences, whole genome sequences, gene frequencies, restriction sites, distance matrices, unique characters, mutations such as SNPs, and mitochondrial genome data.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species. It represents the application of coalescent theory to the case of multiple species. The multispecies coalescent results in cases where the relationships among species for an individual gene can differ from the broader history of the species. It has important implications for the theory and practice of phylogenetics and for understanding genome evolution.

<span class="mw-page-title-main">James O. McInerney</span>

James O. McInerney is an Irish-born microbiologist, computational evolutionary biologist, professor, and former head of the School of Life Sciences at the University of Nottingham. He is an elected Fellow of the American Society for Microbiology and elected Fellow of the Linnean Society. In June 2020 he was elected president-designate of the Society for Molecular Biology and Evolution and in 2022 he took up the role of President.

Minimum evolution is a distance method employed in phylogenetics modeling. It shares with maximum parsimony the aspect of searching for the phylogeny that has the shortest total sum of branch lengths.

<span class="mw-page-title-main">Phylogenetic reconciliation</span> Technique in evolutionary study

In phylogenetics, reconciliation is an approach to connect the history of two or more coevolving biological entities. The general idea of reconciliation is that a phylogenetic tree representing the evolution of an entity can be drawn within another phylogenetic tree representing an encompassing entity to reveal their interdependence and the evolutionary events that have marked their shared history. The development of reconciliation approaches started in the 1980s, mainly to depict the coevolution of a gene and a genome, and of a host and a symbiont, which can be mutualist, commensalist or parasitic. It has also been used for example to detect horizontal gene transfer, or understand the dynamics of genome evolution.

References

  1. 1 2 3 Bansal, M.; Burleigh, J.; Eulenstein, O.; Fernández-Baca, D. (2010). "Robinson-Foulds supertrees". Algorithms for Molecular Biology. 5: 18. doi: 10.1186/1748-7188-5-18 . PMC   2846952 . PMID   20181274.
  2. "Supertree: Introduction". genome.cs.iastate.edu.
  3. Gordon, A. (1986). "Consensus supertrees: the synthesis of rooted trees containing overlapping sets of labeled leaves". Journal of Classification. 3 (2): 335–348. doi:10.1007/BF01894195. S2CID   122146129.
  4. Mark A. Ragan (1992). "Phylogenetic inference based on matrix representation of trees". Molecular Phylogenetics and Evolution. 1 (1): 53–58. doi:10.1016/1055-7903(92)90035-F. ISSN   1055-7903. PMID   1342924.
  5. Ranwez, Vincent; Criscuolo, Alexis; Douzery, Emmanuel J.P. (2010-06-15). "S uper T riplets : a triplet-based supertree approach to phylogenomics". Bioinformatics. 26 (12): i115–i123. doi:10.1093/bioinformatics/btq196. ISSN   1367-4811. PMC   2881381 . PMID   20529895.
  6. Akanni, Wasiu A.; Creevey, Christopher J.; Wilkinson, Mark; Pisani, Davide (2014-06-12). "L.U.St: a tool for approximated maximum likelihood supertree reconstruction". BMC Bioinformatics. 15 (1): 183. doi: 10.1186/1471-2105-15-183 . ISSN   1471-2105. PMC   4073192 . PMID   24925766.
  7. Semple, C. (2000). "A supertree method for rooted trees". Discrete Applied Mathematics. 105 (1–3): 147–158. CiteSeerX   10.1.1.24.6784 . doi: 10.1016/S0166-218X(00)00202-X .
  8. Creevey, C. J.; McInerney, J. O. (2005-02-01). "Clann: investigating phylogenetic information through supertree analyses". Bioinformatics. 21 (3): 390–392. doi: 10.1093/bioinformatics/bti020 . ISSN   1367-4803. PMID   15374874.
  9. Creevey, C. J.; McInerney, J. O. (2009-01-01). "Trees from Trees: Construction of Phylogenetic Supertrees Using Clann" (PDF). In Posada, David (ed.). Bioinformatics for DNA Sequence Analysis. Methods in Molecular Biology. Vol. 537. Humana Press. pp. 139–161. doi:10.1007/978-1-59745-251-9_7. ISBN   978-1-58829-910-9. PMID   19378143.
  10. Davies, T.; Barraclough, T.; Chase, M.; Soltis, P.; Soltis, D.; Savolainen, V. (2004). "Darwin's abominable mystery: Insights from a supertree of the angiosperms". Proceedings of the National Academy of Sciences of the United States of America. 101 (7): 1904–1909. Bibcode:2004PNAS..101.1904D. doi: 10.1073/pnas.0308127100 . PMC   357025 . PMID   14766971.
  11. Pisani, D.; Cotton, J.; McInerney, J. (2007). "Supertrees disentangle the chimerical origin of eukaryotic genomes". Molecular Biology and Evolution. 24 (8): 1752–1760. doi: 10.1093/molbev/msm095 . PMID   17504772.
  12. Bininda-Emonds, O.; Cardillo, M.; Jones, K.; MacPhee, R.; Beck, R.; Grenyer, R.; Price, S.; Vos, R.; Gittleman, J.; Purvis, A. (2007). "The delayed rise of present-day mammals". Nature. 446 (7135): 507–512. Bibcode:2007Natur.446..507B. doi:10.1038/nature05634. PMID   17392779. S2CID   4314965.
  13. Davies, T.; Fritz, S.; Grenyer, R.; Orme, C.; Bielby, J.; Bininda-Emonds, O.; Cardillo, M.; Jones, K.; Gittleman, J.; Mace, G. M.; Purvis, A. (2008). "Phylogenetic trees and the future of mammalian biodiversity". Proceedings of the National Academy of Sciences of the United States of America. 105 Suppl 1 (Supplement_1): 11556–11563. Bibcode:2008PNAS..10511556D. doi: 10.1073/pnas.0801917105 . PMC   2556418 . PMID   18695230.
  14. Webb, C. O.; Ackerly, D. D.; McPeek, M. A.; Donoghue, M. J. (2002). "Phylogenies and Community Ecology". Annual Review of Ecology and Systematics. 33: 475–505. doi:10.1146/annurev.ecolsys.33.010802.150448. S2CID   535590.