Perfect phylogeny

Last updated

Perfect phylogeny is a term used in computational phylogenetics to denote a phylogenetic tree in which all internal nodes may be labeled such that all characters evolve down the tree without homoplasy. That is, characteristics do not hold to evolutionary convergence, and do not have analogous structures. Statistically, this can be represented as an ancestor having state "0" in all characteristics where 0 represents a lack of that characteristic. Each of these characteristics changes from 0 to 1 exactly once and never reverts to state 0. It is rare that actual data adheres to the concept of perfect phylogeny. [1] [2]

Contents

Building

In general there are two different data types that are used in the construction of a phylogenetic tree. In distance-based computations a phylogenetic tree is created by analyzing relationships among the distance between species and the edge lengths of a corresponding tree. Using a character-based approach employs character states across species as an input in an attempt to find the most "perfect" phylogenetic tree. [3] [4]

The statistical components of a perfect phylogenetic tree can best be described as follows: [3]

A perfect phylogeny for an n x m character state matrix M is a rooted tree T with n leaves satisfying:


i. Each row of M labels exactly one leaf of T
ii. Each column of M labels exactly one edge of T
iii. Every interior edge of T is labeled by at least one column of M

iv. The characters associated with the edges along the unique path from root to a leaf v exactly specify the character vector of v, i.e. the character vector has a 1 entry in all columns corresponding to characters associated to path edges and a 0 entry otherwise.

It is worth noting that it is very rare to find actual phylogenetic data that adheres to the concepts and limitations detailed here. Therefore, it is often the case that researchers are forced to compromise by developing trees that simply try to minimize homoplasy, finding a maximum-cardinality set of compatible characters, or constructing phylogenies that match as closely as possible to the partitions implied by the characters.

Example

Both of these data sets illustrate examples of character state matrices. Using matrix M'1 one is able to observe that the resulting phylogenetic tree can be created such that each of the characters label exactly one edge of the tree. In contrast, when observing matrix M'2, one can see that there is no way to set up the phylogenetic tree such that each character labels only one edge length. [3] If the samples come from variant allelic frequency (VAF) data of a population of cells under study, the entries in the character matrix are frequencies of mutations, and take a value between 0 and 1. Namely, if represents a position in the genome, then the entry corresponding to and sample will hold the frequencies of genomes in sample with a mutation in position . [5] [6] [7] [8] [9]

Usage

Perfect phylogeny is a theoretical framework that can also be used in more practical methods. One such example is that of Incomplete Directed Perfect Phylogeny. This concept involves utilizing perfect phylogenies with real, and therefore incomplete and imperfect, datasets. Such a method utilizes SINEs to determine evolutionary similarity. These Short Interspersed Elements are present across many genomes and can be identified by their flanking sequences. SINEs provide information on the inheritance of certain traits across different species. Unfortunately, if a SINE is missing it is difficult to know whether those SINEs were present prior to the deletion. By utilizing algorithms derived from perfect phylogeny data we are able to attempt to reconstruct a phylogenetic tree in spite of these limitations. [10]

Perfect phylogeny is also used in the construction of haplotype maps. By utilizing the concepts and algorithms described in perfect phylogeny one can determine information regarding missing and unavailable haplotype data. [11] By assuming that the set of haplotypes that result from genotype mapping corresponds and adheres to the concept of perfect phylogeny (as well as other assumptions such as perfect Mendelian inheritance and the fact that there is only one mutation per SNP), one is able to infer missing haplotype data. [12] [13] [14] [15]

Inferring a phylogeny from noisy VAF data under the PPM is a hard problem. [5] Most inference tools include some heuristic step to make inference computationally tractable. Examples of tools that infer phylogenies from noisy VAF data include AncesTree, Canopy, CITUP, EXACT, and PhyloWGS. [5] [6] [7] [8] [9] In particular, EXACT performs exact inference by using GPUs to compute a posterior probability on all possible trees for small size problems. Extensions to the PPM have been made with accompanying tools. [16] [17] For example, tools such as MEDICC, TuMult, and FISHtrees allow the number of copies of a given genetic element, or ploidy, to both increase, or decrease, thus effectively allowing the removal of mutations. [18] [19] [20]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference, methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies). It consists of 65 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows, Mac OS 8, Mac OS 9, OS X, Linux ; and FreeBSD from FreeBSD.org. Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals, populations, or specie to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

Microbial phylogenetics is the study of the manner in which various groups of microorganisms are genetically related. This helps to trace their evolution. To study these relationships biologists rely on comparative genomics, as physiology and comparative anatomy are not possible methods.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Tumour heterogeneity describes the observation that different tumour cells can show distinct morphological and phenotypic profiles, including cellular morphology, gene expression, metabolism, motility, proliferation, and metastatic potential. This phenomenon occurs both between tumours and within tumours. A minimal level of intra-tumour heterogeneity is a simple consequence of the imperfection of DNA replication: whenever a cell divides, a few mutations are acquired—leading to a diverse population of cancer cells. The heterogeneity of cancer cells introduces significant challenges in designing effective treatment strategies. However, research into understanding and characterizing heterogeneity can allow for a better understanding of the causes and progression of disease. In turn, this has the potential to guide the creation of more refined treatment strategies that incorporate knowledge of heterogeneity to yield higher efficacy.

Bacterial phylodynamics is the study of immunology, epidemiology, and phylogenetics of bacterial pathogens to better understand the evolutionary role of these pathogens. Phylodynamic analysis includes analyzing genetic diversity, natural selection, and population dynamics of infectious disease pathogen phylogenies during pandemics and studying intra-host evolution of viruses. Phylodynamics combines the study of phylogenetic analysis, ecological, and evolutionary processes to better understand of the mechanisms that drive spatiotemporal incidence and phylogenetic patterns of bacterial pathogens. Bacterial phylodynamics uses genome-wide single-nucleotide polymorphisms (SNP) in order to better understand the evolutionary mechanism of bacterial pathogens. Many phylodynamic studies have been performed on viruses, specifically RNA viruses which have high mutation rates. The field of bacterial phylodynamics has increased substantially due to the advancement of next-generation sequencing and the amount of data available.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Daniel Mier Gusfield is an American computer scientist, Distinguished Professor of Computer Science at the University of California, Davis. Gusfield is known for his research in combinatorial optimization and computational biology.

PyClone is a software that implements a Hierarchical Bayes statistical model to estimate cellular frequency patterns of mutations in a population of cancer cells using observed alternate allele frequencies, copy number, and loss of heterozygosity (LOH) information. PyClone outputs clusters of variants based on calculated cellular frequencies of mutations.

<span class="mw-page-title-main">Phylogenetic reconciliation</span> Technique in evolutionary study

In phylogenetics, reconciliation is an approach to connect the history of two or more coevolving biological entities. The general idea of reconciliation is that a phylogenetic tree representing the evolution of an entity can be drawn within another phylogenetic tree representing an encompassing entity to reveal their interdependence and the evolutionary events that have marked their shared history. The development of reconciliation approaches started in the 1980s, mainly to depict the coevolution of a gene and a genome, and of a host and a symbiont, which can be mutualist, commensalist or parasitic. It has also been used for example to detect horizontal gene transfer, or understand the dynamics of genome evolution.

In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".

References

  1. Fernandez-Baca D. "The Perfect Phylogeny Problem" (PDF). Kluwer Academic Publishers. Retrieved 30 September 2012.
  2. Nakhleh L, Ringe D, Warnow T. "Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages" (PDF). Retrieved 1 October 2012.
  3. 1 2 3 Uhler C. "Finding a Perfect Phylogeny" (PDF). Archived from the original (PDF) on 4 March 2016. Retrieved 29 September 2012.
  4. Nikaido M, Rooney AP, Okada N (August 1999). "Phylogenetic relationships among cetartiodactyls based on insertions of short and long interpersed elements: hippopotamuses are the closest extant relatives of whales". Proceedings of the National Academy of Sciences of the United States of America. 96 (18): 10261–6. Bibcode:1999PNAS...9610261N. doi: 10.1073/pnas.96.18.10261 . PMC   17876 . PMID   10468596.
  5. 1 2 3 El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ (June 2015). "Reconstruction of clonal trees and tumor composition from multi-sample sequencing data". Bioinformatics. 31 (12): i62-70. doi:10.1093/bioinformatics/btv261. PMC   4542783 . PMID   26072510.
  6. 1 2 Satas G, Raphael BJ (July 2017). "Tumor phylogeny inference using tree-constrained importance sampling". Bioinformatics. 33 (14): i152–i160. doi:10.1093/bioinformatics/btx270. PMC   5870673 . PMID   28882002.
  7. 1 2 Malikic S, McPherson AW, Donmez N, Sahinalp CS (May 2015). "Clonality inference in multiple tumor samples using phylogeny". Bioinformatics. 31 (9): 1349–56. doi: 10.1093/bioinformatics/btv003 . PMID   25568283.
  8. 1 2 Ray S, Jia B, Safavi S, van Opijnen T, Isberg R, Rosch J, Bento J (2019-08-22). "Exact inference under the perfect phylogeny model". arXiv: 1908.08623 [q-bio.QM].
  9. 1 2 Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q (February 2015). "PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors". Genome Biology. 16 (1): 35. doi: 10.1186/s13059-015-0602-8 . PMC   4359439 . PMID   25786235.
  10. Pe'er I, Pupko T, Shamir R, Sharan R. "Incomplete Directed Perfect Phylogeny". Tel-Aviv University. Archived from the original on 20 October 2013. Retrieved 30 October 2012.{{cite web}}: CS1 maint: bot: original URL status unknown (link)
  11. Eskin E, Halperin E, Karp RM (April 2003). "Efficient reconstruction of haplotype structure via perfect phylogeny" (PDF). Journal of Bioinformatics and Computational Biology. University of California, Berkeley. 1 (1): 1–20. doi:10.1142/S0219720003000174. PMID   15290779 . Retrieved 30 October 2012.
  12. Gusfield D. "An Overview of Computational Methods for Haplotype Inference" (PDF). University of California, Davis. Retrieved 18 November 2012.
  13. Ding Z, Filkov V, Gusfield D. "A Linear Time Algorithm for the Perfect Phylogeny Haplotyping Problem". University of California, Davis. Retrieved 18 November 2012.
  14. Bafna V, Gusfield D, Lancia G, Yooseph S (2003). "Haplotyping as perfect phylogeny: a direct approach". Journal of Computational Biology. 10 (3–4): 323–40. doi:10.1089/10665270360688048. PMID   12935331.
  15. Seyalioglu H. "Haplotyping as Perfect Phylogeny" (PDF). Archived from the original (PDF) on 30 September 2011. Retrieved 30 October 2012.
  16. Bonizzoni P, Carrieri AP, Della Vedova G, Trucco G (October 2014). "Explaining evolution via constrained persistent perfect phylogeny". BMC Genomics. 15 Suppl 6 (S6): S10. doi: 10.1186/1471-2164-15-S6-S10 . PMC   4240218 . PMID   25572381.
  17. Hajirasouliha I, Raphael BJ (2014), Brown D, Morgenstern B (eds.), "Reconstructing Mutational History in Multiply Sampled Tumors Using Perfect Phylogeny Mixtures", Algorithms in Bioinformatics, Lecture Notes in Computer Science, Springer Berlin Heidelberg, vol. 8701, pp. 354–367, doi:10.1007/978-3-662-44753-6_27, ISBN   9783662447529
  18. Schwarz RF, Trinh A, Sipos B, Brenton JD, Goldman N, Markowetz F (April 2014). Beerenwinkel N (ed.). "Phylogenetic quantification of intra-tumour heterogeneity". PLOS Computational Biology. 10 (4): e1003535. arXiv: 1306.1685 . Bibcode:2014PLSCB..10E3535S. doi: 10.1371/journal.pcbi.1003535 . PMC   3990475 . PMID   24743184.
  19. Letouzé E, Allory Y, Bollet MA, Radvanyi F, Guyon F (2010). "Analysis of the copy number profiles of several tumor samples from the same patient reveals the successive steps in tumorigenesis". Genome Biology. 11 (7): R76. doi: 10.1186/gb-2010-11-7-r76 . PMC   2926787 . PMID   20649963.
  20. Gertz EM, Chowdhury SA, Lee WJ, Wangsa D, Heselmeyer-Haddad K, Ried T, et al. (2016-06-30). "FISHtrees 3.0: Tumor Phylogenetics Using a Ploidy Probe". PLOS ONE. 11 (6): e0158569. Bibcode:2016PLoSO..1158569G. doi: 10.1371/journal.pone.0158569 . PMC   4928784 . PMID   27362268.