PHYLIP

Last updated
PHYLogeny Inference Package
Original author(s) Joseph Felsenstein
Developer(s) University of Washington
Initial releaseOctober 1980;43 years ago (1980-10)
Stable release
3.697 / 2 November 2014;8 years ago (2014-11-02)
Repository
Written in C
Operating system Windows, Mac OS X, Linux
Platform x86, x86-64
Available inEnglish
Type Phylogenetics
License => v3.697: open-source
=< v3.695: proprietary freeware
Website evolution.genetics.washington.edu/phylip.html   OOjs UI icon edit-ltr-progressive.svg

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies). [1] It consists of 65 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows (95, 98, ME, NT, 2000, XP, Vista), Mac OS 8, Mac OS 9, OS X, Linux (Debian, Red Hat); and FreeBSD from FreeBSD.org. [2] Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle. [3]

Contents

Methods (implemented by each program) that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters. [2]

Each program is controlled through a menu, which asks users which options they want to set, and allows them to start the computation. The data is read into the program from a text file, which the user can prepare using any word processor or text editor (but this text file cannot be in the special format of the word processor, it must instead be in flat ASCII or text only format). Some sequence analysis programs such as the ClustalW alignment program can write data files in the PHYLIP format. Most of the programs look for the data in a file called infile . If the phylip programs do not find this file, they then ask the user to type in the file name of the data file. [2]

File format

The component programs of phylip use several different formats, all of which are relatively simple. Programs for the analysis of DNA sequence alignments, protein sequence alignments, or discrete characters (e.g., morphological data) can accept those data in sequential or interleaved format, as shown below.

Sequential format:

5 42  Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT  Salmo schiAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT  H. sapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA  Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT  Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA

Interleaved format:

5 42  Turkey    AAGCTNGGGC ATTTCAGGGT  Salmo schiAAGCCTTGGC AGTGCAGGGT  H. sapiensACCGGTTGGC CGTTCAGGGT  Chimp     AAACCCTTGC CGTTACGCTT  Gorilla   AAACCCTTGC CGGTACGCTT  GAGCCCGGGC AATACAGGGT AT  GAGCCGTGGC CGGGCACGGT AT  ACAGGTTGGC CGTTCAGGGT AA  AAACCGAGGC CGGGACACTC AT  AAACCATTGC CGGTACGCTT AA

The numbers are the number of taxa (different species in the example shown above) followed by the number of characters (aligned nucleotides or amino acids in the case of molecular sequences). Restriction site data must include the number of enzymes as well.

Names are limited to 10 characters by default and must be blank-filled to be of that length and followed immediately by the character data using one-letter codes, although the 10 character limit name can be changed by a minor modification of the code (by changing nmlngth in phylip.h and recompiling). All printable ASCII/ISO characters are allowed names, except for parentheses ("(" and ")"), square brackets ("[" and "]"), colon (":"), semicolon (";") and comma (","). The spaces embedded in the alignment are ignored.

Many programs for phylogenetic analyses, including the commonly-used RAxML [4] [5] and IQ-TREE [6] programs, use the phylip format or a minor modification of that format called the relaxed phylip format.

Relaxed phylip format (sequential):

5 42  Turkey                  AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT  Salmo_schiefermuelleri  AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT  H_sapiens               ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA  Chimp                   AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT  Gorilla                 AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

The primary difference in relaxed phylip format is the absence of the 10 character limit and the removal of the need to blank fill names to reach that length (although filling names to start the character matrix at the same position can improve readability for user). This example of relaxed uses underscores rather than spaces in the names and uses spaces between the names and the aligned character data; it is often good practice to avoid white space within taxon names and to separate the character data from the name when generating files. Like strict phylip format files, relaxed phylip format files can be in interleaved format and include spaces and endlines within the sequence data.

The programs that use distance data, like the neighbor program that implements the neighbor-joining method, also use a simple distance matrix format the includes only the number of taxa, their names, and numerical values for the distances:

Phylip distance matrix:

7  Bovine    0.0000 1.6866 1.7198 1.6606 1.5243 1.6043 1.5905  Mouse     1.6866 0.0000 1.5232 1.4841 1.4465 1.4389 1.4629  Gibbon    1.7198 1.5232 0.0000 0.7115 0.5958 0.6179 0.5583  Orang     1.6606 1.4841 0.7115 0.0000 0.4631 0.5061 0.4710  Gorilla   1.5243 1.4465 0.5958 0.4631 0.0000 0.3484 0.3083  Chimp     1.6043 1.4389 0.6179 0.5061 0.3484 0.0000 0.2692  Human     1.5905 1.4629 0.5583 0.4710 0.3083 0.2692 0.0000

The number indicates the number of taxa and same limitations for taxon names exist. Note that this matrix is symmetric and the diagonal has values of 0 (since the distance between a taxon and itself is zero by definition).

Programs that use trees as input accept the trees in Newick format, an informal standard agreed to in 1986 by authors of seven major phylogeny packages. Output is written onto files with names like outfile and outtree. Trees written onto outtree are in the Newick format.

Component programs

Programs listed in PHYLIP [7]
Program nameDescription
protparsEstimates phylogenies of peptide sequences using the parsimony method
dnaparsEstimates phylogenies of DNA sequences using the parsimony method
dnapennyDNA parsimony branch and bound method, finds all of the most parsimonious phylogenies for nucleic acid sequences by branch-and-bound search
dnamoveInteractive construction of phylogenies from nucleic acid sequences, with their evaluation by DNA parsimony method, with compatibility and display of reconstructed ancestral bases
dnacompEstimates phylogenies from nucleic acid sequence data using the compatibility criterion
dnamlEstimates phylogenies from nucleotide sequences using the maximum likelihood method
dnamlkDNA maximum likelihood method with molecular clock; using both dnaml and dnamlk together permits a likelihood-ratio test for the molecular clock hypothesis
promlEstimates phylogenies from protein amino acid sequences by using the maximum likelihood method
promlkProtein sequence maximum likelihood method with molecular clock
restmlEstimation of phylogenies by maximum likelihood using restriction sites data; not from restriction fragments but from the presence or absence of individual sites
dnainvarFor nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, which test alternative tree topologies
dnadistDNA distance method which computes four different distances between species from nucleic acid sequences; distances can then be used in the distance matrix programs
protdistProtein sequence distance method which computes a distance measure for sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on genetic code plus a constraint on changing to a different category of amino acid
restdistDistances calculated from restriction sites data or restriction fragments data
seqbootBootstrapping-jackknifing program; reads in a data set, and emits multiple data sets from it by bootstrap resampling
fitchFitch-Margoliash distance matrix method; estimates phylogenies from distance matrix data under the additive tree model according to which the distances are expected to equal the sums of branch lengths between species
kitschFitch-Margoliash distance matrix method with molecular clock; estimates phylogenies from distance matrix data under the ultrametric model which is the same as the additive tree model except an evolutionary clock is assumed
neighborImplementation of the methods neighbor joining and UPGMA
contmlMaximum likelihood continuous characters and gene frequencies; estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations; also does maximum likelihood analysis of continuous characters that evolve by a Brownian Motion model, assuming that the characters evolve at equal rates and in an uncorrelated fashion; does not account for character correlations
contrastReads a tree from a tree file, and a data set with continuous characters data, and emits the independent contrasts for those characters, for use in any multivariate statistics package
gendistGenetic distance program which computes one of three different genetic distance formulas from gene frequency data
parsUnordered multistate discrete-characters parsimony method
mixEstimates phylogenies by some parsimony methods for discrete character data with two states (0, 1); allows using methods: Wagner, Camin-Sokal, or arbitrary mixes
pennyBranch and bound mixed method which finds all of the most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search
moveInteractive construction of phylogenies from discrete character data with two states (0, 1); evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree
dollopEstimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states (0, 1)
dolpennyFinds all or most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-and-bound method of exact search
dolmoveInteractive construction of phylogenies from discrete character data with two states (0, 1) using the Dollo or polymorphism parsimony criteria; evaluates parsimony and compatibility criteria for those phylogenies; displays reconstructed states throughout the tree
cliqueFinds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states (0, 1); the largest clique (or all cliques within a given size range of the largest one) are found by a fast branch and bound search method
factorCharacter recoding program which takes discrete multistate data with character state trees and emits the corresponding data set with two states (0, 1)
drawgramRooted tree drawing program which plots rooted phylogenies, cladograms, and phenograms in a wide variety of user-controllable formats. The program is interactive and allows previewing of the tree on PC or Macintosh graphics screens, and Tektronix or Digital graphics terminals.
drawtreeUnrooted tree drawing program similar to DRAWGRAM, but plots phylogenies
consenseConsensus tree program which computes trees by the majority-rule tree method, which also allows easily finding the strict consensus tree; unable to compute Adams consensus tree
treedistComputes the Robinson–Foulds symmetric difference distance between trees, which allows differences in tree topology
retreeInteractive tree rearrangement program which reads in a tree (with branch lengths if needed) and allows rerooting the tree, to flip branches, to change species names and branch lengths, and then write the result out; can be used to convert between rooted and unrooted trees

Related Research Articles

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

<span class="mw-page-title-main">Phylogenetic tree</span> Branching diagram of evolutionary relationships between organisms

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

<span class="mw-page-title-main">Joseph Felsenstein</span> American phylogeneticist

Joseph "Joe" Felsenstein is a Professor Emeritus in the Departments of Genome Sciences and Biology at the University of Washington in Seattle. He is best known for his work on phylogenetic inference, and is the author of Inferring Phylogenies, and principal author and distributor of the package of phylogenetic inference programs called PHYLIP. Closely related to his work on phylogenetic inference is his introduction of methods for making statistically independent comparisons using phylogenies.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Perfect phylogeny is a term used in computational phylogenetics to denote a phylogenetic tree in which all internal nodes may be labeled such that all characters evolve down the tree without homoplasy. That is, characteristics do not hold to evolutionary convergence, and do not have analogous structures. Statistically, this can be represented as an ancestor having state "0" in all characteristics where 0 represents a lack of that characteristic. Each of these characteristics changes from 0 to 1 exactly once and never reverts to state 0. It is rare that actual data adheres to the concept of perfect phylogeny.

The extensible NEXUS file format is widely used in bioinformatics. It stores information about taxa, morphological and molecular characters, distances, genetic codes, assumptions, sets, trees, etc. Several popular phylogenetic programs such as PAUP*, MrBayes, Mesquite, MacClade and SplitsTree use this format.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

In mathematics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

<span class="mw-page-title-main">SplitsTree</span>

SplitsTree is a popular freeware program for inferring phylogenetic trees, phylogenetic networks, or, more generally, splits graphs, from various types of data such as a sequence alignment, a distance matrix or a set of trees. SplitsTree implements published methods such as split decomposition, neighbor-net, consensus networks, super networks methods or methods for computing hybridization or simple recombination networks. It uses the NEXUS file format. The splits graph is defined using a special data block.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

PAUP* is a computational phylogenetics program for inferring evolutionary trees (phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implemented parsimony, but from version 4.0 it also supports distance matrix and likelihood methods. Version 3.0 ran on Macintosh computers and supported a rich, user-friendly graphical interface. Together with the program MacClade, with which it shares the NEXUS data format, PAUP* was the phylogenetic software of choice for many phylogenetists.

A patrocladogram is a cladistic branching pattern that has been precisely modified by use of patristic distances ; a type of phylogram. The patristic distance is defined as, "the number of apomorphic step changes separating two taxa on a cladogram," and is used exclusively to determine the amount of divergence of a characteristic from a common ancestor. This means that cladistic and patristic distances are combined to construct a new tree using various phenetic algorithms. The purpose of the patrocladogram in biological classification is to form a hypothesis about which evolutionary processes are actually involved before making a taxonomic decision. Patrocladograms are based on biostatistics that include but are not limited to: parsimony, distance matrix, likelihood methods, and Bayesian probability. Some examples of genomically related data that can be used as inputs for these methods are: molecular sequences, whole genome sequences, gene frequencies, restriction sites, distance matrices, unique characters, mutations such as SNPs, and mitochondrial genome data.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

<span class="mw-page-title-main">Mesquite (software)</span>

Mesquite is a software package primarily designed for phylogenetic analyses. It was developed as a successor to MacClade, when the authors recognized that implementing a modular architecture in MacClade would be infeasible. Mesquite is largely written in Java and uses NEXUS-formatted files as input. Mesquite is available as a compiled executable for Macintosh, Windows, and Unix-like platforms, and the source code is available on GitHub.

References

  1. Felsenstein, J. (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. Bibcode:1981JMolE..17..368F. doi:10.1007/BF01734359. PMID   7288891. S2CID   8024924.
  2. 1 2 3 "PHYLIP general information page" . Retrieved 2010-02-14.
  3. Joseph Felsenstein (August 2003). Inferring Phylogenies. Sinauer Associates. ISBN   0-87893-177-5. Archived from the original on 2011-10-22. Retrieved 2006-03-24.
  4. Stamatakis, Alexandros (2014-05-01). "RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies". Bioinformatics. 30 (9): 1312–1313. doi:10.1093/bioinformatics/btu033. ISSN   1460-2059. PMC   3998144 . PMID   24451623.
  5. Kozlov, Alexey M; Darriba, Diego; Flouri, Tomáš; Morel, Benoit; Stamatakis, Alexandros (2019-11-01). Wren, Jonathan (ed.). "RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference". Bioinformatics. 35 (21): 4453–4455. doi:10.1093/bioinformatics/btz305. ISSN   1367-4803. PMC   6821337 . PMID   31070718.
  6. Minh, Bui Quang; Schmidt, Heiko A; Chernomor, Olga; Schrempf, Dominik; Woodhams, Michael D; von Haeseler, Arndt; Lanfear, Robert (2020-05-01). Teeling, Emma (ed.). "IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era". Molecular Biology and Evolution. 37 (5): 1530–1534. doi:10.1093/molbev/msaa015. ISSN   0737-4038. PMC   7182206 . PMID   32011700.
  7. "PHYLIP package documentation mirror site". Archived from the original on 2005-10-19. Retrieved 2006-03-24.