Nexus file

Last updated
Nexus format
Filename extensions usually .nex or .nxs
Internet media type application/octet-stream
Magic number '#NEXUS\n'
Developed byMaddison DR, Swofford DL, Maddison WP
Initial releaseDecember 1997(27 years ago) (1997-12)
Type of format bioinformatics
Open format?Yes

The extensible NEXUS file format is widely used in phylogenetics, evolutionary biology, and bioinformatics. It stores information about taxa, morphological character states, DNA and protein sequence alignments, distances, and phylogenetic trees. [1] The NEXUS format also allows the storage of data that can facilitate analyses, such as sets of characters or taxa. Many popular phylogenetic programs, including PAUP*, [2] MrBayes, [3] Mesquite, [4] MacClade, [5] and SplitsTree, [6] use this format. Nexus file names typically have the extension .nxs or .nex .

Contents

Syntax

A NEXUS file is made out of a fixed header #NEXUS followed by multiple blocks. Each block starts with BEGIN block_name; and ends with END;. The keywords are case-insensitive. Comments are enclosed inside square brackets [...]. [7] Each of the pre-defined types of blocks may appear only once.

Block NameDescription
TAXASpecifies the OTUs (operational taxonomic units) in data set
CHARACTERSSpecifies the character data (e.g., homologous morphological characters or a multiple sequence alignment)
DATAEquivalent to a CHARACTERS block that includes the NewTaxa subcommand in the Dimensions command
TREESStores trees in Newick format
DISTANCESStores distance matrices
SETSAssigns names to sets of characters (CHARSET) or OTUs (TAXSET)
ASSUMPTIONSAssumptions about the data or directions regarding data treatment (e.g., the character exclusion status)

The following example NEXUS uses the TAXA, CHARACTERS, and TREES blocks:

#NEXUS Begin TAXA;   Dimensions ntax=4;   TaxLabels Alpha Beta Gamma Delta; End;  Begin CHARACTERS;   Dimensions nchar=15;   Format datatype=dna missing=? gap=- matchchar=.;   Matrix [ When a position is a "matchchar", it means that it is the same as the first entry at the same position. ]     Alpha   atgctagctagctcg     Beta    ......??...-.a.     Gamma   ...t.......-.g. [ same as atgttagctag-tgg ]     Delta   ...t.......-.a.              ; End;  Begin TREES;   Tree tree1 = ((Alpha,Beta),Gamma,Delta); END;

See also

Related Research Articles

In biology, taxonomy is the scientific study of naming, defining (circumscribing) and classifying groups of biological organisms based on shared characteristics. Organisms are grouped into taxa, and these groups are given a taxonomic rank; groups of a given rank can be aggregated to form a more inclusive group of higher rank, thus creating a taxonomic hierarchy. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus, and species. The Swedish botanist Carl Linnaeus is regarded as the founder of the current system of taxonomy, as he developed a ranked system known as Linnaean taxonomy for categorizing organisms and binomial nomenclature for naming organisms.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

<span class="mw-page-title-main">Taxon</span> Grouping of biological populations

In biology, a taxon is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular name and given a particular ranking, especially if and when it is accepted or becomes established. It is very common, however, for taxonomists to remain at odds over what belongs to a taxon and the criteria used for inclusion, especially in the context of rank-based ("Linnaean") nomenclature. If a taxon is given a formal scientific name, its use is then governed by one of the nomenclature codes specifying which scientific name is correct for a particular grouping.

The International Code of Phylogenetic Nomenclature, known as the PhyloCode for short, is a formal set of rules governing phylogenetic nomenclature. Its current version is specifically designed to regulate the naming of clades, leaving the governance of species names up to the rank-based nomenclature codes.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

In phylogenetics and computational phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes. Under the maximum-parsimony criterion, the optimal tree will minimize the amount of homoplasy. In other words, under this criterion, the shortest possible tree that explains the data is considered best. Some of the basic ideas behind maximum parsimony were presented by James S. Farris in 1970 and Walter M. Fitch in 1971.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.

In mathematics and phylogenetics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

Phylogenetic nomenclature is a method of nomenclature for taxa in biology that uses phylogenetic definitions for taxon names as explained below. This contrasts with the traditional method, by which taxon names are defined by a type, which can be a specimen or a taxon of lower rank, and a description in words. Phylogenetic nomenclature is regulated currently by the International Code of Phylogenetic Nomenclature (PhyloCode).

MorphoBank is a web application for collaborative evolutionary research, specifically phylogenetic systematics or cladistics, on the phenotype. Historically, scientists conducting research on phylogenetic systematics have worked individually or in small groups employing traditional single-user software applications such as MacClade, Mesquite and Nexus Data Editor. As the hypotheses under study have grown more complex, large research teams have assembled to tackle the problem of discovering the Tree of Life for the estimated 4-100 million living species(Wilson 2003, pp. 77–80) and the many thousands more extinct species known from fossils. Because the phenotype is fundamentally visual, and as phenotype-based phylogenetic studies have continued to increase in size, it becomes important that observations be backed up by labeled images. Traditional desktop software applications currently in wide use do not provide robust support for team-based research or for image manipulation and storage. MorphoBank is a particularly important tool for the growing scientific field of phenomics.

Archaeopteryx is an interactive computer software program, written in Java, for viewing, editing, and analyzing phylogenetic trees. This type of program can be used for a variety of analyses of molecular data sets, but is particularly designed for phylogenomics. Besides tree description formats with limited expressiveness, it also implements the phyloXML format. Archaeopteryx is the successor to Java program A Tree Viewer (ATV).

PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees and associated data. The structure of phyloXML is described by XML Schema Definition (XSD) language.

PAUP* is a computational phylogenetics program for inferring evolutionary trees (phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implemented parsimony, but from version 4.0 it also supports distance matrix and likelihood methods. Version 3.0 ran on Macintosh computers and supported a rich, user-friendly graphical interface. Together with the program MacClade, with which it shares the NEXUS data format, PAUP* was the phylogenetic software of choice for many phylogenetists.

A patrocladogram is a cladistic branching pattern that has been precisely modified by use of patristic distances ; a type of phylogram. The patristic distance is defined as, "the number of apomorphic step changes separating two taxa on a cladogram," and is used exclusively to determine the amount of divergence of a characteristic from a common ancestor. This means that cladistic and patristic distances are combined to construct a new tree using various phenetic algorithms. The purpose of the patrocladogram in biological classification is to form a hypothesis about which evolutionary processes are actually involved before making a taxonomic decision. Patrocladograms are based on biostatistics that include but are not limited to: parsimony, distance matrix, likelihood methods, and Bayesian probability. Some examples of genomically related data that can be used as inputs for these methods are: molecular sequences, whole genome sequences, gene frequencies, restriction sites, distance matrices, unique characters, mutations such as SNPs, and mitochondrial genome data.

Wayne Paul Maddison is a Canadian evolutionary biologist, arachnologist, and biological illustrator. He is Canada Research Chair in Biodiversity and a professor at the departments of zoology and botany at the University of British Columbia, and the Director of the Spencer Entomological Collection at the Beaty Biodiversity Museum.

NeXML is an exchange standard for representing phyloinformatic data. It was inspired by the widely used Nexus file format but uses XML to produce a more robust format for rich phylogenetic data. Advantages include syntax validation, semantic annotation, and web services. The format is broadly supported and has libraries in many popular programming languages for bioinformatics.

<span class="mw-page-title-main">Mesquite (software)</span>

Mesquite is a software package primarily designed for phylogenetic analyses. It was developed as a successor to MacClade, when the authors recognized that implementing a modular architecture in MacClade would be infeasible. Mesquite is largely written in Java and uses NEXUS-formatted files as input. Mesquite is available as a compiled executable for Macintosh, Windows, and Unix-like platforms, and the source code is available on GitHub.

References

  1. Maddison DR, Swofford DL, Maddison WP (1997). "NEXUS: An extensible file format for systematic information". Systematic Biology. 46 (4): 590–621. doi: 10.1093/sysbio/46.4.590 . PMID   11975335.
  2. PAUP* Archived 2006-09-03 at the Wayback Machine Phylogenetic Analysis Using Parsimony *and other methods
  3. MrBayes
  4. Mesquite: A modular system for evolutionary analysis
  5. MacClade
  6. Huson and Bryant, Application of Phylogenetic Networks in Evolutionary Studies, Mol Biol Evol (2005) 23 (2): 254-267. https://doi.org/10.1093/molbev/msj030
  7. Detailed NEXUS specification