Nexus file

Last updated
Nexus format
Filename extensions usually .nex or .nxs
Internet media type application/octet-stream
Magic number '#NEXUS\n'
Developed byMaddison DR, Swofford DL, Maddison WP
Initial releaseDecember 1997(27 years ago) (1997-12)
Type of format bioinformatics
Open format?Yes

The extensible NEXUS file format is widely used in bioinformatics. It stores information about taxa, morphological and molecular characters, distances, genetic codes, assumptions, sets, trees, etc. [1] Several popular phylogenetic programs such as PAUP*, [2] MrBayes, [3] Mesquite, [4] MacClade [5] and SplitsTree [6] use this format.

Contents

Syntax

A NEXUS file is made out of a fixed header #NEXUS followed by multiple blocks. Each block starts with BEGIN block_name; and ends with END;. The keywords are case-insensitive. Comments are enclosed inside square brackets [...]. [7]

There are a few pre-defined block names for common types of data. Examples include: [7]

TAXA block
The TAXA block contains information about taxa.
DATA block
The DATA block contains the data matrix (e.g. sequence alignment).
TREES block
The TREES block contains phylogenetic trees described using the Newick format, e.g. ((A,B),C);:

The following example uses the three block types above:

#NEXUS Begin TAXA;   Dimensions ntax=4;   TaxLabels SpaceDog SpaceCat SpaceOrc SpaceElf; End;  Begin data;   Dimensions nchar=15;   Format datatype=dna missing=? gap=- matchchar=.;   Matrix     [ When a position is a "matchchar", it means that it is the same as the first entry at the same position. ]     SpaceDog   atgctagctagctcg     SpaceCat   ......??...-.a.     SpaceOrc   ...t.......-.g.[ same as atgttagctag-tgg ]     SpaceElf   ...t.......-.a.              ; End;  BEGIN TREES;   Tree tree1 = (((SpaceDog,SpaceCat),SpaceOrc,SpaceElf)); END;

See also

Related Research Articles

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

The International Code of Phylogenetic Nomenclature, known as the PhyloCode for short, is a formal set of rules governing phylogenetic nomenclature. Its current version is specifically designed to regulate the naming of clades, leaving the governance of species names up to the rank-based nomenclature codes.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

MARC is a standard set of digital formats for the machine-readable description of items catalogued by libraries, such as books, DVDs, and digital resources. Computerized library catalogs and library management software need to structure their catalog records as per an industry-wide standard, which is MARC, so that bibliographic information can be shared freely between computers. The structure of bibliographic records almost universally follows the MARC standard. Other standards work in conjunction with MARC, for example, Anglo-American Cataloguing Rules (AACR)/Resource Description and Access (RDA) provide guidelines on formulating bibliographic data into the MARC record structure, while the International Standard Bibliographic Description (ISBD) provides guidelines for displaying MARC records in a standard, human-readable form.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies). It consists of 65 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows, Mac OS 8, Mac OS 9, OS X, Linux ; and FreeBSD from FreeBSD.org. Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle.

In mathematics and phylogenetics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

MorphoBank is a web application for collaborative evolutionary research, specifically phylogenetic systematics or cladistics, on the phenotype. Historically, scientists conducting research on phylogenetic systematics have worked individually or in small groups employing traditional single-user software applications such as MacClade, Mesquite and Nexus Data Editor. As the hypotheses under study have grown more complex, large research teams have assembled to tackle the problem of discovering the Tree of Life for the estimated 4-100 million living species(Wilson 2003, pp. 77–80) and the many thousands more extinct species known from fossils. Because the phenotype is fundamentally visual, and as phenotype-based phylogenetic studies have continued to increase in size, it becomes important that observations be backed up by labeled images. Traditional desktop software applications currently in wide use do not provide robust support for team-based research or for image manipulation and storage. MorphoBank is a particularly important tool for the growing scientific field of phenomics.

Archaeopteryx is an interactive computer software program, written in Java, for viewing, editing, and analyzing phylogenetic trees. This type of program can be used for a variety of analyses of molecular data sets, but is particularly designed for phylogenomics. Besides tree description formats with limited expressiveness, it also implements the phyloXML format. Archaeopteryx is the successor to Java program A Tree Viewer (ATV).

PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees and associated data. The structure of phyloXML is described by XML Schema Definition (XSD) language.

PAUP* is a computational phylogenetics program for inferring evolutionary trees (phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implemented parsimony, but from version 4.0 it also supports distance matrix and likelihood methods. Version 3.0 ran on Macintosh computers and supported a rich, user-friendly graphical interface. Together with the program MacClade, with which it shares the NEXUS data format, PAUP* was the phylogenetic software of choice for many phylogenetists.

A patrocladogram is a cladistic branching pattern that has been precisely modified by use of patristic distances ; a type of phylogram. The patristic distance is defined as, "the number of apomorphic step changes separating two taxa on a cladogram," and is used exclusively to determine the amount of divergence of a characteristic from a common ancestor. This means that cladistic and patristic distances are combined to construct a new tree using various phenetic algorithms. The purpose of the patrocladogram in biological classification is to form a hypothesis about which evolutionary processes are actually involved before making a taxonomic decision. Patrocladograms are based on biostatistics that include but are not limited to: parsimony, distance matrix, likelihood methods, and Bayesian probability. Some examples of genomically related data that can be used as inputs for these methods are: molecular sequences, whole genome sequences, gene frequencies, restriction sites, distance matrices, unique characters, mutations such as SNPs, and mitochondrial genome data.

Stylus Studio is an integrated development environment (IDE) for the Extensible Markup Language (XML). It consists of a variety of tools and visual designers to edit and transform XML documents and legacy data such as electronic data interchange (EDI), comma-separated values (CSV) and relational data.

NeXML is an exchange standard for representing phyloinformatic data. It was inspired by the widely used Nexus file format but uses XML to produce a more robust format for rich phylogenetic data. Advantages include syntax validation, semantic annotation, and web services. The format is broadly supported and has libraries in many popular programming languages for bioinformatics.

<span class="mw-page-title-main">Mesquite (software)</span>

Mesquite is a software package primarily designed for phylogenetic analyses. It was developed as a successor to MacClade, when the authors recognized that implementing a modular architecture in MacClade would be infeasible. Mesquite is largely written in Java and uses NEXUS-formatted files as input. Mesquite is available as a compiled executable for Macintosh, Windows, and Unix-like platforms, and the source code is available on GitHub.

References

  1. Maddison DR, Swofford DL, Maddison WP (1997). "NEXUS: An extensible file format for systematic information". Systematic Biology. 46 (4): 590–621. doi: 10.1093/sysbio/46.4.590 . PMID   11975335.
  2. PAUP* Archived 2006-09-03 at the Wayback Machine Phylogenetic Analysis Using Parsimony *and other methods
  3. MrBayes
  4. Mesquite: A modular system for evolutionary analysis
  5. MacClade
  6. Huson and Bryant, Application of Phylogenetic Networks in Evolutionary Studies, Mol Biol Evol (2005) 23 (2): 254-267. https://doi.org/10.1093/molbev/msj030
  7. 1 2 Detailed NEXUS specification