Newick format

Last updated
Newick format
Filename extensions .tree
Internet media type text/x-nh
Initial release24 June 1986(37 years ago) (1986-06-24)
Type of format graph-theoretical trees
Open format?Yes

In mathematics and phylogenetics, Newick tree format (or Newick notation or New Hampshire tree format) is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant [1] in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package. [2]

Contents

Examples

The following tree:

NewickExample.svg

could be represented in Newick format in several ways

(,,(,));                               no nodes are named (A,B,(C,D));                           leaf nodes are named (A,B,(C,D)E)F;                         all nodes are named (:0.1,:0.2,(:0.3,:0.4):0.5);           all but root node have a distance to parent (:0.1,:0.2,(:0.3,:0.4):0.5):0.0;       all have a distance to parent (A:0.1,B:0.2,(C:0.3,D:0.4):0.5);       distances and leaf names(popular) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;     distances and all names ((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;    a tree rooted on a leaf node(rare)

Newick format is typically used for tools like PHYLIP and is a minimal definition for a phylogenetic tree.

Rooted, unrooted, and binary trees

When an unrooted tree is represented in Newick notation, an arbitrary node is chosen as its root. Whether rooted or unrooted, typically a tree's representation is rooted on an internal node and it is rare (but legal) to root a tree on a leaf node.

A rooted binary tree that is rooted on an internal node has exactly two immediate descendant nodes for each internal node. An unrooted binary tree that is rooted on an arbitrary internal node has exactly three immediate descendant nodes for the root node, and each other internal node has exactly two immediate descendant nodes. A binary tree rooted from a leaf has at most one immediate descendant node for the root node, and each internal node has exactly two immediate descendant nodes.

Grammar

A grammar for parsing the Newick format (roughly based on [3] ):

The grammar nodes

Tree: The full input Newick Format for a single tree Subtree: an internal node (and its descendants) or a leaf node Leaf: a node with no descendants Internal: a node and its one or more descendants BranchSet: a set of one or more Branches Branch: a tree edge and its descendant subtree. Name: the name of a node Length: the length of a tree edge.

The grammar rules

Note, "|" separates alternatives.

TreeSubtree ";" SubtreeLeaf | InternalLeafNameInternal → "(" BranchSet ")" NameBranchSetBranch | Branch "," BranchSetBranchSubtreeLengthNameempty | stringLengthempty | ":" number

Whitespace (spaces, tabs, carriage returns, and linefeeds) within number is prohibited. Whitespace within string is often prohibited. Whitespace elsewhere is ignored. Sometimes the Namestring must be of a specified fixed length; otherwise the punctuation characters from the grammar (semicolon, parentheses, comma, and colon) are prohibited. The TreeSubtree ";" production is instead the TreeBranch ";" production in those cases where having the entire tree descended from nowhere is permitted; this captures the replaced production as well because Length can be empty.

Note that when a tree having more than one leaf is rooted from one of its leaves, a representation that is rarely seen in practice, the root leaf is characterized as an Internal node by the above grammar. Generally, a root node labeled as Internal should be construed as actually internal if and only if it has at least two Branches in its BranchSet. One can make a grammar that formalizes this distinction by replacing the above Tree production rule with

TreeRootLeaf ";" | RootInternal ";" RootLeafName | "(" Branch ")" NameRootInternal → "(" Branch "," BranchSet ")" Name

The first RootLeaf production is for a tree with exactly one leaf. The second RootLeaf production is for rooting a tree from one of its two or more leaves.

Notes

Dialects

New Hampshire X format

The New Hampshire X (NHX) format is an extension to Newick that adds key-value data (gene duplication, etc.) to Newick nodes. This is done by putting the additional data in brackets [&&NHX:key=value:...] in the node labels. The brackets are used because they represent comments in the Nexus file format, so any parser not understanding these additional information will ignore them. [4]

Extended Newick

While the standard Newick notation is limited to phylogenetic trees, Extended Newick (Perl Bio::PhyloNetwork) can be used to encode explicit phylogenetic networks. [5] In a phylogenetic network, which is a generalization of a phylogenetic tree, a node either represents a divergence event (cladogenesis) or a reticulation event such as hybridization, introgression, horizontal (lateral) gene transfer or recombination. Nodes that represent a reticulation event are duplicated, annotated by introducing the # symbol into the Newick format, and numbered consecutively (using integer values starting with 1).

For example, if leaf Y is the product of hybridisation (x) between lineages leading to C and D in the tree above,

ExtendedNewickExample.png

f

A

B

e
c

C

Y

D

f

A

B

e

C

d

Y

D

Two trees in standard Newick

one can express this situation by defining two trees in standard Newick notation

(A,B,((C,Y)c,D)e)f; and (A,B,(C,(Y,D)d)e)f;  standard Newick, all nodes are named (internal nodes lowercase, leaves upper case)

or in extended Newick notation

(A,B,((C,(Y)x#H1)c,(x#H1,D)d)e)f;               extended Newick, all nodes are named; 1 is the integer identifying the hybrid node x

The x#H1 here is a hybrid node. It will be joined by the program into a single node when drawn. This is the picture drawn by Dendroscope for this example: Dendroscope-network.png


The production rules above is modified by the following for labelling hybrid nodes (in general, nodes representing reticulation events): [6]

LeafNameHybridHybridempty | "#" Typeinteger  -- The #i part is an obligatory identifier for a hybrid node Typeempty | string              -- type of reticulation, e.g., H = hybridisation, LGT = lateral gene transfer, R = recombination.

In the visualization of LGT events, for a given reticulate node, one incoming edge is usually drawn as an "acceptor" edge and all other incoming edges are drawn as "transfer" edges. Some programs (e.g. Dendroscope and SplitsTree) allow exactly one copy of the reticulate node to be labeled with ## to indicate that it corresponds to the acceptor edge.

Extended Newick is backward-compatible: a hybrid node would simply be interpreted as a few strangely-named nodes for legacy parsers.

Rich Newick format

The Rich Newick format, also known as the Rice Newick format, is a further extension of Extended Newick. [7] It adds support for:

Ad hoc extensions

Some other programs, like NWX, uses comments starting with & to encode additional information in an ad hoc manner: [8]

Visualization

Many tools have been published to visualize Newick tree data. Specific examples include the ETE toolkit ("Environment for Tree Exploration") [9] and T-REX. [10] Phylogenetic software packages such as SplitsTree and the tree-viewer Dendroscope as well as the online tree viewing tool IcyTree can handle standard and extended Newick notation, while the phylogenetic network software PhyloNet makes use of both the Extended Newick and Rich Newick format.

See also

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Binary tree</span> Limited form of tree data structure

In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with k = 2. A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.

<span class="mw-page-title-main">Tree (data structure)</span> Linked node hierarchical data structure

In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children, but must be connected to exactly one parent, except for the root node, which has no parent. These constraints mean there are no cycles or "loops", and also that each child can be treated like the root node of its own subtree, making recursion a useful technique for tree traversal. In contrast to linear data structures, many trees cannot be represented by relationships between neighboring nodes in a single straight line.

<span class="mw-page-title-main">Tree (graph theory)</span> Undirected, connected and acyclic graph

In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path, or equivalently a connected acyclic undirected graph. A forest is an undirected graph in which any two vertices are connected by at most one path, or equivalently an acyclic undirected graph, or equivalently a disjoint union of trees.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

In computer science, a leftist tree or leftist heap is a priority queue implemented with a variant of a binary heap. Every node x has an s-value which is the distance to the nearest leaf in subtree rooted at x. In contrast to a binary heap, a leftist tree attempts to be very unbalanced. In addition to the heap property, leftist trees are maintained so the right descendant of each node has the lower s-value.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Tree rearrangements are deterministic algorithms devoted to search for optimal phylogenetic tree structure. They can be applied to any set of data that are naturally arranged into a tree, but have most applications in computational phylogenetics, especially in maximum parsimony and maximum likelihood searches of phylogenetic trees, which seek to identify one among many possible trees that best explains the evolutionary history of a particular gene or species.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

In mathematics, specifically in graph theory and number theory, a hydra game is a single-player iterative mathematical game played on a mathematical tree called a hydra where, usually, the goal is to cut off the hydra's "heads" while the hydra simultaneously expands itself. Hydra games can be used to generate large numbers or infinite ordinals or prove the strength of certain mathematical theories.

PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees and associated data. The structure of phyloXML is described by XML Schema Definition (XSD) language.

<span class="mw-page-title-main">Unrooted binary tree</span>

In mathematics and computer science, an unrooted binary tree is an unrooted tree in which each vertex has either one or three neighbors.

In computer science, an x-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O(log log M), using O(n log M) space, where n is the number of stored values and M is the maximum value in the domain. The structure was proposed by Dan Willard in 1982, along with the more complicated y-fast trie, as a way to improve the space usage of van Emde Boas trees, while retaining the O(log log M) query time.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

In combinatorial mathematics and theoretical computer science, heavy-light decomposition is a technique for decomposing a rooted tree into a set of paths. In a heavy path decomposition, each non-leaf node selects one "heavy edge", the edge to the child that has the greatest number of descendants. The selected edges form the paths of the decomposition.

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.

In the mathematical field of graph theory, an agreement forest for two given trees is any forest which can, informally speaking, be obtained from both trees by removing a common number of edges.

References

  1. Newick's Lobster House home page
  2. "The Newick tree format".
  3. 1 2 3 4 Olsen, Gary (August 30, 1990). "Interpretation of "Newick's 8:45" Tree Format".
  4. Zmasek, Christian M. (1999). "The New Hampshire X Format (NHX)" (PDF).
  5. Cardona, Gabriel; Rosselló, Francesc; Valiente, Gabriel (2008-03-27). "A perl package and an alignment tool for phylogenetic networks". BMC Bioinformatics. 9: 175. doi: 10.1186/1471-2105-9-175 . ISSN   1471-2105. PMC   2330044 . PMID   18371228.
  6. Cardona, Gabriel; Rosselló, Francesc; Valiente, Gabriel (2008). "Extended Newick: it is time for a standard representation of phylogenetic networks". BMC Bioinformatics. 9: 532. doi: 10.1186/1471-2105-9-532 . PMC   2621367 . PMID   19077301.
  7. Barnett, Robert Matthew (16 February 2012). "Rich Newick Format". Rice University Wiki.
  8. Yu, Guangchuang. "Chapter 1 Importing Tree with Data". Data Integration, Manipulation and Visualization of Phylogenetic Tree.
  9. Huerta-Cepas, Jaime; Serra, François; Bork, Peer (June 2016). "ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data". Molecular Biology and Evolution. 33 (6): 1635–1638. doi:10.1093/molbev/msw046. ISSN   0737-4038. PMC   4868116 . PMID   26921390.
  10. Boc, Alix; Diallo, Alpha Boubacar; Makarenkov, Vladimir (July 2012). "T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks". Nucleic Acids Research. 40 (Web Server issue): W573–579. doi:10.1093/nar/gks485. ISSN   1362-4962. PMC   3394261 . PMID   22675075.