Dendrogram

Last updated
Dendrogram of a hierarchical clustering (UPGMA) with the height of the nodes (adapted from bacterial 5S rRNA sequence data ). UPGMA Dendrogram Hierarchical.svg
Dendrogram of a hierarchical clustering (UPGMA) with the height of the nodes (adapted from bacterial 5S rRNA sequence data ).
Dendrogram output for hierarchical clustering of marine provinces using presence / absence of sponge species. Global-Diversity-of-Sponges-(Porifera)-pone.0035105.s008.tif
Dendrogram output for hierarchical clustering of marine provinces using presence / absence of sponge species.
A dendrogram of the Tree of Life. This phylogenetic tree is adapted from Woese et al. rRNA analysis. The vertical line at bottom represents the last universal common ancestor (LUCA). Phylogenetic tree.svg
A dendrogram of the Tree of Life. This phylogenetic tree is adapted from Woese et al. rRNA analysis. The vertical line at bottom represents the last universal common ancestor (LUCA).
Heatmap of RNA-Seq data showing two dendrograms in the left and top margins. Heatmap RNAseqV2 1.png
Heatmap of RNA-Seq data showing two dendrograms in the left and top margins.

A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts:

Contents

The name dendrogram derives from the two ancient greek words δένδρον (déndron), meaning "tree", and γράμμα (grámma), meaning "drawing, mathematical figure". [7] [8]

Clustering example

For a clustering example, suppose that five taxa ( to ) have been clustered by UPGMA based on a matrix of genetic distances. The hierarchical clustering dendrogram would show a column of five nodes representing the initial data (here individual taxa), and the remaining nodes represent the clusters to which the data belong, with the arrows representing the distance (dissimilarity). The distance between merged clusters is monotone, increasing with the level of the merger: the height of each node in the plot is proportional to the value of the intergroup dissimilarity between its two daughters (the nodes on the right representing individual observations all plotted at zero height).

See also

Related Research Articles

<span class="mw-page-title-main">Phylogenetic tree</span> Branching diagram of evolutionary relationships between organisms

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In another word, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. All life on Earth is part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the field of the study for the phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

UPGMA is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.

<span class="mw-page-title-main">Hierarchical clustering</span> Statistical method of analysis which seeks to build a hierarchy of clusters

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley, Bob Mau in Madison, and Shuying Li in University of Iowa, the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001, and is now one of the most popular methods in molecular phylogenetics.

In mathematics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

Hierarchical clustering is one method for finding community structures in a network. The technique arranges the network into a hierarchy of groups according to a specified weight function. The data can then be represented in a tree structure known as a dendrogram. Hierarchical clustering can either be agglomerative or divisive depending on whether one proceeds through the algorithm by adding links to or removing links from the network, respectively. One divisive technique is the Girvan–Newman algorithm.

In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion, at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

The Robinson–Foulds or symmetric difference metric, often abbreviated as the RF distance, is a simple way to calculate the distance between phylogenetic trees. It is defined as where A is the number of partitions of data implied by the first tree but not the second tree and B is the number of partitions of data implied by the second tree but not the first tree. The partitions are calculated for each tree by removing each branch. Thus, the number of eligible partitions for each tree is equal to the number of branches in that tree. RF distances have been criticized as biased, but they represent a relatively intuitive measure of the distances between phylogenetic trees and therefore remain widely used. Nevertheless, the biases inherent to the RF distances suggest that researches should consider using "Generalized" Robinson–Foulds metrics that may have better theoretical and practical performance and avoid the biases and misleading attributes of the original metric.

Complete-linkage clustering is one of several methods of agglomerative hierarchical clustering. At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.

UniFrac is a distance metric used for comparing biological communities. It differs from dissimilarity measures such as Bray-Curtis dissimilarity in that it incorporates information on the relative relatedness of community members by incorporating phylogenetic distances between observed organisms in the computation.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

WPGMA is a simple agglomerative (bottom-up) hierarchical clustering method, generally attributed to Sokal and Michener.

In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".

References

Citations

  1. Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996). "Phylogenetic inference". In Hillis DM, Moritz C, Mable BK (eds.). Molecular Systematics, 2nd edition. Sunderland, MA: Sinauer. pp. 407–514. ISBN   9780878932825.
  2. Van Soest R, Boury-Esnault N, Vacelet J, Dohrmann M, Erpenbeck D, De Voogd N, Santodomingo N, Vanhoorne B, Kelly M, Hooper J (2012). "Global Diversity of Sponges (Porifera)". PLOS ONE. 7 (4): e35105. Bibcode:2012PLoSO...735105V. doi: 10.1371/journal.pone.0035105 . PMC   3338747 . PMID   22558119.
  3. Woese, Carl R.; Kandler, O; Wheelis, M (1990). "Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya" (PDF). Proc Natl Acad Sci USA. 87 (12): 4576–4579. Bibcode:1990PNAS...87.4576W. doi: 10.1073/pnas.87.12.4576 . PMC   54159 . PMID   2112744.
  4. Everitt, Brian (1998). Dictionary of Statistics . Cambridge, UK: Cambridge University Press. p.  96. ISBN   0-521-59346-8.
  5. Wilkinson, Leland; Friendly, Michael (May 2009). "The History of the Cluster Heat Map". The American Statistician. 63 (2): 179–184. doi:10.1198/tas.2009.0033. S2CID   122792460.
  6. "Phylogenetic tree (biology)". Encyclopedia Britannica. Retrieved 2018-10-22.
  7. Bailly, Anatole (1981-01-01). Abrégé du dictionnaire grec français. Paris: Hachette. ISBN   2010035283. OCLC   461974285.
  8. Bailly, Anatole. "Greek-french dictionary online". www.tabularium.be. Retrieved October 20, 2018.

Sources