Distance matrix

Last updated

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. [1] Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

Contents

Non-metric distance matrix

In general, a distance matrix is a weighted adjacency matrix of some graph. In a network, a directed graph with weights assigned to the arcs, the distance between two nodes of the network can be defined as the minimum of the sums of the weights on the shortest paths joining the two nodes. [2] This distance function, while well defined, is not a metric. There need be no restrictions on the weights other than the need to be able to combine and compare them, so negative weights are used in some applications. Since paths are directed, symmetry can not be guaranteed, and if cycles exist the distance matrix may not be hollow.

An algebraic formulation of the above can be obtained by using the min-plus algebra. Matrix multiplication in this system is defined as follows: Given two n × n matrices A = (aij) and B = (bij), their distance product C = (cij) = AB is defined as an n × n matrix such that

Note that the off-diagonal elements that are not connected directly will need to be set to infinity or a suitable large value for the min-plus operations to work correctly. A zero in these locations will be incorrectly interpreted as an edge with no distance, cost, etc.

If W is an n × n matrix containing the edge weights of a graph, then Wk (using this distance product) gives the distances between vertices using paths of length at most k edges, and Wn is the distance matrix of the graph.

An arbitrary graph G on n vertices can be modeled as a weighted complete graph on n vertices by assigning a weight of one to each edge of the complete graph that corresponds to an edge of G and zero to all other edges. W for this complete graph is the adjacency matrix of G. The distance matrix of G can be computed from W as above, however, Wn calculated by the usual matrix multiplication only encodes the number of paths between any two vertices of length exactly n.

Metric distance matrix

The value of a distance matrix formalism in many applications is in how the distance matrix can manifestly encode the metric axioms and in how it lends itself to the use of linear algebra techniques. That is, if M = (xij) with 1 ≤ i, jN is a distance matrix for a metric distance, then

  1. the entries on the main diagonal are all zero (that is, the matrix is a hollow matrix), i.e. xii = 0 for all 1 ≤ iN,
  2. all the off-diagonal entries are positive (xij > 0 if ij), (that is, a non-negative matrix),
  3. the matrix is a symmetric matrix (xij = xji), and
  4. for any i and j, xijxik + xkj for all k (the triangle inequality). This can be stated in terms of tropical matrix multiplication

When a distance matrix satisfies the first three axioms (making it a semi-metric) it is sometimes referred to as a pre-distance matrix. A pre-distance matrix that can be embedded in a Euclidean space is called a Euclidean distance matrix.

Another common example of a metric distance matrix arises in coding theory when in a block code the elements are strings of fixed length over an alphabet and the distance between them is given by the Hamming distance metric. The smallest non-zero entry in the distance matrix measures the error correcting and error detecting capability of the code.

Additive distance matrix

An additive distance matrix is a special type of matrix used in bioinformatics to build a phylogenetic tree. Let x be the lowest common ancestor between two species i and j, we expect Mij = Mix + Mxj. This is where the additive metric comes from. A distance matrix M for a set of species S is said to be additive if and only if there exists a phylogeny T for S such that:

For this case, M is called an additive matrix and T is called an additive tree. Below we can see an example of an additive distance matrix and its corresponding tree:

Additive distance matrix (left) and its phylogeny tree (right) Additive distance matrix.png
Additive distance matrix (left) and its phylogeny tree (right)

Ultrametric distance matrix

The ultrametric distance matrix is defined as an additive matrix which models the constant molecular clock. It is used to build a phylogenetic tree. A matrix M is said to be ultrametric if there exists a tree T such that:

Here is an example of an ultrametric distance matrix with its corresponding tree:

Ultrametric tree.png

Bioinformatics

The distance matrix is widely used in the bioinformatics field, and it is present in several methods, algorithms and programs. Distance matrices are used to represent protein structures in a coordinate-independent manner, as well as the pairwise distances between two sequences in sequence space. They are used in structural and sequential alignment, and for the determination of protein structures from NMR or X-ray crystallography.

Sometimes it is more convenient to express data as a similarity matrix.

It is also used to define the distance correlation.

Sequence alignment

An alignment of two sequences is formed by inserting spaces in arbitrary locations along the sequences so that they end up with the same length and there are no two spaces at the same position of the two augmented sequences. [3] One of the primary methods for sequence alignment is dynamic programming. The method is used to fill the distance matrix and then obtain the alignment. In typical usage, for sequence alignment a matrix is used to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino-acid in one sequence with a gap in the other.

Global alignment

The Needleman-Wunsch algorithm used to calculate global alignment uses dynamic programming to obtain the distance matrix.

Local alignment

The Smith-Waterman algorithm is also dynamic programming based which consists also in obtaining the distance matrix and then obtain the local alignment.

Multiple sequence alignment

Multiple sequence alignment is an extension of pairwise alignment to align several sequences at a time. Different MSA methods are based on the same idea of the distance matrix as global and local alignments.

  • Center star method. This method defines a center sequence Sc which minimizes the distance between the sequence Sc and any other sequence Si. Then it generates a multiple alignment M for the set of sequences S so that for every Si the alignment distance dM(Sc,Si) is the optimal pairwise alignment. This method has the characteristic that the computed alignment for S whose sum-of-pair distance is at most twice the optimal multiple alignment.
  • Progressive alignment method. This heuristic method to create MSA first aligns the two most related sequences, and then it progressively aligns the next two most related sequences until all sequences are aligned.

There are other methods that have their own program due to their popularity:

MAFFT

Multiple alignment using fast Fourier transform (MAFFT) is a program with an algorithm based on progressive alignment, and it offers various multiple alignment strategies. First, MAFFT constructs a distance matrix based on the number of shared 6-tuples. Second, it builds the guide tree based on the previous matrix. Third, it clusters the sequences with the help of the fast Fourier transform and starts the alignment. Based on the new alignment, it reconstructs the guide tree and align again.

Phylogenetic analysis

To perform phylogenetic analysis, the first step is to reconstruct the phylogenetic tree: given a collection of species, the problem is to reconstruct or infer the ancestral relationships among the species, i.e., the phylogenetic tree among the species. Distance matrix methods perform this activity.

Distance matrix methods

Distance matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore require multiple sequences as an input. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. [4] Given n species, the input is an n × n distance matrix M where Mij is the mutation distance between species i and j . The aim is to output a tree of degree 3 which is consistent with the distance matrix.

They are frequently used as the basis for progressive and iterative types of multiple sequence alignment. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees. [4] Despite potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as DNA-DNA hybridization assays.

The following are distance based methods for phylogeny reconstruction:

Additive tree reconstruction

Additive tree reconstruction is based on additive and ultrametric distance matrices. These matrices have a special characteristic:

Consider an additive matrix M. For any three species i, j, k, the corresponding tree is unique. [3] Every ultrametric distance matrix is an additive matrix. We can observe this property for the tree below, which consists on the species i, j, k.

Phylogenetic tree from 3 species Unique tree additive matrix.png
Phylogenetic tree from 3 species

The additive tree reconstruction technique starts with this tree. And then adds one more species each time, based on the distance matrix combined with the property mentioned above. For example, consider an additive matrix M and 5 species a, b, c, d and e. First we form an additive tree for two species a and b. Then we chose a third one, let's say c and attach it to a point x on the edge between a and b. The edge weights are computed with the property above. Next we add the fourth species d to any of the edges. If we apply the property then we identify that d should be attached to only one specific edge. Finally, we add e following the same procedure as before.

UPGMA

The basic principle of UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is that similar species should be closer in the phylogenetic tree. Hence, it builds the tree by clustering similar sequences iteratively. The method works by building the phylogenetic tree bottom up from its leaves. Initially, we have n leaves (or n singleton trees), each representing a species in S. Those n leaves are referred as n clusters. Then, we perform n-1 iterations. In each iteration, we identify two clusters C1 and C2 with the smallest average distance and merge them to form a bigger cluster C. If we suppose M is ultrametric, for any cluster C created by the UPGMA algorithm, C is a valid ultrametric tree.

Neighbor joining

Neighbor is a bottom-up clustering method. It takes a distance matrix specifying the distance between each pair of sequences. The algorithm starts with a completely unresolved tree, whose topology corresponds to that of a star network, and iterates over the following steps until the tree is completely resolved and all branch lengths are known:

  1. Based on the current distance matrix calculate the matrix (defined below).
  2. Find the pair of distinct taxa i and j (i.e. with) for which has its lowest value. These taxa are joined to a newly created node, which is connected to the central node.
  3. Calculate the distance from each of the taxa in the pair to this new node.
  4. Calculate the distance from each of the taxa outside of this pair to the new node.
  5. Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step. [5]
Fitch-Margoliash

The Fitch–Margoliash method uses a weighted least squares method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. [6]

Data Mining and Machine Learning

Data Mining

A common function in data mining is applying cluster analysis on a given set of data to group data based on how similar or more similar they are when compared to other groups. Distance matrices became heavily dependent and utilized in cluster analysis since similarity can be measured with a distance metric. Thus, distance matrix became the representation of the similarity measure between all the different pairs of data in the set.

Hierarchical clustering

A distance matrix is necessary for traditional hierarchical clustering algorithms which are often heuristic methods employed in biological sciences such as phylogeny reconstruction. When implementing any of the hierarchical clustering algorithms in data mining, the distance matrix will contain all pair-wise distances between every point and then will begin to create clusters between two different points or clusters based entirely on distances from the distance matrix.

If N be the number of points, the complexity of hierarchical clustering is:

  • Time complexity is due to the repetitive calculations done after every cluster to update the distance matrix
  • Space complexity is

Machine Learning

Distance metrics are a key part of several machine learning algorithms, which are used in both supervised and unsupervised learning. They are generally used to calculate the similarity between data points: this is where the distance matrix is an essential element. The use of an effective distance matrix improves the performance of the machine learning model, whether it is for classification tasks or for clustering. [7]

K-Nearest Neighbors

A distance matrix is utilized in the k-NN algorithm which is one of the slowest but simplest and most used instance-based machine learning algorithms that can be used both in classification and regression tasks. It is one of the slowest machine learning algorithms since each test sample's predicted result requires a fully computed distance matrix between the test sample and each training sample in the training set. Once the distance matrix is computed, the algorithm selects the K number of training samples that are the closest to the test sample to predict the test sample's result based on the selected set's majority (classification) or average (regression) value.

  • Prediction time complexity is , to compute the distance between each test sample with every training sample to construct the distance matrix where:
  1. k = number of nearest neighbors selected
  2. n = size of the training set
  3. d = number of dimensions being used for the data

This classification focused model predicts the label of the target based on the distance matrix between the target and each of the training samples to determine the K-number of samples that are the closest/nearest to the target.

DistanceMatrix KNN.png
K nearestNeighborVisual.png

Computer Vision

A distance matrix can be used in neural networks for 2D to 3D regression in image predicting machine learning models.

Information retrieval

Distance matrices using Gaussian mixture distance

Potential basic algorithms worth noting on the topic of information retrieval is Fish School Search algorithm an information retrieval that partakes in the act of using distance matrices in order for gathering collective behavior of fish schools. By using a feeding operator to update their weights

Eq. A:

Eq. B:

Stepvol defines the size of the maximum volume displacement preformed with the distance matrix, specifically using a Euclidean distance matrix.

Evaluation of the similarity or dissimilarity of Cosine similarity and Distance matrices

Conversion formula between cosine similarity and Euclidean Distance SimilarityTOidistance.png
Conversion formula between cosine similarity and Euclidean Distance

Clustering Documents

The implementation of hierarchical clustering with distance-based metrics to organize and group similar documents together will require the need and utilization of a distance matrix. The distance matrix will represent the degree of association that a document has with another document that will be used to create clusters of closely associated documents that will be utilized in retrieval methods of relevant documents for a user's query.

Isomap

Isomap incorporates distance matrices to utilize geodesic distances to able to compute lower-dimensional embeddings. This helps to address a collection of documents that reside within a massive number of dimensions and empowers to perform document clustering.

Neighborhood Retrieval Visualizer (NeRV)

An algorithm used for both unsupervised and supervised visualization that uses distance matrices to find similar data based on the similarities shown on a display/screen.

The distance matrix needed for Unsupervised NeRV can be computed through fixed input pairwise distances.

The distance matrix needed for Supervised NeRV requires formulating a supervised distance metric to be able to compute the distance of the input in a supervised manner.

Chemistry

The distance matrix is a mathematical object widely used in both graphical-theoretical (topological) and geometric (topographic) versions of chemistry. [8] The distance matrix is used in chemistry in both explicit and implicit forms.

Interconversion mechanisms between two permutational isomers

Distance matrices were used as the main approach to depict and reveal the shortest path sequence needed to determine the rearrangement between the two permutational isomers.

Distance Polynomials and Distance Spectra

Explicit use of Distance matrices is required in order to construct the distance polynomials and distance spectra of molecular structures.

Structure-property model

Implicit use of Distance matrices was applied through the use of the distance based metric Weiner number/Weiner Index which was formulated to represent the distances in all chemical structures. The Weiner number is equal to half-sum of the elements of the distance matrix.

Conversion formula between Weiner Number and Distance Matrix WeinerNumtoDistanceMatrix.png
Conversion formula between Weiner Number and Distance Matrix

Graph-theoretical Distance matrix

Distance matrix in chemistry that are used for the 2-D realization of molecular graphs, which are used to illustrate the main foundational features of a molecule in a myriad of applications.

Labeled tree representation of C6H14's carbon skeleton based on its distance matrix Chem DistanceMtrix.png
Labeled tree representation of C6H14's carbon skeleton based on its distance matrix
  1. Creating a label tree that represents the carbon skeleton of a molecule based on its distance matrix. The distance matrix is imperative in this application because similar molecules can have a myriad of label tree variants of their carbon skeleton. The labeled tree structure of hexane (C6H14) carbon skeleton that is created based on the distance matrix in the example, has different carbon skeleton variants that affect both the distance matrix and the labeled tree
  2. Creating a labeled graph with edge weights, used in chemical graph theory, that represent molecules with hetero-atoms.
  3. Le Verrier-Fadeev-Frame (LVFF) method is a computer oriented used to speed up the process of detecting the graph center in polycyclic graphs. However, LVFF requires the input to be a diagonalized distance matrix which is easily resolved by implementing the Householder tridiagonal-QL algorithm that takes in a distance matrix and returns the diagonalized distance needed for the LVFF method.

Geometric-Distance Matrix

Geometric distance matrix for 2,4-dimethylhexane Geometric distance matrix.png
Geometric distance matrix for 2,4-dimethylhexane

While the graph-theoretical distance matrix 2-D captures the constitutional features of the molecule, its three-dimensional (3D) character is encoded in the geometric-distance matrix. The geometric-distance matrix is a different type of distance matrix that is based on the graph-theoretical distance matrix of a molecule to represent and graph the 3-D molecule structure. [8] The geometric-distance matrix of a molecular structure G is a real symmetric n x n matrix defined in the same way as a 2-D matrix. However, the matrix elements Dij will hold a collection of shortest Cartesian distances between i and j in G. Also known as topographic matrix, the geometric-distance matrix can be constructed from the known geometry of the molecule. As an example, the geometric-distance matrix of the carbon skeleton of 2,4-dimethylhexane is shown below:

Other Applications

Time Series Analysis

Dynamic Time Warping distance matrices are utilized with the clustering and classification algorithms of a collection/group of time series objects.

Examples

For example, suppose these data are to be analyzed, where pixel Euclidean distance is the distance metric.

Raw data Clusters.svg
Raw data

The distance matrix would be:

abcdef
a0184222177216231
b184045123128200
c222450129121203
d17712312904683
e21612812146083
f23120020383830

These data can then be viewed in graphic form as a heat map. In this image, black denotes a distance of 0 and white is maximal distance.

Graphical View Distance matrix.PNG
Graphical View

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Phylogenetic tree</span> Branching diagram of evolutionary relationships between organisms

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Summary of algorithms for nonlinear dimensionality reduction

Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to . Sometimes the associated metric is also called a non-Archimedean metric or super-metric.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

UPGMA is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

<span class="mw-page-title-main">Clustal</span>

Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its algorithm is also detailed in their respective categories. Available operating systems listed in the sidebar are a combination of the software availability and may not be supported for every current version of the Clustal tools. Clustal Omega has the widest variety of operating systems out of all the Clustal tools.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion, at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

<span class="mw-page-title-main">Isomap</span>

Isomap is a nonlinear dimensionality reduction method. It is one of several widely used low-dimensional embedding methods. Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points. The algorithm provides a simple method for estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors on the manifold. Isomap is highly efficient and generally applicable to a broad range of data sources and dimensionalities.

Consensus clustering is a method of aggregating results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering, it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning.

Complete-linkage clustering is one of several methods of agglomerative hierarchical clustering. At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

WPGMA is a simple agglomerative (bottom-up) hierarchical clustering method, generally attributed to Sokal and Michener.

References

  1. Weyenberg, G., & Yoshida, R. (2015). Reconstructing the phylogeny: Computational methods. In Algebraic and Discrete Mathematical methods for modern Biology (pp. 293–319). Academic Press.
  2. Frank Harary, Robert Z. Norman and Dorwin Cartwright (1965) Structural Models: An Introduction to the Theory of Directed Graphs, pages 134–8, John Wiley & Sons MR 0184874
  3. 1 2 Sung, Wing-Kin (2010). Algorithms in bioinformatics: A practical introduction. Chapman & Hall. p. 29. ISBN   978-1-4200-7033-0.
  4. 1 2 Felsenstein, Joseph (2003). Inferring phylogenies. Sinauer Associates. ISBN   9780878931774.
  5. Saitou, Naruya (1987). "The neighbor-joining method: A new method for reconstructing phylogenetic trees". Molecular Biology and Evolution. 4 (4): 406–425. doi: 10.1093/oxfordjournals.molbev.a040454 . PMID   3447015.
  6. Fitch, Walter M. (1967). "Construction of Phylogenetic Trees: A method based on mutation distances as estimated from cytochrome c sequences is of general applicability". Science. 155 (3760): 279–284. doi:10.1126/science.155.3760.279. PMID   5334057.
  7. "4 types of distance metrics in machine learning". February 25, 2020.
  8. 1 2 Mihalic, Zlatko (1992). "The distance matrix in chemistry". Journal of Mathematical Chemistry. 11: 223–258. doi:10.1007/BF01164206. S2CID   121181446.