WPGMA

Last updated

WPGMA (Weighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up) hierarchical clustering method, generally attributed to Sokal and Michener. [1]

Contents

The WPGMA method is similar to its unweighted variant, the UPGMA method.

Algorithm

The WPGMA algorithm constructs a rooted tree (dendrogram) that reflects the structure present in a pairwise distance matrix (or a similarity matrix). At each step, the nearest two clusters, say and , are combined into a higher-level cluster . Then, its distance to another cluster is simply the arithmetic mean of the average distances between members of and and and  :

The WPGMA algorithm produces rooted dendrograms and requires a constant-rate assumption: it produces an ultrametric tree in which the distances from the root to every branch tip are equal. This ultrametricity assumption is called the molecular clock when the tips involve DNA, RNA and protein data.

Working example

This working example is based on a JC69 genetic distance matrix computed from the 5S ribosomal RNA sequence alignment of five bacteria: Bacillus subtilis (), Bacillus stearothermophilus (), Lactobacillus viridescens (), Acholeplasma modicum (), and Micrococcus luteus (). [2] [3]

First step

Let us assume that we have five elements and the following matrix of pairwise distances between them :

abcde
a017213123
b170303421
c213002839
d313428043
e232139430

In this example, is the smallest value of , so we join elements and .

Let denote the node to which and are now connected. Setting ensures that elements and are equidistant from . This corresponds to the expectation of the ultrametricity hypothesis. The branches joining and to then have lengths ( see the final dendrogram )

We then proceed to update the initial distance matrix into a new distance matrix (see below), reduced in size by one row and one column because of the clustering of with . Bold values in correspond to the new distances, calculated by averaging distances between each element of the first cluster and each of the remaining elements:

Italicized values in are not affected by the matrix update as they correspond to distances between elements not involved in the first cluster.

Second step

We now reiterate the three previous steps, starting from the new distance matrix  :

(a,b)cde
(a,b)025.532.522
c25.502839
d32.528043
e2239430

Here, is the smallest value of , so we join cluster and element .

Let denote the node to which and are now connected. Because of the ultrametricity constraint, the branches joining or to , and to are equal and have the following length:

We deduce the missing branch length: ( see the final dendrogram )

We then proceed to update the matrix into a new distance matrix (see below), reduced in size by one row and one column because of the clustering of with  :

Of note, this average calculation of the new distance does not account for the larger size of the cluster (two elements) with respect to (one element). Similarly:

The averaging procedure therefore gives differential weight to the initial distances of matrix . This is the reason why the method is weighted, not with respect to the mathematical procedure but with respect to the initial distances.

Third step

We again reiterate the three previous steps, starting from the updated distance matrix .

((a,b),e)cd
((a,b),e)032.2537.75
c32.25028
d37.75280

Here, is the smallest value of , so we join elements and .

Let denote the node to which and are now connected. The branches joining and to then have lengths ( see the final dendrogram )

There is a single entry to update:

Final step

The final matrix is:

((a,b),e)(c,d)
((a,b),e)035
(c,d)350

So we join clusters and .

Let denote the (root) node to which and are now connected. The branches joining and to then have lengths:

We deduce the two remaining branch lengths:

The WPGMA dendrogram

WPGMA Dendrogram 5S data WPGMA Dendrogram 5S data.svg
WPGMA Dendrogram 5S data

The dendrogram is now complete. It is ultrametric because all tips ( to ) are equidistant from  :

The dendrogram is therefore rooted by , its deepest node.

Comparison with other linkages

Alternative linkage schemes include single linkage clustering, complete linkage clustering, and UPGMA average linkage clustering. Implementing a different linkage is simply a matter of using a different formula to calculate inter-cluster distances during the distance matrix update steps of the above algorithm. Complete linkage clustering avoids a drawback of the alternative single linkage clustering method - the so-called chaining phenomenon, where clusters formed via single linkage clustering may be forced together due to single elements being close to each other, even though many of the elements in each cluster may be very distant to each other. Complete linkage tends to find compact clusters of approximately equal diameters. [4]

Comparison of dendrograms obtained under different clustering methods from the same distance matrix.
Simple linkage-5S.svg
Complete linkage Dendrogram 5S data.svg
WPGMA Dendrogram 5S data.svg
UPGMA Dendrogram 5S data.svg
Single-linkage clustering Complete-linkage clustering Average linkage clustering: WPGMA. Average linkage clustering: UPGMA

See also

Related Research Articles

<span class="mw-page-title-main">Spinor</span> Non-tensorial representation of the spin group; represents fermions in physics

In geometry and physics, spinors are elements of a complex number-based vector space that can be associated with Euclidean space. A spinor transforms linearly when the Euclidean space is subjected to a slight (infinitesimal) rotation, but unlike geometric vectors and tensors, a spinor transforms to its negative when the space rotates through 360°. It takes a rotation of 720° for a spinor to go back to its original state. This property characterizes spinors: spinors can be viewed as the "square roots" of vectors.

Kinematics is a subfield of physics, developed in classical mechanics, that describes the motion of points, bodies (objects), and systems of bodies without considering the forces that cause them to move. Kinematics, as a field of study, is often referred to as the "geometry of motion" and is occasionally seen as a branch of mathematics. A kinematics problem begins by describing the geometry of the system and declaring the initial conditions of any known values of position, velocity and/or acceleration of points within the system. Then, using arguments from geometry, the position, velocity and acceleration of any unknown parts of the system can be determined. The study of how forces act on bodies falls within kinetics, not kinematics. For further details, see analytical dynamics.

<span class="mw-page-title-main">Moment of inertia</span> Scalar measure of the rotational inertia with respect to a fixed axis of rotation

The moment of inertia, otherwise known as the mass moment of inertia, angular mass, second moment of mass, or most accurately, rotational inertia, of a rigid body is a quantity that determines the torque needed for a desired angular acceleration about a rotational axis, akin to how mass determines the force needed for a desired acceleration. It depends on the body's mass distribution and the axis chosen, with larger moments requiring more torque to change the body's rate of rotation.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to . Sometimes the associated metric is also called a non-Archimedean metric or super-metric.

<span class="mw-page-title-main">Quantum group</span> Algebraic construct of interest in theoretical physics

In mathematics and theoretical physics, the term quantum group denotes one of a few different kinds of noncommutative algebras with additional structure. These include Drinfeld–Jimbo type quantum groups, compact matrix quantum groups, and bicrossproduct quantum groups. Despite their name, they do not themselves have a natural group structure, though they are in some sense 'close' to a group.

In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa to create the phylogenetic tree.

UPGMA is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.

<span class="mw-page-title-main">Hierarchical clustering</span> Statistical method of analysis which seeks to build a hierarchy of clusters

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Total least squares</span>

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In linear algebra, geometry, and trigonometry, the Cayley–Menger determinant is a formula for the content, i.e. the higher-dimensional volume, of a -dimensional simplex in terms of the squares of all of the distances between pairs of its vertices. The determinant is named after Arthur Cayley and Karl Menger.

In solid-state physics, the tight-binding model is an approach to the calculation of electronic band structure using an approximate set of wave functions based upon superposition of wave functions for isolated atoms located at each atomic site. The method is closely related to the LCAO method used in chemistry. Tight-binding models are applied to a wide variety of solids. The model gives good qualitative results in many cases and can be combined with other models that give better results where the tight-binding model fails. Though the tight-binding model is a one-electron model, the model also provides a basis for more advanced calculations like the calculation of surface states and application to various kinds of many-body problem and quasiparticle calculations.

<span class="mw-page-title-main">Gaussian network model</span>

The Gaussian network model (GNM) is a representation of a biological macromolecule as an elastic mass-and-spring network to study, understand, and characterize the mechanical aspects of its long-time large-scale dynamics. The model has a wide range of applications from small proteins such as enzymes composed of a single domain, to large macromolecular assemblies such as a ribosome or a viral capsid. Protein domain dynamics plays key roles in a multitude of molecular recognition and cell signalling processes. Protein domains, connected by intrinsically disordered flexible linker domains, induce long-range allostery via protein domain dynamics. The resultant dynamic modes cannot be generally predicted from static structures of either the entire protein or individual domains.

In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion, at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

<span class="mw-page-title-main">Line integral</span> Definite integral of a scalar or vector field along a path

In mathematics, a line integral is an integral where the function to be integrated is evaluated along a curve. The terms path integral, curve integral, and curvilinear integral are also used; contour integral is used as well, although that is typically reserved for line integrals in the complex plane.

Complete-linkage clustering is one of several methods of agglomerative hierarchical clustering. At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.

The stretched grid method (SGM) is a numerical technique for finding approximate solutions of various mathematical and engineering problems that can be related to an elastic grid behavior. In particular, meteorologists use the stretched grid method for weather prediction and engineers use the stretched grid method to design tents and other tensile structures.

<span class="mw-page-title-main">Derivations of the Lorentz transformations</span>

There are many ways to derive the Lorentz transformations utilizing a variety of physical principles, ranging from Maxwell's equations to Einstein's postulates of special relativity, and mathematical tools, spanning from elementary algebra and hyperbolic functions, to linear algebra and group theory.

<span class="mw-page-title-main">Symmetry in quantum mechanics</span> Properties underlying modern physics

Symmetries in quantum mechanics describe features of spacetime and particles which are unchanged under some transformation, in the context of quantum mechanics, relativistic quantum mechanics and quantum field theory, and with applications in the mathematical formulation of the standard model and condensed matter physics. In general, symmetry in physics, invariance, and conservation laws, are fundamentally important constraints for formulating physical theories and models. In practice, they are powerful methods for solving problems and predicting what can happen. While conservation laws do not always give the answer to the problem directly, they form the correct constraints and the first steps to solving a multitude of problems.

References

  1. Sokal, Michener (1958). "A statistical method for evaluating systematic relationships". University of Kansas Science Bulletin. 38: 1409–1438.
  2. Erdmann VA, Wolters J (1986). "Collection of published 5S, 5.8S and 4.5S ribosomal RNA sequences". Nucleic Acids Research. 14 Suppl (Suppl): r1–59. doi:10.1093/nar/14.suppl.r1. PMC   341310 . PMID   2422630.
  3. Olsen GJ (1988). Phylogenetic analysis using ribosomal RNA. Methods in Enzymology. Vol. 164. pp. 793–812. doi:10.1016/s0076-6879(88)64084-5. PMID   3241556.
  4. Everitt, B. S.; Landau, S.; Leese, M. (2001). Cluster Analysis. 4th Edition. London: Arnold. pp. 62–64.