HCS clustering algorithm

HCS clustering algorithm
Class	Cluster analysis (on a similarity graph)
Data structure	Graph
Worst-case performance	O(2N x f(n,m))

Last updated September 06, 2022

The HCS (Highly Connected Subgraphs) clustering algorithm ^[1] (also known as the HCS algorithm, and other names such as Highly Connected Clusters/Components/Kernels) is an algorithm based on graph connectivity for cluster analysis. It works by representing the similarity data in a similarity graph, and then finding all the highly connected subgraphs. It does not make any prior assumptions on the number of the clusters. This algorithm was published by Erez Hartuv and Ron Shamir in 2000.

Similarity modeling and preprocessing

The goal of cluster analysis is to group elements into disjoint subsets, or clusters, based on similarity between elements, so that elements in the same cluster are highly similar to each other (homogeneity), while elements from different clusters have low similarity to each other (separation). Similarity graph is one of the models to represent the similarity between elements, and in turn facilitate generating of clusters. To construct a similarity graph from similarity data, represent elements as vertices, and elicit edges between vertices when the similarity value between them is above some threshold.

Algorithm

In the similarity graph, the more edges exist for a given number of vertices, the more similar such a set of vertices are between each other. In other words, if we try to disconnect a similarity graph by removing edges, the more edges we need to remove before the graph becomes disconnected, the more similar the vertices in this graph. Minimum cut is a minimum set of edges without which the graph will become disconnected.

HCS clustering algorithm finds all the subgraphs with n vertices such that the minimum cut of those subgraphs contain more than n/2 edges, and identifies them as clusters. Such a subgraph is called a Highly Connected Subgraph (HCS). Single vertices are not considered clusters and are grouped into a singletons set S.

Given a similarity graph G(V,E), HCS clustering algorithm will check if it is already highly connected, if yes, returns G, otherwise uses the minimum cut of G to partition G into two subgraphs H and H', and recursively run HCS clustering algorithm on H and H'.

Example

The following animation shows how the HCS clustering algorithm partitions a similarity graph into three clusters.

Pseudocode

function HCS(G(V, E)) isifG is highly connected thenreturn (G)     else         (H1, H2, C) ← MINIMUMCUT(G)         HCS(H1)         HCS(H2)     end ifend function

The step of finding the minimum cut on graph G is a subroutine that can be implemented using different algorithms for this problem. See below for an example algorithm for finding minimum cut using randomization.

Complexity

The running time of the HCS clustering algorithm is bounded by N × f(n, m). f(n, m) is the time complexity of computing a minimum cut in a graph with n vertices and m edges, and N is the number of clusters found. In many applications N << n.

For fast algorithms for finding a minimum cut in an unweighted graph:

Proofs of properties

The clusters produced by the HCS clustering algorithm possess several properties, which can demonstrate the homogeneity and separation of the solution.

Theorem 1 The diameter of every highly connected graph is at most two.

Proof: Let n=|G|. If G has a vertex x with degree <= n/2, then G has a minimum cut (that isolates x) with edges <= n/2, so G is not highly connected. So if G is highly connected, every vertex has degree >= n/2. There is a famous theorem in graph theory that says that if every vertex has degree >= n/2, then the diameter of G (the longest path between any two nodes) <= 2.

Theorem 2 (a) The number of edges in a highly connected graph is quadratic. (b) The number of edges removed by each iteration of the HCS algorithm is at most linear.

Proof: (a) From Theorem 1 we know that every vertex has degree >= n/2. Therefore, the number of edges in a highly connected graph must be at least (n × n/2)/2, where we sum the degrees of each vertex and divide by 2.

(b) By definition, each iteration removes a minimum cut with <= n/2 edges.

Theorems 1 and 2a provide a strong indication of a final cluster's homogeneity. Doing better approaches the case where all vertices of a cluster are connected, which is both too stringent and also NP-hard.

Theorem 2b indicates separation since any two final clusters C1 and C2 would not have been separated unless there were at most O(C1+C2) edges between them (contrast with the quadratic edges within clusters).

Variations

Singletons adoption: Elements left as singletons by the initial clustering process can be "adopted" by clusters based on similarity to the cluster. If the maximum number of neighbors to a specific cluster is large enough, then it can be added to that cluster.

Removing Low Degree Vertices: When the input graph has vertices with low degrees, it is not worthy to run the algorithm since it is computationally expensive and not informative. Alternatively, a refinement of the algorithm can first remove all vertices with a degree lower than certain threshold.

Examples of HCS usage

Gene expression analysis ^[2] The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fingerprint for each cDNA clone. Run HCS algorithm on these fingerprints can identify clones corresponding to the same gene.

PPI network structure discovery ^[3] Using HCS clustering to detect dense subnetworks in PPI that may have biological meaning and represent biological processes.
"Survey of clustering algorithms." Neural Networks, IEEE Transactions ^[4]
The CLICK clustering algorithm^[5] is an adaptation of HCS algorithm on weighted similarity graphs, where the weight is assigned with a probability flavor.
https://www.researchgate.net/publication/259350461_Partitioning_Biological_Networks_into_Highly_Connected_Clusters_with_Maximum_Edge_Coverage Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage]^[6]

Related Research Articles

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

Component (graph theory) Maximal subgraph whose vertices can reach each other

In graph theory, a component of an undirected graph is a connected subgraph that is not part of any larger connected subgraph. The components of any graph partition its vertices into disjoint sets, and are the induced subgraphs of those sets. A graph that is itself connected has exactly one component, consisting of the whole graph. Components are sometimes called connected components.

In graph theory, a branch of mathematics, the (binary) cycle space of an undirected graph is the set of its even-degree subgraphs.

This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges.

In the mathematical field of graph theory, a spanning treeT of an undirected graph G is a subgraph that is a tree which includes all of the vertices of G. In general, a graph may have several spanning trees, but a graph that is not connected will not contain a spanning tree. If all of the edges of G are also edges of a spanning tree T of G, then G is a tree and is identical to T.

In the mathematical area of graph theory, a clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent. That is, a clique of a graph $is an induced subgraph of that is complete. Cliques are one of the basic concepts of graph theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been studied in computer science: the task of finding whether there is a clique of a given size in a graph is NP-complete, but despite this hardness result, many algorithms for finding cliques have been studied.$

In the mathematical discipline of graph theory, the line graph of an undirected graph $G$ is another graph $L(G)$ that represents the adjacencies between edges of $G$ . $L(G)$ is constructed in the following way: for each edge in $G$ , make a vertex in $L(G)$ ; for every two edges in $G$ that have a vertex in common, make an edge between their corresponding vertices in $L(G)$ .

Strongly connected component Partition of a graph whose components are reachable from all vertices

In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time.

In graph theory, an edge coloring of a graph is an assignment of "colors" to the edges of the graph so that no two incident edges have the same color. For example, the figure to the right shows an edge coloring of a graph by the colors red, blue, and green. Edge colorings are one of several different types of graph coloring. The edge-coloring problem asks whether it is possible to color the edges of a given graph using at most $k$ different colors, for a given value of $k$ , or with the fewest possible colors. The minimum required number of colors for the edges of a given graph is called the chromatic index of the graph. For example, the edges of the graph in the illustration can be colored by three colors but cannot be colored by two colors, so the graph shown has chromatic index three.

In the mathematical theory of matroids, a graphic matroid is a matroid whose independent sets are the forests in a given finite undirected graph. The dual matroids of graphic matroids are called co-graphic matroids or bond matroids. A matroid that is both graphic and co-graphic is sometimes called a planar matroid ; these are exactly the graphic matroids formed from planar graphs.

In graph theory, a bridge, isthmus, cut-edge, or cut arc is an edge of a graph whose deletion increases the graph's number of connected components. Equivalently, an edge is a bridge if and only if it is not contained in any cycle. For a connected graph, a bridge can uniquely determine a cut. A graph is said to be bridgeless or isthmus-free if it contains no bridges.

In mathematics and computer science, connectivity is one of the basic concepts of graph theory: it asks for the minimum number of elements that need to be removed to separate the remaining nodes into two or more isolated subgraphs. It is closely related to the theory of network flow problems. The connectivity of a graph is an important measure of its resilience as a network.

The arboricity of an undirected graph is the minimum number of forests into which its edges can be partitioned. Equivalently it is the minimum number of spanning forests needed to cover all the edges of the graph. The Nash-Williams theorem provides necessary and sufficient conditions for when a graph is k-arboric.

In the mathematical discipline of graph theory, the dual graph of a plane graph $G$ is a graph that has a vertex for each face of $G$ . The dual graph has an edge for each pair of faces in $G$ that are separated from each other by an edge, and a self-loop when the same face appears on both sides of an edge. Thus, each edge $e$ of $G$ has a corresponding dual edge, whose endpoints are the dual vertices corresponding to the faces on either side of $e$ . The definition of the dual depends on the choice of embedding of the graph $G$ , so it is a property of plane graphs rather than planar graphs. For planar graphs generally, there may be multiple dual graphs, depending on the choice of planar embedding of the graph.

In graph theory, a minimum cut or min-cut of a graph is a cut that is minimal in some metric.

In graph theory, a connected graph is $k$ -edge-connected if it remains connected whenever fewer than $k$ edges are removed.

Biconnected component Maximal biconnected subgraph

In graph theory, a biconnected component is a maximal biconnected subgraph. Any connected graph decomposes into a tree of biconnected components called the block-cut tree of the graph. The blocks are attached to each other at shared vertices called cut vertices or separating vertices or articulation points. Specifically, a cut vertex is any vertex whose removal increases the number of connected components.

Pseudoforest Graph with one cycle per component

In graph theory, a pseudoforest is an undirected graph in which every connected component has at most one cycle. That is, it is a system of vertices and edges connecting pairs of vertices, such that no two cycles of consecutive edges share any vertex with each other, nor can any two cycles be connected to each other by a path of consecutive edges. A pseudotree is a connected pseudoforest.

In graph theory, the planar separator theorem is a form of isoperimetric inequality for planar graphs, that states that any planar graph can be split into smaller pieces by removing a small number of vertices. Specifically, the removal of $vertices from an n -vertex graph can partition the graph into disjoint subgraphs each of which has at most vertices.$

In graph theory, a k-degenerate graph is an undirected graph in which every subgraph has a vertex of degree at most k: that is, some vertex in the subgraph touches k or fewer of the subgraph's edges. The degeneracy of a graph is the smallest value of k for which it is k-degenerate. The degeneracy of a graph is a measure of how sparse it is, and is within a constant factor of other sparsity measures such as the arboricity of a graph.

References

↑ Hartuv, E.; Shamir, R. (2000), "A clustering algorithm based on graph connectivity", Information Processing Letters, 76 (4–6): 175–181, doi:10.1016/S0020-0190(00)00142-3
↑ E Hartuv, A O Schmitt, J Lange, S Meier-Ewert, H Lehrach, R Shamir. "An algorithm for clustering cDNA fingerprints." Genomics 66, no. 3 (2000): 249-256.
↑ Jurisica, Igor, and Dennis Wigle. Knowledge Discovery in Proteomics. Vol. 8. CRC press, 2006.
↑ Xu, Rui, and Donald Wunsch. "Survey of clustering algorithms." Neural Networks, IEEE Transactions on 16, no. 3 (2005): 645-678.
↑ Sharan, R.; Shamir, R. (2000), "CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis", Proceedings ISMB '00, 8: 307–316C, PMID 10977092
↑ Huffner, F.; Komusiewicz, C.; Liebtrau, A; Niedermeier, R (2014), "Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage", IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11 (3): 455–467, CiteSeerX 10.1.1.377.1900 , doi:10.1109/TCBB.2013.177, PMID 26356014, S2CID 991687

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Hartuv, E.; Shamir, R. (2000), "A clustering algorithm based on graph connectivity", Information Processing Letters, 76 (4–6): 175–181, doi:10.1016/S0020-0190(00)00142-3

[2] E Hartuv, A O Schmitt, J Lange, S Meier-Ewert, H Lehrach, R Shamir. "An algorithm for clustering cDNA fingerprints." Genomics 66, no. 3 (2000): 249-256.

[3] Jurisica, Igor, and Dennis Wigle. Knowledge Discovery in Proteomics. Vol. 8. CRC press, 2006.

[4] Xu, Rui, and Donald Wunsch. "Survey of clustering algorithms." Neural Networks, IEEE Transactions on 16, no. 3 (2005): 645-678.

[5] Sharan, R.; Shamir, R. (2000), "CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis", Proceedings ISMB '00, 8: 307–316C, PMID 10977092

[6] Huffner, F.; Komusiewicz, C.; Liebtrau, A; Niedermeier, R (2014), "Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage", IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11 (3): 455–467, CiteSeerX 10.1.1.377.1900 , doi:10.1109/TCBB.2013.177, PMID 26356014, S2CID 991687

[1]

[2]

[3]

[4]

[5]

[6]