HCS clustering algorithm

Last updated
HCS clustering algorithm
Class Cluster analysis (on a similarity graph)
Data structure Graph
Worst-case performance O(2N x f(n,m))

The HCS (Highly Connected Subgraphs) clustering algorithm [1] (also known as the HCS algorithm, and other names such as Highly Connected Clusters/Components/Kernels) is an algorithm based on graph connectivity for cluster analysis. It works by representing the similarity data in a similarity graph, and then finding all the highly connected subgraphs. It does not make any prior assumptions on the number of the clusters. This algorithm was published by Erez Hartuv and Ron Shamir in 2000.

Contents

The HCS algorithm gives a clustering solution, which is inherently meaningful in the application domain, since each solution cluster must have diameter 2 while a union of two solution clusters will have diameter 3.

Similarity modeling and preprocessing

The goal of cluster analysis is to group elements into disjoint subsets, or clusters, based on similarity between elements, so that elements in the same cluster are highly similar to each other (homogeneity), while elements from different clusters have low similarity to each other (separation). Similarity graph is one of the models to represent the similarity between elements, and in turn facilitate generating of clusters. To construct a similarity graph from similarity data, represent elements as vertices, and elicit edges between vertices when the similarity value between them is above some threshold.

Algorithm

In the similarity graph, the more edges exist for a given number of vertices, the more similar such a set of vertices are between each other. In other words, if we try to disconnect a similarity graph by removing edges, the more edges we need to remove before the graph becomes disconnected, the more similar the vertices in this graph. Minimum cut is a minimum set of edges without which the graph will become disconnected.

HCS clustering algorithm finds all the subgraphs with n vertices such that the minimum cut of those subgraphs contain more than n/2 edges, and identifies them as clusters. Such a subgraph is called a Highly Connected Subgraph (HCS). Single vertices are not considered clusters and are grouped into a singletons set S.

Given a similarity graph G(V,E), HCS clustering algorithm will check if it is already highly connected, if yes, returns G, otherwise uses the minimum cut of G to partition G into two subgraphs H and H', and recursively run HCS clustering algorithm on H and H'.

Example

The following animation shows how the HCS clustering algorithm partitions a similarity graph into three clusters.

HCS Algorithm.gif

Pseudocode

function HCS(G(V, E)) isifG is highly connected thenreturn (G)     else         (H1, H2, C) ← MINIMUMCUT(G)         HCS(H1)         HCS(H2)     end ifend function

The step of finding the minimum cut on graph G is a subroutine that can be implemented using different algorithms for this problem. See below for an example algorithm for finding minimum cut using randomization.

Complexity

The running time of the HCS clustering algorithm is bounded by N × f(n, m). f(n, m) is the time complexity of computing a minimum cut in a graph with n vertices and m edges, and N is the number of clusters found. In many applications N << n.

For fast algorithms for finding a minimum cut in an unweighted graph:

Proofs of properties

The clusters produced by the HCS clustering algorithm possess several properties, which can demonstrate the homogeneity and separation of the solution.

Theorem 1 The diameter of every highly connected graph is at most two.

Proof: Let n=|G|. If G has a vertex x with degree <= n/2, then G has a minimum cut (that isolates x) with edges <= n/2, so G is not highly connected. So if G is highly connected, every vertex has degree >= n/2. There is a famous theorem in graph theory that says that if every vertex has degree >= n/2, then the diameter of G (the longest path between any two nodes) <= 2.

Theorem 2 (a) The number of edges in a highly connected graph is quadratic. (b) The number of edges removed by each iteration of the HCS algorithm is at most linear.

Proof: (a) From Theorem 1 we know that every vertex has degree >= n/2. Therefore, the number of edges in a highly connected graph must be at least (n × n/2)/2, where we sum the degrees of each vertex and divide by 2.

(b) By definition, each iteration removes a minimum cut with <= n/2 edges.

Theorems 1 and 2a provide a strong indication of a final cluster's homogeneity. Doing better approaches the case where all vertices of a cluster are connected, which is both too stringent and also NP-hard.

Theorem 2b indicates separation since any two final clusters C1 and C2 would not have been separated unless there were at most O(C1+C2) edges between them (contrast with the quadratic edges within clusters).

Variations

Singletons adoption: Elements left as singletons by the initial clustering process can be "adopted" by clusters based on similarity to the cluster. If the maximum number of neighbors to a specific cluster is large enough, then it can be added to that cluster.

Removing Low Degree Vertices: When the input graph has vertices with low degrees, it is not worthy to run the algorithm since it is computationally expensive and not informative. Alternatively, a refinement of the algorithm can first remove all vertices with a degree lower than certain threshold.

Examples of HCS usage

Related Research Articles

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

<span class="mw-page-title-main">Hamiltonian path</span> Path in a graph that visits each vertex exactly once

In the mathematical field of graph theory, a Hamiltonian path is a path in an undirected or directed graph that visits each vertex exactly once. A Hamiltonian cycle is a cycle that visits each vertex exactly once. A Hamiltonian path that starts and ends at adjacent vertices can be completed by adding one more edge to form a Hamiltonian cycle, and removing any edge from a Hamiltonian cycle produces a Hamiltonian path. The computational problems of determining whether such paths and cycles exist in graphs are NP-complete; see Hamiltonian path problem for details.

In graph theory, a branch of mathematics, the (binary) cycle space of an undirected graph is the set of its even-degree subgraphs.

This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges.

<span class="mw-page-title-main">Spanning tree</span> Tree which includes all vertices of a graph

In the mathematical field of graph theory, a spanning treeT of an undirected graph G is a subgraph that is a tree which includes all of the vertices of G. In general, a graph may have several spanning trees, but a graph that is not connected will not contain a spanning tree. If all of the edges of G are also edges of a spanning tree T of G, then G is a tree and is identical to T.

<span class="mw-page-title-main">Clique (graph theory)</span> Adjacent subset of an undirected graph

In the mathematical area of graph theory, a clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent. That is, a clique of a graph is an induced subgraph of that is complete. Cliques are one of the basic concepts of graph theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been studied in computer science: the task of finding whether there is a clique of a given size in a graph is NP-complete, but despite this hardness result, many algorithms for finding cliques have been studied.

<span class="mw-page-title-main">Perfect graph</span> Graph with tight clique-coloring relation

In graph theory, a perfect graph is a graph in which the chromatic number equals the size of the maximum clique, both in the graph itself and in every induced subgraph. In all graphs, the chromatic number is greater than or equal to the size of the maximum clique, but they can be far apart. A graph is perfect when these numbers are equal, and remain equal after the deletion of arbitrary subsets of vertices.

In the mathematical discipline of graph theory, the line graph of an undirected graph G is another graph L(G) that represents the adjacencies between edges of G. L(G) is constructed in the following way: for each edge in G, make a vertex in L(G); for every two edges in G that have a vertex in common, make an edge between their corresponding vertices in L(G).

<span class="mw-page-title-main">Strongly connected component</span> Partition of a graph whose components are reachable from all vertices

In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of a directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time (that is, Θ(V + E )).

In the mathematical theory of matroids, a graphic matroid is a matroid whose independent sets are the forests in a given finite undirected graph. The dual matroids of graphic matroids are called co-graphic matroids or bond matroids. A matroid that is both graphic and co-graphic is sometimes called a planar matroid ; these are exactly the graphic matroids formed from planar graphs.

<span class="mw-page-title-main">Connectivity (graph theory)</span> Basic concept of graph theory

In mathematics and computer science, connectivity is one of the basic concepts of graph theory: it asks for the minimum number of elements that need to be removed to separate the remaining nodes into two or more isolated subgraphs. It is closely related to the theory of network flow problems. The connectivity of a graph is an important measure of its resilience as a network.

The arboricity of an undirected graph is the minimum number of forests into which its edges can be partitioned. Equivalently it is the minimum number of spanning forests needed to cover all the edges of the graph. The Nash-Williams theorem provides necessary and sufficient conditions for when a graph is k-arboric.

<span class="mw-page-title-main">Dual graph</span> Graph representing faces of another graph

In the mathematical discipline of graph theory, the dual graph of a planar graph G is a graph that has a vertex for each face of G. The dual graph has an edge for each pair of faces in G that are separated from each other by an edge, and a self-loop when the same face appears on both sides of an edge. Thus, each edge e of G has a corresponding dual edge, whose endpoints are the dual vertices corresponding to the faces on either side of e. The definition of the dual depends on the choice of embedding of the graph G, so it is a property of plane graphs rather than planar graphs. For planar graphs generally, there may be multiple dual graphs, depending on the choice of planar embedding of the graph.

<span class="mw-page-title-main">Minimum cut</span> Partition of a graph by removing fewest possible edges

In graph theory, a minimum cut or min-cut of a graph is a cut that is minimal in some metric.

In the mathematical field of graph theory, a transitive reduction of a directed graph D is another directed graph with the same vertices and as few edges as possible, such that for all pairs of vertices v, w a (directed) path from v to w in D exists if and only if such a path exists in the reduction. Transitive reductions were introduced by Aho, Garey & Ullman (1972), who provided tight bounds on the computational complexity of constructing them.

In graph theory, a connected graph is k-edge-connected if it remains connected whenever fewer than k edges are removed.

<span class="mw-page-title-main">Biconnected component</span> Maximal biconnected subgraph

In graph theory, a biconnected component or block is a maximal biconnected subgraph. Any connected graph decomposes into a tree of biconnected components called the block-cut tree of the graph. The blocks are attached to each other at shared vertices called cut vertices or separating vertices or articulation points. Specifically, a cut vertex is any vertex whose removal increases the number of connected components. A block containing at most one cut vertex is called a leaf block, it corresponds to a leaf vertex in the block-cut tree.

<span class="mw-page-title-main">Pseudoforest</span> Graph with at most one cycle per component

In graph theory, a pseudoforest is an undirected graph in which every connected component has at most one cycle. That is, it is a system of vertices and edges connecting pairs of vertices, such that no two cycles of consecutive edges share any vertex with each other, nor can any two cycles be connected to each other by a path of consecutive edges. A pseudotree is a connected pseudoforest.

In graph theory, the planar separator theorem is a form of isoperimetric inequality for planar graphs, that states that any planar graph can be split into smaller pieces by removing a small number of vertices. Specifically, the removal of vertices from an n-vertex graph can partition the graph into disjoint subgraphs each of which has at most vertices.

<span class="mw-page-title-main">Degeneracy (graph theory)</span> Measurement of graph sparsity

In graph theory, a k-degenerate graph is an undirected graph in which every subgraph has a vertex of degree at most k: that is, some vertex in the subgraph touches k or fewer of the subgraph's edges. The degeneracy of a graph is the smallest value of k for which it is k-degenerate. The degeneracy of a graph is a measure of how sparse it is, and is within a constant factor of other sparsity measures such as the arboricity of a graph.

References

  1. Hartuv, E.; Shamir, R. (2000), "A clustering algorithm based on graph connectivity", Information Processing Letters, 76 (4–6): 175–181, doi:10.1016/S0020-0190(00)00142-3
  2. E Hartuv, A O Schmitt, J Lange, S Meier-Ewert, H Lehrach, R Shamir. "An algorithm for clustering cDNA fingerprints." Genomics 66, no. 3 (2000): 249-256.
  3. Jurisica, Igor, and Dennis Wigle. Knowledge Discovery in Proteomics. Vol. 8. CRC press, 2006.
  4. Xu, Rui, and Donald Wunsch. "Survey of clustering algorithms." Neural Networks, IEEE Transactions on 16, no. 3 (2005): 645-678.
  5. Sharan, R.; Shamir, R. (2000), "CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis", Proceedings ISMB '00, 8: 307–316C, PMID   10977092
  6. Huffner, F.; Komusiewicz, C.; Liebtrau, A; Niedermeier, R (2014), "Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage", IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11 (3): 455–467, CiteSeerX   10.1.1.377.1900 , doi:10.1109/TCBB.2013.177, PMID   26356014, S2CID   991687