Correlation clustering

Last updated

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. [1]

Contents

Description of the problem

In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.

It may not be possible to find a perfect clustering, where all similar items are in a cluster while all dissimilar ones are in different clusters. If the graph indeed admits a perfect clustering, then simply deleting all the negative edges and finding the connected components in the remaining graph will return the required clusters.

But, in general a graph may not have a perfect clustering. For example, given nodes a,b,c such that a,b and a,c are similar while b,c are dissimilar, a perfect clustering is not possible. In such cases, the task is to find a clustering that maximizes the number of agreements (number of + edges inside clusters plus the number of − edges between clusters) or minimizes the number of disagreements (the number of − edges inside clusters plus the number of + edges between clusters). This problem of maximizing the agreements is NP-complete (multiway cut problem reduces to maximizing weighted agreements and the problem of partitioning into triangles [2] can be reduced to the unweighted version).

Formal Definitions

Let be a graph with nodes and edges . A clustering of is a partition of its node set with and for . For a given clustering , let denote the subset of edges of whose endpoints are in different subsets of the clustering . Now, let be a function that assigns a non-negative weight to each edge of the graph and let be a partition of the edges into attractive () and repulsive () edges.

The minimum disagreement correlation clustering problem is the following optimization problem:

Here, the set contains the attractive edges whose endpoints are in different components with respect to the clustering and the set contains the repulsive edges whose Endpoints are in the same component with respect to the clustering . Together these two sets contain all edges that disagree with the clustering .

Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as

Here, the set contains the attractive edges whose endpoints are in the same component with respect to the clustering and the set contains the repulsive edges whose Endpoints are in different components with respect to the clustering . Together these two sets contain all edges that agree with the clustering .

Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights and a given partition of the edges into attractive and repulsive edges, the edge costs can be defined by

for all .

An edge whose endpoints are in different clusters is said to be cut. The set of all edges that are cut is often called a multicut [3] of .

The minimum cost multicut problem is the problem of finding a clustering of such that the sum of the costs of the edges whose endpoints are in different clusters is minimal:

Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games [4] is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal:

This formulation is also known as the clique partitioning problem. [5]

It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives.

Algorithms

Bansal et al. [6] discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al. [7] propose a randomized 3-approximation algorithm for the same problem.

CC-Pivot(G=(V,E+,E))     Pick random pivot i  V     Set , V'=Ø     For all j  V, j  i;         If (i,j)  E+then             Add j to C         Else (If (i,j)  E)             Add j to V'     Let G' be the subgraph induced by V'     Return clustering C,CC-Pivot(G')

The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev. [8]

Karpinski and Schudy [9] proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters.

Optimal number of clusters

In 2011, it was shown by Bagon and Galun [10] that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the underlying number of clusters. This analysis suggests the functional assumes a uniform prior over all possible partitions regardless of their number of clusters. Thus, a non-uniform prior over the number of clusters emerges.

Several discrete optimization algorithms are proposed in this work that scales gracefully with the number of elements (experiments show results with more than 100,000 variables). The work of Bagon and Galun also evaluated the effectiveness of the recovery of the underlying number of clusters in several applications.

Correlation clustering (data mining)

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process. These correlations may be different in different clusters, thus a global decorrelation cannot reduce this to traditional (uncorrelated) clustering.

Correlations among subsets of attributes result in different spatial shapes of clusters. Hence, the similarity between cluster objects is defined by taking into account the local correlation patterns. With this notion, the term has been introduced in [11] simultaneously with the notion discussed above. Different methods for correlation clustering of this type are discussed in [12] and the relationship to different types of clustering is discussed in. [13] See also Clustering high-dimensional data.

Correlation clustering (according to this definition) can be shown to be closely related to biclustering. As in biclustering, the goal is to identify groups of objects that share a correlation in some of their attributes; where the correlation is usually typical for the individual clusters.

Related Research Articles

In computer science and optimization theory, the max-flow min-cut theorem states that in a flow network, the maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in a minimum cut, i.e., the smallest total weight of the edges which if removed would disconnect the source from the sink.

In the mathematical discipline of graph theory, a matching or independent edge set in an undirected graph is a set of edges without common vertices. In other words, a subset of the edges is a matching if each vertex appears in at most one edge of that matching. Finding a matching in a bipartite graph can be treated as a network flow problem.

In graph theory, a domatic partition of a graph is a partition of into disjoint sets , ,..., such that each Vi is a dominating set for G. The figure on the right shows a domatic partition of a graph; here the dominating set consists of the yellow vertices, consists of the green vertices, and consists of the blue vertices.

In graph theory, a cut is a partition of the vertices of a graph into two disjoint subsets. Any cut determines a cut-set, the set of edges that have one endpoint in each subset of the partition. These edges are said to cross the cut. In a connected graph, each cut-set determines a unique cut, and in some cases cuts are identified with their cut-sets rather than with their vertex partitions.

The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal–dual methods. It was developed and published in 1955 by Harold Kuhn, who gave it the name "Hungarian method" because the algorithm was largely based on the earlier works of two Hungarian mathematicians, Dénes Kőnig and Jenő Egerváry. However, in 2006 it was discovered that Carl Gustav Jacobi had solved the assignment problem in the 19th century, and the solution had been published posthumously in 1890 in Latin.

Single-machine scheduling or single-resource scheduling is an optimization problem in computer science and operations research. We are given n jobs J1, J2, ..., Jn of varying processing times, which need to be scheduled on a single machine, in a way that optimizes a certain objective, such as the throughput.

<span class="mw-page-title-main">Kőnig's theorem (graph theory)</span> Theorem showing that maximum matching and minimum vertex cover are equivalent for bipartite graphs

In the mathematical area of graph theory, Kőnig's theorem, proved by Dénes Kőnig, describes an equivalence between the maximum matching problem and the minimum vertex cover problem in bipartite graphs. It was discovered independently, also in 1931, by Jenő Egerváry in the more general case of weighted graphs.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Conductance (graph theory)</span> A mixing property of Markov chains and graphs

In theoretical computer science, graph theory, and mathematics, the conductance is a parameter of a Markov chain that is closely tied to its mixing time, that is, how rapidly the chain converges to its stationary distribution, should it exist. Equivalently, the conductance can be viewed as a parameter of a directed graph, in which case it can be used to analyze how quickly random walks in the graph converge.

In computer science and graph theory, the Canadian traveller problem (CTP) is a generalization of the shortest path problem to graphs that are partially observable. In other words, a "traveller" on a given point on the graph cannot see the full graph, rather only adjacent nodes or a certain "realization restriction."

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

<span class="mw-page-title-main">Strength of a graph</span> Graph-theoretic connectivity parameter

In graph theory, the strength of an undirected graph corresponds to the minimum ratio edges removed/components created in a decomposition of the graph in question. It is a method to compute partitions of the set of vertices and detect zones of high concentration of edges, and is analogous to graph toughness which is defined similarly for vertex removal.

In combinatorial optimization, the matroid intersection problem is to find a largest common independent set in two matroids over the same ground set. If the elements of the matroid are assigned real weights, the weighted matroid intersection problem is to find a common independent set with the maximum possible weight. These problems generalize many problems in combinatorial optimization including finding maximum matchings and maximum weight matchings in bipartite graphs and finding arborescences in directed graphs.

Bidimensionality theory characterizes a broad range of graph problems (bidimensional) that admit efficient approximate, fixed-parameter or kernel solutions in a broad range of graphs. These graph classes include planar graphs, map graphs, bounded-genus graphs and graphs excluding any fixed minor. In particular, bidimensionality theory builds on the graph minor theory of Robertson and Seymour by extending the mathematical results and building new algorithmic tools. The theory was introduced in the work of Demaine, Fomin, Hajiaghayi, and Thilikos, for which the authors received the Nerode Prize in 2015.

In mathematics, a submodular set function is a set function that, informally, describes the relationship between a set of inputs and an output, where adding more of one input has a decreasing additional benefit. The natural diminishing returns property which makes them suitable for many applications, including approximation algorithms, game theory and electrical networks. Recently, submodular functions have also found utility in several real world problems in machine learning and artificial intelligence, including automatic summarization, multi-document summarization, feature selection, active learning, sensor placement, image collection summarization and many other domains.

Approximate max-flow min-cut theorems are mathematical propositions in network flow theory. Approximate max-flow min-cut theorems deal with the relationship between maximum flow rate ("max-flow") and minimum cut ("min-cut") in a multi-commodity flow problem. The theorems have enabled the development of approximation algorithms for use in graph partition and related problems.

In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016. Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure.

A central problem in algorithmic graph theory is the shortest path problem. One of the generalizations of the shortest path problem is known as the single-source-shortest-paths (SSSP) problem, which consists of finding the shortest paths from a source vertex to all other vertices in the graph. There are classical sequential algorithms which solve this problem, such as Dijkstra's algorithm. In this article, however, we present two parallel algorithms solving this problem.

<span class="mw-page-title-main">Cutwidth</span> Property in graph theory

In graph theory, the cutwidth of an undirected graph is the smallest integer with the following property: there is an ordering of the vertices of the graph, such that every cut obtained by partitioning the vertices into earlier and later subsets of the ordering is crossed by at most edges. That is, if the vertices are numbered , then for every , the number of edges with and is at most .

References

  1. Becker, Hila, "A Survey of Correlation Clustering", 5 May 2005
  2. Garey, M.; Johnson, D. (2000). Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company.
  3. Deza, M.; Grötschel, M.; Laurent, M. (1992). "Clique-Web Facets for Multicut Polytopes". Mathematics of Operations Research . 17 (4): 981–1000. doi:10.1287/moor.17.4.981.
  4. Bachrach, Yoram; Kohli, Pushmeet; Kolmogorov, Vladimir; Zadimoghaddam, Morteza (2013). "Optimal coalition structure generation in cooperative graph games". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 27. pp. 81–87.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  5. Grötschel, G.; Wakabayashi, Y. (1989). "A cutting plane algorithm for a clustering problem". Mathematical Programming . 45 (1–3): 59–96. doi:10.1007/BF01589097.
  6. Bansal, N.; Blum, A.; Chawla, S. (2004). "Correlation Clustering". Machine Learning. 56 (1–3): 89–113. doi: 10.1023/B:MACH.0000033116.57574.95 .
  7. Ailon, N.; Charikar, M.; Newman, A. (2005). "Aggregating inconsistent information". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing – STOC '05. p. 684. doi:10.1145/1060590.1060692. ISBN   1581139608.
  8. Chawla, Shuchi; Makarychev, Konstantin; Schramm, Tselil; Yaroslavtsev, Grigory. "Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs". Proceedings of the 46th Annual ACM on Symposium on Theory of Computing.
  9. Karpinski, M.; Schudy, W. (2009). "Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems". Proceedings of the 41st annual ACM symposium on Symposium on theory of computing – STOC '09. p. 313. arXiv: 0811.3244 . doi:10.1145/1536414.1536458. ISBN   9781605585062.
  10. Bagon, S.; Galun, M. (2011) "Large Scale Correlation Clustering Optimization" arXiv : 1112.2903v1
  11. Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data – SIGMOD '04. p. 455. CiteSeerX   10.1.1.5.1279 . doi:10.1145/1007568.1007620. ISBN   978-1581138597. S2CID   6411037.
  12. Zimek, A. (2008). Correlation Clustering (Text.PhDThesis). Ludwig-Maximilians-Universität München.
  13. Kriegel, H. P.; Kröger, P.; Zimek, A. (2009). "Clustering high-dimensional data". ACM Transactions on Knowledge Discovery from Data. 3: 1–58. doi:10.1145/1497577.1497578. S2CID   17363900.