Chinese whispers (clustering method)

Last updated

Chinese whispers is a clustering method used in network science named after the famous whispering game. [1] Clustering methods are basically used to identify communities of nodes or links in a given network. This algorithm was designed by Chris Biemann and Sven Teresniak in 2005. [1] The name comes from the fact that the process can be modeled as a separation of communities where the nodes send the same type of information to each other. [1]

Contents

Chinese whispers is a hard partitioning, randomized, flat clustering (no hierarchical relations between clusters) method. [1] The random property means that running the process on the same network several times can lead to different results, while because of hard partitioning one node can belong to only one cluster at a given moment. The original algorithm is applicable to undirected, weighted and unweighted graphs. Chinese whispers runs in linear time, which means that it is extremely fast even if there are very many nodes and links in the network. [1]

Algorithm

An example of how Chinese whispers works in action. The different colors represent different classes. (top) Initial cluster assignment. (middle) The graph after the first step 2, in which (in order) the pink, green, yellow, red, black, white, and blue nodes were selected. (bottom) The graph after a second step 2, in which the green and white nodes were selected, with the order of the remaining nodes after that not important. This clustering is stable because each node is colored the same as the plurality of its neighbors. Chinese Whispers example cluster.png
An example of how Chinese whispers works in action. The different colors represent different classes. (top) Initial cluster assignment. (middle) The graph after the first step 2, in which (in order) the pink, green, yellow, red, black, white, and blue nodes were selected. (bottom) The graph after a second step 2, in which the green and white nodes were selected, with the order of the remaining nodes after that not important. This clustering is stable because each node is colored the same as the plurality of its neighbors.

The algorithm works in the following way in an undirected unweighted graph: [1]

  1. All nodes are assigned to a distinct class, so that the number of initial classes equals the number of nodes.
  2. All of the network nodes are selected one by one in a random order. For each selected node, its cluster label is changed to the cluster with which it has the most connections. If there is a tie, one is picked randomly from the tied clusters.
  3. Step two repeats itself until a predetermined number of iterations or until the process converges. In the end the emerging classes represent the clusters of the network.

The predetermined threshold for the number of the iterations is needed because it is possible that process does not converge. On the other hand in a network with approximately 10000 nodes the clusters does not change significantly after 40-50 iterations even if there is no convergence. [1]

Strengths and weaknesses

The main strength of Chinese whispers lies in its time linear property. Because the processing time increases linearly with the number of nodes, the algorithm is capable of identifying communities in a network very fast. For this reason Chinese whispers is a good tool to analyze community structures in graph with a very high number of nodes. The effectiveness of the method increases further if the network has the small world property. [1]

On the other hand because the algorithm is not deterministic in the case of small node number the resulting clusters often significantly differ from each other. The reason for this is that in the case of a small network it matters more from which node the iteration process starts while in large networks the relevance of starting points disappears. [1] For this reason for small graphs other clustering methods are recommended.

Applications

Chinese whispers is used in many subfield of network science. Most frequently it is mentioned in the context of natural language processing problems. [2] [3] On the other hand the algorithm is applicable to any kind of community identification problem which is related to a network framework. Chinese whispers is available for personal use as an extension package for Gephi [4] which is an open source program designed for network analysis.

Related Research Articles

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

<span class="mw-page-title-main">Strongly connected component</span> Partition of a graph whose components are reachable from all vertices

In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of a directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time (that is, Θ(V + E )).

<span class="mw-page-title-main">Force-directed graph drawing</span> Physical simulation to visualize graphs

Force-directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthetically-pleasing way. Their purpose is to position the nodes of a graph in two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions, and then using these forces either to simulate the motion of the edges and nodes or to minimize their energy.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

In graph theory, a cut is a partition of the vertices of a graph into two disjoint subsets. Any cut determines a cut-set, the set of edges that have one endpoint in each subset of the partition. These edges are said to cross the cut. In a connected graph, each cut-set determines a unique cut, and in some cases cuts are identified with their cut-sets rather than with their vertex partitions.

<span class="mw-page-title-main">Pathfinding</span> Plotting by a computer application

Pathfinding or pathing is the search, by a computer application, for the shortest route between two points. It is a more practical variant on solving mazes. This field of research is based heavily on Dijkstra's algorithm for finding the shortest path on a weighted graph.

<span class="mw-page-title-main">Minimum cut</span> Partition of a graph by removing fewest possible edges

In graph theory, a minimum cut or min-cut of a graph is a cut that is minimal in some metric.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Pseudoforest</span> Graph with at most one cycle per component

In graph theory, a pseudoforest is an undirected graph in which every connected component has at most one cycle. That is, it is a system of vertices and edges connecting pairs of vertices, such that no two cycles of consecutive edges share any vertex with each other, nor can any two cycles be connected to each other by a path of consecutive edges. A pseudotree is a connected pseudoforest.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

The HCS clustering algorithm is an algorithm based on graph connectivity for cluster analysis. It works by representing the similarity data in a similarity graph, and then finding all the highly connected subgraphs. It does not make any prior assumptions on the number of the clusters. This algorithm was published by Erez Hartuv and Ron Shamir in 2000.

<span class="mw-page-title-main">Louvain method</span> Clustering and community detection algorithm

The Louvain method for community detection is a greedy optimization method intended to extract non-overlapping communities from large networks created by Blondel et al. from the University of Louvain.

<span class="mw-page-title-main">Lancichinetti–Fortunato–Radicchi benchmark</span> Algorithm

Lancichinetti–Fortunato–Radicchibenchmark is an algorithm that generates benchmark networks. They have a priori known communities and are used to compare different community detection methods. The advantage of the benchmark over other methods is that it accounts for the heterogeneity in the distributions of node degrees and of community sizes.

<span class="mw-page-title-main">Leiden algorithm</span> Clustering and community detection algorithm

The Leiden algorithm is a community detection algorithm developed by Traag et al at Leiden University. It was developed as a modification of the Louvain method. Like the Louvain method, the Leiden algorithm attempts to optimize modularity in extracting communities from networks; however, it addresses key issues present in the Louvain method, namely poorly connected communities and the resolution limit of modularity.

References

  1. 1 2 3 4 5 6 7 8 9 Chris Biemann,"Chinese Whispers- an Efficient Graph Clustering Algorithm and its Applications to Natural Language Processing Problems", 2006
  2. Antonio Di Marco - Roberto Navigili,"Clustering and Diversifying Web Search Results with Graph Based Word Sense Induction", 2013
  3. Ioannis Korkontzelos - Suresh Manandhar,"Detecting Compositionality in Multi-Word Expressions", 2009
  4. ""Gephi Marketplace"". Archived from the original on 2015-09-23. Retrieved 2015-06-02.