Louvain method

Last updated

The Louvain method for community detection is a greedy optimization method intended to extract non-overlapping communities from large networks created by Blondel et al. [1] from the University of Louvain (the source of this method's name).

Contents

Modularity optimization

The inspiration for this method of community detection is the optimization of modularity as the algorithm progresses. Modularity is a scale value between −0.5 (non-modular clustering) and 1 (fully modular clustering) that measures the relative density of edges inside communities with respect to edges outside communities. Optimizing this value theoretically results in the best possible grouping of the nodes of a given network. But because going through all possible iterations of the nodes into groups is impractical, heuristic algorithms are used.

In the Louvain Method of community detection, first small communities are found by optimizing modularity locally on all nodes, then each small community is grouped into one node and the first step is repeated. The method is similar to the earlier method by Clauset, Newman and Moore [2] that connects communities whose amalgamation produces the largest increase in modularity. The Louvain algorithm was shown to correctly identify the community structure when it exists, in particular in the stochastic block model. [3]

Algorithm Description

Modularity

The value to be optimized is modularity, defined as a value in the range that measures the density of links inside communities compared to links between communities. [1] For a weighted graph, modularity is defined as:

where:

Based on the above equation, the modularity of a community can be calculated as: [4]

where

As nodes in different communities do not contribute to the modularity , it can be written as:

The Louvain Method Algorithm

The Louvain method works by repeating two phases. [1] In phase one, nodes are sorted into communities based on how the modularity of the graph changes when a node moves communities. In phase two, the graph is reinterpreted so that communities are seen as individual nodes. A detailed explanation is provided below.

Phase 1:

Figure 1: Each node in the graph is randomly assigned to a singleton community Louvain Step1.png
Figure 1: Each node in the graph is randomly assigned to a singleton community
1.1: Each node in the network is assigned to its own community.

The Louvain method begins by considering each node in a graph to be its own community. This can be seen in Figure 1, where each dot (representing nodes) is a unique color (representing which community the node belongs to).

1.2: Nodes are grouped into communities

For each node , we consider how moving from its current community into a neighboring community will affect the modularity of the graph partition. In the pseudo-code below, this happens in the for-loop. We select the community with the greatest change in modularity, and if the change is positive, we move into ; otherwise we leave it where it is. This continues until the modularity stops improving.

Figure 2: Nodes are assigned to communities based on their modularities Louvain Step2.png
Figure 2: Nodes are assigned to communities based on their modularities
function moveNodes(Graph G, Partition P):     do          old_modularity <- current_modularity_of_partition         for v in V(G), do             # find the community that causes the largest increase in modularity when v is moved into it             C' <- argmax(delta_Q) # delta_Q is the change in modularity             if delta_Q > 0, then                 move v into C'             end if          end for     update current_modularity_of_partition     while current_modularity_of_partition > old_modularity     return P  end function

[5]

This process is applied repeatedly and sequentially to all nodes until no modularity increase can occur. Once this local maximum of modularity is hit, the first phase has ended. Figure 2 shows how the graph in Figure 1 might look after one iteration of phase 1.

Phase 2:

2.1: Communities are reduced to a single node

For each community in our graph's partition, the individual nodes making up that community are combined and the community itself becomes a node. The edges connecting distinct communities are used to weight the new edges connecting our aggregate nodes.

This process is modeled in the pseudo-code, where the function returns a new graph whose vertices are the partition of the old graph, and whose edges are calculated using the old graph. This function does not show the edges being weighted, but a simple modification would allow for that information to be tracked.

Figure 3: Communities are reduced to a single node with weighted edges Louvain Step3.png
Figure 3: Communities are reduced to a single node with weighted edges
function aggregateGraph(Graph G, Partition P):     V <- P      E <- [(A,B) | (x,y) is in E(G), x is in A and A is in P, y is in B and B is in P]     return Graph(V,E) end function

[5]

Figure 3 shows what the graph from Figure 2 would look like after being aggregated. This graph is analogous to the graph in Figure 1 in the sense that each node is assigned to a single community. From here, the process can be repeated so that more nodes are moved into existing communities until an optimal level of modularity is reached.

The pseudo-code below shows how the previous two functions work together to complete the process.

function louvain(Graph G, Partition P):     do          P <- moveNodes(G, P)         done <- length(P) == length(V(G)) # every community is a single node, despite running moveNodes         if not done, then:             G <- aggregateGraph(G, P)             P <- singletonPartition(G)         end if     while not done  end function  function singletonPartition(Graph G):     return [{v} | v is in V(G)] # each node is placed in its own community end function

[5]

Time Complexity

Generally, the Louvain method is assumed to have a time complexity of . [6] [7] Richard Blondel, co-author of the paper that originally published the Louvain method, seems to support this notion, [8] but other sources claim the time complexity is "essentially linear in the number of links in the graph," [9] meaning the time complexity would instead be , where is the number of edges in the graph. Unfortunately, no source has published an analysis of the Louvain method's time complexity so one is attempted here.

In the pseudo-code above, the function controls the execution of the algorithm. It's clear to see that inside of , will be repeated until it is no longer possible to combine nodes into communities. This depends on two factors: how much the modularity of the graph can improve and, in the worst case, if the modularity can improve with every iteration of , it depends on how quickly will reduce the graph down to a single node.

If, in each iteration of , is only able to move one node into a community, then will only be able to reduce the size of the graph by one. This would cause to repeat times. Since iterates through all nodes in a graph, this would result in a time complexity of , where is the number of nodes.

It is unclear if this situation is possible, so the above result should be considered a loose bound. Blondel et al. state in their original publication that most of the run time is spent in the early iterations of the algorithm because "the number of communities decreases drastically after just a few passes." [1] This can be understood by considering a scenario where is able to move each node so that every community has two nodes. In this case, would return a graph half the size of the original. If this continued, then the Louvain method would have a runtime of , although it is unclear if this would be the worst case, best case, average case, or none of those. Additionally, there is no guarantee the size of the graph would be reduced by the same factor with each iteration, and so no single logarithm function can perfectly describe the time complexity.

Previous uses

Disadvantages

Louvain produces only non-overlapping communities, which means that each node can belong to at most one community. This is highly unrealistic in many real-world applications. For example, in social networks, most people belong to multiple communities: their family, their friends, their co-workers, old school buddies, etc. In biological networks, most genes or proteins belong to more than one pathway or complex. Furthermore, Louvain has been shown to sometimes produce arbitrarily badly connected communities, and has been effectively superseded (at least in the non-overlapping case) by the Leiden algorithm.

A graph illustrating how communities can become disconnected when using the Louvain algorithm. Louvain Algorithm Disconnected Community.png
A graph illustrating how communities can become disconnected when using the Louvain algorithm.

These badly connected communities arise when a node that had been acting as a "bridge" between two groups of nodes in its community is moved to a new community, leaving the old one disconnected. The remaining nodes in the old community may also be relocated, but if their connection to the community is strong enough despite the removal of the "bridge" node, they will instead remain in place. For an example of this, see the image to the right; note how the removal of the bridge node, node 0, caused the red community to be split into two disjoint subgroups. While this is the worst-case scenario, there are other, more subtle problems with the Louvain algorithm that can also lead to arbitrarily badly connected communities, such as the formation of communities using nodes that are only weakly connected.

An image depicting how the resolution limit of modularity can cause subcommunities to become hidden. Modularity vs substructure.png
An image depicting how the resolution limit of modularity can cause subcommunities to become hidden.

Another common issue with the Louvain algorithm is the resolution limit of modularity - that is, multiple small communities being grouped together into a larger community. This causes the smaller communities to be hidden; for an example of this, see the visual depiction of the resolution limit to the right. Note how, when the green community is absorbed into the blue community to increase the graph's modularity, the smaller group of nodes that it represented is lost. There is no longer a way to differentiate those nodes from the nodes that were already in the blue community. Conversely, the nodes that were already in the blue community no longer appear distinct from those that were in the green community; in other words, whatever difference caused them to initially be placed in separate communities has been obscured.

Both the resolution limit of modularity and the arbitrarily badly connected community problem are further exasperated by each iteration of the algorithm. Ultimately, the only thing the Louvain algorithm guarantees is that the resulting communities cannot be merged further; in other words, they're well-separated. To avoid the problems that arise from arbitrarily badly connected communities and the resolution limit of modularity, it is recommended to use the Leiden algorithm instead, as its refinement phase and other various adjustments have corrected these issues. [5]

Comparison to other methods of non-overlapping community detection

When comparing modularity optimization methods, the two measures of importance are the speed and the resulting modularity value. A higher speed is better as it shows a method is more efficient than others and a higher modularity value is desirable as it points to having better-defined communities. The compared methods are, the algorithm of Clauset, Newman, and Moore, [2] Pons and Latapy, [13] and Wakita and Tsurumi. [14]

Modularity Optimization Comparison [15]
KarateArxivInternetWeb nd.eduPhoneWeb uk-2005Web WebBase 2001
Nodes/links34/779k/24k70k/351k325k/1M2.6M/6.3M39M/783M118M/1B
Clauset, Newman, and Moore.38/0s.772/3.6s.692/799s.927/5034s-/--/--/-
Pons and Latapy.42/0s.757/3.3s.729/575s.895/6666s-/--/--/-
Wakita and Tsurumi.42/0s.761/0.7s.667/62s.898/248s.56/464s-/--/-
Louvain Method.42/0s.813/0s.781/1s.935/3s.769/134s.979/738s.984/152mn

-/- in the table refers to a method that took over 24hrs to run. This table (from [1] [16] ) shows that the Louvain method outperforms many similar modularity optimization methods in both the modularity and the time categories.

See also

Related Research Articles

<span class="mw-page-title-main">Dijkstra's algorithm</span> Algorithm for finding shortest paths

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, a road network. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

In computer science and optimization theory, the max-flow min-cut theorem states that in a flow network, the maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in a minimum cut, i.e., the smallest total weight of the edges which if removed would disconnect the source from the sink.

<span class="mw-page-title-main">Breadth-first search</span> Algorithm to search the nodes of a graph

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

<span class="mw-page-title-main">Maximum flow problem</span> Computational problem in graph theory

In optimization theory, maximum flow problems involve finding a feasible flow through a flow network that obtains the maximum possible flow rate.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications, especially in ranking problems such as feedback arc set. Topological sorting is also possible when the DAG has disconnected components.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

In graph theory, the planar separator theorem is a form of isoperimetric inequality for planar graphs, that states that any planar graph can be split into smaller pieces by removing a small number of vertices. Specifically, the removal of vertices from an n-vertex graph can partition the graph into disjoint subgraphs each of which has at most vertices.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

The Kernighan–Lin algorithm is a heuristic algorithm for finding partitions of graphs. The algorithm has important practical application in the layout of digital circuits and components in electronic design automation of VLSI.

In graph theory, the modular decomposition is a decomposition of a graph into subsets of vertices called modules. A module is a generalization of a connected component of a graph. Unlike connected components, however, one module can be a proper subset of another. Modules therefore lead to a recursive (hierarchical) decomposition of the graph, instead of just a partition.

In graph theory, Yen's algorithm computes single-source K-shortest loopless paths for a graph with non-negative edge cost. The algorithm was published by Jin Y. Yen in 1971 and employs any shortest path algorithm to find the best path, then proceeds to find K − 1 deviations of the best path.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function , if it is possible to construct a flow network with positive weights such that

<span class="mw-page-title-main">Leiden algorithm</span> Clustering and community detection algorithm

The Leiden algorithm is a community detection algorithm developed by Traag et al at Leiden University. It was developed as a modification of the Louvain method. Like the Louvain method, the Leiden algorithm attempts to optimize modularity in extracting communities from networks; however, it addresses key issues present in the Louvain method, namely poorly connected communities and the resolution limit of modularity.

References

  1. 1 2 3 4 5 Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (9 October 2008). "Fast unfolding of communities in large networks". Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): 10008. arXiv: 0803.0476 . Bibcode:2008JSMTE..10..008B. doi:10.1088/1742-5468/2008/10/P10008. S2CID   334423.
  2. 1 2 Clauset, Aaron; Newman, M. E. J.; Moore, Cristopher (2004-12-06). "Finding community structure in very large networks". Physical Review E. 70 (6): 066111. arXiv: cond-mat/0408187 . Bibcode:2004PhRvE..70f6111C. doi:10.1103/PhysRevE.70.066111. ISSN   1539-3755. PMID   15697438. S2CID   8977721.
  3. Cohen-Addad, Vincent; Kosowski, Adrian; Mallmann-Trenn, Frederik; Saulpic, David (2020). "On the Power of Louvain in the Stochastic Block Model". Advances in Neural Information Processing Systems (Neurips 2020). Curran Associates, Inc. pp. 4055–4066.
  4. https://eecs.wsu.edu/~ananth/papers/Ghosh_IPDPS18.pdf [ bare URL PDF ]
  5. 1 2 3 4 Traag, V. A.; Waltman, L.; van Eck, N. J. (2019-03-26). "From Louvain to Leiden: guaranteeing well-connected communities". Scientific Reports. 9 (1): 5233. arXiv: 1810.08473 . Bibcode:2019NatSR...9.5233T. doi:10.1038/s41598-019-41695-z. ISSN   2045-2322. PMC   6435756 . PMID   30914743.
  6. Kumar, Vikas; Sisodia, Anubhav; Maini, Umesh; Dhull, Pankaj; Anand, Abhineet (November 2016). "Comparing Algorithms of Community Structure in Networks". Indian Journal of Science and Technology. 9 (44): 1–5.
  7. Watson, Cullen (April 25, 2022). "Girvan-Newman and Louvain Algorithms for Community Detection". Medium. Retrieved November 21, 2024.
  8. "Louvain method for community detection". perso.uclouvain.be. Retrieved 2024-11-21.
  9. "Louvain - Analytics & Algorithms - Ultipa Graph". www.ultipa.com. Retrieved 2024-11-21.
  10. Pujol, Josep M.; Erramilli, Vijay; Rodriguez, Pablo (2009). "Divide and Conquer: Partitioning Online Social Networks". arXiv: 0905.4918v1 [cs.NI].
  11. Greene, Derek; Doyle, Dónal; Cunningham, Pádraig (May 2011). Tracking the Evolution of Communities in Dynamic Social Networks (PDF) (Technical report). University College Dublin. UCD-CSI-2011-06. Archived from the original (PDF) on 2013-05-12. Retrieved 2014-11-20.
  12. Markovitch, Omer; Krasnogor, Natalio (2018). "Predicting species emergence in simulated complex pre-biotic networks". PLOS ONE. 13 (2): e0192871. Bibcode:2018PLoSO..1392871M. doi: 10.1371/journal.pone.0192871 . PMC   5813963 . PMID   29447212.
  13. Pons, Pascal; Latapy, Matthieu (2006). "Computing Communities in Large Networks Using Random Walks" (PDF). Journal of Graph Algorithms and Applications. 10 (2): 191–218. arXiv: cond-mat/0412368 . doi: 10.7155/jgaa.00124 . S2CID   121714719.
  14. Wakita, Ken; Tsurumi, Toshiyuki (2007). "Finding Community Structure in Mega-scale Social Networks". arXiv: cs/0702048 .
  15. Blondel, Vincent D.; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (2008). "Fast unfolding of communities in large networks". Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): 10008. arXiv: 0803.0476 . Bibcode:2008JSMTE..10..008B. doi:10.1088/1742-5468/2008/10/P10008. S2CID   334423.
  16. Aynaud, Thomas; Blondel, Vincent D.; Guillaume, Jean-Loup; Lambiotte, Renaud (2013). "Multilevel Local Optimization of Modularity". In Bichot, Charles-Edmond; Siarry, Patrick (eds.). Graph Partitioning (1 ed.). Wiley (published 13 February 2013). pp. 315–345. doi:10.1002/9781118601181.ch13. ISBN   978-1-84821-233-6.