Leiden algorithm

Last updated

The Leiden algorithm is a community detection algorithm developed by Traag et al [1] at Leiden University. It was developed as a modification of the Louvain method. Like the Louvain method, the Leiden algorithm attempts to optimize modularity in extracting communities from networks; however, it addresses key issues present in the Louvain method, namely poorly connected communities and the resolution limit of modularity.

Contents

Improvement over Louvain method

Broadly, the Leiden algorithm uses the same two primary phases as the Louvain algorithm: a local node moving step (though, the method by which nodes are considered in Leiden is more efficient [1] ) and a graph aggregation step. However, to address the issues with poorly-connected communities and the merging of smaller communities into larger communities (the resolution limit of modularity), the Leiden algorithm employs an intermediate refinement phase in which communities may be split to guarantee that all communities are well-connected.

Consider, for example, the following graph:

Initial Graph.svg

Three communities are present in this graph (each color represents a community). Additionally, the center "bridge" node (represented with an extra circle) is a member of the community represented by blue nodes. Now consider the result of a node-moving step which merges the communities denoted by red and green nodes into a single community (as the two communities are highly connected):

Graph after node-moving step.svg

Notably, the center "bridge" node is now a member of the larger red community after node moving occurs (due to the greedy nature of the local node moving algorithm). In the Louvain method, such a merging would be followed immediately by the graph aggregation phase. However, this causes a disconnection between two different sections of the community represented by blue nodes. In the Leiden algorithm, the graph is instead refined:

Graph after refinement phase.svg

The Leiden algorithm's refinement step ensures that the center "bridge" node is kept in the blue community to ensure that it remains intact and connected, despite the potential improvement in modularity from adding the center "bridge" node to the red community.

Graph components

Before defining the Leiden algorithm, it will be helpful to define some of the components of a graph.

Vertices and edges

A graph is composed of vertices (nodes) and edges. Each edge is connected to two vertices, and each vertex may be connected to zero or more edges. Edges are typically represented by straight lines, while nodes are represented by circles or points. In set notation, let be the set of vertices, and be the set of edges:

where is the directed edge from vertex to vertex . We can also write this as an ordered pair:


Community

A community is a unique set of nodes:

and the union of all communities must be the total set of vertices:

Partition

A partition is the set of all communities:


Partition Quality

How communities are partitioned is an integral part on the Leiden algorithm. How partitions are decided can depend on how their quality is measured. Additionally, many of these metrics contain parameters of their own that can change the outcome of their communities.

Modularity

Modularity is a highly used quality metric for assessing how well a set of communities partition a graph. The equation for this metric is defined for an adjacency matrix, A, as: [2]

where:

Reichardt Bornholdt Potts Model (RB)

One of the most well used metrics for the Leiden algorithm is the Reichardt Bornholdt Potts Model (RB). [3] This model is used by default in most mainstream Leiden algorithm libraries under the name RBConfigurationVertexPartition. [4] [5] This model introduces a resolution parameter and is highly similar to the equation for modularity. This model is defined by the following quality function for an adjacency matrix, A, as: [4]

where:

Constant Potts Model (CPM)

Another metric similar to RB, is the Constant Potts Model (CPM). This metric also relies on a resolution parameter [6] The quality function is defined as:

Understanding Potts Model resolution parameters/Resolution limit

The image depicts 2 separate community interpretations of the graph shown. The first graph shows 2 separate communities(1 blue, 1 red) that attempt to maximize modularity. The second graph interpretation shows 3 communities(1 blue, 1 red, 1 green) that show a more accurate representation of the substructures within the graph Modularity vs substructure.png
The image depicts 2 separate community interpretations of the graph shown. The first graph shows 2 separate communities(1 blue, 1 red) that attempt to maximize modularity. The second graph interpretation shows 3 communities(1 blue, 1 red, 1 green) that show a more accurate representation of the substructures within the graph

Typically Potts models such as RB or CPM include a resolution parameter in their calculation. [3] [6] Potts models are introduced as a response to the resolution limit problem that is present in modularity maximization based community detection. The resolution limit problem is that, for some graphs, maximizing modularity may cause substructures of a graph to merge and become a single community and thus smaller structures are lost. [7] These resolution parameters allow modularity adjacent methods to be modified to suit the requirements of the user applying the Leiden algorithm to account for small substructures at a certain granularity.

The figure on the right illustrates why resolution can be a helpful parameter when using modularity based quality metrics. In the first graph, modularity only captures the large scale structures of the graph; however, in the second example, a more granular quality metric could potentially detect all substructures in a graph.

Algorithm

A graph depicting the steps of the Leiden algorithm. Leiden-algorithm.webp
A graph depicting the steps of the Leiden algorithm.


The Leiden algorithm starts with a graph of disorganized nodes (a) and sorts it by partitioning them to maximize modularity (the difference in quality between the generated partition and a hypothetical randomized partition of communities). The method it uses is similar to the Louvain algorithm, except that after moving each node it also considers that node's neighbors that are not already in the community it was placed in. This process results in our first partition (b), also referred to as . Then the algorithm refines this partition by first placing each node into its own individual community and then moving them from one community to another to maximize modularity. It does this iteratively until each node has been visited and moved, and each community has been refined - this creates partition (c), which is the initial partition of . Then an aggregate network (d) is created by turning each community into a node. is used as the basis for the aggregate network while is used to create its initial partition. Because we use the original partition in this step, we must retain it so that it can be used in future iterations. These steps together form the first iteration of the algorithm.

In subsequent iterations, the nodes of the aggregate network (which each represent a community) are once again placed into their own individual communities and then sorted according to modularity to form a new , forming (e) in the above graphic. In the case depicted by the graph, the nodes were already sorted optimally, so no change took place, resulting in partition (f). Then the nodes of partition (f) would once again be aggregated using the same method as before, with the original partition still being retained. This portion of the algorithm repeats until each aggregate node is in its own individual network; this means that no further improvements can be made.

The Leiden algorithm consists of three main steps: local moving of nodes, refinement of the partition, and aggregation of the network based on the refined partition. All of the functions in the following steps are called using our main function Leiden, depicted below: The Fast Louvain method is borrowed by the authors of Leiden from "A Simple Acceleration Method for the Louvain Algorithm". [8]

function Leiden_community_detection(Graph G, Partition P)     do         P = fast_louvain_move_nodes(G, P)              /* Call the function to move the nodes to communities.(more details in function below). */         done = (|P| == |V(G)|)             /* If the number of partitions in P equals the number of nodes in G, then set done flag to True to end do-while loop, as this will mean that each node has been aggregated into its own community. */         if not done             P_refined = get_p_refined(G, P)                 /* This is a crucial part of what separates Leiden from Louvain, as this refinement of the partition enforces that only nodes that are well connected within their community are considered to be moved out of the community. (more detail in function refine_partition_subset below). */             G = aggregate_graph(G, P_refined)                 /* Aggregates communities into single nodes for next iteration (details in function below). */             P = {{v | v ⊆ C, v ∈ V (G)} | C ∈ P}                  /*  This line essentially takes nodes from the communities in P and breaks them down so that each node is treated as its own singleton community (community made up of one node). */         end if     while not done     return flattened(P) /* Return final partition where all nodes of G are listed in one community each. */ end function


Step 1 of the Leiden algorithm (local moving of nodes). Leiden Algorithm Step 1.png
Step 1 of the Leiden algorithm (local moving of nodes).

Step 1: Local Moving of Nodes

First, we move the nodes from into neighboring communities to maximize modularity (the difference in quality between the generated partition and a hypothetical randomized partition of communities). In the above image, our initial collection of unsorted nodes is represented by the graph on the left, with each node's unique color representing that they do not belong to a community yet. The graph on the right is a representation of this step's result, the sorted graph ; note how the nodes have all been moved into one of three communities, as represented by the nodes' colors (red, blue, and green).

function fast_louvain_move_nodes(Graph G, Partition P)     Q = queue(V(G))  /* Place all of the nodes of G into a queue to ensure that they are all visited. */     while Q not empty         v = Q.pop_front() /* Select the first node from the queue to visit. */         C_prime = arg maxC∈P∪∅ ∆HP(v → C)              /* Set C_prime to be the community in P or the empty set (no community) that provides the maximum increase in the Quality function H when node v is moved into that community. */         if ∆HP(v → C_prime) > 0  /* Only look at moving nodes that will result in a positive change in the quality function. */             v → C_prime          /* Move node v to community C_prime */             N = {u | (u, v) ∈ E(G), u !∈ C_prime} /* Create a set N of nodes that are direct neighbors of v but are not in the community C_prime. */             Q.add(N - Q) /* Add all of the nodes from N to the queue, unless they are already in Q. */         end if     return P /* Return the updated partition. */ end function 
Step 2 of the Leiden algorithm (refinement of the partition). Leiden Algorithm Step 2.png
Step 2 of the Leiden algorithm (refinement of the partition).

Step 2: Refinement of the Partition

Next, each node in the network is assigned to its own individual community and then moved them from one community to another to maximize modularity. This occurs iteratively until each node has been visited and moved, and is very similar to the creation of except that each community is refined after a node is moved. The result is our initial partition for , as shown on the right. Note that we're also keeping track of the communities from , which are represented by the colored backgrounds behind the nodes.

function get_p_refined(Graph G, Partition P)     P_refined = get_singleton_partition(G)  /* Assign each node in G to a singleton community (a community by itself). */     for C ∈ P          P_refined = refine_partition_subset(G, P_refined, C)               /* Refine partition for each of the communities in P_refined. */     end for     return P_refined /* return newly refined partition. */  function refine_partition_subset(Graph G, Partition P, Subset S)     R = {v | v ∈ S, E(v, S − v) ≥  γ * degree(v) * (degree(S) − degree(v))}         /* For node v, which is a member of subset S, check if E(v, S-v) (the edges of v connected to other members of the community S, excluding v itself) are above a certain scaling factor. degree(v) is the degree of node v and degree(S) is the total degree of the nodes in the subset S. This statement essentially requires that if v is removed from the subset, the community will remain in tact. */     for v ∈ R          if v in singleton_community  /* If node v is in a singleton community, meaning it is the only node. */             T = {C | C ∈ P, C ⊆ S, E(C, S − C) ≥ γ * degree(C) · (degree(S) − degree(C)}                  /* Create a set T of communities where E(C, S - C) (the edges between community C and subset S, excluding edges between community C and itself) is greater than the threshold. The threshold here is γ * degree(C) · (degree(S) − degree(C). */             Pr(C_prime = C) ∼ exp(1/θ ∆HP(v → C) if ∆HP(v → C) ≥ 0             0 otherwise                                             for C ∈ T                 /* If moving the node v to C_prime changes the quality function in the positive direction, set the probability that the community of v to exp(1/θ * ∆HP(v → C)) else set it to 0 for all of the communities in T. */             v → C_prime /* Move node v into a random C_prime community with a positive probability. */         end if     end for     return P  /* return refined partition */ end function 
Step 3 of the Leiden algorithm (aggregation of the network). Leiden Algorithm Step 3.png
Step 3 of the Leiden algorithm (aggregation of the network).

Step 3: Aggregation of the Network

We then convert each community in into a single node. Note how, as is depicted in the above image, the communities of are used to sort these aggregate nodes after their creation.

function aggregate_graph(Graph G, Partition P)      V = P   /* Set communities of P as individual nodes of the graph. */      E = {(C, D) | (u, v) ∈ E(G), u ∈ C ∈ P, v ∈ D ∈ P}  /* If u is a member of subset C of P, and v is a member subset D of P and u and v share an edge in E(G), then we add a connection between C and D in the new graph. */     return Graph(V, E) /* Return the new graph's nodes and edges. */ end function  function get_singleton_partition(Graph G)      return {{v} | v ∈ V (G)}  /* This is the function where we assign each node in G to a singleton community (a community by itself). */ end function 

We repeat these steps until each community contains only one node, with each of these nodes representing an aggregate of nodes from the original network that are strongly connected with each other.

Limitations

The Leiden algorithm does a great job of creating a quality partition which places nodes into distinct communities. However, Leiden creates a hard partition, meaning nodes can belong to only one community. In many networks such as social networks, nodes may belong to multiple communities and in this case other methods may be preferred.

Leiden is more efficient than Louvain, but in the case of massive graphs may result in extended processing times. Recent advancements have boosted the speed using a "parallel multicore implementation of the Leiden algorithm". [9]

The Leiden algorithm does much to overcome the resolution limit problem. However, there is still the possibility that small substructures can be missed in certain cases. The selection of the gamma parameter is crucial to ensure that these structures are not missed, as it can vary significantly from one graph to the next.

Related Research Articles

In mathematics, Hall's marriage theorem, proved by Philip Hall, is a theorem with two equivalent formulations. In each case, the theorem gives a necessary and sufficient condition for an object to exist:

<span class="mw-page-title-main">Hypergraph</span> Generalization of graph theory

In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. In contrast, in an ordinary graph, an edge connects exactly two vertices.

<span class="mw-page-title-main">Random graph</span> Graph generated by a random process

In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications are found in all areas in which complex networks need to be modeled – many random graph models are thus known, mirroring the diverse types of complex networks encountered in different areas. In a mathematical context, random graph refers almost exclusively to the Erdős–Rényi random graph model. In other contexts, any graph model may be referred to as a random graph.

<span class="mw-page-title-main">Szemerédi regularity lemma</span> Concept in extremal graph theory

In extremal graph theory, Szemerédi’s regularity lemma states that a graph can be partitioned into a bounded number of parts so that the edges between parts are regular. The lemma shows that certain properties of random graphs can be applied to dense graphs like counting the copies of a given subgraph within graphs. Endre Szemerédi proved the lemma over bipartite graphs for his theorem on arithmetic progressions in 1975 and for general graphs in 1978. Variants of the lemma use different notions of regularity and apply to other mathematical objects like hypergraphs.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

In graph theory, the planar separator theorem is a form of isoperimetric inequality for planar graphs, that states that any planar graph can be split into smaller pieces by removing a small number of vertices. Specifically, the removal of vertices from an n-vertex graph can partition the graph into disjoint subgraphs each of which has at most vertices.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

The Kernighan–Lin algorithm is a heuristic algorithm for finding partitions of graphs. The algorithm has important practical application in the layout of digital circuits and components in electronic design automation of VLSI.

In graph theory, the modular decomposition is a decomposition of a graph into subsets of vertices called modules. A module is a generalization of a connected component of a graph. Unlike connected components, however, one module can be a proper subset of another. Modules therefore lead to a recursive (hierarchical) decomposition of the graph, instead of just a partition.

<span class="mw-page-title-main">Top tree</span> Data structure

A top tree is a data structure based on a binary tree for unrooted dynamic trees that is used mainly for various path-related operations. It allows simple divide-and-conquer algorithms. It has since been augmented to maintain dynamically various properties of a tree such as diameter, center and median.

<span class="mw-page-title-main">Hyperbolic geometric graph</span>

A hyperbolic geometric graph (HGG) or hyperbolic geometric network (HGN) is a special type of spatial network where (1) latent coordinates of nodes are sprinkled according to a probability density function into a hyperbolic space of constant negative curvature and (2) an edge between two nodes is present if they are close according to a function of the metric (typically either a Heaviside step function resulting in deterministic connections between vertices closer than a certain threshold distance, or a decaying function of hyperbolic distance yielding the connection probability). A HGG generalizes a random geometric graph (RGG) whose embedding space is Euclidean.

<span class="mw-page-title-main">Louvain method</span> Clustering and community detection algorithm

The Louvain method for community detection is a greedy optimization method intended to extract non-overlapping communities from large networks created by Blondel et al. from the University of Louvain.

<span class="mw-page-title-main">Stochastic block model</span>

The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function , if it is possible to construct a flow network with positive weights such that

In graph theory, the graph removal lemma states that when a graph contains few copies of a given subgraph, then all of the copies can be eliminated by removing a small number of edges. The special case in which the subgraph is a triangle is known as the triangle removal lemma.

In network science, the network entropy is a disorder measure derived from information theory to describe the level of randomness and the amount of information encoded in a graph. It is a relevant metric to quantitatively characterize real complex networks and can also be used to quantify network complexity

References

[3]

  1. 1 2 Traag, Vincent A; Waltman, Ludo; van Eck, Nees Jan (26 March 2019). "From Louvain to Leiden: guaranteeing well-connected communities". Scientific Reports. 9 (1): 5233. arXiv: 1810.08473 . Bibcode:2019NatSR...9.5233T. doi:10.1038/s41598-019-41695-z. PMC   6435756 . PMID   30914743.
  2. Clauset, Aaron and Newman, M. E. J. and Moore, Cristopher (2004). "Finding community structure in very large networks". Phys. Rev. E. 70 (6): 066111. arXiv: cond-mat/0408187 . Bibcode:2004PhRvE..70f6111C. doi:10.1103/PhysRevE.70.066111. PMID   15697438. S2CID   8977721.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. 1 2 3 Reichardt, Jörg; Bornholdt, Stefan (2004-11-15). "Detecting Fuzzy Community Structures in Complex Networks with a Potts Model". Physical Review Letters. 93 (21). arXiv: cond-mat/0402349 . doi:10.1103/PhysRevLett.93.218701. ISSN   0031-9007.
  4. 1 2 "Reference — leidenalg 0.10.3.dev0+gcb0bc63.d20240122 documentation". leidenalg.readthedocs.io. Retrieved 2024-11-23.
  5. https://cran.r-project.org/web/packages/leiden/leiden.pdf
  6. 1 2 Traag, Vincent A; Van Dooren, Paul; Nesterov, Yurii (29 July 2011). "Narrow scope for resolution-limit-free community detection". Physical Review E. 84 (1): 016114. arXiv: 1104.3083 . Bibcode:2011PhRvE..84a6114T. doi:10.1103/PhysRevE.84.016114. PMID   21867264.
  7. Fortunato, Santo; Barthélemy, Marc (2007-01-02). "Resolution limit in community detection". Proceedings of the National Academy of Sciences. 104 (1): 36–41. doi:10.1073/pnas.0605965104. ISSN   0027-8424. PMC   1765466 . PMID   17190818.
  8. Blondel, Vincent D.; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (2008). "A Simple Acceleration Method for the Louvain Algorithm". arXiv: 0803.0476 .
  9. Yang, Zizhang; Wang, Junhao (2023). "GVE-Leiden: Fast Leiden Algorithm for Community Detection in Shared Memory Setting". arXiv: 2312.13936 .