Modularity (networks)

Last updated
Example of modularity measurement and colouring on a scale-free network. Gephi 0.9.1 Network Analysis and Visualization Software.png
Example of modularity measurement and colouring on a scale-free network.

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

Contents

Motivation

Many scientifically important problems can be represented and empirically studied using networks. For example, biological and social patterns, the World Wide Web, metabolic networks, food webs, neural networks and pathological networks are real world problems that can be mathematically represented and topologically studied to reveal some unexpected structural features. [1] Most of these networks possess a certain community structure that has substantial importance in building an understanding regarding the dynamics of the network. For instance, a closely connected social community will imply a faster rate of transmission of information or rumor among them than a loosely connected community. Thus, if a network is represented by a number of individual nodes connected by links which signify a certain degree of interaction between the nodes, communities are defined as groups of densely interconnected nodes that are only sparsely connected with the rest of the network. Hence, it may be imperative to identify the communities in networks since the communities may have quite different properties such as node degree, clustering coefficient, betweenness, centrality, [2] etc., from that of the average network. Modularity is one such measure, which when maximized, leads to the appearance of communities in a given network.

Definition

Modularity is the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random. The value of the modularity for unweighted and undirected graphs lies in the range . [3] It is positive if the number of edges within groups exceeds the number expected on the basis of chance. For a given division of the network's vertices into some modules, modularity reflects the concentration of edges within modules compared with random distribution of links between all nodes regardless of modules.

There are different methods for calculating modularity. [1] In the most common version of the concept, the randomization of the edges is done so as to preserve the degree of each vertex. Consider a graph with nodes and links (edges) such that the graph can be partitioned into two communities using a membership variable . If a node belongs to community 1, , or if belongs to community 2, . Let the adjacency matrix for the network be represented by , where means there's no edge (no interaction) between nodes and and means there is an edge between the two. Also for simplicity we consider an undirected network. Thus . (It is important to note that multiple edges may exist between two nodes, but here we assess the simplest case).

Modularity is then defined as the fraction of edges that fall within group 1 or 2, minus the expected number of edges within groups 1 and 2 for a random graph with the same node degree distribution as the given network.

The expected number of edges shall be computed using the concept of a configuration model. [4] The configuration model is a randomized realization of a particular network. Given a network with nodes, where each node has a node degree , the configuration model cuts each edge into two halves, and then each half edge, called a stub, is rewired randomly with any other stub in the network, even allowing self-loops (which occur when a stub is rewired to another stub from the same node) and multiple-edges between the same two nodes. Thus, even though the node degree distribution of the graph remains intact, the configuration model results in a completely random network.

Expected Number of Edges Between Nodes

Now consider two nodes and , with node degrees and respectively, from a randomly rewired network as described above. We calculate the expected number of full edges between these nodes.

Let us consider each of the stubs of node and create associated indicator variables for them, , with if the -th stub happens to connect to one of the stubs of node in this particular random graph. If it does not, then . Since the -th stub of node can connect to any of the remaining stubs with equal probability (while is the number of edges in the original graph), and since there are stubs it can connect to associated with node , evidently

The total number of full edges between and is just , so the expected value of this quantity is

Many texts then make the following approximations, for random networks with a large number of edges. When is large, they drop the subtraction of in the denominator above and simply use the approximate expression for the expected number of edges between two nodes. Additionally, in a large random network, the number of self-loops and multi-edges is vanishingly small. [5] Ignoring self-loops and multi-edges allows one to assume that there is at most one edge between any two nodes. In that case, becomes a binary indicator variable, so its expected value is also the probability that it equals , which means one can approximate the probability of an edge existing between nodes and as .

Modularity

Hence, the difference between the actual number of edges between node and and the expected number of edges between them is

Summing over all node pairs gives the equation for modularity, . [1]

It is important to note that Eq. 3 holds good for partitioning into two communities only. Hierarchical partitioning (i.e. partitioning into two communities, then the two sub-communities further partitioned into two smaller sub communities only to maximize Q) is a possible approach to identify multiple communities in a network. Additionally, (3) can be generalized for partitioning a network into c communities. [6]

where eij is the fraction of edges with one end vertices in community i and the other in community j:

and ai is the fraction of ends of edges that are attached to vertices in community i:

Example of multiple community detection

We consider an undirected network with 10 nodes and 12 edges and the following adjacency matrix.

Fig 1. Sample Network corresponding to the Adjacency matrix with 10 nodes, 12 edges. Sample Network.jpg
Fig 1. Sample Network corresponding to the Adjacency matrix with 10 nodes, 12 edges.
Fig 2. Network partitions that maximize Q. Maximum Q=0.4896 Partitioned network.jpg
Fig 2. Network partitions that maximize Q. Maximum Q=0.4896
Node ID12345678910
10110000001
21010000000
31100000000
40000110001
50001010000
60001100000
70000000111
80000001010
90000001100
101001001000

The communities in the graph are represented by the red, green and blue node clusters in Fig 1. The optimal community partitions are depicted in Fig 2.

Matrix formulation

An alternative formulation of the modularity, useful particularly in spectral optimization algorithms, is as follows. [1] Define to be if vertex belongs to group and otherwise. Then

and hence

where is the (non-square) matrix having elements and is the so-called modularity matrix, which has elements

All rows and columns of the modularity matrix sum to zero, which means that the modularity of an undivided network is also always .

For networks divided into just two communities, one can alternatively define to indicate the community to which node belongs, which then leads to

where is the column vector with elements . [1]

This function has the same form as the Hamiltonian of an Ising spin glass, a connection that has been exploited to create simple computer algorithms, for instance using simulated annealing, to maximize the modularity. The general form of the modularity for arbitrary numbers of communities is equivalent to a Potts spin glass and similar algorithms can be developed for this case also. [7]

Overfitting

Although the method of modularity maximization is motivated by computing a deviation from a null model, this deviation is not computed in a statistically consistent manner. [8] Because of this, the method notoriously finds high-scoring communities in its own null model [9] (the configuration model), which by definition cannot be statistically significant. Because of this, the method cannot be used to reliably obtain statistically significant community structure in empirical networks.

Resolution limit

Modularity compares the number of edges inside a cluster with the expected number of edges that one would find in the cluster if the network were a random network with the same number of nodes and where each node keeps its degree, but edges are otherwise randomly attached. This random null model implicitly assumes that each node can get attached to any other node of the network. This assumption is however unreasonable if the network is very large, as the horizon of a node includes a small part of the network, ignoring most of it. Moreover, this implies that the expected number of edges between two groups of nodes decreases if the size of the network increases. So, if a network is large enough, the expected number of edges between two groups of nodes in modularity's null model may be smaller than one. If this happens, a single edge between the two clusters would be interpreted by modularity as a sign of a strong correlation between the two clusters, and optimizing modularity would lead to the merging of the two clusters, independently of the clusters' features. So, even weakly interconnected complete graphs, which have the highest possible density of internal edges, and represent the best identifiable communities, would be merged by modularity optimization if the network were sufficiently large. [10] For this reason, optimizing modularity in large networks would fail to resolve small communities, even when they are well defined. This bias is inevitable for methods like modularity optimization, which rely on a global null model. [11]

Multiresolution methods

There are two main approaches which try to solve the resolution limit within the modularity context: the addition of a resistance r to every node, in the form of a self-loop, which increases (r>0) or decreases (r<0) the aversion of nodes to form communities; [12] or the addition of a parameter γ>0 in front of the null-case term in the definition of modularity, which controls the relative importance between internal links of the communities and the null model. [7] Optimizing modularity for values of these parameters in their respective appropriate ranges, it is possible to recover the whole mesoscale of the network, from the macroscale in which all nodes belong to the same community, to the microscale in which every node forms its own community, hence the name multiresolution methods. However, it has been shown that these methods have limitations when communities are very heterogeneous in size. [13]

Software Tools

There are a couple of software tools available that are able to compute clusterings in graphs with good modularity.

Original implementation of the multi-level Louvain method. [14]

The Leiden algorithm which additionally avoids unconnected communities. [15]

The Vienna Graph Clustering (VieClus) algorithm, a parallel memetic algorithm. [16]

See also

Related Research Articles

<span class="mw-page-title-main">Scale-free network</span> Network whose degree distribution follows a power law

A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. That is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

<span class="mw-page-title-main">Giant component</span> Large connected component of a random graph

In network theory, a giant component is a connected component of a given random graph that contains a significant fraction of the entire graph's vertices.

<span class="mw-page-title-main">Barabási–Albert model</span> Scale-free network generation algorithm

The Barabási–Albert (BA) model is an algorithm for generating random scale-free networks using a preferential attachment mechanism. Several natural and human-made systems, including the Internet, the World Wide Web, citation networks, and some social networks are thought to be approximately scale-free and certainly contain few nodes with unusually high degree as compared to the other nodes of the network. The BA model tries to explain the existence of such nodes in real networks. The algorithm is named for its inventors Albert-László Barabási and Réka Albert.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

<span class="mw-page-title-main">Watts–Strogatz model</span> Method of generating random small-world graphs

The Watts–Strogatz model is a random graph generation model that produces graphs with small-world properties, including short average path lengths and high clustering. It was proposed by Duncan J. Watts and Steven Strogatz in their article published in 1998 in the Nature scientific journal. The model also became known as the (Watts) beta model after Watts used to formulate it in his popular science book Six Degrees.

<span class="mw-page-title-main">Assortativity</span> Tendency for similar nodes to be connected

Assortativity, or assortative mixing, is a preference for a network's nodes to attach to others that are similar in some way. Though the specific measure of similarity may vary, network theorists often examine assortativity in terms of a node's degree. The addition of this characteristic to network models more closely approximates the behaviors of many real world networks.

<span class="mw-page-title-main">Erdős–Rényi model</span> Two closely related models for generating random graphs

In the mathematical field of graph theory, the Erdős–Rényi model refers to one of two closely related models for generating random graphs or the evolution of a random network. These models are named after Hungarian mathematicians Paul Erdős and Alfréd Rényi, who introduced one of the models in 1959. Edgar Gilbert introduced the other model contemporaneously with and independently of Erdős and Rényi. In the model of Erdős and Rényi, all graphs on a fixed vertex set with a fixed number of edges are equally likely. In the model introduced by Gilbert, also called the Erdős–Rényi–Gilbert model, each edge has a fixed probability of being present or absent, independently of the other edges. These models can be used in the probabilistic method to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Triadic closure</span>

Triadic closure is a concept in social network theory, first suggested by German sociologist Georg Simmel in his 1908 book Soziologie [Sociology: Investigations on the Forms of Sociation]. Triadic closure is the property among three nodes A, B, and C, that if the connections A-B and A-C exist, there is a tendency for the new connection B-C to be formed. Triadic closure can be used to understand and predict the growth of networks, although it is only one of many mechanisms by which new connections are formed in complex networks.

<span class="mw-page-title-main">Network science</span> Academic field

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

Graphlets in mathematics are induced subgraph isomorphism classes in a graph, i.e. two graphlet occurrences are isomorphic, whereas two graphlets are non-isomorphic. Graphlets differ from network motifs in a statistical sense, network motifs are defined as over- or under-represented graphlets with respect to some random graph null model.

Random walk closeness centrality is a measure of centrality in a network, which describes the average speed with which randomly walking processes reach a node from other nodes of the network. It is similar to the closeness centrality except that the farness is measured by the expected length of a random walk rather than by the shortest path.

Evolving networks are dynamic networks that change through time. In each period there are new nodes and edges that join the network while the old ones disappear. Such dynamic behaviour is characteristic for most real-world networks, regardless of their range - global or local. However, networks differ not only in their range but also in their topological structure. It is possible to distinguish:

<span class="mw-page-title-main">Multidimensional network</span> Networks with multiple kinds of relations

In network theory, multidimensional networks, a special type of multilayer network, are networks with multiple kinds of relations. Increasingly sophisticated attempts to model real-world systems as multidimensional networks have yielded valuable insight in the fields of social network analysis, economics, urban and international transport, ecology, psychology, medicine, biology, commerce, climatology, physics, computational neuroscience, operations management, and finance.

<span class="mw-page-title-main">Louvain method</span> Clustering and community detection algorithm

The Louvain method for community detection is a method to extract non-overlapping communities from large networks created by Blondel et al. from the University of Louvain. The method is a greedy optimization method that appears to run in time where is the number of nodes in the network.

Maximal entropy random walk (MERW) is a popular type of biased random walk on a graph, in which transition probabilities are chosen accordingly to the principle of maximum entropy, which says that the probability distribution which best represents the current state of knowledge is the one with largest entropy. While standard random walk chooses for every vertex uniform probability distribution among its outgoing edges, locally maximizing entropy rate, MERW maximizes it globally by assuming uniform probability distribution among all paths in a given graph.

<span class="mw-page-title-main">Configuration model</span>

In network science, the configuration model is a method for generating random networks from a given degree sequence. It is widely used as a reference model for real-life social networks, because it allows the modeler to incorporate arbitrary degree distributions.

References

  1. 1 2 3 4 5 Newman, M. E. J. (2006). "Modularity and community structure in networks". Proceedings of the National Academy of Sciences of the United States of America. 103 (23): 8577–8696. arXiv: physics/0602124 . Bibcode:2006PNAS..103.8577N. doi: 10.1073/pnas.0601602103 . PMC   1482622 . PMID   16723398.
  2. Newman, M. E. J. (2007). Palgrave Macmillan, Basingstoke (ed.). "Mathematics of networks". The New Palgrave Encyclopedia of Economics (2 ed.).
  3. Brandes, U.; Delling, D.; Gaertler, M.; Gorke, R.; Hoefer, M.; Nikoloski, Z.; Wagner, D. (February 2008). "On Modularity Clustering". IEEE Transactions on Knowledge and Data Engineering. 20 (2): 172–188. doi:10.1109/TKDE.2007.190689. S2CID   150684.
  4. van der Hofstad, Remco (2013). "Chapter 7" (PDF). Random Graphs and Complex Networks. Archived (PDF) from the original on 2013-12-18. Retrieved 2013-12-08.
  5. "NetworkScience". Albert-László Barabási. Archived from the original on 2020-03-05. Retrieved 2020-03-20.
  6. Clauset, Aaron and Newman, M. E. J. and Moore, Cristopher (2004). "Finding community structure in very large networks". Phys. Rev. E. 70 (6): 066111. arXiv: cond-mat/0408187 . Bibcode:2004PhRvE..70f6111C. doi:10.1103/PhysRevE.70.066111. PMID   15697438. S2CID   8977721.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  7. 1 2 Joerg Reichardt & Stefan Bornholdt (2006). "Statistical mechanics of community detection". Physical Review E. 74 (1): 016110. arXiv: cond-mat/0603718 . Bibcode:2006PhRvE..74a6110R. doi:10.1103/PhysRevE.74.016110. PMID   16907154. S2CID   792965.
  8. Peixoto, Tiago P. (2021). "Descriptive vs. inferential community detection: pitfalls, myths and half-truths". arXiv: 2112.00183 .
  9. Guimera, Roger; Sales-Pardo, Marta (August 19, 2004), "Modularity from fluctuations in random graphs and complex networks", Physical Review, 70 (2): 025101, arXiv: cond-mat/0403660 , Bibcode:2004PhRvE..70b5101G, doi:10.1103/PhysRevE.70.025101, PMC   2441765 , PMID   15447530
  10. Santo Fortunato & Marc Barthelemy (2007). "Resolution limit in community detection". Proceedings of the National Academy of Sciences of the United States of America. 104 (1): 36–41. arXiv: physics/0607100 . Bibcode:2007PNAS..104...36F. doi: 10.1073/pnas.0605965104 . PMC   1765466 . PMID   17190818.
  11. J.M. Kumpula; J. Saramäki; K. Kaski & J. Kertész (2007). "Limited resolution in complex network community detection with Potts model approach". European Physical Journal B. 56 (1): 41–45. arXiv: cond-mat/0610370 . Bibcode:2007EPJB...56...41K. doi:10.1140/epjb/e2007-00088-4. S2CID   4411525.
  12. Alex Arenas, Alberto Fernández and Sergio Gómez (2008). "Analysis of the structure of complex networks at different resolution levels". New Journal of Physics. 10 (5): 053039. arXiv: physics/0703218 . Bibcode:2008NJPh...10e3039A. doi:10.1088/1367-2630/10/5/053039. S2CID   11544197.
  13. Andrea Lancichinetti & Santo Fortunato (2011). "Limits of modularity maximization in community detection". Physical Review E. 84 (6): 066122. arXiv: 1107.1155 . Bibcode:2011PhRvE..84f6122L. doi:10.1103/PhysRevE.84.066122. PMID   22304170. S2CID   16180375.
  14. First implementation of Louvain algorithm, archived from the original on 2021-03-17, retrieved 2020-11-30
  15. Leiden algorithm repository, 15 December 2021, archived from the original on 26 November 2020, retrieved 30 November 2020
  16. Vienna graph clustering repository, 13 April 2021, archived from the original on 21 October 2020, retrieved 30 November 2020