Clustering coefficient

Last updated

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971; [1] Watts and Strogatz, 1998 [2] ).

Contents

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the extent of "clustering" of a single node.

Local clustering coefficient

Example local clustering coefficient on an undirected graph. The local clustering coefficient of the blue node is computed as the proportion of connections among its neighbours which are actually realised compared with the number of all possible connections. In the figure, the blue node has three neighbours, which can have a maximum of 3 connections among them. In the top part of the figure all three possible connections are realised (thick black segments), giving a local clustering coefficient of 1. In the middle part of the figure only one connection is realised (thick black line) and 2 connections are missing (dotted red lines), giving a local cluster coefficient of 1/3. Finally, none of the possible connections among the neighbours of the blue node are realised, producing a local clustering coefficient value of 0. Clustering coefficient example.svg
Example local clustering coefficient on an undirected graph. The local clustering coefficient of the blue node is computed as the proportion of connections among its neighbours which are actually realised compared with the number of all possible connections. In the figure, the blue node has three neighbours, which can have a maximum of 3 connections among them. In the top part of the figure all three possible connections are realised (thick black segments), giving a local clustering coefficient of 1. In the middle part of the figure only one connection is realised (thick black line) and 2 connections are missing (dotted red lines), giving a local cluster coefficient of 1/3. Finally, none of the possible connections among the neighbours of the blue node are realised, producing a local clustering coefficient value of 0.

The local clustering coefficient of a vertex (node) in a graph quantifies how close its neighbours are to being a clique (complete graph). Duncan J. Watts and Steven Strogatz introduced the measure in 1998 to determine whether a graph is a small-world network.

A graph formally consists of a set of vertices and a set of edges between them. An edge connects vertex with vertex .

The neighbourhood for a vertex is defined as its immediately connected neighbours as follows:

We define as the number of vertices, , in the neighbourhood, , of vertex .

The local clustering coefficient for a vertex is then given by a proportion of the number of links between the vertices within its neighbourhood divided by the number of links that could possibly exist between them. For a directed graph, is distinct from , and therefore for each neighbourhood there are links that could exist among the vertices within the neighbourhood ( is the number of neighbours of a vertex). Thus, the local clustering coefficient for directed graphs is given as [2]

An undirected graph has the property that and are considered identical. Therefore, if a vertex has neighbours, edges could exist among the vertices within the neighbourhood. Thus, the local clustering coefficient for undirected graphs can be defined as

Let be the number of triangles on for undirected graph . That is, is the number of subgraphs of with 3 edges and 3 vertices, one of which is . Let be the number of triples on . That is, is the number of subgraphs (not necessarily induced) with 2 edges and 3 vertices, one of which is and such that is incident to both edges. Then we can also define the clustering coefficient as

It is simple to show that the two preceding definitions are the same, since

These measures are 1 if every neighbour connected to is also connected to every other vertex within the neighbourhood, and 0 if no vertex that is connected to connects to any other vertex that is connected to .

Since any graph is fully specified by its adjacency matrix A, the local clustering coefficient for a simple undirected graph can be expressed in terms of A as: [3]

where:

and Ci=0 when ki is zero or one. In the above expression, the numerator counts twice the number of complete triangles that vertex i is involved in. In the denominator, ki2 counts the number of edge pairs that vertex i is involved in plus the number of single edges traversed twice. ki is the number of edges connected to vertex i, and subtracting ki then removes the latter, leaving only a set of edge pairs that could conceivably be connected into triangles. For every such edge pair, there will be another edge pair which could form the same triangle, so the denominator counts twice the number of conceivable triangles that vertex i could be involved in.

Global clustering coefficient

The global clustering coefficient is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle graph therefore includes three closed triplets, one centred on each of the nodes (n.b. this means the three triplets in a triangle come from overlapping selections of nodes). The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed). The first attempt to measure it was made by Luce and Perry (1949). [4] This measure gives an indication of the clustering in the whole network (global), and can be applied to both undirected and directed networks (often called transitivity, see Wasserman and Faust, 1994, page 243 [5] ).

The global clustering coefficient is defined as:

.

The number of closed triplets has also been referred to as 3 × triangles in the literature, so:

.

A generalisation to weighted networks was proposed by Opsahl and Panzarasa (2009), [6] and a redefinition to two-mode networks (both binary and weighted) by Opsahl (2009). [7]

Since any simple graph is fully specified by its adjacency matrix A, the global clustering coefficient for an undirected graph can be expressed in terms of A as:

where:

and C=0 when the denominator is zero.

Network average clustering coefficient

As an alternative to the global clustering coefficient, the overall level of clustering in a network is measured by Watts and Strogatz [2] as the average of the local clustering coefficients of all the vertices  : [8]

It is worth noting that this metric places more weight on the low degree nodes, while the transitivity ratio places more weight on the high degree nodes.

A generalisation to weighted networks was proposed by Barrat et al. (2004), [9] and a redefinition to bipartite graphs (also called two-mode networks) by Latapy et al. (2008) [10] and Opsahl (2009). [7]

Alternative generalisations to weighted and directed graphs have been provided by Fagiolo (2007) [11] and Clemente and Grassi (2018). [12]

This formula is not, by default, defined for graphs with isolated vertices; see Kaiser (2008) [13] and Barmpoutis et al. [14] The networks with the largest possible average clustering coefficient are found to have a modular structure, and at the same time, they have the smallest possible average distance among the different nodes. [14]

Percolation of clustered networks

For a random tree-like network without degree-degree correlation, it can be shown that such network can have a giant component, and the percolation threshold (transmission probability) is given by , where is the generating function corresponding to the excess degree distribution.

In networks with low clustering, , the critical point gets scaled by such that:

[15]

This indicates that for a given degree distribution, the clustering leads to a larger percolation threshold, mainly because for a fixed number of links, the clustering structure reinforces the core of the network with the price of diluting the global connections. For networks with high clustering, strong clustering could induce the core–periphery structure, in which the core and periphery might percolate at different critical points, and the above approximate treatment is not applicable. [16]

For studying the robustness of clustered networks a percolation approach is developed. [17] [18]

See also

Related Research Articles

<span class="mw-page-title-main">Hypergraph</span> Generalization of graph theory

In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. In contrast, in an ordinary graph, an edge connects exactly two vertices.

In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges.

<span class="mw-page-title-main">Random graph</span> Graph generated by a random process

In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications are found in all areas in which complex networks need to be modeled – many random graph models are thus known, mirroring the diverse types of complex networks encountered in different areas. In a mathematical context, random graph refers almost exclusively to the Erdős–Rényi random graph model. In other contexts, any graph model may be referred to as a random graph.

In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Laplacian matrix can be viewed as a matrix form of the negative discrete Laplace operator on a graph approximating the negative continuous Laplacian obtained by the finite difference method.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

<span class="mw-page-title-main">Giant component</span> Large connected component of a random graph

In network theory, a giant component is a connected component of a given random graph that contains a significant fraction of the entire graph's vertices.

<span class="mw-page-title-main">Minimum cut</span> Partition of a graph by removing fewest possible edges

In graph theory, a minimum cut or min-cut of a graph is a cut that is minimal in some metric.

<span class="mw-page-title-main">Watts–Strogatz model</span> Method of generating random small-world graphs

The Watts–Strogatz model is a random graph generation model that produces graphs with small-world properties, including short average path lengths and high clustering. It was proposed by Duncan J. Watts and Steven Strogatz in their article published in 1998 in the Nature scientific journal. The model also became known as the (Watts) beta model after Watts used to formulate it in his popular science book Six Degrees.

<span class="mw-page-title-main">Reciprocity (network science)</span>

In network science, reciprocity is a measure of the likelihood of vertices in a directed network to be mutually linked. Like the clustering coefficient, scale-free degree distribution, or community structure, reciprocity is a quantitative measure used to study complex networks.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Triadic closure</span>

Triadic closure is a concept in social network theory, first suggested by German sociologist Georg Simmel in his 1908 book Soziologie [Sociology: Investigations on the Forms of Sociation]. Triadic closure is the property among three nodes A, B, and C, that if the connections A-B and A-C exist, there is a tendency for the new connection B-C to be formed. Triadic closure can be used to understand and predict the growth of networks, although it is only one of many mechanisms by which new connections are formed in complex networks.

<span class="mw-page-title-main">Network science</span> Academic field

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

<span class="mw-page-title-main">Betweenness centrality</span> Measure of a graphs centrality, based on shortest paths

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through or the sum of the weights of the edges is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

<span class="mw-page-title-main">Exponential family random graph models</span>

Exponential Random Graph Models (ERGMs) are a family of statistical models for analyzing data from social and other networks. Examples of networks examined using ERGM include knowledge networks, organizational networks, colleague networks, social media networks, networks of scientific development, and others.

<span class="mw-page-title-main">Stoer–Wagner algorithm</span> Recursive algorithm in graph theory

In graph theory, the Stoer–Wagner algorithm is a recursive algorithm to solve the minimum cut problem in undirected weighted graphs with non-negative weights. It was proposed by Mechthild Stoer and Frank Wagner in 1995. The essential idea of this algorithm is to shrink the graph by merging the most intensive vertices, until the graph only contains two combined vertex sets. At each phase, the algorithm finds the minimum - cut for two vertices and chosen at its will. Then the algorithm shrinks the edge between and to search for non - cuts. The minimum cut found in all phases will be the minimum weighted cut of the graph.

<span class="mw-page-title-main">Configuration model</span>

In network science, the configuration model is a method for generating random networks from a given degree sequence. It is widely used as a reference model for real-life social networks, because it allows the modeler to incorporate arbitrary degree distributions.

In mathematics, calculus on finite weighted graphs is a discrete calculus for functions whose domain is the vertex set of a graph with a finite number of vertices and weights associated to the edges. This involves formulating discrete operators on graphs which are analogous to differential operators in calculus, such as graph Laplacians as discrete versions of the Laplacian, and using these operators to formulate differential equations, difference equations, or variational models on graphs which can be interpreted as discrete versions of partial differential equations or continuum variational models. Such equations and models are important tools to mathematically model, analyze, and process discrete information in many different research fields, e.g., image processing, machine learning, and network analysis.

References

  1. P. W. Holland & S. Leinhardt (1971). "Transitivity in structural models of small groups". Comparative Group Studies. 2 (2): 107–124. doi:10.1177/104649647100200201. S2CID   145544488.
  2. 1 2 3 D. J. Watts & Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature . 393 (6684): 440–442. Bibcode:1998Natur.393..440W. doi:10.1038/30918. PMID   9623998. S2CID   4429113.
  3. Wang, Yu; Ghumare, Eshwar; Vandenberghe, Rik; Dupont, Patrick (2017). "Comparison of Different Generalizations of Clustering Coefficient and Local Efficiency for Weighted Undirected Graphs". Neural Computation. 29 (2): 313–331. doi: 10.1162/NECO_a_00914 . PMID   27870616. S2CID   11000115. Archived from the original on August 10, 2020. Retrieved August 8, 2020.
  4. R. D. Luce & A. D. Perry (1949). "A method of matrix analysis of group structure". Psychometrika. 14 (1): 95–116. doi:10.1007/BF02289146. hdl: 10.1007/BF02289146 . PMID   18152948. S2CID   16186758.
  5. Stanley Wasserman, Katherine Faust, 1994. Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press.
  6. Tore Opsahl & Pietro Panzarasa (2009). "Clustering in Weighted Networks". Social Networks. 31 (2): 155–163. doi:10.1016/j.socnet.2009.02.002. Archived from the original on 2019-07-01. Retrieved 2009-06-11.
  7. 1 2 Tore Opsahl (2009). "Clustering in Two-mode Networks". Conference and Workshop on Two-Mode Social Analysis (Sept 30-Oct 2, 2009). Archived from the original on March 21, 2016. Retrieved September 11, 2009.
  8. Kemper, Andreas (2009). Valuation of Network Effects in Software Markets: A Complex Networks Approach. Springer. p. 142. ISBN   9783790823660.
  9. Barrat, A.; Barthelemy, M.; Pastor-Satorras, R.; Vespignani, A. (2004). "The architecture of complex weighted networks". Proceedings of the National Academy of Sciences. 101 (11): 3747–3752. arXiv: cond-mat/0311416 . Bibcode:2004PNAS..101.3747B. doi: 10.1073/pnas.0400087101 . PMC   374315 . PMID   15007165.
  10. Latapy, M.; Magnien, C.; Del Vecchio, N. (2008). "Basic Notions for the Analysis of Large Two-mode Networks" (PDF). Social Networks. 30 (1): 31–48. doi:10.1016/j.socnet.2007.04.006.
  11. Fagiolo, G. (2007). "Clustering in complex directed networks". Physical Review E. 76 (2 Pt 2): 026107. arXiv: physics/0612169 . CiteSeerX   10.1.1.262.1006 . doi:10.1103/PhysRevE.76.026107. PMID   17930104. S2CID   2317676.
  12. Clemente, G.P.; Grassi, R. (2018). "Directed clustering in weighted networks: A new perspective". Chaos, Solitons & Fractals. 107: 26–38. arXiv: 1706.07322 . Bibcode:2018CSF...107...26C. doi:10.1016/j.chaos.2017.12.007. S2CID   21919524.
  13. Kaiser, Marcus (2008). "Mean clustering coefficients: the role of isolated nodes and leafs on clustering measures for small-world networks". New Journal of Physics. 10 (8): 083042. arXiv: 0802.2512 . Bibcode:2008NJPh...10h3042K. doi:10.1088/1367-2630/10/8/083042. S2CID   16480565.
  14. 1 2 Barmpoutis, D.; Murray, R. M. (2010). "Networks with the Smallest Average Distance and the Largest Average Clustering". arXiv: 1007.4031 [q-bio.MN].
  15. Berchenko, Yakir; Artzy-Randrup, Yael; Teicher, Mina; Stone, Lewi (2009-03-30). "Emergence and Size of the Giant Component in Clustered Random Graphs with a Given Degree Distribution". Physical Review Letters. 102 (13): 138701. doi:10.1103/PhysRevLett.102.138701. ISSN   0031-9007. PMID   19392410. Archived from the original on 2023-02-04. Retrieved 2022-02-24.
  16. Berchenko, Yakir; Artzy-Randrup, Yael; Teicher, Mina; Stone, Lewi (2009-03-30). "Emergence and Size of the Giant Component in Clustered Random Graphs with a Given Degree Distribution". Physical Review Letters. 102 (13): 138701. doi:10.1103/PhysRevLett.102.138701. ISSN   0031-9007. PMID   19392410. Archived from the original on 2023-02-04. Retrieved 2022-02-24.
  17. M. E. J. Newman (2009). "Random Graphs with Clustering". Phys. Rev. Lett. 103 (5): 058701. arXiv: 0903.4009 . doi:10.1103/PhysRevLett.103.058701. PMID   19792540. S2CID   28214709.
  18. A. Hackett; S. Melnik & J. P. Gleeson (2011). "Cascades on a class of clustered random networks". Phys. Rev. E. 83 (5 Pt 2): 056107. arXiv: 1012.3651 . doi:10.1103/PhysRevE.83.056107. PMID   21728605. S2CID   18071422.