Stochastic block model

Last updated

The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al. [1] The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.

Contents

Definition

The stochastic block model takes the following parameters:

The edge set is then sampled at random as follows: any two vertices and are connected by an edge with probability . An example problem is: given a graph with vertices, where the edges are sampled as described, recover the groups .

Special cases

An example of the assortative case for the stochastic block model. Assortative Case of the SBM.png
An example of the assortative case for the stochastic block model.

If the probability matrix is a constant, in the sense that for all , then the result is the Erdős–Rényi model . This case is degenerate—the partition into communities becomes irrelevant—but it illustrates a close relationship to the Erdős–Rényi model.

The planted partition model is the special case that the values of the probability matrix are a constant on the diagonal and another constant off the diagonal. Thus two vertices within the same community share an edge with probability , while two vertices in different communities share an edge with probability . Sometimes it is this restricted model that is called the stochastic block model. The case where is called an assortative model, while the case is called disassortative.

Returning to the general stochastic block model, a model is called strongly assortative if whenever : all diagonal entries dominate all off-diagonal entries. A model is called weakly assortative if whenever : each diagonal entry is only required to dominate the rest of its own row and column. [2] Disassortative forms of this terminology exist, by reversing all inequalities. For some algorithms, recovery might be easier for block models with assortative or disassortative conditions of this form. [2]

Typical statistical tasks

Much of the literature on algorithmic community detection addresses three statistical tasks: detection, partial recovery, and exact recovery.

Detection

The goal of detection algorithms is simply to determine, given a sampled graph, whether the graph has latent community structure. More precisely, a graph might be generated, with some known prior probability, from a known stochastic block model, and otherwise from a similar Erdos-Renyi model. The algorithmic task is to correctly identify which of these two underlying models generated the graph. [3]

Partial recovery

In partial recovery, the goal is to approximately determine the latent partition into communities, in the sense of finding a partition that is correlated with the true partition significantly better than a random guess. [4]

Exact recovery

In exact recovery, the goal is to recover the latent partition into communities exactly. The community sizes and probability matrix may be known [5] or unknown. [6]

Statistical lower bounds and threshold behavior

Stochastic block models exhibit a sharp threshold effect reminiscent of percolation thresholds. [7] [3] [8] Suppose that we allow the size of the graph to grow, keeping the community sizes in fixed proportions. If the probability matrix remains fixed, tasks such as partial and exact recovery become feasible for all non-degenerate parameter settings. However, if we scale down the probability matrix at a suitable rate as increases, we observe a sharp phase transition: for certain settings of the parameters, it will become possible to achieve recovery with probability tending to 1, whereas on the opposite side of the parameter threshold, the probability of recovery tends to 0 no matter what algorithm is used.

For partial recovery, the appropriate scaling is to take for fixed , resulting in graphs of constant average degree. In the case of two equal-sized communities, in the assortative planted partition model with probability matrix partial recovery is feasible [4] with probability whenever , whereas any estimator fails [3] partial recovery with probability whenever .

For exact recovery, the appropriate scaling is to take , resulting in graphs of logarithmic average degree. Here a similar threshold exists: for the assortative planted partition model with equal-sized communities, the threshold lies at . In fact, the exact recovery threshold is known for the fully general stochastic block model. [5]

Algorithms

In principle, exact recovery can be solved in its feasible range using maximum likelihood, but this amounts to solving a constrained or regularized cut problem such as minimum bisection that is typically NP-complete. Hence, no known efficient algorithms will correctly compute the maximum-likelihood estimate in the worst case.

However, a wide variety of algorithms perform well in the average case, and many high-probability performance guarantees have been proven for algorithms in both the partial and exact recovery settings. Successful algorithms include spectral clustering of the vertices, [9] [4] [5] [10] semidefinite programming, [2] [8] forms of belief propagation, [7] [11] and community detection [12] among others.

Variants

Several variants of the model exist. One minor tweak allocates vertices to communities randomly, according to a categorical distribution, rather than in a fixed partition. [5] More significant variants include the degree-corrected stochastic block model, [13] the hierarchical stochastic block model, [14] the geometric block model, [15] censored block model and the mixed-membership block model. [16]

Topic models

Stochastic block model have been recognised to be a topic model on bipartite networks. [17] In a network of documents and words, Stochastic block model can identify topics: group of words with a similar meaning.

Extensions to signed graphs

Signed graphs allow for both favorable and adverse relationships and serve as a common model choice for various data analysis applications, e.g., correlation clustering. The stochastic block model can be trivially extended to signed graphs by assigning both positive and negative edge weights or equivalently using a difference of adjacency matrices of two stochastic block models. [18]

DARPA/MIT/AWS Graph Challenge: streaming stochastic block partition

GraphChallenge [19] encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field. Streaming stochastic block partition is one of the challenges since 2017. [20] Spectral clustering has demonstrated outstanding performance compared to the original and even improved [21] base algorithm, matching its quality of clusters while being multiple orders of magnitude faster. [22] [23]

See also

Related Research Articles

<span class="mw-page-title-main">Shortest path problem</span> Computational problem of graph theory

In graph theory, the shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

<span class="mw-page-title-main">Random graph</span> Graph generated by a random process

In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications are found in all areas in which complex networks need to be modeled – many random graph models are thus known, mirroring the diverse types of complex networks encountered in different areas. In a mathematical context, random graph refers almost exclusively to the Erdős–Rényi random graph model. In other contexts, any graph model may be referred to as a random graph.

In statistical mechanics, the Potts model, a generalization of the Ising model, is a model of interacting spins on a crystalline lattice. By studying the Potts model, one may gain insight into the behaviour of ferromagnets and certain other phenomena of solid-state physics. The strength of the Potts model is not so much that it models these physical systems well; it is rather that the one-dimensional case is exactly solvable, and that it has a rich mathematical formulation that has been studied extensively.

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.

<span class="mw-page-title-main">Triangle-free graph</span> Graph without triples of adjacent vertices

In the mathematical area of graph theory, a triangle-free graph is an undirected graph in which no three vertices form a triangle of edges. Triangle-free graphs may be equivalently defined as graphs with clique number ≤ 2, graphs with girth ≥ 4, graphs with no induced 3-cycle, or locally independent graphs.

<span class="mw-page-title-main">Community structure</span> Concept in graph theory

In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.

<span class="mw-page-title-main">Assortativity</span> Tendency for similar nodes to be connected

Assortativity, or assortative mixing, is a preference for a network's nodes to attach to others that are similar in some way. Though the specific measure of similarity may vary, network theorists often examine assortativity in terms of a node's degree. The addition of this characteristic to network models more closely approximates the behaviors of many real world networks.

<span class="mw-page-title-main">Erdős–Rényi model</span> Two closely related models for generating random graphs

In the mathematical field of graph theory, the Erdős–Rényi model refers to one of two closely related models for generating random graphs or the evolution of a random network. These models are named after Hungarian mathematicians Paul Erdős and Alfréd Rényi, who introduced one of the models in 1959. Edgar Gilbert introduced the other model contemporaneously with and independently of Erdős and Rényi. In the model of Erdős and Rényi, all graphs on a fixed vertex set with a fixed number of edges are equally likely. In the model introduced by Gilbert, also called the Erdős–Rényi–Gilbert model, each edge has a fixed probability of being present or absent, independently of the other edges. These models can be used in the probabilistic method to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Network science</span> Academic field

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

<span class="mw-page-title-main">Graphon</span>

In graph theory and statistics, a graphon is a symmetric measurable function , that is important in the study of dense graphs. Graphons arise both as a natural notion for the limit of a sequence of dense graphs, and as the fundamental defining objects of exchangeable random graph models. Graphons are tied to dense graphs by the following pair of observations: the random graph models defined by graphons give rise to dense graphs almost surely, and, by the regularity lemma, graphons capture the structure of arbitrary large dense graphs.

t-distributed stochastic neighbor embedding Technique for dimensionality reduction

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

<span class="mw-page-title-main">Louvain method</span> Clustering and community detection algorithm

The Louvain method for community detection is a method to extract non-overlapping communities from large networks created by Blondel et al. from the University of Louvain. The method is a greedy optimization method that appears to run in time where is the number of nodes in the network.

In mathematics, the hafnian is a scalar function of a symmetric matrix that generalizes the permanent.

Maximal entropy random walk (MERW) is a popular type of biased random walk on a graph, in which transition probabilities are chosen accordingly to the principle of maximum entropy, which says that the probability distribution which best represents the current state of knowledge is the one with largest entropy. While standard random walk chooses for every vertex uniform probability distribution among its outgoing edges, locally maximizing entropy rate, MERW maximizes it globally by assuming uniform probability distribution among all paths in a given graph.

Trophic coherence is a property of directed graphs. It is based on the concept of trophic levels used mainly in ecology, but which can be defined for directed networks in general and provides a measure of hierarchical structure among nodes. Trophic coherence is the tendency of nodes to fall into well-defined trophic levels. It has been related to several structural and dynamical properties of directed networks, including the prevalence of cycles and network motifs, ecological stability, intervality, and spreading processes like epidemics and neuronal avalanches.

<span class="mw-page-title-main">Configuration model</span>

In network science, the configuration model is a method for generating random networks from a given degree sequence. It is widely used as a reference model for real-life social networks, because it allows the modeler to incorporate arbitrary degree distributions.

In statistical mechanics, probability theory, graph theory, etc. the random cluster model is a random graph that generalizes and unifies the Ising model, Potts model, and percolation model. It is used to study random combinatorial structures, electrical networks, etc. It is also referred to as the RC model or sometimes the FK representation after its founders Cees Fortuin and Piet Kasteleyn. The random cluster model has a critical limit, described by a conformal field theory.

References

  1. Holland, Paul W; Laskey, Kathryn Blackmond; Leinhardt, Samuel (1983). "Stochastic blockmodels: First steps". Social Networks . 5 (2): 109–137. doi:10.1016/0378-8733(83)90021-7. ISSN   0378-8733. S2CID   34098453. Archived from the original on 2023-02-04. Retrieved 2021-06-16.
  2. 1 2 3 Amini, Arash A.; Levina, Elizaveta (June 2014). "On semidefinite relaxations for the block model". arXiv: 1406.5647 [cs.LG].
  3. 1 2 3 Mossel, Elchanan; Neeman, Joe; Sly, Allan (February 2012). "Stochastic Block Models and Reconstruction". arXiv: 1202.1499 [math.PR].
  4. 1 2 3 Massoulie, Laurent (November 2013). "Community detection thresholds and the weak Ramanujan property". arXiv: 1311.3085 [cs.SI].
  5. 1 2 3 4 Abbe, Emmanuel; Sandon, Colin (March 2015). "Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms". arXiv: 1503.00609 [math.PR].
  6. Abbe, Emmanuel; Sandon, Colin (June 2015). "Recovering communities in the general stochastic block model without knowing the parameters". arXiv: 1506.03729 [math.PR].
  7. 1 2 Decelle, Aurelien; Krzakala, Florent; Moore, Cristopher; Zdeborová, Lenka (September 2011). "Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications". Physical Review E. 84 (6): 066106. arXiv: 1109.3041 . Bibcode:2011PhRvE..84f6106D. doi:10.1103/PhysRevE.84.066106. PMID   22304154. S2CID   15788070.
  8. 1 2 Abbe, Emmanuel; Bandeira, Afonso S.; Hall, Georgina (May 2014). "Exact Recovery in the Stochastic Block Model". arXiv: 1405.3267 [cs.SI].
  9. Krzakala, Florent; Moore, Cristopher; Mossel, Elchanan; Neeman, Joe; Sly, Allan; Lenka, Lenka; Zhang, Pan (October 2013). "Spectral redemption in clustering sparse networks". Proceedings of the National Academy of Sciences. 110 (52): 20935–20940. arXiv: 1306.5550 . Bibcode:2013PNAS..11020935K. doi: 10.1073/pnas.1312486110 . PMC   3876200 . PMID   24277835.
  10. Lei, Jing; Rinaldo, Alessandro (February 2015). "Consistency of spectral clustering in stochastic block models". The Annals of Statistics. 43 (1): 215–237. arXiv: 1312.2050 . doi:10.1214/14-AOS1274. ISSN   0090-5364. S2CID   88519551.
  11. Mossel, Elchanan; Neeman, Joe; Sly, Allan (September 2013). "Belief Propagation, Robust Reconstruction, and Optimal Recovery of Block Models". The Annals of Applied Probability. 26 (4): 2211–2256. arXiv: 1309.1380 . Bibcode:2013arXiv1309.1380M. doi:10.1214/15-AAP1145. S2CID   184446.
  12. Fathi, Reza (April 2019). "Efficient Distributed Community Detection in the Stochastic Block Model". arXiv: 1904.07494 [cs.DC].
  13. Karrer, Brian; Newman, Mark E J (2011). "Stochastic blockmodels and community structure in networks". Physical Review E. 83 (1): 016107. arXiv: 1008.3926 . Bibcode:2011PhRvE..83a6107K. doi:10.1103/PhysRevE.83.016107. PMID   21405744. S2CID   9068097. Archived from the original on 2023-02-04. Retrieved 2021-06-16.
  14. Peixoto, Tiago (2014). "Hierarchical block structures and high-resolution model selection in large networks". Physical Review X. 4 (1): 011047. arXiv: 1310.4377 . Bibcode:2014PhRvX...4a1047P. doi:10.1103/PhysRevX.4.011047. S2CID   5841379. Archived from the original on 2021-06-24. Retrieved 2021-06-16.
  15. Galhotra, Sainyam; Mazumdar, Arya; Pal, Soumyabrata; Saha, Barna (February 2018). "The Geometric Block Model". AAAI. 32. arXiv: 1709.05510 . doi:10.1609/aaai.v32i1.11905. S2CID   19152144.
  16. Airoldi, Edoardo; Blei, David; Feinberg, Stephen; Xing, Eric (May 2007). "Mixed membership stochastic blockmodels". Journal of Machine Learning Research. 9: 1981–2014. arXiv: 0705.4485 . Bibcode:2007arXiv0705.4485A. PMC   3119541 . PMID   21701698.
  17. Martin Gerlach; Tiago Peixoto; Eduardo Altmann (2018). "A network approach to topic models". Science Advances. 4 (7): eaaq1360. arXiv: 1708.01677 . Bibcode:2018SciA....4.1360G. doi:10.1126/sciadv.aaq1360. PMC   6051742 . PMID   30035215.
  18. Alyson Fox; Geoffrey Sanders; Andrew Knyazev (2018). "Investigation of Spectral Clustering for Signed Graph Matrix Representations". 2018 IEEE High Performance extreme Computing Conference (HPEC). pp. 1–7. doi:10.1109/HPEC.2018.8547575. ISBN   978-1-5386-5989-2. OSTI   1476177. S2CID   54443034.
  19. Archived 2023-02-04 at the Wayback Machine DARPA/MIT/AWS Graph Challenge
  20. Archived 2023-02-04 at the Wayback Machine DARPA/MIT/AWS Graph Challenge Champions
  21. A. J. Uppal; J. Choi; T. B. Rolinger; H. Howie Huang (2021). "Faster Stochastic Block Partition Using Aggressive Initial Merging, Compressed Representation, and Parallelism Control". 2021 IEEE High Performance Extreme Computing Conference (HPEC). pp. 1–7. doi:10.1109/HPEC49654.2021.9622836. ISBN   978-1-6654-2369-4. S2CID   244780210.
  22. David Zhuzhunashvili; Andrew Knyazev (2017). "Preconditioned spectral clustering for stochastic block partition streaming graph challenge (Preliminary version at arXiv.)". 2017 IEEE High Performance Extreme Computing Conference (HPEC). pp. 1–6. arXiv: 1708.07481 . doi:10.1109/HPEC.2017.8091045. ISBN   978-1-5386-3472-1. S2CID   19781504.
  23. Lisa Durbeck; Peter Athanas (2020). "Incremental Streaming Graph Partitioning". 2020 IEEE High Performance Extreme Computing Conference (HPEC). pp. 1–8. doi:10.1109/HPEC43674.2020.9286181. ISBN   978-1-7281-9219-2. S2CID   229376193.