Disparity filter algorithm of weighted network

Last updated

Disparity filter [1] is a network reduction algorithm (a.k.a. graph sparsification algorithm [2] ) to extract the backbone structure of undirected weighted network. Many real world networks such as citation networks, food web, airport networks display heavy tailed statistical distribution of nodes' weight and strength. Disparity filter can sufficiently reduce the network without destroying the multi-scale nature of the network. The algorithm is developed by M. Angeles Serrano, Marian Boguna and Alessandro Vespignani.

Contents

Overview of other network reduction algorithms and their limitations

k-core decomposition

k-core decomposition is an algorithm that reduces a graph into a maximal connected subgraph of vertices with at least degree k. This algorithm can only be applied to unweighted graphs.

Minimum spanning tree

A minimum spanning tree is a tree-like subgraph of a given graph G, in which it keeps all the nodes of graph G but minimizes the total weight of the subgraph. A minimum spanning tree is the least expensive way to maintain the size of a connected component. The significant limitation of this algorithm is that it overly simplifies the structure of the network (graph). The minimum spanning tree destroys local cycles, clustering coefficients which are usually present in real networks and are considered important in network measurement.

Global weight threshold

A weighted graph can be easily reduced to a subgraph in which any of the edges' weight is larger than a given threshold wc. This technique has been applied to study the resistance of food webs [3] and functional networks that connect correlated human brain sites. [4] The shortcoming of this method is that it disregards nodes with small strength. In real networks, both strength and weight distribution in general follow heavy tailed distributions which span several degrees of magnitude. Applying a simple cutoff on weight will remove all the information below the cut-off.

Disparity filter algorithm

The null model of normalized weight distribution

In network science, the strength notated as si of a node i is defined as si = Σjwij, where wij is the weight of the link between i and j.

In order to apply the disparity filter algorithm without overlooking nodes with low strength, a normalized weight pij is defined as pij = wij/si. In the null model, the normalized weights of a certain node with degree k is generated like this: k  1 pins are randomly assigned between the interval 0 and 1. The interval is then divided into k subintervals. The length of the subinterval represents the normalized weight of each link in the null model.

Consecutively, and based on the null model, we can derive that the normalized weight distribution of a node with degree k follows . [1]

Disparity filter

The disparity filter algorithm is based on p-value [5] statistical significance test [6] of the null model: For a given normalized weight pij, the p-value αij of pij based on the null model is given by which reduces to . The meaning of αij is the probability of having normalized weight larger or equal to pij in the framework of the given null model. By setting a significance level α (between 0 and 1), for any link of normalized weight pij, if αij is larger than α, it will be filtered out. Changing α we can progressively remove irrelevant links thus effectively extracting the backbone structure of the weighted network. [1]

Generalizations

Pólya Filter

The disparity filter algorithm has been shown to be a particular case of the Pólya Filter [7] (built around the famous combinatorial scheme known as the Pólya Urn). The Pólya Filter is able to adapt the filtering procedure to the network's own heterogeneity by using a Maximum Likelihood procedure to set its free parameter , which represent the strength of the self reinforcing mechanism ruling the underlying urn scheme. Given a significance level , the Pólya Filter has been shown to produce backbones much more sparse than the disparity filter and yet able to retain the most salient [8] links of the system.

See also

Related Research Articles

<span class="mw-page-title-main">Graph theory</span> Area of discrete mathematics

In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices which are connected by edges. A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically. Graphs are one of the principal objects of study in discrete mathematics.

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

Kruskal's algorithm finds a minimum spanning forest of an undirected edge-weighted graph. If the graph is connected, it finds a minimum spanning tree. It is a greedy algorithm in graph theory as in each step it adds the next lowest-weight edge that will not form a cycle to the minimum spanning forest.

This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges.

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

Algebraic signal processing (ASP) is an emerging area of theoretical signal processing (SP). In the algebraic theory of signal processing, a set of filters is treated as an (abstract) algebra, a set of signals is treated as a module or vector space, and convolution is treated as an algebra representation. The advantage of algebraic signal processing is its generality and portability.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

For a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

Capacitated minimum spanning tree is a minimal cost spanning tree of a graph that has a designated root node and satisfies the capacity constraint . The capacity constraint ensures that all subtrees incident on the root node have no more than nodes. If the tree nodes have weights, then the capacity constraint may be interpreted as follows: the sum of weights in any subtree should be no greater than . The edges connecting the subgraphs to the root node are called gates. Finding the optimal solution is NP-hard.

The random walker algorithm is an algorithm for image segmentation. In the first description of the algorithm, a user interactively labels a small number of pixels with known labels, e.g., "object" and "background". The unlabeled pixels are each imagined to release a random walker, and the probability is computed that each pixel's random walker first arrives at a seed bearing each label, i.e., if a user places K seeds, each with a different label, then it is necessary to compute, for each pixel, the probability that a random walker leaving the pixel will first arrive at each seed. These probabilities may be determined analytically by solving a system of linear equations. After computing these probabilities for each pixel, the pixel is assigned to the label for which it is most likely to send a random walker. The image is modeled as a graph, in which each pixel corresponds to a node which is connected to neighboring pixels by edges, and the edges are weighted to reflect the similarity between the pixels. Therefore, the random walk occurs on the weighted graph.

I-sites are short sequence-structure motifs that are mined from the Protein Data Bank (PDB) that correlate strongly with three-dimensional structural elements. These sequence-structure motifs are used for the local structure prediction of proteins. Local structure can be expressed as fragments or as backbone angles. Locations in the protein sequence that have high confidence I-sites predictions may be the initiation sites of folding. I-sites have also been identified as discrete models for folding pathways. I-sites consist of about 250 motifs. Each motif has an amino acid profile, a fragment structure and optionally, a 4-dimensional tensor of pairwise sequence covariance.

<span class="mw-page-title-main">Betweenness centrality</span> Measure of a graphs centrality, based on shortest paths

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through or the sum of the weights of the edges is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

Bipartite network projection is an extensively used method for compressing information about bipartite networks. Since the one-mode projection is always less informative than the original bipartite graph, an appropriate method for weighting network connections is often required. Optimal weighting methods reflect the nature of the specific network, conform to the designer's objectives and aim at minimizing information loss.

<span class="mw-page-title-main">Hyperbolic geometric graph</span>

A hyperbolic geometric graph (HGG) or hyperbolic geometric network (HGN) is a special type of spatial network where (1) latent coordinates of nodes are sprinkled according to a probability density function into a hyperbolic space of constant negative curvature and (2) an edge between two nodes is present if they are close according to a function of the metric (typically either a Heaviside step function resulting in deterministic connections between vertices closer than a certain threshold distance, or a decaying function of hyperbolic distance yielding the connection probability). A HGG generalizes a random geometric graph (RGG) whose embedding space is Euclidean.

<span class="mw-page-title-main">Louvain method</span> Clustering and community detection algorithm

The Louvain method for community detection is a method to extract communities from large networks created by Blondel et al. from the University of Louvain. The method is a greedy optimization method that appears to run in time where is the number of nodes in the network.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function , if it is possible to construct a flow network with positive weights such that

In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

<span class="mw-page-title-main">Graph neural network</span> Class of artificial neural networks

A graph neural network (GNN) is a class of artificial neural networks for processing data that can be represented as graphs.

References

  1. 1 2 3 Serrano, M.Angeles; Boguna, Marian; Vespignani, Alessandro (2009), "Extracting the multiscale backbone of complex weighted networks", Proceedings of the National Academy of Sciences , 106 (16): 6483–6488, arXiv: 0904.2389 , Bibcode:2009PNAS..106.6483S, doi: 10.1073/pnas.0808904106 , PMC   2672499 , PMID   19357301 .
  2. Coscia, Michele (2021-02-08), "The Atlas for the Aspiring Network Scientist", arXiv: 2101.00863 [cs.CY]
  3. Eguiluz, Victor M; Chialvo, Dante R; Cecchi, Guillermo A; Baliki, Marwan; Apkarian, A Vania (2005), "Scale-free brain functional networks", Physical Review Letters , 94 (1): 018102, arXiv: cond-mat/0309092 , Bibcode:2005PhRvL..94a8102E, doi:10.1103/PhysRevLett.94.018102, PMID   15698136, S2CID   7115880 .
  4. Allesina, Stefano; Bodini, Antonio; Bondavalli, Cristina (2006), "Secondary extinctions in ecological networks: bottlenecks unveiled", Ecological Modelling , 194 (1): 150–161, doi:10.1016/j.ecolmodel.2005.10.016 .
  5. Goodman, SN (1999). "Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy". Annals of Internal Medicine . 130 (12): 995–1004. doi:10.7326/0003-4819-130-12-199906150-00008. PMID   10383371. S2CID   7534212.
  6. R. A. Fisher (1925).Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p. 43.
  7. Marcaccioli, Riccardo; Livan, Giacomo (2019-02-14). "A Pólya urn approach to information filtering in complex networks". Nature Communications. 10 (1): 745. Bibcode:2019NatCo..10..745M. doi: 10.1038/s41467-019-08667-3 . ISSN   2041-1723. PMC   6375975 . PMID   30765706.
  8. Grady, Daniel; Thiemann, Christian; Brockmann, Dirk (2012-05-29). "Robust classification of salient links in complex networks". Nature Communications. 3 (1): 864. arXiv: 1110.3864 . Bibcode:2012NatCo...3..864G. doi: 10.1038/ncomms1847 . ISSN   2041-1723. PMID   22643891.