Degree-preserving randomization

Last updated

Degree Preserving Randomization is a technique used in Network Science that aims to assess whether or not variations observed in a given graph could simply be an artifact of the graph's inherent structural properties rather than properties unique to the nodes, in an observed network.

Contents

Background

Cataloged as early as 1996, [1] the simplest implementation of degree preserving randomization relies on a Monte Carlo algorithm that rearranges, or "rewires" the network at random such that, with a sufficient number of rewires, the network's degree distribution is identical to the initial degree distribution of the network, though the topological structure of the network has become completely distinct from the original network.

A demonstration of a single iteration of the Degree Preserving Randomization algorithm. Degree Preserving Randomization rewire step.svg
A demonstration of a single iteration of the Degree Preserving Randomization algorithm.

Degree preserving randomization, while it has many different forms, typically takes on the form of a relatively simple approach: for any network consisting of nodes with edges, select two dyadically tied nodes. For each of these dyadic pairs, switch the edges such that the new dyadic pairs are mismatched. After a sufficient number of these mismatches, the network increasingly loses its original observed topography.

As is common with algorithms based on Markov chains, the number of iterations, or individual rewires, that must occur on a given graph such that the graph is sufficiently random and distinct from the original graph is unknown, though Espinoza [2] asserts that a safe minimum threshold is , where "is at least 100" (Espinoza). Others have provided input for this issue, including one author who states that a safe minimum may instead be at least , where epsilon lies in a range between and , though ultimately the correct number is not presently known. [3] [4]

Uses

There are several cases in which published research have explicitly employed degree preserving randomization in order to analyze network properties. Dekker [5] used rewiring in order to more accurately model observed social networks by adding a secondary variable, , which introduces a high-degree attachment bias. Liu et al. [6] have additionally employed degree preserving randomization to assert that the Control Centrality, a metric they identify, alters little when compared to the Control Centrality of an Erdős–Rényi model containing the same number of nodes in their simulations - Liu et al. have also used degree preserving randomization models in subsequent work exploring network controllability. [7]

Additionally, some work has been done in investigating how Degree Preserving Randomization may be used in addressing considerations of anonymity in networked data research, which has been shown to be a cause for concern in Social Network Analysis, as in the case of a study by Lewis et al. [8] [9] Ultimately the work conducted by Ying and Wu, starting from a foundation of Degree Preserving Randomization, and then forwarding several modifications, has showed moderate advances in protecting anonymity without compromising the integrity of the underlying utility of the observed network. [10]

Additionally, the method is similar in nature to the broadly used Exponential random graph models popularized in social science, [11] [12] and indeed the various forms of modeling networks against observed networks in order to identify and theorize about the differences expressed in real networks. Importantly, Degree Preserving Randomization provides a simple algorithmic design for those familiar with programming to apply a model to an available observed network.

Example

What follows is a small example showing how one may apply Degree Preserving Randomization to an observed network in an effort to understand the network against otherwise random variation while maintaining the degree distributional aspect of the network. The Association of Internet Researchers has a Listserv that constitutes the majority of discussion threads surrounding their work. On it, members post updates about their own research, upcoming conferences, calls for papers and also engage one another in substantive discussions in their field. These emails can in turn constitute a directed and temporal network graph, where nodes are individual e-mail accounts belonging to the Listserv and edges are cases in which one e-mail address responds to another e-mail address on the Listserv.

Results from Degree Preserving Randomization Trials. Distribution of Reciprocity and Average Path Length, Observed versus Trial.png
Results from Degree Preserving Randomization Trials.

In this observed network, the properties of the Listserv are relatively simple to calculate - for the network of 3,235 individual e-mail accounts and 9,824 exchanges in total, the observed reciprocity of the network is about 0.074, and the [Average path length|average path length] is about 4.46. Could these values be arrived at simply through the nature of the network's inherent structure?

Applying the rule, this network would require around 67,861 individual edge rewires to construct a likely sufficiently random degree-preserved graph. If we construct many random, degree preserving graphs from the real graph, we can then create a probability space for characteristics, such as reciprocity and average path length, and assess the degree to which the network could have expressed these characteristics at random. 534 networks were generated using Degree Preserving Randomization. As both reciprocity and average path length in this graph are normally distributed, and as the standard deviation for both reciprocity and average path length are far too narrow to include the observed case, we can reasonably posit that this network is expressing characteristics that are non-random (and thus open for further theory and modeling).

Related Research Articles

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

<span class="mw-page-title-main">Scale-free network</span> Network whose degree distribution follows a power law

A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. That is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as

<span class="mw-page-title-main">Markov random field</span> Set of random variables

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

<span class="mw-page-title-main">Small-world network</span> Graph where most nodes are reachable in a small number of steps

A small-world network is a graph characterized by a high clustering coefficient and low distances. On an example of social network, high clustering implies the high probability that two friends of one person are friends themselves. The low distances, on the other hand, mean that there is a short chain of social connections between any two people. Specifically, a small-world network is defined to be a network where the typical distance L between two randomly chosen nodes grows proportionally to the logarithm of the number of nodes N in the network, that is:

<span class="mw-page-title-main">Complex network</span> Network with non-trivial topological features

In the context of network theory, a complex network is a graph (network) with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in networks representing real systems. The study of complex networks is a young and active area of scientific research inspired largely by empirical findings of real-world networks such as computer networks, biological networks, technological networks, brain networks, climate networks and social networks.

<span class="mw-page-title-main">Barabási–Albert model</span> Scale-free network generation algorithm

The Barabási–Albert (BA) model is an algorithm for generating random scale-free networks using a preferential attachment mechanism. Several natural and human-made systems, including the Internet, the World Wide Web, citation networks, and some social networks are thought to be approximately scale-free and certainly contain few nodes with unusually high degree as compared to the other nodes of the network. The BA model tries to explain the existence of such nodes in real networks. The algorithm is named for its inventors Albert-László Barabási and Réka Albert.

<span class="mw-page-title-main">Watts–Strogatz model</span> Method of generating random small-world graphs

The Watts–Strogatz model is a random graph generation model that produces graphs with small-world properties, including short average path lengths and high clustering. It was proposed by Duncan J. Watts and Steven Strogatz in their article published in 1998 in the Nature scientific journal. The model also became known as the (Watts) beta model after Watts used to formulate it in his popular science book Six Degrees.

<span class="mw-page-title-main">Erdős–Rényi model</span> Two closely related models for generating random graphs

In the mathematical field of graph theory, the Erdős–Rényi model refers to one of two closely related models for generating random graphs or the evolution of a random network. These models are named after Hungarian mathematicians Paul Erdős and Alfréd Rényi, who introduced one of the models in 1959. Edgar Gilbert introduced the other model contemporaneously with and independently of Erdős and Rényi. In the model of Erdős and Rényi, all graphs on a fixed vertex set with a fixed number of edges are equally likely. In the model introduced by Gilbert, also called the Erdős–Rényi–Gilbert model, each edge has a fixed probability of being present or absent, independently of the other edges. These models can be used in the probabilistic method to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Network science</span> Academic field

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

<span class="mw-page-title-main">Evolving network</span>

Evolving networks are networks that change as a function of time. They are a natural extension of network science since almost all real world networks evolve over time, either by adding or removing nodes or links over time. Often all of these processes occur simultaneously, such as in social networks where people make and lose friends over time, thereby creating and destroying edges, and some people become part of new social networks or leave their networks, changing the nodes in the network. Evolving network concepts build on established network theory and are now being introduced into studying networks in many diverse fields.

<span class="mw-page-title-main">Exponential family random graph models</span>

Exponential Random Graph Models (ERGMs) are a family of statistical models for analyzing data from social and other networks. Examples of networks examined using ERGM include knowledge networks, organizational networks, colleague networks, social media networks, networks of scientific development, and others.

<span class="mw-page-title-main">Network controllability</span>

Network controllability concerns the structural controllability of a network. Controllability describes our ability to guide a dynamical system from any initial state to any desired final state in finite time, with a suitable choice of inputs. This definition agrees well with our intuitive notion of control. The controllability of general directed and weighted complex networks has recently been the subject of intense study by a number of groups in wide variety of networks, worldwide. Recent studies by Sharma et al. on multi-type biological networks identified control targets in phenotypically characterized Osteosarcoma showing important role of genes and proteins responsible for maintaining tumor microenvironment.

Random walk closeness centrality is a measure of centrality in a network, which describes the average speed with which randomly walking processes reach a node from other nodes of the network. It is similar to the closeness centrality except that the farness is measured by the expected length of a random walk rather than by the shortest path.

Price's model is a mathematical model for the growth of citation networks. It was the first model which generalized the Simon model to be used for networks, especially for growing networks. Price's model belongs to the broader class of network growing models whose primary target is to explain the origination of networks with strongly skewed degree distributions. The model picked up the ideas of the Simon model reflecting the concept of rich get richer, also known as the Matthew effect. Price took the example of a network of citations between scientific papers and expressed its properties. His idea was that the way an old vertex gets new edges should be proportional to the number of existing edges the vertex already has. This was referred to as cumulative advantage, now also known as preferential attachment. Price's work is also significant in providing the first known example of a scale-free network. His ideas were used to describe many real-world networks such as the Web.

In the context of complex networks, attack tolerance is the network's robustness meaning the ability to maintain the overall connectivity and diameter of the network as nodes are removed. Several graph metrics have been proposed to predicate network robustness. Algebraic connectivity is a graph metric that shows the best graph robustness among them.

<span class="mw-page-title-main">Deterministic scale-free network</span>

A scale-free network is a type of networks that is of particular interest of network science. It is characterized by its degree distribution following a power law. While the most widely known generative models for scale-free networks are stochastic, such as the Barabási–Albert model or the Fitness model can reproduce many properties of real-life networks by assuming preferential attachment and incremental growth, the understanding of deterministic scale-free networks leads to valuable, analytical results.

<span class="mw-page-title-main">Autologistic actor attribute models</span>

Autologistic actor attribute models (ALAAMs) are a family of statistical models used to model the occurrence of node attributes in network data. They are frequently used with social network data to model social influence, the process by which connections in a social network influence the outcomes experienced by nodes. The dependent variable can strictly be binary. However, they may be applied to any type of network data that incorporates binary, ordinal or continuous node attributes as dependent variables.

In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

References

  1. Rao, A Ramachandra; Jana, Rabindranath; Bandyopadhyay, Suraj (1996). "A Markov chain Monte Carlo method for generating random (0, 1)-matrices with given marginals" (PDF). Indian Journal of Statistics Series A. Retrieved November 5, 2014.
  2. Espinoza, Max. "On Network Randomization Methods: A Negative Control Study" (PDF). Archived from the original (PDF) on 2016-03-04. Retrieved 2014-11-06.{{cite journal}}: Cite journal requires |journal= (help)
  3. Re: [igraph] Degree-preserving rewiring of a large graph
  4. Pinar, Ali; Ray, Jaideep; Seshadri, S. (2012), Are we there yet? When to stop a Markov chain while generating random graphs (PDF), arXiv: 1202.3473 , Bibcode:2012arXiv1202.3473R
  5. Dekker, A.H. (2007), "Realistic Social Networks for Simulation using Network Rewiring" (PDF), Proceedings MODSIM 2007
  6. Liu, Y-Y.; Slotine, J-J; Barabási, A-L (2012), "Control Centrality and Hierarchical Structure in Complex Networks", PLOS ONE, 7 (9): e44459, arXiv: 1203.2655 , Bibcode:2012PLoSO...744459L, doi: 10.1371/journal.pone.0044459 , PMC   3459977 , PMID   23028542
  7. Liu, Yang-Yu; Slotine, Jean-Jacques; Barabási, Albert-Laszlo (2013), "Effect of correlations on network controllability", Sci. Rep., 3: 1067, arXiv: 1203.5161 , Bibcode:2013NatSR...3E1067P, doi:10.1038/srep01067, PMC   3545232 , PMID   23323210
  8. Parry, Marc (July 10, 2011), "Harvard Researchers Accused of Breaching Students' Privacy", The Chronicle of Higher Education, retrieved November 5, 2014
  9. Lewis, Kevin; Kaufman, Jason; Gonzalez, Marco; Wimmer, Andreas; Christakis, Nicholas (2008), "Tastes, ties, and time: A new social network dataset using Facebook.com" (PDF), Social Networks, 30 (4): 330–342, CiteSeerX   10.1.1.158.9087 , doi:10.1016/j.socnet.2008.07.002
  10. Ying, Xiaowei; Wu, Xintao (2008), "Randomizing Social Networks: A Spectrum Preserving Approach", Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 739–750, CiteSeerX   10.1.1.140.6647 , doi:10.1137/1.9781611972788.67, ISBN   978-0-89871-654-2
  11. Snijders, Tom AB. (2002), "Markov chain Monte Carlo estimation of exponential random graph models", Journal of Social Structure, 3 (2): 1–40
  12. Robins, Garry; Patterson, Pip; Kalish, Yuval; Lusher, Dean (2007), "An introduction to exponential random graph models for social networks", Social Networks, 29 (2): 173–191, doi:10.1016/j.socnet.2006.08.002, hdl: 1959.3/216571