Random surfing model

Last updated May 09, 2024

The random surfing model is a graph model which describes the probability of a random user visiting a web page. The model attempts to predict the chance that a random internet surfer will arrive at a page by either clicking a link or by accessing the site directly, for example by directly entering the website's URL in the address bar. For this reason, an assumption is made that all users surfing the internet will eventually stop following links in favor of switching to another site completely. The model is similar to a Markov chain, where the chain's states are web pages the user lands on and transitions are equally probable links between these pages.

Description

A user navigates the internet in two primary ways; the user may access a site directly by entering the site's URL or clicking a bookmark, or the user may use a series of hyperlinks to get to the desired page. The random surfer model assumes that the link which the user selects next is picked at random. The model also assumes that the number of successive links is not infinite – the user will at some point lose interest and leave the current site for a completely new site.^[1]

The random surfer model is presented as a series of nodes which indicate web pages that can be accessed at random by users. A new node is added to the a graph when a new website is published. The movement about the graphs nodes is modeled by choosing a start node at random, then performing a short and random traversal of the nodes, or random walk. This traversal is analogous to a user accessing a website, then following hyperlink $t$ number of times, until the user either exits the page or accesses another site completely. Connections to other nodes in this graph are formed when outbound links are placed on the page.

Graph definitions

In the random surfing model, webgraphs are presented as a sequence of directed graphs $G_{t},t=1,2,\ldots$ such that a graph $G_{t}$ has $t$ vertices and $t$ edges. The process of defining graphs is parameterized with a probability $p$ , thus we let $q=1-p$ .^[2]

Nodes of the model arrive one at time, forming $k$ connections to the existing graph $G_{t}$ . In some models, connections represent directed edges, and in others, connections represent undirected edges. Models start with a single node $v_{0}$ and have $k$ self-loops. $v_{t}$ denotes a vertex added in the $t^{th}$ step, and $n$ denotes the total number of vertices.^[1]

Model 1. (1-step walk with self-loop)

At time $t$ , vertex $v_{t}$ makes $k$ connections by $k$ iterations of the following steps:

Pick an existing node $v$ uniformly at random from $\{v_{0},v_{1},\ldots ,v_{t-1}\}$
With probability $p$ stay at $v$ ; with probability $1-p$ take a 1-step walk to a random neighbor of $v$
Add an edge from $v_{t}$ to the current node

For directed graphs, edges added are directed from $v_{t}$ into the existing graph. Edges are undirected in respective undirected graphs.

Model 2. (Random walks with coin flips)

At time $t$ , vertex $v_{t}$ makes $k$ connections by $k$ iterations of the following steps:

Pick an existing node $v$ uniformly at random from $\{v_{0},v_{1},...,v_{t-1}\}$
Flip a coin of bias $p$
If the coin comes up heads add an edge from $v_{t}$ to the current node and stop
If the coin comes up tails, move to a random neighbor of the current node and go back to step 2

For directed graphs, edges added are directed from $v_{t}$ into the existing graph. Edges are undirected in respective undirected graphs.

Limitations

There are some caveats to the standard random surfer model, one of which is that the model ignores the content of the sites which users select – since the model assumes links are selected at random. Because users tend to have a goal in mind when surfing the internet, the content of the linked sites is a determining factor of whether or not the user will click a link.^[1]^[2]

Application

The normalized eigenvector centrality combined with random surfer model's assumption of random jumps created the foundation of Google's PageRank algorithm.^[2]^[3]

Related Research Articles

<span class="mw-page-title-main">Graph theory</span> Area of discrete mathematics

In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices which are connected by edges. A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically. Graphs are one of the principal objects of study in discrete mathematics.

A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. That is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as

<span class="mw-page-title-main">Component (graph theory)</span> Maximal subgraph whose vertices can reach each other

In graph theory, a component of an undirected graph is a connected subgraph that is not part of any larger connected subgraph. The components of any graph partition its vertices into disjoint sets, and are the induced subgraphs of those sets. A graph that is itself connected has exactly one component, consisting of the whole graph. Components are sometimes called connected components.

<span class="mw-page-title-main">Graph (discrete mathematics)</span> Vertices connected in pairs by edges

In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects are represented by abstractions called vertices and each of the related pairs of vertices is called an edge. Typically, a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges.

In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications are found in all areas in which complex networks need to be modeled – many random graph models are thus known, mirroring the diverse types of complex networks encountered in different areas. In a mathematical context, random graph refers almost exclusively to the Erdős–Rényi random graph model. In other contexts, any graph model may be referred to as a random graph.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

In mathematics and computer science, connectivity is one of the basic concepts of graph theory: it asks for the minimum number of elements that need to be removed to separate the remaining nodes into two or more isolated subgraphs. It is closely related to the theory of network flow problems. The connectivity of a graph is an important measure of its resilience as a network.

<span class="mw-page-title-main">Maximal independent set</span> Independent set which is not a subset of any other independent set

In graph theory, a maximal independent set (MIS) or maximal stable set is an independent set that is not a subset of any other independent set. In other words, there is no vertex outside the independent set that may join it because it is maximal with respect to the independent set property.

In network theory, a giant component is a connected component of a given random graph that contains a significant fraction of the entire graph's vertices.

In the study of scale-free networks, a copying mechanism is a process by which such a network can form and grow, by means of repeated steps in which nodes are duplicated with mutations from existing nodes. Several variations have been studied. In the general copying model, a growing network starts as a small initial graph and, at each time step, a new vertex is added with a given number k of new outgoing edges. As a result of a stochastic selection, the neighbors of the new vertex are either chosen randomly among the existing vertices, or one existing vertex is randomly selected and k of its neighbors are "copied" as heads of the new edges.

In the mathematical field of graph theory, the Erdős–Rényi model refers to one of two closely related models for generating random graphs or the evolution of a random network. These models are named after Hungarian mathematicians Paul Erdős and Alfréd Rényi, who introduced one of the models in 1959. Edgar Gilbert introduced the other model contemporaneously with and independently of Erdős and Rényi. In the model of Erdős and Rényi, all graphs on a fixed vertex set with a fixed number of edges are equally likely. In the model introduced by Gilbert, also called the Erdős–Rényi–Gilbert model, each edge has a fixed probability of being present or absent, independently of the other edges. These models can be used in the probabilistic method to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs.

In computer science and graph theory, Karger's algorithm is a randomized algorithm to compute a minimum cut of a connected graph. It was invented by David Karger and first published in 1993.

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

In network science, the configuration model is a method for generating random networks from a given degree sequence. It is widely used as a reference model for real-life social networks, because it allows the modeler to incorporate arbitrary degree distributions.

Copying network models are network generation models that use a copying mechanism to form a network, by repeatedly duplicating and mutating existing nodes of the network. Such a network model has first been proposed in 1999 to explain the network of links between web pages, but since has been used to model biological and citation networks as well.

References

1 2 3 Blum, Avrim; Chan, T-H. Hubert; Rwebangira, Mugizi Robert (21 January 2006). Written at 3600 University City Science Center Philadelphia, PA, United States. "A Random-Surfer Web-Graph Model" (PDF). Computer Science Department. ANALCO '06: Proceedings of the Meeting on Analytic Algorithmics and Combinatorics. Carnegie Mellon University: Society for Industrial and Applied Mathematics: 238–246.{{cite journal}}: CS1 maint: location (link)
1 2 3 Chebolu, Prasad; Melsted, Páll (1 January 2008). "PageRank and the random surfer model" (PDF). Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Department of Mathematical Sciences, Carnegie Mellon University: 1010–1018.
↑ Zaki, Mohammed J.; Meira, Jr., Wagner (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. ISBN 9780521766333.

External links

Case study on random web surfers
Data Mining and Analysis: Fundamental Concepts and Algorithms is freely available to download for personal use here
Microsoft research on PageRank and the Random Surfer Model
Paper on how Google web search implements PageRank to find relevant search results

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 Blum, Avrim; Chan, T-H. Hubert; Rwebangira, Mugizi Robert (21 January 2006). Written at 3600 University City Science Center Philadelphia, PA, United States. "A Random-Surfer Web-Graph Model" (PDF). Computer Science Department. ANALCO '06: Proceedings of the Meeting on Analytic Algorithmics and Combinatorics. Carnegie Mellon University: Society for Industrial and Applied Mathematics: 238–246.{{cite journal}}: CS1 maint: location (link)

[:1-2] 1 2 3 Chebolu, Prasad; Melsted, Páll (1 January 2008). "PageRank and the random surfer model" (PDF). Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Department of Mathematical Sciences, Carnegie Mellon University: 1010–1018.

[3] Zaki, Mohammed J.; Meira, Jr., Wagner (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. ISBN 9780521766333.

[1]

[2]

[3]