SALSA algorithm

Last updated August 08, 2023

Stochastic Approach for Link-Structure Analysis (SALSA) is a web page ranking algorithm designed by R. Lempel and S. Moran to assign high scores to hub and authority web pages based on the quantity of hyperlinks among them.^[1]

Origins

SALSA is inspired by two other link-based ranking algorithms, namely HITS and PageRank, in the following ways:

like HITS, the algorithm assigns two scores to each web page: a hub score and an authority score. An authority is a page which is significantly more relevant to a given topic than other pages, whereas a hub is a page which contains many links to authorities;
like HITS, SALSA also works on a focused subgraph which is topic-dependent. This focused subgraph is obtained by first finding a set of pages most relevant to a given topic (e.g. take the top-n pages returned by a text-based search algorithm) and then augmenting this set with web pages that link directly to it and with pages that are linked directly from it. Because of this selection process, the hub and authority scores are topic-dependent;
like PageRank, the algorithm computes the scores by simulating a random walk through a Markov chain that represents the graph of web pages. SALSA however works with two different Markov chains: a chain of hubs and a chain of authorities. This is a departure from HITS's notions of hubs and authorities based on a mutually reinforcing relationship.

Properties

SALSA can be seen as an improvement of HITS.

It is computationally lighter since its ranking is equivalent to a weighted in/out degree ranking. The computational cost of the algorithm is a crucial factor since HITS and SALSA are computed at query time and can therefore significantly affect the response time of a search engine. This should be contrasted with query-independent algorithms like PageRank that can be computed off-line.

SALSA is less vulnerable to the Tightly Knit Community (TKC) effect than HITS. A TKC is a topological structure within the Web that consists of a small set of highly interconnected pages. The presence of TKCs in a focused subgraph is known to negatively affect the detection of meaningful authorities by HITS.

The Twitter Social network uses a SALSA style algorithm to suggest accounts to follow.^[2]

Related Research Articles

In computer science and SEO science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the search space of a problem domain, with either discrete or continuous values.

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution. Various algorithms exist for constructing chains, including the Metropolis–Hastings algorithm.

Algorithmic composition is the technique of using algorithms to create music.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Hyperlink-Induced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs.

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Search Engine Results Pages (SERP) are the pages displayed by search engines in response to a query by a user. The main component of the SERP is the listing of results that are returned by the search engine in response to a keyword query.

A search engine is an information retrieval software program that discovers, crawls, transforms, and stores information for retrieval and presentation in response to user queries.

Kaltix Corporation was a personalized search engine company founded at Stanford University in June 2003 by Sepandar Kamvar, Taher Haveliwala, and Glen Jeh. It was acquired by Google on September 2003.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query $q$ and a collection $D$ of documents that match the query, the problem is to rank, that is, sort, the documents in $D$ according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

<span class="mw-page-title-main">Google matrix</span> Stochastic matrix representing links between entities

A Google matrix is a particular stochastic matrix that is used by Google's PageRank algorithm. The matrix represents a graph with edges representing links between pages. The PageRank of each page can then be generated iteratively from the Google matrix using the power method. However, in order for the power method to converge, the matrix must be stochastic, irreducible and aperiodic.

Alistair Sinclair is a British computer scientist and computational theorist.

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

References

↑ Wang, Ziyang. "Improved Link-Based Algorithms for Ranking Web Pages" (PDF). cs.nyu.edu. New York University, Department of Computer Science. Retrieved 7 August 2023.
↑ Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh WTF: The who-to-follow system at Twitter, Proceedings of the 22nd international conference on World Wide Web

Lempel, R.; Moran S. (April 2001). "SALSA: The Stochastic Approach for Link-Structure Analysis". ACM Transactions on Information Systems. 19 (2): 131–160. CiteSeerX 10.1.1.38.5859 . doi:10.1145/382979.383041. S2CID 9607841.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Wang, Ziyang. "Improved Link-Based Algorithms for Ranking Web Pages" (PDF). cs.nyu.edu. New York University, Department of Computer Science. Retrieved 7 August 2023.

[twitterwtf-2] Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh WTF: The who-to-follow system at Twitter, Proceedings of the 22nd international conference on World Wide Web

[1]

[2]

SALSA algorithm

Contents

Origins

Properties

Related Research Articles

References