Ranking (information retrieval)

Last updated

Ranking of query is one of the fundamental problems in information retrieval (IR), [1] the scientific/engineering discipline behind search engines. [2] Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. [3] A majority of search engines use ranking algorithms to provide users with accurate and relevant results. [4]

Contents

History

The notion of page rank dates back to the 1940s and the idea originated in the field of economics. In 1941, Wassily Leontief developed an iterative method of valuing a country's sector based on the importance of other sectors that supplied resources to it. In 1965, Charles H Hubbell at the University of California, Santa Barbara, published a technique for determining the importance of individuals based on the importance of the people who endorse them. [5]

Gabriel Pinski and Francis Narin came up with an approach to rank journals. [6] Their rule was that a journal is important if it is cited by other important journals. Jon Kleinberg, a computer scientist at Cornell University, developed an almost identical approach to PageRank which was called Hypertext Induced Topic Search or HITS and it treated web pages as "hubs" and "authorities".

Google’s PageRank algorithm was developed in 1998 by Google’s founders Sergey Brin and Larry Page and it is a key part of Google’s method of ranking web pages in search results. [7] All the above methods are somewhat similar as all of them exploit the structure of links and require an iterative approach. [8]

Ranking models

Ranking functions are evaluated by a variety of means; one of the simplest is determining the precision of the first k top-ranked results for some fixed k; for example, the proportion of the top 10 results that are relevant, on average over many queries.

IR models can be broadly divided into three types: Boolean models or BIR, Vector Space Models, and Probabilistic Models. [9] Various comparisons between retrieval models can be found in the literature (e.g., [10] ).

Boolean Models

Boolean Model or BIR is a simple baseline query model where each query follows the underlying principles of relational algebra with algebraic expressions and where documents are not fetched unless they completely match with each other. Since the query is either fetch the document (1) or doesn’t fetch the document (0), there is no methodology to rank them.

Vector Space Model

Since the Boolean Model only fetches complete matches, it doesn’t address the problem of the documents being partially matched. The Vector Space Model solves this problem by introducing vectors of index items each assigned with weights. The weights are ranged from positive (if matched completely or to some extent) to negative (if unmatched or completely oppositely matched) if documents are present. Term Frequency - Inverse Document Frequency (tf-idf) is one of the most popular techniques where weights are terms (e.g. words, keywords, phrases etc.) and dimensions is number of words inside corpus.

The similarity score between query and document can be found by calculating cosine value between query weight vector and document weight vector using cosine similarity. Desired documents can be fetched by ranking them according to similarity score and fetched top k documents which has the highest scores or most relevant to query vector.

Probabilistic Model

In probabilistic model, probability theory has been used as a principal means for modeling the retrieval process in mathematical terms. The probability model of information retrieval was introduced by Maron and Kuhns in 1960 and further developed by Roberston and other researchers. According to Spack Jones and Willett (1997): The rationale for introducing probabilistic concepts is obvious: IR systems deal with natural language, and this is too far imprecise to enable a system to state with certainty which document will be relevant to a particular query.

The model applies the theory of probability to information retrieval (An event has a possibility from 0 percent to 100 percent of occurring). i.e, in probability model, relevance is expressed in terms of probability. Here, documents are ranked in order of decreasing probability of relevance. It takes into the consideration of uncertainty element in the IR process. i.e., uncertainty about whether documents retrieved by the system are relevant to a given query.

The probability model intends to estimate and calculate the probability that a document will be relevant to a given query based on some methods. The “event” in this context of information retrieval refers to the probability of relevance between a query and a document. Unlike other IR models, the probability model does not treat relevance as an exact miss-or-match measurement.

The model adopts various methods to determine the probability of relevance between queries and documents. Relevance in the probability model is judged according to the similarity between queries and documents. The similarity judgment is further dependent on term frequency.

Thus, for a query consisting of only one term (B), the probability that a particular document (Dm) will be judged relevant is the ratio of users who submit query term (B) and consider the document (Dm) to be relevant in relation to the number of users who submitted the term (B). As represented in Maron’s and Kuhn’s model, can be represented as the probability that users submitting a particular query term (B) will judge an individual document (Dm) to be relevant.

According to Gerard Salton and Michael J. McGill, the essence of this model is that if estimates for the probability of occurrence of various terms in relevant documents can be calculated, then the probabilities that a document will be retrieved, given that it is relevant, or that it is not, can be estimated. [11]

Several experiments have shown that the probabilistic model can yield good results. However, such results have not been sufficiently better than those obtained using the Boolean or Vector Space model. [12] [13]

Evaluation Measures

The most common measures of evaluation are precision, recall, and f-score. They are computed using unordered sets of documents. These measures must be extended, or new measures must be defined, in order to evaluate the ranked retrieval results that are standard in modern search engines. In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve. [14]

Precision

Precision measures the exactness of the retrieval process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the precision is given by:

Recall

Recall is a measure of completeness of the IR process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the recall is given by:

F1 Score

F1 Score tries to combine the precision and recall measure. It is the harmonic mean of the two. If P is the precision and R is the recall then the F-Score is given by:

Page Rank Algorithm

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on the links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value. The formulae is given below:

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the amount L(v) of links from page v.

HITS Algorithm

Similar to PageRank, HITS uses Link Analysis for analyzing the relevance of the pages but only works on small sets of subgraph (rather than entire web graph) and as well as being query dependent. The subgraphs are ranked according to weights in hubs and authorities where pages that rank highest are fetched and displayed. [15]

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

In general computing, a search engine is an information retrieval system designed to help find information stored on a computer system. It is an information retrieval software program that discovers, crawls, transforms, and stores information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. A search engine normally consists of four components, as follows: a search interface, a crawler, an indexer, and a database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document as well.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

<span class="mw-page-title-main">Text Retrieval Conference</span> Meetings for information retrieval research

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

Human–computer information retrieval (HCIR) is the study and engineering of information retrieval techniques that bring human intelligence into the search process. It combines the fields of human-computer interaction (HCI) and information retrieval (IR) and creates systems that improve search by taking into account the human context, or through a multi-step search process that provides the opportunity for human feedback.

Discounted cumulative gain (DCG) is a measure of ranking quality which is often presented in its normalized and more easily interpretable form, known as Normalized DCG (nDCG or NDCG). In information retrieval, NDCG is often used to measure effectiveness of search engine algorithms and related applications. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. To compute NDCG, the gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks. The normalized version corrects for the fact that different queries may have different numbers of relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

The probabilistic relevance model was devised by Stephen E. Robertson and Karen Spärck Jones as a framework for probabilistic models to come. It is a formalism of information retrieval useful to derive ranking functions used by search engines and web search engines in order to rank matching documents according to their relevance to a given search query.

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms. The success of an IR system may be judged by a range of criteria including relevance, speed, user satisfaction, usability, efficiency and reliability. However, the most important factor in determining a system's effectiveness for users is the overall relevance of results retrieved in response to a query. Evaluation measures may be categorised in various ways including offline or online, user-based or system-based and include methods such as observed user behaviour, test collections, precision and recall, and scores from prepared benchmark test sets.

References

  1. Piccoli, Gabriele; Pigni, Federico (July 2018). Information systems for managers: with cases (Edition 4.0 ed.). Prospect Press. p. 28. ISBN   978-1-943153-50-3 . Retrieved 25 November 2018.
  2. Mogotsi, I. C. "Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to information retrieval: Cambridge University Press, Cambridge, England, 2008, 482 pp, ISBN: 978-0-521-86571-5". Information Retrieval. 13 (2): 192–195. doi:10.1007/s10791-009-9115-y. ISSN   1386-4564. S2CID   31674042.
  3. "What is Information Retrieval?". GeeksforGeeks. 2020-07-02. Retrieved 2022-03-02.
  4. "Google's Search Algorithm and Ranking System - Google Search". www.google.com. Retrieved 2022-03-02.
  5. "Scientist Finds PageRank-Type Algorithm from the 1940s". MIT Technology Review. Retrieved 2022-03-02.
  6. Pinski, Gabriel; Narin, Francis (1976). "Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics". Information Processing & Management. 12 (5): 297–312. doi:10.1016/0306-4573(76)90048-0.
  7. "What are SERP Features?". www.accuranker.com. 2019-03-28. Retrieved 2022-03-02.
  8. Franceschet, Massimo (17 February 2010). "Scientist Finds PageRank-Type Algorithm from the 1940s". www.technologyreview.com.
  9. Datta, Joydip (16 April 2010). "Ranking in Information Retrieval" (PDF). Department of Computer Science and Engineering, Indian Institute of Technology. p. 7. Retrieved 25 April 2019.
  10. Turtle, Howard R.; Croft, W.Bruce (1992). "A comparison of text retrieval models". The Computer Journal. OUP. 35 (3): 279–290. doi: 10.1093/comjnl/35.3.279 .
  11. Harter, Stephen P. (1984-07-01). "Introduction to modem information retrieval (Gerard Salton and Michael J. McGill)". Education for Information. 2 (3): 237–238. doi:10.3233/EFI-1984-2307.
  12. Chu, H. Information Representation and Retrieval in the Digital Age. New Delhi: Ess Ess Publication.
  13. G.G.Choudhary. Introduction to Modern Information Retrieval. Facet Publishing.
  14. Manning, Christopher; Raghavan, Prabhakar; Schutze, Hinrich. Evaluation of ranked retrieval results. Cambridge University Press.
  15. Tanase, Racula; Radu, Remus (16 April 2010). "Lecture #4: HITS Algorithm - Hubs and Authorities on the Internet".