SMART Information Retrieval System

Last updated

The SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s. [1] Many important concepts in information retrieval were developed as part of research on the SMART system, including the vector space model, relevance feedback, and Rocchio classification.

Gerard Salton led the group that developed SMART. Other contributors included Mike Lesk.

The SMART system also provides a set of corpora, queries and reference rankings, taken from different subjects, notably

To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.

The following tables establish the SMART notation: [2]

Symbols and notation
represents a document vector, where is the weight of the term in and is the number of unique terms in . Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.
Occurrence frequency of term in document Number of unique terms in document
Number of collection documentsAverage number of unique terms in a document
Number of documents with term presentNumber of characters in document
Occurrence frequency of the most common term in document Average number of characters in a document
Average occurrence frequency of a term in document Global collection statistics
The slope in the context of pivoted document length normalization [3]
Smart term-weighting triple notation
Term frequency Document frequency Document length normalization
bBinary weightxnDisregards the collection frequencyxnNo document length normalization
tnRaw term frequencyfInverse collection frequencycCosine normalization
aAugmented normalized term frequencytInverse collection frequencyuPivoted unique normalization [3]
lLogarithmpProbabilistic inverse collection frequencybPivoted characted length normalization [3]
LAverage-term-frequency-based normalization [3]
dDouble logarithm

The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper. [4] The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Gerard A. "Gerry" Salton was a professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father of Information Retrieval". His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard. It was the very first system to use the now popular vector space model for Information Retrieval.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

In the field of information retrieval, divergence from randomness, one of the first models, is one type of probabilistic model. It is basically used to test the amount of information carried in the documents. It is based on Harter's 2-Poisson indexing-model. The 2-Poisson model has a hypothesis that the level of the documents is related to a set of documents which contains words occur relatively greater than the rest of the documents. It is not a 'model', but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on notion of eliteness.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf has been one of the most popular term-weighting schemes. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and the Language Technologies Institute at Carnegie Mellon University. The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri and Galago search engines, the ClueWeb09 and ClueWeb12 datasets, and the RankLib learning-to-rank library. The software and datasets are used widely in scientific and research applications, as well as in some commercial applications.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

A search engine is an information retrieval software program that discovers, crawls, transforms, and stores information for retrieval and presentation in response to user queries.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

TREX is a search engine in the SAP NetWeaver integrated technology platform produced by SAP SE using columnar storage. The TREX engine is a standalone component that can be used in a range of system environments but is used primarily as an integral part of SAP products such as Enterprise Portal, Knowledge Warehouse, and Business Intelligence. In SAP NetWeaver BI, the TREX engine powers the BI Accelerator, which is a plug-in appliance for enhancing the performance of online analytical processing. The name "TREX" stands for Text Retrieval and information EXtraction, but it is not a registered trademark of SAP and is not used in marketing collateral.

Term discrimination is a way to rank keywords in how useful they are for information retrieval.

<span class="mw-page-title-main">Karen Spärck Jones</span>

Karen Spärck Jones was a self-taught programmer and a pioneering British computer scientist responsible for the concept of inverse document frequency (IDF), a technology that underlies most modern search engines. She was an advocate for women in the field of computer science. She even came up with a slogan: “Computing is too important to be left to men.” In 2019, The New York Times published her belated obituary in its series Overlooked, calling her "a pioneer of computer science for work combining statistics and linguistics, and an advocate for women in the field." From 2008, to recognize her achievements in the fields of information retrieval (IR) and natural language processing (NLP), the Karen Spärck Jones Award is awarded to a new recipient with outstanding research in one or both of her fields.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

<span class="mw-page-title-main">Visual Word</span>

Visual words, as used in image retrieval systems, refer to small parts of an image that carry some kind of information related to the features or changes occurring in the pixels such as the filtering, low-level feature descriptors.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

References

  1. Salton, G, Lesk, M.E. (June 1965). "The SMART automatic document retrieval systems—an illustration". Communications of the ACM. 8 (6): 391–398. doi: 10.1145/364955.364990 .
  2. Palchowdhury, Sauparna (2016). "On The Provenance of tf-idf". sauparna.sdf.org. Retrieved 2019-07-29.
  3. 1 2 3 4 Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176-184.
  4. Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage., 24, 513-523.