SMART Information Retrieval System

Last updated June 04, 2024

The SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.^[1] Many important concepts in information retrieval were developed as part of research on the SMART system, including the vector space model, relevance feedback, and Rocchio classification.

Gerard Salton led the group that developed SMART. Other contributors included Mike Lesk.

The SMART system also provides a set of corpora, queries and reference rankings, taken from different subjects, notably

ADI: publications from information science reviews
Computer science
Cranfield collection: publications from aeronautic reviews
Forensic science: library science
MEDLARS collection: publications from medical reviews
Time magazine collection: archives of the generalist review Time in 1963

To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.

The following tables establish the SMART notation:^[2]

Symbols and notation
${\textstyle D_{i}=\{w_{i_{1}},w_{i_{2}},\ldots ,w_{i_{t}}\}}$ represents a document vector, where ${\textstyle w_{i_{k}}}$ is the weight of the term ${\textstyle T_{k}}$ in ${\textstyle D_{i}}$ and $t$ is the number of unique terms in ${\textstyle D_{i}}$ . Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.
${\textstyle f_{i_{k}}}$	Occurrence frequency of term ${\textstyle T_{k}}$ in document ${\textstyle D_{i}}$	${\textstyle u_{i}}$	Number of unique terms in document ${\textstyle D_{i}}$
$N$	Number of collection documents	$\operatorname {avg} (u)$	Average number of unique terms in a document
${\textstyle n_{k}}$	Number of documents with term ${\textstyle T_{k}}$ present	$b_{t}$	Number of characters in document $D_{i}$
$\max(f_{i_{k}})$	Occurrence frequency of the most common term in document $D_{i}$	${\textstyle \operatorname {avg} (b)}$	Average number of characters in a document
$\operatorname {avg} (f_{i_{k}})$	Average occurrence frequency of a term in document $D_{i}$	${\textstyle G}$	Global collection statistics
$s$	The slope in the context of pivoted document length normalization^[3]

Smart term-weighting triple notation
Term frequency ${\textstyle {\text{tf}}(f_{i_{k}})}$				Document frequency ${\textstyle {\text{df}}(N,n_{k})}$				Document length normalization ${\textstyle g(G,D_{i})}$
	`b`	${\textstyle 1}$	Binary weight	`x`	`n`	${\textstyle 1}$	Disregards the collection frequency	`x`	`n`	${\textstyle 1}$	No document length normalization
`t`	`n`	${\textstyle f_{i_{k}}}$	Raw term frequency	`f`		$\log _{2}\left({\frac {N}{n_{k}}}\right)$	Inverse collection frequency		`c`	${\sqrt {\sum _{k=1}^{t}w_{i_{k}}^{2}}}$	Cosine normalization
	`a`	${\textstyle 0.5+0.5{\frac {f_{i_{k}}}{\max(f_{i_{k}})}}}$	Augmented normalized term frequency		`t`	$\log _{2}\left({\frac {N+1}{n_{k}}}\right)$	Inverse collection frequency		`u`	$1-s+s{\frac {u_{i}}{\operatorname {avg} (u)}}$	Pivoted unique normalization^[3]
	`l`	$1+\log _{2}f_{i_{k}}$	Logarithm	`p`		$\log _{2}\left({\frac {N-n_{k}}{n_{k}}}\right)$	Probabilistic inverse collection frequency		`b`	$1-s+s{\frac {b_{i}}{\operatorname {avg} (b)}}$	Pivoted characted length normalization^[3]
	`L`	${\frac {1+\log _{2}(f_{i_{k}})}{1+\log _{2}(\operatorname {avg} (f_{i_{k}}))}}$	Average-term-frequency-based normalization^[3]
	`d`	$1+\log _{2}(1+\log _{2}(f_{i_{k}}))$	Double logarithm

The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper.^[4] The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. The most widely used type of search engine is a web search engine, which searches for information on the World Wide Web.

Gerard A. "Gerry" Salton was a professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father of Information Retrieval". His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard. It was the very first system to use the now popular vector space model for information retrieval.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

In the field of information retrieval, divergence from randomness, one of the first models, is one type of probabilistic model. It is basically used to test the amount of information carried in the documents. It is based on Harter's 2-Poisson indexing-model. The 2-Poisson model has a hypothesis that the level of the documents is related to a set of documents which contains words occur relatively greater than the rest of the documents. It is not a 'model', but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on notion of eliteness.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and the Language Technologies Institute at Carnegie Mellon University. The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri and Galago search engines, the ClueWeb09 and ClueWeb12 datasets, and the RankLib learning-to-rank library. The software and datasets are used widely in scientific and research applications, as well as in some commercial applications.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

TREX is a search engine in the SAP NetWeaver integrated technology platform produced by SAP SE using columnar storage. The TREX engine is a standalone component that can be used in a range of system environments but is used primarily as an integral part of SAP products such as Enterprise Portal, Knowledge Warehouse, and Business Intelligence. In SAP NetWeaver BI, the TREX engine powers the BI Accelerator, which is a plug-in appliance for enhancing the performance of online analytical processing. The name "TREX" stands for Text Retrieval and information EXtraction, but it is not a registered trademark of SAP and is not used in marketing collateral.

Term discrimination is a way to rank keywords in how useful they are for information retrieval.

Karen Ida Boalth Spärck Jones was a self-taught programmer and a pioneering British computer scientist responsible for the concept of inverse document frequency (IDF), a technology that underlies most modern search engines. She was an advocate for women in computer science, her slogan being, "Computing is too important to be left to men." In 2019, The New York Times published her belated obituary in its series Overlooked, calling her "a pioneer of computer science for work combining statistics and linguistics, and an advocate for women in the field." From 2008, to recognize her achievements in the fields of information retrieval (IR) and natural language processing (NLP), the Karen Spärck Jones Award is awarded to a new recipient with outstanding research in one or both of her fields.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query $q$ and a collection $D$ of documents that match the query, the problem is to rank, that is, sort, the documents in $D$ according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

<span class="mw-page-title-main">Visual Word</span>

Visual words, as used in image retrieval systems, refer to small parts of an image that carry some kind of information related to the features or changes occurring in the pixels such as the filtering, low-level feature descriptors.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

References

↑ Salton, G, Lesk, M.E. (June 1965). "The SMART automatic document retrieval systems—an illustration". Communications of the ACM. 8 (6): 391–398. doi: 10.1145/364955.364990 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
↑ Palchowdhury, Sauparna (2016). "On The Provenance of tf-idf". sauparna.sdf.org. Retrieved 2019-07-29.
1 2 3 4 Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176-184.
↑ Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage., 24, 513-523.

External links

Software and test collections ^{[ dead link ]} (FTP at Cornell University)
Interactive SMART tutorial

This software-engineering-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Salton, G, Lesk, M.E. (June 1965). "The SMART automatic document retrieval systems—an illustration". Communications of the ACM. 8 (6): 391–398. doi: 10.1145/364955.364990 .{{cite journal}}: CS1 maint: multiple names: authors list (link)

[2] Palchowdhury, Sauparna (2016). "On The Provenance of tf-idf". sauparna.sdf.org. Retrieved 2019-07-29.

[:0-3] 1 2 3 4 Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176-184.

[4] Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage., 24, 513-523.

[1]

[2]

[3]

[4]