Index term

Last updated

In information retrieval, an index term (also known as subject term, subject heading, descriptor, or keyword) is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Contents

Keywords are stored in a search index. Common words like articles (a, an, the) and conjunctions (and, or, but) are not treated as keywords because it's inefficient. Almost every English-language site on the Internet has the article "the", and so it makes no sense to search for it. The most popular search engine, Google removed stop words such as "the" and "a" from its indexes for several years, but then re-introduced them, making certain types of precise search possible again.

The term "descriptor" was by Calvin Mooers in 1948. It is in particular used about a preferred term from a thesaurus.

The Simple Knowledge Organization System language (SKOS) provides a way to express index terms with Resource Description Framework for use in the context of the Semantic Web. [1]

In web search engines

Most web search engines are designed to search for words anywhere in a document—the title, the body, and so on. This being the case, a keyword can be any term that exists within the document. However, priority is given to words that occur in the title, words that recur numerous times, and words that are explicitly assigned as keywords within the coding. [2] Index terms can be further refined using Boolean operators such as "AND, OR, NOT." "AND" is normally unnecessary as most search engines infer it. "OR" will search for results with one search term or another or both. "NOT" eliminates a word or phrase from the search, getting rid of any results that include it. Multiple words can also be enclosed in quotation marks to turn the individual index terms into a specific index phrase. These modifiers and methods all help to refine search terms, to better maximize the accuracy of search results. [3]

Author keywords

Author keywords are an integral part of literature. [1] Many journals and databases provide access to index terms made by authors of the respective articles. How qualified the provider is decides the quality of both indexer-provided index terms and author-provided index terms. The quality of these two types of index terms is of research interest, particularly in relation to information retrieval. In general, an author will have difficulty providing indexing terms that characterize his or her document relative to other documents in the database.

Examples

See also

Related Research Articles

<span class="mw-page-title-main">Glossary of library and information science</span>

This page is a glossary of library and information science.

PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the database as part of the Entrez system of information retrieval.

PsycINFO is a database of abstracts of literature in the field of psychology. It is produced by the American Psychological Association and distributed on the association's APA PsycNET and through third-party vendors. It is the electronic version of the now-ceased Psychological Abstracts. In 2000, it absorbed PsycLIT which had been published on CD-ROM.

<span class="mw-page-title-main">Medical Subject Headings</span> Controlled vocabulary

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings. MeSH is also used by ClinicalTrials.gov registry to classify which diseases are studied by trials registered in ClinicalTrials.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

<span class="mw-page-title-main">Tag (metadata)</span> Keyword assigned to information

In information systems, a tag is a keyword or term assigned to a piece of information. This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are generally chosen informally and personally by the item's creator or by its viewer, depending on the system, although they may also be chosen from a controlled vocabulary.

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.

A bibliogram is a graphical representation of the frequency of certain target words, usually noun phrases, in a given text. The term was introduced in 2005 by Howard D. White to name the linguistic object studied, but not previously named, in informetrics, scientometrics and bibliometrics. The noun phrases in the ranking may be authors, journals, subject headings, or other indexing terms. The "stretches of text” may be a book, a set of related articles, a subject bibliography, a set of Web pages, and so on. Bibliograms are always generated from writings, usually from scholarly or scientific literature.

The Library of Congress Subject Headings (LCSH) comprise a thesaurus of subject headings, maintained by the United States Library of Congress, for use in bibliographic records. LC Subject Headings are an integral part of bibliographic control, which is the function by which libraries collect, organize, and disseminate documents. It was first published in 1898, a year after the publication of Library of Congress Classification (1897). The last print edition was published in 2016. Access to the continuously revised vocabulary is now available via subscription and free services.

A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, patents, books, etc. In contrast to library catalogue entries, a large proportion of the bibliographic records in bibliographic databases describe articles, conference papers, etc., rather than complete monographs, and they generally contain very rich subject descriptions in the form of keywords, subject classification terms, or abstracts.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term.

Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

<span class="mw-page-title-main">Polythematic Structured Subject Heading System</span>

Polythematic structured-subject heading system is a bilingual Czech–English controlled vocabulary of subject headings developed and maintained by the National Technical Library in Prague. It was designed for describing and searching information resources according to their subject. PSH contains more than 13,900 terms, which cover the main fields of human knowledge.

The Schlagwortnormdatei or SWD is a controlled vocabulary index term system used primarily for subject indexing in library catalogs. The SWD is managed by the German National Library (DNB) in cooperation with various library networks. The inclusion of keywords in the SWD is defined by Regeln für die Schlagwortkatalogisierung (RSWK). Similar authority systems in other languages include the Library of Congress Subject Headings (LCSH) and the Répertoire d’autorité-matière encyclopédique et alphabétique unifié (RAMEAU). Since April 2012 the SWD is part of the Gemeinsame Normdatei (GND).

Automatic indexing is the computerized process of scanning large volumes of documents against a controlled vocabulary, taxonomy, thesaurus or ontology and using those controlled terms to quickly and effectively index large electronic document depositories. These keywords or language are applied by training a system on the rules that determine what words to match. There are additional parts to this such as syntax, usage, proximity, and other algorithms based on the system and what is required for indexing. This is taken into account using Boolean statements to gather and capture the indexing information out of the text. As the number of documents exponentially increases with the proliferation of the Internet, automatic indexing will become essential to maintaining the ability to find relevant information in a sea of irrelevant information. Natural language systems are used to train a system based on seven different methods to help with this sea of irrelevant information. These methods are Morphological, Lexical, Syntactic, Numerical, Phraseological, Semantic, and Pragmatic. Each of these look and different parts of speed and terms to build a domain for the specific information that is being covered for indexing. This is used in the automated process of indexing.

In the context of information retrieval, a thesaurus is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.

References

  1. 1 2 Svenonius, Elaine (2009). The intellectual foundation of information organization (1st MIT Press pbk. ed.). Cambridge, Massachusetts: MIT Press. ISBN   9780262512619.
  2. Cutts, Matt. (2010, March 4). How search works. Retrieved from https://www.youtube.com/watch?v=BNHR6IQJGZs
  3. CLIO. Keyword search. Columbia University Libraries. Retrieved from http://www.columbia.edu/cu/lweb/help/clio/keyword.html

Further reading