Thesaurus (information retrieval)

Last updated

In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". [1] The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object. [2]

Contents

A thesaurus serves to guide both an indexer and a searcher in selecting the same preferred term or combination of preferred terms to represent a given subject. ISO 25964, the international standard for information retrieval thesauri, defines a thesaurus as a “controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms.”

A thesaurus is composed by at least three elements: 1-a list of words (or terms), 2-the relationship amongst the words (or terms), indicated by their hierarchical relative position (e.g. parent/broader term; child/narrower term, synonym, etc.), 3-a set of rules on how to use the thesaurus.

History

Wherever there have been large collections of information, whether on paper or in computers, scholars have faced a challenge in pinpointing the items they seek. The use of classification schemes to arrange the documents in order was only a partial solution. Another approach was to index the contents of the documents using words or terms, rather than classification codes. In the 1940s and 1950s some pioneers, such as Calvin Mooers, Charles L. Bernier, Evan J. Crane and Hans Peter Luhn, collected up their index terms in various kinds of list that they called a “thesaurus” (by analogy with the well known thesaurus developed by Peter Roget). [3] The first such list put seriously to use in information retrieval was the thesaurus developed in 1959 at the E I Dupont de Nemours Company. [4] [5]

The first two of these lists to be published were the Thesaurus of ASTIA Descriptors (1960) and the Chemical Engineering Thesaurus of the American Institute of Chemical Engineers (1961), a descendant of the Dupont thesaurus. More followed, culminating in the influential Thesaurus of Engineering and Scientific Terms (TEST) published jointly by the Engineers Joint Council and the US Department of Defense in 1967. TEST did more than just serve as an example; its Appendix 1 presented Thesaurus rules and conventions that have guided thesaurus construction ever since. Hundreds of thesauri have been produced since then, perhaps thousands. The most notable innovations since TEST have been: (a) Extension from monolingual to multilingual capability; and (b) Addition of a conceptually organized display to the basic alphabetical presentation.

Here we mention only some of the national and international standards that have built steadily on the basic rules set out in TEST:

The most clearly visible trend across this history of thesaurus development has been from the context of small-scale isolation to a networked world. [6] Access to information was notably enhanced when thesauri crossed the divide between monolingual and multilingual applications. More recently, as can be seen from the titles of the latest ISO and NISO standards, there is a recognition that thesauri need to work in harness with other forms of vocabulary or knowledge organization system, such as subject heading schemes, classification schemes, taxonomies and ontologies. The official website for ISO 25964 gives more information, including a reading list. [7]

Purpose

In information retrieval, a thesaurus can be used as a form of controlled vocabulary to aid in the indexing of appropriate metadata for information bearing entities. A thesaurus helps with expressing the manifestations of a concept in a prescribed way, to aid in improving precision and recall. This means that the semantic conceptual expressions of information bearing entities are easier to locate due to uniformity of language. Additionally, a thesaurus is used for maintaining a hierarchical listing of terms, usually single words or bound phrases, that aid the indexer in narrowing the terms and limiting semantic ambiguity.

The Art & Architecture Thesaurus, for example, is used by countless museums around the world to catalogue their collections. AGROVOC, the thesaurus of the UN's Food and Agriculture Organization, is used to index and/or search its AGRIS database of worldwide literature on agricultural research.

Structure

Information retrieval thesauri are formally organized so that existing relationships between concepts are made clear. For example, "citrus fruits" might be linked to the broader concept of "fruits" and to the narrower ones of "oranges", "lemons", etc. When the terms are displayed online, the links between them make it very easy to browse the thesaurus, selecting useful terms for a search. When a single term could have more than one meaning, like tables (furniture) or tables (data), these are listed separately so that the user can choose which concept to search for and avoid retrieving irrelevant results. For any one concept, all known synonyms are listed, such as "mad cow disease", "bovine spongiform encephalopathy", "BSE", etc. The idea is to guide all the indexers and all the searchers to use the same term for the same concept, so that search results will be as complete as possible. If the thesaurus is multilingual, equivalent terms in other languages are shown too. Following international standards, concepts are generally arranged hierarchically within facets or grouped by themes or topics. Unlike a general thesaurus that is used for literary purposes, information retrieval thesauri typically focus on one discipline, subject or field of study.

See also

Related Research Articles

<span class="mw-page-title-main">Dublin Core</span> Standardized set of metadata elements

The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen main metadata items for describing digital or physical resources. The Dublin Core Metadata Initiative (DCMI) is responsible for formulating the Dublin Core; DCMI is a project of the Association for Information Science and Technology (ASIS&T), a non-profit organization.

In computing, AAP DTD is a set of three SGML Document Type Definitions for scientific documents, defined by the Association of American Publishers. It was ratified as a U.S. standard under the name ANSI/NISO Z39.59 in 1988, and evolved into the international ISO 12083 standard in 1993. It was supplanted as a U.S. standard by ANSI/ISO 12083 in 1995.

Z39.50 is an international standard client–server, application layer communications protocol for searching and retrieving information from a database over a TCP/IP computer network, developed and maintained by the Library of Congress. It is covered by ANSI/NISO standard Z39.50, and ISO standard 23950.

The National Information Standards Organization is a United States non-profit standards organization that develops, maintains and publishes technical standards related to publishing, bibliographic and library applications. It was founded in 1939 as the Z39 Committee, chaired from 1963-1977 by Jerrold Orne, incorporated as a not-for-profit education association in 1983, and assumed its current name in 1984.

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.

<span class="mw-page-title-main">Learning object metadata</span> Data model

Learning Object Metadata is a data model, usually encoded in XML, used to describe a learning object and similar digital resources used to support learning. The purpose of learning object metadata is to support the reusability of learning objects, to aid discoverability, and to facilitate their interoperability, usually in the context of online learning management systems (LMS).

The Getty Thesaurus of Geographic Names is a product of the J. Paul Getty Trust included in the Getty Vocabulary Program. The TGN includes names and associated information about places. Places in TGN include administrative political entities and physical features. Current and historical places are included. Other information related to history, population, culture, art and architecture is included.

The ISO/IEC 11179 metadata registry (MDR) standard is an international ISO/IEC standard for representing metadata for an organization in a metadata registry. It documents the standardization and registration of metadata to make data understandable and shareable.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

IMS VDEX, which stands for IMS Vocabulary Definition Exchange, in data management, is a mark-up language – or grammar – for controlled vocabularies developed by IMS Global as an open specification, with the Final Specification being approved in February 2004.

ISO 2788 was the ISO international standard for monolingual thesauri for information retrieval, first published in 1974 and revised in 1986. The official title of the standard was "Guidelines for the establishment and development of monolingual thesauri".

The AgMES initiative was developed by the Food and Agriculture Organization (FAO) of the United Nations and aims to encompass issues of semantic standards in the domain of agriculture with respect to description, resource discovery, interoperability, and data exchange for different types of information resources.

Agricultural Information Management Standards (AIMS) is a web site managed by the Food and Agriculture Organization of the United Nations (FAO) for accessing and discussing agricultural information management standards, tools and methodologies connecting information workers worldwide to build a global community of practice. Information management standards, tools and good practices can be found on AIMS:

The Art & Architecture Thesaurus (AAT) is a controlled vocabulary used for describing items of art, architecture, and material culture. The AAT contains generic terms, such as "cathedral", but no proper names, such as "Cathedral of Notre Dame." The AAT is used by, among others, museums, art libraries, archives, catalogers, and researchers in art and art history. The AAT is a thesaurus in compliance with ISO and NISO standards including ISO 2788, ISO 25964 and ANSI/NISO Z39.19.

ISO 5964 was the ISO standard for the establishment and development of multilingual thesauri. Its full title was Guidelines for the establishment and development of multilingual thesauri. It was withdrawn in 2011, when replaced by ISO 25964-1. See more explanation on the official website for ISO 25964

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

The Grey Literature International Steering Committee (GLISC) was established in 2006 after the 7th International Conference on Grey Literature (GL7) held in Nancy (France) on 5–6 December 2005.

<span class="mw-page-title-main">ISO 25964</span> International standard

ISO 25964 is the international standard for thesauri, published in two parts as follows:

ISO 25964 Information and documentation - Thesauri and interoperability with other vocabulariesPart 1: Thesauri for information retrieval [published August 2011]  Part 2: Interoperability with other vocabularies [published March 2013]
<span class="mw-page-title-main">Journal Article Tag Suite</span>

The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.

<span class="mw-page-title-main">Nuovo soggettario</span>

The Nuovo soggettario is a subject indexing system managed and implemented by the National Central Library of Florence, that in Italy has the institutional task to curate and develop the subject indexing tools, as national book archive and as bibliographic production agency of the Italian National Bibliography. It can be used in libraries, archives, media libraries, documentation centers and other institutes of the cultural heritage to index resources of various nature on various supports

References

  1. ANSI & NISO 2005, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, NISO, Maryland, U.S.A, p.11
  2. ANSI & NISO 2005, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, NISO, Maryland, U.S.A, p.12
  3. Roberts, N. The pre-history of the information retrieval thesaurus. Journal of Documentation, 40(4), 1984, p.271-285.
  4. Aitchison, J. and Dextre Clarke, S. The thesaurus: a historical viewpoint, with a look to the future. Cataloging & Classification Quarterly, 37 (3/4), 2004, p.5-21.
  5. Krooks, D.A. and Lancaster, F.W. The evolution of guidelines for thesaurus construction. Libri, 43(4), 1993, p.326-342.
  6. Dextre Clarke, Stella G. and Zeng, Marcia Lei. From ISO 2788 to ISO 25964: the evolution of thesaurus standards towards interoperability and data modeling Information standards quarterly, 24(1), 2012, p.20-26.
  7. ISO 25964 – the international standard for thesauri and interoperability with other vocabularies. National Information Standards Organization, 2013.