Controlled vocabulary

Last updated

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, [1] [2] taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.

Contents

In library and information science

In library and information science, controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. [3] [4] Controlled vocabularies solve the problems of homographs, synonyms and polysemes by a bijection between concepts and preferred terms. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

For example, in the Library of Congress Subject Headings [5] (a subject heading system that uses a controlled vocabulary), preferred terms—subject headings in this case—have to be chosen to handle choices between variant spellings of the same word (American versus British), choice among scientific and popular terms (cockroach versus Periplaneta americana), and choices between synonyms (automobile versus car), among other difficult issues.

Choices of preferred terms are based on the principles of user warrant (what terms users are likely to use), literary warrant (what terms are generally used in the literature and documents), and structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).

Controlled vocabularies also typically handle the problem of homographs with qualifiers. For example, the term pool has to be qualified to refer to either swimming pool or the game pool to ensure that each preferred term or heading refers to only one concept.

Types used in libraries

There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences. Historically, subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialized covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesaurus terms are always in direct order. Subject headings also tend to use more pre-coordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one preferred subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various preferred and non-preferred (but potentially synonymous) terms, while historically most subject headings did not. For example, the Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".

The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the Library of Congress system, the U S NIH National Library of Medicine Medical Subject Headings (MeSH) and Sears. Well known thesauri include the Art and Architecture Thesaurus and the ERIC Thesaurus.

Choosing preferred terms to be used is a tricky business, [6] besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language.

Lastly the amount of pre-coordination (in which case the degree of enumeration versus synthesis becomes an issue) and post-coordination in the system is another important issue. Controlled vocabulary elements (terms/phrases) employed as tags, to aid in the content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as metadata.

Indexing languages

There are three main types of indexing languages.

When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.

In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies[ citation needed ] have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.

Advantages

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language. Take the English word football for example. Football is the name given to a number of different team sports. Worldwide the most popular of these team sports is association football, which also happens to be called soccer in several countries. The word football is also applied to rugby football (rugby union and rugby league), American football, Australian rules football, Gaelic football, and Canadian football. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.

Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic).

In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct preferred term is searched, there is no need to search for other terms that might be synonyms of that term.

Problems

A controlled vocabulary search may lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.

This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with that of the indexer.

Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.

On the other hand, free text searches have high exhaustivity (every word is searched) so although it has much lower precision, it has potential for high recall as long as the searcher overcome the problem of synonyms by entering every combination.

Controlled vocabularies may become outdated rapidly in fast developing fields of knowledge, unless the preferred terms are updated regularly. Even in an ideal scenario, a controlled vocabulary is often less specific than the words of the text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while this precise problem is not a factor in a free text, as it uses the author's own words.

The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.

Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be described in multiple ways.

Word choice in chosen vocabularies is not neutral, and the indexer must carefully consider the ethics of their word choices. For example, traditionally colonialist terms have often been the preferred terms in chosen vocabularies when discussing First Nations issues, which has caused controversy. [7]

Applications

Controlled vocabularies, such as the Library of Congress Subject Headings, are an essential component of bibliography, the study and classification of books. They were initially developed in library and information science. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National Library of Medicine. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup X.25 networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.

Technical communication

In large organizations, controlled vocabularies may be introduced to improve technical communication. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management, where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to the same thing.

Semantic web and structured data

Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web, in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative. An example of a controlled vocabulary which is usable for indexing web pages is PSH.

It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. [8] To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification principles. [9] [ non-primary source needed ]

Controlled vocabularies of the Semantic Web define the concepts and relationships (terms) used to describe a field of interest or area of concern. For instance, to declare a person in a machine-readable format, a vocabulary is needed that has the formal definition of "Person", such as the Friend of a Friend (FOAF) vocabulary, which has a Person class that defines typical properties of a person including, but not limited to, name, honorific prefix, affiliation, email address, and homepage, or the Person vocabulary of Schema.org. [10] Similarly, a book can be described using the Book vocabulary of Schema.org [11] and general publication terms from the Dublin Core vocabulary, [12] an event with the Event vocabulary of Schema.org, [13] and so on.

To use machine-readable terms from any controlled vocabulary, web designers can choose from a variety of annotation formats, including RDFa, HTML5 Microdata, or JSON-LD in the markup, or RDF serializations (RDF/XML, Turtle, N3, TriG, TriX) in external files.

See also

Related Research Articles

<span class="mw-page-title-main">Dublin Core</span> Standardized set of metadata elements

The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen main metadata items for describing digital or physical resources. The Dublin Core Metadata Initiative (DCMI) is responsible for formulating the Dublin Core; DCMI is a project of the Association for Information Science and Technology (ASIS&T), a non-profit organization.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

A faceted classification is a classification scheme used in organizing knowledge into a systematic order. A faceted classification uses semantic categories, either general or subject-specific, that are combined to create the full classification entry. Many library classification systems use a combination of a fixed, enumerative taxonomy of concepts with subordinate facets that further refine the topic.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

In information science, authority control is a process that organizes information, for example in library catalogs, by using a single, distinct spelling of a name (heading) or an identifier for each topic or concept. The word authority in authority control derives from the idea that the names of people, places, things, and concepts are authorized, i.e., they are established in one particular form. These one-of-a-kind headings or identifiers are applied consistently throughout catalogs which make use of the respective authority file, and are applied for other methods of organizing data such as linkages and cross references. Each controlled entry is described in an authority record in terms of its scope and usage, and this organization helps the library staff maintain the catalog and make it user-friendly for researchers.

The semantic spectrum is a series of increasingly precise or rather semantically expressive definitions for data elements in knowledge representations, especially for machine use.

The e-Government Metadata Standard, e-GMS, is the UK e-Government Metadata Standard. It defines how UK public sector bodies should label content such as web pages and documents to make such information more easily managed, found and shared.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

Labelling or using a label is describing someone or something in a word or short phrase. For example, the label "criminal" may be used to describe someone who has broken a law. Labelling theory is a theory in sociology which ascribes labelling of people to control and identification of deviant behaviour. It has been argued that labelling is necessary for communication. However, the use of the term is often intended to highlight the fact that the label is a description applied from the outside, rather than something intrinsic to the labelled thing. This can be done for several reasons:

IMS VDEX, which stands for IMS Vocabulary Definition Exchange, in data management, is a mark-up language – or grammar – for controlled vocabularies developed by IMS Global as an open specification, with the Final Specification being approved in February 2004.

The AgMES initiative was developed by the Food and Agriculture Organization (FAO) of the United Nations and aims to encompass issues of semantic standards in the domain of agriculture with respect to description, resource discovery, interoperability, and data exchange for different types of information resources.

Agricultural Information Management Standards (AIMS) is a web site managed by the Food and Agriculture Organization of the United Nations (FAO) for accessing and discussing agricultural information management standards, tools and methodologies connecting information workers worldwide to build a global community of practice. Information management standards, tools and good practices can be found on AIMS:

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

A schema crosswalk is a table that shows equivalent elements in more than one database schema. It maps the elements in one schema to the equivalent elements in another.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

<span class="mw-page-title-main">Polythematic Structured Subject Heading System</span>

Polythematic Structured Subject Heading System is a bilingual Czech–English controlled vocabulary of subject headings developed and maintained by the National Technical Library in Prague. It was designed for describing and searching information resources according to their subject. PSH contains more than 13,900 terms, which cover the main fields of human knowledge.

<span class="mw-page-title-main">ISO 25964</span>

ISO 25964 is the international standard for thesauri, published in two parts as follows:

ISO 25964 Information and documentation - Thesauri and interoperability with other vocabulariesPart 1: Thesauri for information retrieval [published August 2011]  Part 2: Interoperability with other vocabularies [published March 2013]

In the context of information retrieval, a thesaurus is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.

<span class="mw-page-title-main">Nuovo soggettario</span>

The Nuovo soggettario is a subject indexing system managed and implemented by the National Central Library of Florence, that in Italy has the institutional task to curate and develop the subject indexing tools, as national book archive and as bibliographic production agency of the Italian National Bibliography. It can be used in libraries, archives, media libraries, documentation centers and other institutes of the cultural heritage to index resources of various nature on various supports

References

  1. Controlled Vocabularies Links to examples of thesauri and classification schemes.
  2. Controlled Vocabularies Links to examples of thesauri and classification schemes used in the domain of Agriculture, Fisheries, Forestry etc.
  3. Amy Warner, A taxonomy primer // dead link Archived 2016-03-05 at the Wayback Machine .
  4. "Karl Fast, Fred Leise and Mike Steckel". Archived from the original on 2017-11-17. Retrieved 2014-09-15.
  5. "Controlled Vocabularies | Librarians | Library of Congress". The Library of Congress. Archived from the original on 2019-11-16. Retrieved 2018-05-22.
  6. who said this subjective statement?
  7. Smith, Catherine (2021-04-03). "Controlled Vocabularies: Past, Present and Future of Subject Access". Cataloging & Classification Quarterly. 59 (2–3): 186–202. doi:10.1080/01639374.2021.1881007. ISSN   0163-9374. S2CID   233205938.
  8. Cory Doctorow, Metacrap Archived 2007-05-08 at the Wayback Machine .
  9. Mark Pilgrim, eXchangeable Faceted Metadata Language Archived 2012-02-08 at the Wayback Machine .
  10. "The Person vocabulary of Schema.org". Archived from the original on 28 July 2015. Retrieved 13 March 2015.
  11. "The Book vocabulary of Schema.org". Archived from the original on 11 March 2015. Retrieved 13 March 2015.
  12. "Dublin Core Metadata Element Set, Version 1.1". Archived from the original on 16 August 2013. Retrieved 13 March 2015.
  13. "The Event vocabulary of Schema.org". Archived from the original on 13 March 2015. Retrieved 13 March 2015.