Uniterm

Last updated

Uniterm is a subject indexing system introduced by Mortimer Taube in 1951. The name is a contraction of "unit" and "term", referring to its use of single words as the basis of the index, the "uniterms". Taube referred to the overall concept as "Coordinate Indexing", but today the entire concept is generally referred to as Uniterm as well.

Contents

Uniterm is designed to allow rapid lookups on topic keywords and then cross-reference those keywords across multiple topics in order to find documents that match all of the terms. The result of a uniterm search is a set of accession numbers that can then be used to retrieve the matching documents. Uniterm is based on existing accession numbers, so it is technically a post-coordinate system. This is opposed to a pre-coordinate system, where the subject of the document results it being given a particular number, as in the Dewey Decimal Classification. Uniterm was among the most popular post-coordinate indexing systems, although some of its success was due to Taube's company winning contracts to index huge technical libraries.

History

The development of Uniterm, and other new indexing systems, ultimately traces its history to the late World War II period. Aware of the advanced aircraft and rocket technologies developed in Germany, the US formed Operation Lusty and UK the similar Fedden Mission in order to gather as much of these materials as possible. Along with examples of the aircraft and various weapons, these efforts returned millions of pages of technical documentation. The desire to ease access into these enormous collections led to a great expansion in the field of information retrieval. [1]

In the US, the aeronautical collection was first sent to US Army Air Force at Wright Field, but over time it was merged with similar caches of US research to form an ever-growing collection of technical papers. The collection grew so large and varied that a new operational group, the Armed Services Technical Information Agency (ASTIA), was formed in 1951 to manage it. This group eventually came under the management of the Atomic Energy Commission. ASTIA began running experiments in indexing the collection, and it was from this work that Uniterm emerged. [2]

Taube introduced the Uniterm concept in a 1951 paper, "Coordinate Indexing of Scientific Fields", part of the Symposium on Mechanical Aids to Chemical Documentation. The next year, in partnership with Gerald Sophar, Taube formed Documentation, Inc. The company offered commercial retrieval and indexing services. Among their largest efforts was a 1958 contract with the newly formed NASA to index their entire technical library, and later, make microfilm copies of it. [3]

Taube's original paper indicates that a significant advantage of the Uniterm concept is its ability to be automated. In essence, the uniterm lookup process is looking for the intersection of several terms, or as Taube referred to it, the "coordinates". [lower-alpha 1] To this end, they partnered with IBM to develop the "Continuous Multiple Access Collator", or COMAC. Users would make search term selections on a punch card writer and then feed them into the COMAC, also known as the IBM 9900. [4] The COMAC pulled those uniterm cards and then used optical systems to find matching items. It then returned a new card with those numbers that was then sent into the IBM 305 RAMAC, the first computer with a hard drive, which returned the complete document information for those numbers. [4]

Concept

Uniterm is based on the concept of making a separate card catalog that refers to the documents in the collection by their accession numbers. The accession numbers have no meaning in the Uniterm index, so they may use any of the common systems like the Dewey Decimal Classification or Universal Decimal Classification, or in many cases, simply an incrementing serial number. [5] [2]

As new works are added to the collection, the librarian will make a normal index card for the primary card index as they would for any work. Additionally, they will select a small number of keywords from the title or body of the work that can be used to look it up, and these are also written on the card. For instance, a document on icing of air ducts in aircraft might be filed under "air", "ducts" and "icing", but perhaps not "aircraft" which would be found on too many documents. [6]

The librarian then looks in the Uniterm catalog for cards with those terms on them. If they are not found, they are created by writing the keyword at the top of the card and then dividing the lower portion into ten vertical sections, labeled 0 to 9. The last digit of the accession number is then written on the card in that column, for instance, if the last digit of the accession number is 5, the entire accession number would be written in column 5. If the card for that term is found in the collection, the new accession is simply added to the correct column of the existing card. [7]

To retrieve a document, the user selects potentially useful key terms and extracts those cards from the uniterm index. To find this article, the user might select "indexing" and "library", and retrieves those cards from the uniterm catalog. These cards will have numbers for many different documents, for instance, the "library" card might contain a listing for a book on the Library of Alexandria. However, only those documents on "library indexing" will appear on both cards. [8]

The user then scans the card to see if a particular accession number appears on both cards; splitting the cards into 10 columns is intended to make the visual scanning process simpler. Numbers that appear on both cards are likely relevant to the search, and can then be looked up directly or by looking in the main card catalog if partial accession numbers are used. [8]

The cards in the main catalog also contain the uniterms used to file that entry, forming a cross-index. A user that selects the cards for "propeller" and "aeroplane" may find many intersecting works on the cards. Returning to the main index they can look at the uniterms recorded on the main index cards and find that there are other terms that commonly appear, perhaps "aerodynamics". These might suggest additional terms that could be used to narrow their search. They can then return to the uniterm catalog to apply these new terms to return additional documents or further focus their search. [9]

Advantages and criticisms

Uniterm was popular in the United States for large technical collections, which led to considerable study on the system. One particularly useful effort was the National Security Agency's effort to catalog their 70,000-work collection. [10]

They found one major advantage of the Uniterm system was that the librarians did not have to have an understanding of the material in order to correctly catalog it. Simply selecting terms that appeared in the title or were obviously important within the text would often result in a useful uniterm entry. This contrasted with traditional hierarchical approaches, where selecting the proper spot within the hierarchy often required some, or considerable, knowledge of the underlying field. [10]

The same effort also revealed a number of problems and suggested solutions. One was that synonyms presented a problem; was a paper on "air ducts" the same or different than one on "air intakes"? They suggested this could be addressed by splitting the works into sets of about 1,000 entries and building the catalog out in sections. The first set of 1,000 documents might produce 1,000 uniterms, which were then studied to weed out synonyms. When synonyms were found, they added "see also" headings to those cards. The second set would then be added, using those synonyms. They found that the addition of new terms started to flatten out at about 4,000 entries, and after 10,000 only very specific technical terms were being added. [11]

A concern that was raised when the concept was first introduced was that the terms might return a large number of false positives due to terms being used to describe completely different concepts. In particular, terms that might mean different things depending on their order were believed to be an issue. If one was looking for "American exports to Canada", "Canada", "US" and "exports" would return a large number of documents on Canadian exports into the US as well, perhaps overwhelming the result set. [12]

However, this was found not to be a serious problem in practice, and those few examples that did crop up were solved by adding "delta cards", see-also entries that incorporated a direction. In this case, the "US" card would have a see-also entry for "USΔ", that card would only contain those entries from the US. Uniterms on the USΔ page are only those for US exports. [12]

Notes

  1. As in "things that are coordinated", not "a physical location".

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Library catalog</span> Register of bibliographic items

A library catalog is a register of all bibliographic items found in a library or group of libraries, such as a network of libraries at several locations. A catalog for a group of libraries is also called a union catalog. A bibliographic item can be any information entity that is considered library material, or a group of library materials, or linked from the catalog as far as it is relevant to the catalog and to the users (patrons) of the library.

<span class="mw-page-title-main">Glossary of library and information science</span>

This page is a glossary of library and information science.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

<span class="mw-page-title-main">Medical Subject Headings</span> Controlled vocabulary

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings. MeSH is also used by ClinicalTrials.gov registry to classify which diseases are studied by trials registered in ClinicalTrials.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.

<span class="mw-page-title-main">Outline of library and information science</span> Overview of and topical guide to library science

The following outline is provided as an overview of and topical guide to library and information science:

<span class="mw-page-title-main">Reference card</span>

A reference card or reference sheet or crib sheet is a concise bundling of condensed notes about a specific topic, such as mathematical formulas to calculate area/volume, or common syntactic rules and idioms of a particular computer platform, application program, or formal language. It serves as an ad hoc memory aid for an experienced user.

A bibliographic database is a database of bibliographic records. This is an organised online collection of references to published written works like journal and newspaper articles, conference proceedings, reports, government and legal publications, patents and books. In contrast to library catalogue entries, a majority of the records in bibliographic databases describe articles and conference papers rather than complete monographs, and they generally contain very rich subject descriptions in the form of keywords, subject classification terms, or abstracts.

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Herbert Marvin Ohlman (1927–2002) is the inventor of permutation indexing, or Permuterm and is one of the pioneers of Information Science and Technology. He has been recognized and included in the Pioneers of Information Science in North America ProjectArchived 2015-02-04 at the Wayback Machine by ASIS.

<span class="mw-page-title-main">Defense Technical Information Center</span> US Department of Defense repository for research and engineering information

The Defense Technical Information Center is the repository for research and engineering information for the United States Department of Defense (DoD). DTIC's services are available to DoD personnel, federal government personnel, federal contractors and selected academic institutions. The general public can access unclassified information through its public website.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

The Cranfield experiments were a series of experimental studies in information retrieval conducted by Cyril W. Cleverdon at the College of Aeronautics, today known as Cranfield University, in the 1960s to evaluate the efficiency of indexing systems. The experiments were broken into two main phases, neither of which was computerized. The entire collection of abstracts, resulting indexes and results were later distributed in electronic format and were widely used for decades.

<span class="mw-page-title-main">Mortimer Taube</span>

Mortimer Taube was an American librarian. He is on the list of the 100 most important leaders in American Library and Information Science of the 20th century. He was important to the Library Science field because he invented Coordinate Indexing, which uses "uniterms" in the context of cataloging. It is the forerunner to computer based searches. In the early 1950s he started his own company, Documentation, Inc. with Gerald J. Sophar. Previously he worked at such institutions as the Library of Congress, the Department of Defense, and the Atomic Energy Commission. American Libraries calls him "an innovator and inventor, as well as scholar and savvy businessman." Current Biography called him the "Dewey of mid-twentieth Librarianship." Mortimer Taube was a very active man with varying interests such as tennis, philosophy, sailing, music, and collecting paintings.

In the context of information retrieval, a thesaurus is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.

References

Citations

  1. Lesk, Michael. "The Seven Ages of Information Retrieval". Bellcore.
  2. 1 2 Sharma & Sharma 2007, p. 19.
  3. Times 1965.
  4. 1 2 Taube 1962.
  5. Install 1953, p. 1.
  6. Install 1953, p. 2.
  7. Install 1953, pp. 6, 7.
  8. 1 2 Install 1953, p. 9.
  9. Install 1953, p. 11.
  10. 1 2 Sanford & Theriault 1956, p. 19.
  11. Sanford & Theriault 1956, p. 20.
  12. 1 2 Sanford & Theriault 1956, p. 23.

Bibliography