Automatic indexing

Last updated

Automatic indexing is the computerized process of scanning large volumes of documents against a controlled vocabulary, taxonomy, thesaurus or ontology and using those controlled terms to quickly and effectively index large electronic document depositories. These keywords or language are applied by training a system on the rules that determine what words to match. There are additional parts to this such as syntax, usage, proximity, and other algorithms based on the system and what is required for indexing. This is taken into account using Boolean statements to gather and capture the indexing information out of the text. [1] As the number of documents exponentially increases with the proliferation of the Internet, automatic indexing will become essential to maintaining the ability to find relevant information in a sea of irrelevant information. Natural language systems are used to train a system based on seven different methods to help with this sea of irrelevant information. These methods are Morphological, Lexical, Syntactic, Numerical, Phraseological, Semantic, and Pragmatic. Each of these look and different parts of speed and terms to build a domain for the specific information that is being covered for indexing. This is used in the automated process of indexing. [1]

Contents

The automated process can encounter problems and these are primarily caused by two factors: 1) the complexity of the language; and, 2) the lack intuitiveness and the difficulty in extrapolating concepts out of statements on the part of the computing technology. [2] These are primarily linguistic challenges and specific problems and involve semantic and syntactic aspects of language. [2] These problems occur based on defined keywords. With these keywords you are able to determine the accuracy of the system based on Hits, Misses, and Noise. These terms relate to exact matches, keywords that a computerized system missed that a human wouldn't, and keywords that the computer selected that a human would not have. The Accuracy statistic based on this should be above 85% for Hits out of 100% for human indexing. This puts Misses and Noise combined to be 15% or less. This scale provides a basis for what is considered a good Automatic Indexing System and shows where problems are being encountered. [1]

History

There are scholars who cite that the subject of automatic indexing attracted attention as early as the 1950s, particularly with the demand for faster and more comprehensive access to scientific and engineering literature. [3] This attention in indexing began with text processing between 1957 and 1959 by H.P. Lunh through a series of papers that were published. Lunh proposed that a computer could handle keyword matching, sorting, and content analysis. This was the beginning of Automatic Indexing and the formula to pull keywords from text based on frequency analysis. It was later determined that frequency alone was not sufficient for good descriptors however this began the path to where we are now with Automatic Indexing. [4] This was highlighted by the information explosion, which was predicted in the 1960s [5] and came through the emergence of information technology and the World Wide Web. The prediction was prepared by Mooers where an outline was created with the expected role that computing would have for text processing and information retrieval. This prediction said that machines would be used for storage of documents in large collections and that we would use these machines to run searches. Mooers also predicted the online aspect and retrieval environment for indexing databases. This led Mooers to predict an Induction Inference Machine which would revolutionize indexing. [4] This phenomenon required the development of an indexing system that can cope with the challenge of storing and organizing vast amount of data and can facilitate information access. [6] [7] New electronic hardware further advanced automated indexing since it overcame the barrier imposed by old paper archives, allowing the encoding of information at the molecular level. [5] With this new electronic hardware there were tools developed for assisting users. These were used to manage files and were organized into different categories such as PDM Suites like Outlook or Lotus Note and Mind Mapping Tools such as MindManager and Freemind. These allow users to focus on storage and building a cognitive model. [8] The automatic indexing is also partly driven by the emergence of the field called computational linguistics, which steered research that eventually produced techniques such as the application of computer analysis to the structure and meaning of languages. [3] [9] Automatic indexing is further spurred by research and development in the area of artificial intelligence and self-organizing system also referred to as thinking machine. [3]

Medicine

Automatic Indexing has many practical applications like for instance in the field of medicine. In research published in 2009, researchers talk about how automatic indexing can be used to create an information portal where users can find out reliable information about a drug. CISMeF is one such health portal that is designed to give information about drugs. The website uses MeSH thesaurus to index the scientific articles of the MEDLINE database and the Dublin Core Metadata. The system creates a meta term drug and uses that as search criteria to find all information about a specific drug. The website uses simple and advanced search. The simple search allows you to search by a brand name or by any code given by the drugs. Advanced search allows a more specific search by allowing you enter everything that describes the drug you are looking for. [10]

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, title or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).

<span class="mw-page-title-main">Medical Subject Headings</span> Controlled vocabulary

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings. MeSH is also used by ClinicalTrials.gov registry to classify which diseases are studied by trials registered in ClinicalTrials.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Natural-language user interface is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications.

Nota Bene is an integrated software suite of applications, including word processing, reference management, and document text analysis software that is focused on writers and scholars in the Humanities, Social Sciences, and the Arts. The integrated suite is referred to as the Nota Bene Workstation. It runs on Microsoft Windows and Macintosh.

The following outline is provided as an overview of and topical guide to natural-language processing:

Gregory Grefenstette is a French–American researcher and professor in computer science, in particular artificial intelligence and natural language processing. As of 2020, he is the chief scientific officer at Biggerpan, a company developing a predictive contextual engine for the mobile web. Grefenstette is also a senior associate researcher at the Florida Institute for Human and Machine Cognition (IHMC).

References

  1. 1 2 3 Hlava, Marjorie M. (31 January 2005). "Automatic Indexing: A Matter of Degree". Bulletin of the American Society for Information Science and Technology. 29 (1): 12–15. doi: 10.1002/bult.261 .
  2. 1 2 Cleveland, Ana; Cleveland, Donald (2013). Introduction to Indexing and Abstracting: Fourth Edition. Santa Barbara, CA: ABC-CLIO. p. 289. ISBN   9781598849769.
  3. 1 2 3 Riaz, Muhammad (1989). Advanced Indexing and Abstracting Practies. Delhi: Atlantic Publishers & Distributors. p. 263.
  4. 1 2 Historical Note: The Past Thirty Years in Information Retrieval Salton, Gerard Journal of the American Society for Information Science (1986-1998); Sep 1987; 38, 5; ProQuest pg. 375
  5. 1 2 Torres-Moreno, Juan-Manuel (2014). Automatic Text Summarization. Hoboken, NJ: John Wiley & Sons. pp. xii. ISBN   9781848216686.
  6. Kapetanios, Epaminondas; Sugumaran, Vijayan; Natural Language and Information Systems: 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008 London, UK, June 24–27, 2008, Proceedings, Myra (2008). Natural Language and Information Systems: 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008 London, UK, June 24-27, 2008, Proceedings. Berlin: Springer Science & Business Media. p. 350. ISBN   978-3-540-69857-9.{{cite book}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
  7. Basch, Reva (1996). Secrets of the Super Net Searchers: The Reflections, Revelations, and Hard-won Wisdom of 35 of the World's Top Internet Researchers . Medford, NJ: Information Today, Inc. pp.  271. ISBN   0910965226.
  8. Jayaweera, Y. D.; Johar, Md Gapar Md; Perera, S. N. "Open Journal Systems".{{cite journal}}: Cite journal requires |journal= (help)
  9. Armstrong, Susan (1994). Using Large Corpora. Cambridge, MA: MIT Press. p. 291. ISBN   0262510820.
  10. Sakji, Saoussen; Letord, Catherine; Dahamna, Badisse; Kergourlay, Ivan; Pereira, Suzanne; Joubert, Michel; Darmoni, Stéfan (2009). "Automatic indexing in a drug information portal". Studies in Health Technology and Informatics. 148: 112–122. ISSN   0926-9630. PMID   19745241.