Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about , to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge.
Subject indexing is used in information retrieval especially to create bibliographic indexes to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH, Chemical Abstracts and PubMed. The index terms were mostly assigned by experts but author keywords are also common.
The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms which appropriately identify the subject either by extracting words directly from the document or assigning words from a controlled vocabulary. [1] The terms in the index are then presented in a systematic order.
Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing.
The first step in indexing is to decide on the subject matter of the document. In manual indexing, the indexer would consider the subject matter in terms of answer to a set of questions such as "Does the document deal with a specific product, condition or phenomenon?". [2] As the analysis is influenced by the knowledge and experience of the indexer, it follows that two indexers may analyze the content differently and so come up with different index terms. This will impact on the success of retrieval.
Automatic indexing follows set processes of analyzing frequencies of word patterns and comparing results to other documents in order to assign to subject categories. This requires no understanding of the material being indexed. This leads to more uniform indexing but at the expense of the true meaning being interpreted. A computer program will not understand the meaning of statements and may therefore fail to assign some relevant terms or assign incorrectly. Human indexers focus their attention on certain parts of the document such as the title, abstract, summary and conclusions, as analyzing the full text in depth is costly and time-consuming. [3] An automated system takes away the time limit and allows the entire document to be analyzed, but also has the option to be directed to particular parts of the document.
The second stage of indexing involves the translation of the subject analysis into a set of index terms. This can involve extracting from the document or assigning from a controlled vocabulary. With the ability to conduct a full text search widely available, many people have come to rely on their own expertise in conducting information searches and full text search has become very popular. Subject indexing and its experts, professional indexers, catalogers, and librarians, remains crucial to information organization and retrieval. These experts understand controlled vocabularies and are able to find information that cannot be located by full text search. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annotate documents, social tagging has gained popularity especially in the Web. [4]
One application of indexing, the book index, remains relatively unchanged despite the information revolution.
Extraction indexing involves taking words directly from the document. It uses natural language and lends itself well to automated techniques where word frequencies are calculated and those with a frequency over a pre-determined threshold are used as index terms. A stop-list containing common words (such as "the", "and") would be referred to and such stop words would be excluded as index terms.
Automated extraction indexing may lead to loss of meaning of terms by indexing single words as opposed to phrases. Although it is possible to extract commonly occurring phrases, it becomes more difficult if key concepts are inconsistently worded in phrases. Automated extraction indexing also has the problem that, even with use of a stop-list to remove common words, some frequent words may not be useful for allowing discrimination between documents. For example, the term glucose is likely to occur frequently in any document related to diabetes. Therefore, use of this term would likely return most or all the documents in the database. Post-coordinated indexing where terms are combined at the time of searching would reduce this effect but the onus would be on the searcher to link appropriate terms as opposed to the information professional. In addition terms that occur infrequently may be highly significant for example a new drug may be mentioned infrequently but the novelty of the subject makes any reference significant. One method for allowing rarer terms to be included and common words to be excluded by automated techniques would be a relative frequency approach where frequency of a word in a document is compared to frequency in the database as a whole. Therefore, a term that occurs more often in a document than might be expected based on the rest of the database could then be used as an index term, and terms that occur equally frequently throughout will be excluded.
Another problem with automated extraction is that it does not recognize when a concept is discussed but is not identified in the text by an indexable keyword. [5]
Since this process is based on simple string matching and involves no intellectual analysis, the resulting product is more appropriately known as a concordance than an index.
An alternative is assignment indexing where index terms are taken from a controlled vocabulary. This has the advantage of controlling for synonyms as the preferred term is indexed and synonyms or related terms direct the user to the preferred term. This means the user can find articles regardless of the specific term used by the author and saves the user from having to know and check all possible synonyms. [6] It also removes any confusion caused by homographs by inclusion of a qualifying term. A third advantage is that it allows the linking of related terms whether they are linked by hierarchy or association, e.g. an index entry for an oral medication may list other oral medications as related terms on the same level of the hierarchy but would also link to broader terms such as treatment. Assignment indexing is used in manual indexing to improve inter-indexer consistency as different indexers will have a controlled set of terms to choose from. Controlled vocabularies do not completely remove inconsistencies as two indexers may still interpret the subject differently. [2]
The final phase of indexing is to present the entries in a systematic order. This may involve linking entries. In a pre-coordinated index the indexer determines the order in which terms are linked in an entry by considering how a user may formulate their search. In a post-coordinated index, the entries are presented singly and the user can link the entries through searches, most commonly carried out by computer software. Post-coordination results in a loss of precision in comparison to pre-coordination. [7]
Indexers must make decisions about what entries should be included and how many entries an index should incorporate. The depth of indexing describes the thoroughness of the indexing process with reference to exhaustivity and specificity. [8]
An exhaustive index is one which lists all possible index terms. Greater exhaustivity gives a higher recall, or more likelihood of all the relevant articles being retrieved, however, this occurs at the expense of precision. This means that the user may retrieve a larger number of irrelevant documents or documents which only deal with the subject in little depth. In a manual system a greater level of exhaustivity brings with it a greater cost as more man-hours are required. The additional time taken in an automated system would be much less significant. At the other end of the scale, in a selective index only the most important aspects are covered. [9] Recall is reduced in a selective index as if an indexer does not include enough terms, a highly relevant article may be overlooked. Therefore, indexers should strive for a balance and consider what the document may be used. They may also have to consider the implications of time and expense.
The specificity describes how closely the index terms match the topics they represent [10] An index is said to be specific if the indexer uses parallel descriptors to the concept of the document and reflects the concepts precisely. [11] Specificity tends to increase with exhaustivity as the more terms you include, the narrower those terms will be.
Hjørland (2011) [12] found that theories of indexing are at the deepest level connected to different theories of knowledge:
The core of indexing is, as stated by Rowley and Farrow [16] to evaluate a paper's contribution to knowledge and index it accordingly. Or, in the words of Hjørland (1992, [17] 1997) to index its informative potentials. "In order to achieve good consistent indexing, the indexer must have a thorough appreciation of the structure of the subject and the nature of the contribution that the document is making to the advancement of knowledge" (Rowley & Farrow, 2000, [16] p. 99).
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
This page is a glossary of library and information science.
Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sciences, logic, and library and information science. Most fundamentally, however, it is studied in epistemology. Different theories of knowledge have different implications for what is considered relevant and these fundamental views have implications for all other fields as well.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.
In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.
Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.
Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:
Knowledge organization (KO), organization of knowledge, organization of information, or information organization is an intellectual discipline concerned with activities such as document description, indexing, and classification that serve to provide systems of representation and order for knowledge and information objects. According to The Organization of Information by Joudrey and Taylor, information organization:
examines the activities carried out and tools used by people who work in places that accumulate information resources for the use of humankind, both immediately and for posterity. It discusses the processes that are in place to make resources findable, whether someone is searching for a single known item or is browsing through hundreds of resources just hoping to discover something useful. Information organization supports a myriad of information-seeking scenarios.
Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.
A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.
Aboutness is a term used in library and information science (LIS), linguistics, philosophy of language, and philosophy of mind. In general, the term refers to the concept that a text, utterance, image, or action is on or of something. In LIS, it is often considered synonymous with a document's subject. In the philosophy of mind, it has been often considered synonymous with intentionality, perhaps since John Searle (1983). In the philosophy of logic and language, it is understood as the way a piece of text relates to a subject matter or topic.
Polythematic Structured Subject Heading System is a bilingual Czech–English controlled vocabulary of subject headings developed and maintained by the National Technical Library in Prague. It was designed for describing and searching information resources according to their subject. PSH contains more than 13,900 terms, which cover the main fields of human knowledge.
Automatic indexing is the computerized process of scanning large volumes of documents against a controlled vocabulary, taxonomy, thesaurus or ontology and using those controlled terms to quickly and effectively index large electronic document depositories. These keywords or language are applied by training a system on the rules that determine what words to match. There are additional parts to this such as syntax, usage, proximity, and other algorithms based on the system and what is required for indexing. This is taken into account using Boolean statements to gather and capture the indexing information out of the text. As the number of documents exponentially increases with the proliferation of the Internet, automatic indexing will become essential to maintaining the ability to find relevant information in a sea of irrelevant information. Natural language systems are used to train a system based on seven different methods to help with this sea of irrelevant information. These methods are Morphological, Lexical, Syntactic, Numerical, Phraseological, Semantic, and Pragmatic. Each of these look and different parts of speed and terms to build a domain for the specific information that is being covered for indexing. This is used in the automated process of indexing.
In library and information science documents are classified and searched by subject – as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them findable. There are many ways to do this and in general there is not always consensus about which subject should be assigned to a given document. To optimize subject indexing and searching, we need to have a deeper understanding of what a subject is. The question: "what is to be understood by the statement 'document A belongs to subject category X'?" has been debated in the field for more than 100 years
The following outline is provided as an overview of and topical guide to natural-language processing:
In the context of information retrieval, a thesaurus is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.