Key Word in Context

Last updated

Key Word In Context (KWIC) is the most common format for concordance lines. The term KWIC was coined by Hans Peter Luhn. [1] The system was based on a concept called keyword in titles, which was first proposed for Manchester libraries in 1864 by Andrea Crestadoro. [2]

Contents

A KWIC index is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index. [3] It was a useful indexing method for technical manuals before computerized full text search became common.

For example, a search query including all of the words in an example definition ("KWIC is an acronym for Key Word In Context, the most common format for concordance lines") and the Wikipedia slogan in English ("the free encyclopedia"), searched against a Wikipedia page, might yield a KWIC index as follows. A KWIC index usually uses a wide layout to allow the display of maximum 'in context' information (not shown in the following example).

KWIC is anacronym for Key Word In Context, ...page 1
... Key Word In Context, the mostcommon format for concordance lines.page 1
... the most common format forconcordance lines.page 1
... is an acronym for Key Word InContext, the most common format ...page 1
Wikipedia, The FreeEncyclopediapage 0
... In Context, the most commonformat for concordance lines.page 1
Wikipedia, TheFree Encyclopediapage 0
KWIC is an acronym forKey Word In Context, the most ...page 1
 KWIC is an acronym for Key Word ...page 1
... common format for concordancelines.page 1
... for Key Word In Context, themost common format for concordance ...page 1
 Wikipedia, The Free Encyclopediapage 0
KWIC is an acronym for KeyWord In Context, the most common ...page 1

A KWIC index is a special case of a permuted index. [4] This term refers to the fact that it indexes all cyclic permutations of the headings. Books composed of many short sections with their own descriptive headings, most notably collections of manual pages, often ended with a permuted index section, allowing the reader to easily find a section by any word from its heading. This practice, also known as Key Word Out of Context (KWOC), is no longer common.

References in literature

Note: The first reference does not show the KWIC index unless you pay to view the paper. The second reference does not even list the paper at all.

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Glossary of library and information science</span>

This page is a glossary of library and information science.

Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

Stop words are the words in a stop list which are filtered out before or after processing of natural language data (text) because they are insignificant. There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists to very small stop lists to no stop list whatsoever".

Hans Peter Luhn was an American researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC indexing, and selective dissemination of information ("SDI"). His inventions have found applications in diverse areas like computer science, the textile industry, linguistics, and information science. He was awarded over 80 patents.

<span class="mw-page-title-main">Anchor text</span> Visible, clickable text in a hyperlink

The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

<span class="mw-page-title-main">Index (publishing)</span> List of words or phrases and associated pointers

An index is a list of words or phrases ('headings') and associated pointers ('locators') to where useful material relating to that heading can be found in a document or collection of documents. Examples are an index in the back matter of a book and an index that serves as a library catalog. An index differs from a word index, or concordance, in focusing on the subject of the text rather than the exact words in a text, and it differs from a table of contents because the index is ordered by subject, regardless of whether it is early or late in the book, while the listed items in a table of contents is placed in the same order as the book.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

ptx is a Unix utility, named after the permuted index algorithm which it uses to produce a search or concordance report in the Keyword in Context (KWIC) format. It is available on most Unix and Unix-like operating systems. The GNU implementation uses extensions that are more powerful than the older SysV implementation. The command is available as a separate package for Microsoft Windows as part of the UnxUtils collection of native Win32 ports of common GNU Unix-like utilities.

<span class="mw-page-title-main">Tag cloud</span> Visual representation of word frequency

A tag cloud is a visual representation of text data which is often used to depict keyword metadata on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. When used as website navigation aids, the terms are hyperlinked to items associated with the tag.

<span class="mw-page-title-main">Concordance (publishing)</span> List of words or terms in a published book

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Herbert Marvin Ohlman (1927–2002) is the inventor of permutation indexing, or Permuterm and is one of the pioneers of Information Science and Technology. He has been recognized and included in the Pioneers of Information Science in North America ProjectArchived 2015-02-04 at the Wayback Machine by ASIS.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

In computing, apropos is a command to search the man page files in Unix and Unix-like operating systems. Apropos takes its name from the French "à propos" which means about. It is particularly useful when searching for commands without knowing their exact names.

<span class="mw-page-title-main">Andrea Crestadoro</span> Librarian and inventor (1808–1879)

Dr. Andrea Crestadoro (1808–1879) was a bibliographer who became Chief Librarian of Manchester Free Library, 1864–1879. He is credited with being the first person to propose that books could be catalogued by using keywords that did not occur in the title of the book. His ideas also included a metallic balloon, reform of the tax system, and improvements to a railway locomotive – the Impulsoria – that was powered by four horses on a treadmill.

References

  1. Manning, C. D.; Schütze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press. p. 35.
  2. "Advanced Indexing and Abstracting Practies". Atlantic Publishers & Distri. Retrieved 26 March 2019 via Google Books.
  3. "KWIC indexes and concordances". Archived from the original on 2016-06-06. Retrieved 2016-06-17.
  4. "3. Theory of KWIC indexing". Infohost.nmt.edu. Archived from the original on 14 May 2019. Retrieved 26 March 2019.