Key Word In Context (KWIC) is the most common format for concordance lines. The term KWIC was coined by Hans Peter Luhn. [1] The system was based on a concept called keyword in titles, which was first proposed for Manchester libraries in 1864 by Andrea Crestadoro. [2]
A KWIC index is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index. [3] It was a useful indexing method for technical manuals before computerized full text search became common.
For example, a search query including all of the words in an example definition ("KWIC is an acronym for Key Word In Context, the most common format for concordance lines") and the Wikipedia slogan in English ("the free encyclopedia"), searched against a Wikipedia page, might yield a KWIC index as follows. A KWIC index usually uses a wide layout to allow the display of maximum 'in context' information (not shown in the following example).
KWIC is an | acronym for Key Word In Context, ... | page 1 |
... Key Word In Context, the most | common format for concordance lines. | page 1 |
... the most common format for | concordance lines. | page 1 |
... is an acronym for Key Word In | Context, the most common format ... | page 1 |
Wikipedia, The Free | Encyclopedia | page 0 |
... In Context, the most common | format for concordance lines. | page 1 |
Wikipedia, The | Free Encyclopedia | page 0 |
KWIC is an acronym for | Key Word In Context, the most ... | page 1 |
KWIC is an acronym for Key Word ... | page 1 | |
... common format for concordance | lines. | page 1 |
... for Key Word In Context, the | most common format for concordance ... | page 1 |
Wikipedia, The Free Encyclopedia | page 0 | |
KWIC is an acronym for Key | Word In Context, the most common ... | page 1 |
A KWIC index is a special case of a permuted index. [4] This term refers to the fact that it indexes all cyclic permutations of the headings. Books composed of many short sections with their own descriptive headings, most notably collections of manual pages, often ended with a permuted index section, allowing the reader to easily find a section by any word from its heading. This practice, also known as Key Word Out of Context (KWOC), is no longer common.
Note: The first reference does not show the KWIC index unless you pay to view the paper. The second reference does not even list the paper at all.
ptx
, a Unix command-line utility producing a permuted indexA translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.
A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.
This page is a glossary of library and information science.
Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.
A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.
Stop words are the words in a stop list which are filtered out before or after processing of natural language data (text) because they are deemed insignificant. There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists to very small stop lists to no stop list whatsoever".
Hans Peter Luhn was an American researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC indexing, and selective dissemination of information ("SDI"). His inventions have found applications in diverse areas like computer science, the textile industry, linguistics, and information science. He was awarded over 80 patents.
The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>
. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.
An index is a list of words or phrases ('headings') and associated pointers ('locators') to where useful material relating to that heading can be found in a document or collection of documents. Examples are an index in the back matter of a book and an index that serves as a library catalog. An index differs from a word index, or concordance, in focusing on the subject of the text rather than the exact words in a text, and it differs from a table of contents because the index is ordered by subject, regardless of whether it is early or late in the book, while the listed items in a table of contents is placed in the same order as the book.
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.
ptx is a Unix utility, named after the permuted index algorithm which it uses to produce a search or concordance report in the Keyword in Context (KWIC) format. It is available on most Unix and Unix-like operating systems. The GNU implementation uses extensions that are more powerful than the older SysV implementation. The command is available as a separate package for Microsoft Windows as part of the UnxUtils collection of native Win32 ports of common GNU Unix-like utilities.
A tag cloud is a visual representation of text data which is often used to depict keyword metadata on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. When used as website navigation aids, the terms are hyperlinked to items associated with the tag.
A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document than in some larger corpus. Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section. Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm. SIPs with a linguistic density of two or three words, adjective, adjective, noun or adverb, adverb, verb, will signal the author's attitude, premise or conclusions to the reader or express an important idea.
A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.
In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.
In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.
Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.
Herbert Marvin Ohlman (1927–2002) is the inventor of permutation indexing, or Permuterm and is one of the pioneers of Information Science and Technology. He has been recognized and included in the Pioneers of Information Science in North America Project.
Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.
Dr. Andrea Crestadoro (1808–1879) was a bibliographer who became Chief Librarian of Manchester Free Library, 1864–1879. He is credited with being the first person to propose that books could be catalogued by using keywords that did not occur in the title of the book. His ideas also included a metallic balloon, reform of the tax system, and improvements to a railway locomotive – the Impulsoria – that was powered by four horses on a treadmill.