Statistically improbable phrase

Last updated

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus. [1] [2] [3] Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section. [4] [5] Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm . [6] SIPs with a linguistic density of two or three words, adjective, adjective, noun or adverb, adverb, verb, will signal the author's attitude, premise or conclusions to the reader or express an important idea.

Contents

Another use of SIPs is as a detection tool for plagiarism. (Almost) unique combinations of words can be searched for online, and if they have appeared in a published text, the search will identify where. This method only checks those texts that have been published and that have been digitized online.

For example, a submission by, say, a student that contained the phrase "garden style, praising irregularity in design", might be searched for using Google.com and will yield the original Wikipedia article about Sir William Temple, English political figure and essayist.

Example

In a document about computers, the most common word is likely to be the word "the", but since "the" is the most commonly used word in the English language, it is probable that any given document will use "the" very frequently. However, a phrase like "explicit Boolean algorithm" might occur in the document at a much higher rate than its average rate in the English language. Hence, it is a phrase unlikely to occur in any given document, but did occur in the document given. "Explicit Boolean algorithm" would be a statistically improbable phrase.

Statistically improbable phrases of Darwin's On the Origin of Species could be: temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species and mongrel offspring. [7]

See also

Related Research Articles

English grammar is the set of structural rules of the English language. This includes the structure of words, phrases, clauses, sentences, and whole texts.

The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. Son-in-law is an example of a hyphenated word.

<span class="mw-page-title-main">Glossary of library and information science</span>

This page is a glossary of library and information science.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Keywords are the words that academics use to reveal the internal structure of an author's reasoning. While they are used primarily for rhetoric, they are also used in a strictly grammatical sense for structural composition, reasoning, and comprehension. Indeed, they are an essential part of any language.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

<span class="mw-page-title-main">Anchor text</span> Visible, clickable text in a hyperlink

The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines. Additionally, several significant general-purpose mainframe-based database management systems have used inverted list architectures, including ADABAS, DATACOM/DB, and Model 204.

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The following outline is provided as an overview of and topical guide to natural-language processing:

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries. Query understanding is at the heart of technologies like Amazon Alexa, Apple's Siri. Google Assistant, IBM's Watson, and Microsoft's Cortana.

References

  1. "SIPping Wikipedia" (PDF). Courses.cms.caltech.edu. Retrieved 2017-01-01.
  2. Jonathan Bailey (3 July 2012). "How Long Should a Statistically Improbably Phrase Be?". Plagiarism Today.
  3. Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R. (1 June 2010). "Identifying duplicate content using statistically improbable phrases". Bioinformatics. 26 (11): 1453–1457. doi:10.1093/bioinformatics/btq146. PMC   2872002 . PMID   20472545 via bioinformatics.oxfordjournals.org.
  4. "What are Statistically Improbable Phrases?". Amazon.com . Retrieved 2007-12-18.
  5. Weeks, Linton (August 30, 2005). "Amazon's Vital Statistics Show How Books Stack Up". The Washington Post . Retrieved September 8, 2015.
  6. Rudder, Christian (2014). Dataclysm: Who We Are When We Think No One's Looking. New York: Crown Publishers. ISBN   978-0-385-34737-2.
  7. Sociologically Improbable Phrases Crooked Timber April 2005