Collocation extraction

Last updated

Collocation extraction is the task of using a computer to extract collocations automatically from a corpus.

Contents

The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are mutual information, t-test, z test, chi-squared test and likelihood ratio. [1]

Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.

See also

Related Research Articles

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Morphological derivation, in linguistics, is the process of forming a new word from an existing word, often by adding a prefix or suffix, such as un- or -ness. For example, unhappy and happiness derive from the root word happy.

<span class="mw-page-title-main">English compound</span> Aspect of English grammar

A compound is a word composed of more than one free morpheme. The English language, like many others, uses compounds frequently. English compounds may be classified in several ways, such as the word classes or the semantic relationship of their components.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Hendiadys is a figure of speech used for emphasis—"The substitution of a conjunction for a subordination". The basic idea is to use two words linked by the conjunction "and" instead of the one modifying the other. English names for hendiadys include two for one and figure of twins. The term hendiaduo may also be used. The 17th century English Biblical commentator Matthew Poole referred to "hendiaduos" in his comments on Genesis 3:16, Proverbs 1:6, and Isaiah 19:20.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In linguistics, a compound is a lexeme that consists of more than one stem. Compounding, composition or nominal composition is the process of word formation that creates compound lexemes. Compounding occurs when two or more words or signs are joined to make a longer word or sign. A compound that uses a space rather than a hyphen or concatenation is called an open compound or a spaced compound; the alternative is a closed compound.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

<span class="mw-page-title-main">Word</span> Smallest linguistic element that will be said in isolation with semantic or pragmatic content

A word can be generally defined as a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its definition and numerous attempts to find specific criteria of the concept remain controversial. Different standards have been proposed, depending on the theoretical background and descriptive context; these do not converge on a single definition. Some specific definitions of the term "word" are employed to convey its different meanings at different levels of description, for example based on phonological, grammatical or orthographic basis. Others suggest that the concept is simply a convention used in everyday situations.

In morphology and lexicography, a lemma is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Lemmas have special significance in highly inflected languages such as Arabic, Turkish and Russian. The process of determining the lemma for a given lexeme is called lemmatisation. The lemma can be viewed as the chief of the principal parts, although lemmatisation is at least partly arbitrary.

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document than in some larger corpus. Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section. Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm. SIPs with a linguistic density of two or three words, adjective, adjective, noun or adverb, adverb, verb, will signal the author's attitude, premise or conclusions to the reader or express an important idea.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

A phraseme, also called a set phrase, idiomatic phrase, multi-word expression, or idiom, is a multi-word or multi-morphemic utterance where at least one of whose components is selectionally constrained or restricted by linguistic convention such that it is not freely chosen. In the most extreme cases, there are expressions such as X kicks the bucket ≈ ‘person X dies of natural causes, the speaker being flippant about X’s demise’ where the unit is selected as a whole to express a meaning that bears little or no relation to the meanings of its parts. All of the words in this expression are chosen restrictedly, as part of a chunk. At the other extreme, there are collocations such as stark naked, hearty laugh, or infinite patience where one of the words is chosen freely based on the meaning the speaker wishes to express while the choice of the other (intensifying) word is constrained by the conventions of the English language. Both kinds of expression are phrasemes, and can be contrasted with ’’free phrases’’, expressions where all of the members are chosen freely, based exclusively on their meaning and the message that the speaker wishes to communicate.

In linguistics, a catena is a unit of syntax and morphology, closely associated with dependency grammars. It is a more flexible and inclusive unit than the constituent and may therefore be better suited than the constituent to serve as the fundamental unit of syntactic and morphosyntactic analysis.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">English phrasal verbs</span> Concept in English grammar

In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit composed of a verb followed by a particle, sometimes combined with a preposition.

In grammar, sentence and clause structure, commonly known as sentence composition, is the classification of sentences based on the number and kind of clauses in their syntactic structure. Such division is an element of traditional grammar.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

References

  1. Manning, C. D.; Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press. ISBN   978-0-262-13360-9.