Collocation

Last updated

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Contents

There are about seven main types of collocations: adjective + noun, noun + noun (such as collective nouns), noun + verb, verb + noun, adverb + adjective, verbs + prepositional phrase (phrasal verbs), and verb + adverb.

Collocation extraction is a computational technique that finds collocations in a document or corpus, using various computational linguistics elements resembling data mining.

Expanded definition

Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. Such terms as crystal clear, middle management, nuclear family, and cosmetic surgery are examples of collocated pairs of words.

Collocations can be in a syntactic relation (such as verb–object: make and decision), lexical relation (such as antonymy), or they can be in no linguistically defined relation. Knowledge of collocations is vital for the competent use of a language: a grammatically correct sentence will stand out as awkward if collocational preferences are violated. This makes collocation an interesting area for language teaching.

Corpus linguists specify a key word in context (KWIC) and identify the words immediately surrounding them. This gives an idea of the way words are used.

The processing of collocations involves a number of parameters, the most important of which is the measure of association, which evaluates whether the co-occurrence is purely by chance or statistically significant. Due to the non-random nature of language, most collocations are classed as significant, and the association scores are simply used to rank the results. Commonly used measures of association include mutual information, t scores, and log-likelihood. [1] [2]

Rather than select a single definition, Gledhill [3] proposes that collocation involves at least three different perspectives: co-occurrence, a statistical view, which sees collocation as the recurrent appearance in a text of a node and its collocates; [4] [5] [6] construction, which sees collocation either as a correlation between a lexeme and a lexical-grammatical pattern, [7] or as a relation between a base and its collocative partners; [8] and expression, a pragmatic view of collocation as a conventional unit of expression, regardless of form. [9] [10] These different perspectives contrast with the usual way of presenting collocation in phraseological studies. Traditionally speaking, collocation is explained in terms of all three perspectives at once, in a continuum:

Free combination ↔ bound collocation ↔ frozen idiom

In dictionaries

In 1933, Harold Palmer's Second Interim Report on English Collocations highlighted the importance of collocation as a key to producing natural-sounding language, for anyone learning a foreign language. [11] Thus from the 1940s onwards, information about recurrent word combinations became a standard feature of monolingual learner's dictionaries. As these dictionaries became "less word-centred and more phrase-centred", [12] more attention was paid to collocation. This trend was supported, from the beginning of the 21st century, by the availability of large text corpora and intelligent corpus-querying software, making it possible to provide a more systematic account of collocation in dictionaries. Using these tools, dictionaries such as the Macmillan English Dictionary and the Longman Dictionary of Contemporary English included boxes or panels with lists of frequent collocations. [13]

There are also a number of specialized dictionaries devoted to describing the frequent collocations in a language. [14] These include (for Spanish) Redes: Diccionario combinatorio del español contemporaneo (2004), (for French) Le Robert: Dictionnaire des combinaisons de mots (2007), and (for English) the LTP Dictionary of Selected Collocations (1997) and the Macmillan Collocations Dictionary (2010). [15]

Statistically significant collocation

Student's t-test can be used to determine whether the occurrence of a collocation in a corpus is statistically significant. [16] For a bigram , let be the unconditional probability of occurrence of in a corpus with size , and let be the unconditional probability of occurrence of in the corpus. The t-score for the bigram is calculated as:

where is the sample mean of the occurrence of , is the number of occurrences of , is the probability of under the null-hypothesis that and appear independently in the text, and is the sample variance. With a large , the t-test is equivalent to a Z-test.

See also

Related Research Articles

A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology.

An idiom is a phrase or expression that usually presents a figurative, non-literal meaning attached to the phrase. Some phrases which become figurative idioms, however, do retain the phrase's literal meaning. Categorized as formulaic language, an idiom's figurative meaning is different from the literal meaning. Idioms occur frequently in all languages; in English alone there are an estimated twenty-five million idiomatic expressions.

In linguistics, a calque or loan translation is a word or phrase borrowed from another language by literal word-for-word or root-for-root translation. When used as a verb, “to calque” means to borrow a word or phrase from another language while translating its components, so as to create a new lexeme in the target language. For instance, the English word "skyscraper" has been calqued in dozens of other languages, combining words for "sky" and "scrape" in each language, as for example, German: Wolkenkratzer, Portuguese: Arranha-céu, Turkish: Gökdelen. Another notable example is the Latin weekday names, which came to be associated by ancient Germanic speakers with their own gods following a practice known as interpretatio germanica: the Latin "Day of Mercury", Mercurii dies, was borrowed into Late Proto-Germanic as the "Day of Wōđanaz" (Wodanesdag), which became Wōdnesdæg in Old English, then "Wednesday" in Modern English.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

In linguistics, phraseology is the study of set or fixed expressions, such as idioms, phrasal verbs, and other types of multi-word lexical units, in which the component parts of the expression take on a meaning more specific than, or otherwise not predictable from, the sum of their meanings when used independently. For example, ‘Dutch auction’ is composed of the words Dutch ‘of or pertaining to the Netherlands’ and auction ‘a public sale in which goods are sold to the highest bidder’, but its meaning is not ‘a sale in the Netherlands where goods are sold to the highest bidder’; instead, the phrase has a conventionalized meaning referring to any auction where, instead of rising, the prices fall.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

John McHardy Sinclair was a Professor of Modern English Language at Birmingham University from 1965 to 2000. He pioneered work in corpus linguistics, discourse analysis, lexicography, and language teaching.

In lexicography, a lexical item is a single word, a part of a word, or a chain of words (catena) that forms the basic elements of a language's lexicon (≈ vocabulary). Examples are cat, traffic light, take care of, by the way, and it's raining cats and dogs. Lexical items can be generally understood to convey a single meaning, much as a lexeme, but are not limited to single words. Lexical items are like semes in that they are "natural units" translating between languages, or in learning a new language. In this last sense, it is sometimes said that language consists of grammaticalized lexis, and not lexicalized grammar. The entire store of lexical items in a language is called its lexis.

In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. Corpus linguistics and its statistic analyses reveal patterns of co-occurrences within a language and enable to work out typical collocations for its lexical items. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.

A phraseme, also called a set phrase, fixed expression, idiomatic phrase, multiword expression, or idiom, is a multi-word or multi-morphemic utterance whose components include at least one that is selectionally constrained or restricted by linguistic convention such that it is not freely chosen. In the most extreme cases, there are expressions such as X kicks the bucket ≈ ‘person X dies of natural causes, the speaker being flippant about X’s demise’ where the unit is selected as a whole to express a meaning that bears little or no relation to the meanings of its parts. All of the words in this expression are chosen restrictedly, as part of a chunk. At the other extreme, there are collocations such as stark naked, hearty laugh, or infinite patience where one of the words is chosen freely based on the meaning the speaker wishes to express while the choice of the other (intensifying) word is constrained by the conventions of the English language. Both kinds of expression are phrasemes, and can be contrasted with ’’free phrases’’, expressions where all of the members are chosen freely, based exclusively on their meaning and the message that the speaker wishes to communicate.

Christiane D. Fellbaum is an American linguist and computational linguistics researcher who is Lecturer with Rank of Professor in the Program in Linguistics and the Computer Science Department at Princeton University. The co-developer of the WordNet project, she is also its current director.

Collocation extraction is the task of using a computer to extract collocations automatically from a corpus.

In computational linguistics the Yarowsky algorithm is an unsupervised learning algorithm for word sense disambiguation that uses the "one sense per collocation" and the "one sense per discourse" properties of human languages for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.

Macmillan English Dictionary for Advanced Learners, also known as MEDAL, is an advanced learner's dictionary first published in 2002 by Macmillan Education. It shares most of the features of this type of dictionary: it provides definitions in simple language, using a controlled defining vocabulary; most words have example sentences to illustrate how they are typically used; and information is given about how words combine grammatically or in collocations. MEDAL also introduced a number of innovations. These include:

Norbert Schmitt is an American applied linguist and Emeritus Professor of Applied Linguistics at the University of Nottingham in the United Kingdom. He is known for his work on second-language vocabulary acquisition and second-language vocabulary teaching. He has published numerous books and papers on vocabulary acquisition.

An explanatory combinatorial dictionary (ECD) is a type of monolingual dictionary designed to be part of a meaning-text linguistic model of a natural language. It is intended to be a complete record of the lexicon of a given language. As such, it identifies and describes, in separate entries, each of the language's lexemes and phrasemes. Among other things, each entry contains (1) a definition that incorporates a lexeme's semantic actants (2) complete information on lexical co-occurrence ; (3) an extensive set of examples. The ECD is a production dictionary — that is, it aims to provide all the information needed for a foreign learner or automaton to produce perfectly formed utterances of the language. Since the lexemes and phrasemes of a natural language number in the hundreds of thousands, a complete ECD, in paper form, would occupy the space of a large encyclopaedia. Such a work has yet to be achieved; while ECDs of Russian and French have been published, each describes less than one percent of the vocabulary of the respective languages.

A lexical function (LF) is a tool developed within Meaning-Text Theory for the description and systematization of semantic relationships, specifically collocations and lexical derivation, between particular lexical units (LUs) of a language. LFs are also used in the construction of technical lexica and as abstract nodes in certain types of syntactic representation. Basically, an LF is a function ƒ( ) representing a correspondence ƒ that associates a set ƒ(L) of lexical expressions with an LU L; in f(L), L is the keyword of ƒ, and ƒ(L) = {L´i} is ƒ’s value. Detailed discussions of Lexical Functions are found in Žolkovskij & Mel’čuk 1967, Mel’čuk 1974, 1996, 1998, 2003, 2007, and Wanner (ed.) 1996; analysis of the most frequent type of lexical functions—verb-noun collocations—can be found in Gelbukh & Kolesnikova 2013.

<span class="mw-page-title-main">English phrasal verbs</span> Concept in English grammar

In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit consisting of a verb followed by a particle, sometimes collocated with a preposition.

Idiom, also called idiomaticness or idiomaticity, is the syntactical, grammatical, or structural form peculiar to a language. Idiom is the realized structure of a language, as opposed to possible but unrealized structures that could have developed to serve the same semantic functions but did not.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

References

  1. Dunning, Ted (1993): "Accurate methods for the statistics of surprise and coincidence Archived 2012-08-05 at the Wayback Machine ". Computational Linguistics 19, 1 (Mar. 1993), 61–74.
  2. Dunning, Ted (2008-03-21). "Surprise and Coincidence". blogspot.com. Archived from the original on 2012-01-20. Retrieved 2012-04-09.
  3. Gledhill C. (2000): Collocations in Science Writing Archived 2023-06-29 at the Wayback Machine , Narr, Tübingen
  4. Firth J.R. (1957): Papers in Linguistics 1934–1951. Oxford: Oxford University Press.
  5. Sinclair J. (1996): "The Search for Units of Meaning", in Textus, IX, 75–106.
  6. Smadja F. A & McKeown, K. R. (1990): "Automatically extracting and representing collocations for language generation Archived 2015-09-06 at the Wayback Machine ", Proceedings of ACL'90, 252–259, Pittsburgh, Pennsylvania.
  7. Hunston S. & Francis G. (2000): Pattern Grammar — A Corpus-Driven Approach to the Lexical Grammar of English Archived 2023-06-29 at the Wayback Machine , Amsterdam, John Benjamins
  8. Hausmann F. J. (1989): Le dictionnaire de collocations. In Hausmann F.J., Reichmann O., Wiegand H.E., Zgusta L.(eds), Wörterbücher : ein internationales Handbuch zur Lexikographie. Dictionaries. Dictionnaires. Berlin/New-York : De Gruyter. 1010–1019.
  9. Moon R. (1998): Fixed Expressions and Idioms, a Corpus-Based Approach. Oxford, Oxford University Press.
  10. Frath P. & Gledhill C. (2005): "Free-Range Clusters or Frozen Chunks? Reference as a Defining Criterion for Linguistic Units [ dead link ]", in Recherches anglaises et Nord-américaines, vol. 38 :25–43
  11. Cowie, A.P., English Dictionaries for Foreign Learners, Oxford University Press 1999:54–56
  12. Bejoint, H., The Lexicography of English, Oxford University Press 2010: 318
  13. "MED Second Edition – Key features – Macmillan". macmillandictionaries.com. Archived from the original on 2020-09-28. Retrieved 2011-08-24.
  14. Herbst, T. and Klotz, M. 'Syntagmatic and Phraseological Dictionaries' in Cowie, A.P. (Ed.) The Oxford History of English Lexicography, 2009: part 2, 234–243
  15. "Macmillan Collocation Dictionary – How it was written - Macmillan". macmillandictionaries.com. Archived from the original on 2018-12-21. Retrieved 2011-08-24.
  16. Manning, Chris; Schütze, Hinrich (1999). Foundations of Statistical Natural Language Processing . Cambridge, MA: MIT Press. pp.  163–166. ISBN   0262133601.