Collostructional analysis

Last updated

Collostructional analysis is a family of methods developed by (in alphabetical order) Stefan Th. Gries (University of California, Santa Barbara) and Anatol Stefanowitsch (Free University of Berlin). Collostructional analysis aims at measuring the degree of attraction or repulsion that words exhibit to constructions, where the notion of construction has so far been that of Goldberg's construction grammar.

Contents

Collostructional methods

Collostructional analysis so far comprises three different methods:

Input frequencies

Collostructional analysis requires frequencies of words and constructions and is similar to a wide variety of collocation statistics. It differs from raw frequency counts by providing not only observed co-occurrence frequencies of words and constructions, but also

(i) a comparison of the observed frequency to the one expected by chance; thus, collostructional analysis can distinguish attraction and repulsion of words and constructions;

(ii) a measure of the strength of the attraction or repulsion; this is usually the log-transformed p-value of a Fisher-Yates exact test.

Versus other collocation statistics

Collostructional analysis differs from most collocation statistics such that

(i) it measures not the association of words to words, but of words to syntactic patterns or constructions; thus, it takes syntactic structure more seriously than most collocation-based analyses;

(ii) it has so far only used the most precise statistics, namely the Fisher-Yates exact test based on the hypergeometric distribution; thus, unlike t-scores, z-scores, chi-square tests etc., the analysis is not based on, and does not violate, any distributional assumptions.

See also

Related Research Articles

The following outline is provided as an overview and topical guide to linguistics:

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with a corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:

Collocation Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

Construction grammar is a family of theories within the field of cognitive linguistics which posit that constructions, or learned pairings of linguistic patterns with meanings, are the fundamental building blocks of human language. Constructions include words, morphemes, fixed expressions and idioms, and abstract grammatical rules such as the passive voice or the ditransitive. Any linguistic pattern is considered to be a construction as long as some aspect of its form or its meaning cannot be predicted from its component parts, or from other constructions that are recognized to exist. In construction grammar, every utterance is understood to be a combination of multiple different constructions, which together specify its precise meaning and form.

<i>Syntactic Structures</i> Book by Noam Chomsky

Syntactic Structures is an influential work in linguistics by American linguist Noam Chomsky, originally published in 1957. It is an elaboration of his teacher's, Zellig Harris's, model of transformational generative grammar. A short monograph of about a hundred pages, Chomsky's presentation is recognized as one of the most significant studies of the 20th century. It contains the now-famous sentence "Colorless green ideas sleep furiously", which Chomsky offered as an example of a grammatically correct sentence that has no discernible meaning. Thus, Chomsky argued for the independence of syntax from semantics.

Charles J. Fillmore American linguist

Charles J. Fillmore was an American linguist and Professor of Linguistics at the University of California, Berkeley. He received his Ph.D. in Linguistics from the University of Michigan in 1961. Fillmore spent ten years at The Ohio State University and a year as a Fellow at the Center for Advanced Study in the Behavioral Sciences at Stanford University before joining Berkeley's Department of Linguistics in 1971. Fillmore was extremely influential in the areas of syntax and lexical semantics.

Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. Corpus linguistics and its statistic analyses reveal patterns of co-occurrences within a language and enable to work out typical collocations for its lexical items. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.

Laura Michaelis American linguist

Laura A. Michaelis is a Professor in the Department of Linguistics and a faculty fellow in the Institute of Cognitive Science at the University of Colorado Boulder.

Language and Computers: Studies in Practical Linguistics is a book series on corpus linguistics and related areas.
As studies in linguistics, volumes in the series have, by definition, their foundations in linguistic theory; however, they are not concerned with theory for theory's sake, but always with a definite direct or indirect interest in the possibilities of practical application in the dynamic area where language and computers meet.

Stefan Th. Gries is (full) professor of linguistics in the Department of Linguistics at the University of California, Santa Barbara (UCSB), Honorary Liebig-Professor of the Justus-Liebig-Universität Giessen, and since 1 April 2018 also Chair of English Linguistics at the Justus-Liebig-Universität Giessen. He was a Visiting Chair (2013–2017) of the Centre for Corpus Approaches to Social Science at Lancaster University and was the Leibniz Professor at the Research Academy Leipzig of the Leipzig University.,

Linguistics is the scientific study of language. It encompasses the analysis of every aspect of language, as well as the methods for studying and modelling them.

Mark E. Davies is an American linguist. He specializes in the creation of linguistic corpora that can be used to better study and understand word meaning. He is the main creator of COCA, the Corpus of Contemporary American English. He received a National Endowment for the Humanities grant to build a comparable Spanish language corpus, the Corpus del Español, and another one to create a corpus of Portuguese texts covering material from the 12th century to the present day. Another corpus he compiled is a diachronic corpus of English, containing texts from the years 1810 through 2009, called COHA, the Corpus of Historical American English; this project was also funded by the NEH. Davies is also the creator of the General Conference corpus, which allows a study of how the use of words from the LDS General Conference has changed over time.

In linguistics, the catena is a unit of syntax and morphology, closely associated with dependency grammars. It is a more flexible and inclusive unit than the constituent and may therefore be better suited than the constituent to serve as the fundamental unit of syntactic and morphosyntactic analysis.

Mental lexicon

The mental lexicon is defined as a mental dictionary that contains information regarding a word's meaning, pronunciation, syntactic characteristics, and so on.

Formalism (linguistics)

In linguistics, the term formalism is used in a variety of meanings which relate to formal linguistics in different ways. In common usage, it is merely synonymous with a grammatical model or a syntactic model: a method for analyzing sentence structures. Such formalisms include different methodologies of generative grammar which are especially designed to produce grammatically correct strings of words; or the likes of Functional Discourse Grammar which builds on predicate logic.

Sketch Engine

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

The Usage-based linguistics is a linguistics approach within a broader functional/cognitive framework, that emerged since the late 1980s, and that assumes a profound relation between linguistic structure and usage. It challenges the dominant focus, in 20th century linguistics, on considering language as an isolated system removed from its use in human interaction and human cognition. Rather, usage-based models posit that linguistic information is expressed via context-sensitive mental processing and mental representations, which have the cognitive ability to succinctly account for the complexity of actual language use at all levels. Broadly speaking, a usage-based model of language accounts for language acquisition and processing, synchronic and diachronic patterns, and both low-level and high-level structure in language, by looking at actual language use.

References

General references

Applications

Papers that document its predictive superiority over raw frequency counts