The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.
The Survey of English Usage was founded as the Survey of Spoken English at Durham University in 1959 by Randolph Quirk, moving with him to University College London in 1960. [1] Many well-known linguists have spent time doing research at the Survey, including Bas Aarts, Valerie Adams, John Algeo, Dwight Bolinger, Noël Burton-Roberts, David Crystal, Derek Davy, Jan Firbas, Sidney Greenbaum, Liliane Haegeman, Robert Ilson, Ruth Kempson, Geoffrey Leech, Jan Rusiecki, Jan Svartvik, and Joe Taglicht. The current director is Bas Aarts. [2]
The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed prosodic and paralinguistic annotation developed by Crystal and Quirk (1964). [3] Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey.
This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus. Thirty-four of the spoken texts were published in book form as Svartvik and Quirk (1980), [4] and the corpus was used as the basis for the famous book A Comprehensive Grammar of the English Language (Quirk et al. 1985). [5]
In 1988 Sidney Greenbaum proposed a new project, ICE, the International Corpus of English. ICE was to be an international project, carried out at research centres around the world, to compile corpora of English varieties where English was the first or second official language. ICE texts would contain spoken and written English in a balanced sample of one million words per component so that these samples could be compared in a wide varieties of ways. The ICE project continues around the world to the present day.
ICE-GB, the British Component of ICE, was compiled at the Survey. ICE-GB was annotated to a very detailed level, including constructing a full grammatical analysis (parse) for every sentence in the corpus. The first release of ICE-GB took place in 1998. ICE-GB was distributed with software for searching and exploring the parsed corpus called ICECUP. Release 2 of ICE-GB has now been released and is available on CD.
As well as contrasting varieties of English, many researchers are interested in language development and change over time. A recent project at the Survey undertook the parsing of a large (400,000 word) selection of the spoken part of the LLC in a manner directly comparable with ICE-GB, forming a new, 800,000 word diachronic corpus, called the Diachronic Corpus of Present-Day Spoken English (DCPSE). DCPSE has now been released and is available on CD from the Survey.
These two corpora comprise the largest collection of parsed and corrected, orthographically transcribed spoken English language data in the world, with over one million words of spoken English in this form.
Parsed corpora are large databases containing detailed grammatical tree structures. One of the consequences of forming large collections of valuable linguistic data is a pressing need for methods and tools to help researchers and other users make the most of them. So in parallel with the parsing of natural language data, the Survey team have carried out research and development of software tools to help linguists use these corpora. The ICECUP research platform uses an intuitive grammatical query representation called Fuzzy Tree Fragments (FTFs) to search parsed corpora.
As well as distributing corpora and tools to the corpus linguistics research community, the SEU carries out research into English language. Recent projects include research on the English Noun Phrase, Subordination in Spoken and Written English, and the English Verb Phrase. The Survey also provides support for PhD students who carry out research into English language corpora.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.
In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
In linguistics, an object pronoun is a personal pronoun that is used typically as a grammatical object: the direct or indirect object of a verb, or the object of a preposition. Object pronouns contrast with subject pronouns. Object pronouns in English take the objective case, sometimes called the oblique case or object case. For example, the English object pronoun me is found in "They see me", "He's giving me my book", and "Sit with me" ; this contrasts with the subject pronoun in "I see them," "I am getting my book," and "I am sitting here."
Charles Randolph Quirk, Baron Quirk, CBE, FBA was a British linguist and life peer. He was the Quain Professor of English language and literature at University College London from 1968 to 1981. He sat as a crossbencher in the House of Lords.
The English personal pronouns are a subset of English pronouns taking various forms according to number, person, case and natural gender. Modern English has very little inflection of nouns or adjectives, to the point where some authors describe it as an analytic language, but the Modern English system of personal pronouns has preserved some of the inflectional complexity of Old English and Middle English.
In linguistics, synesis is a traditional grammatical/rhetorical term referring to agreement due to meaning.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of over 30 books and over 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
The Cambridge Grammar of the English Language (CamGEL) is a descriptive grammar of the English language. Its primary authors are Rodney Huddleston and Geoffrey K. Pullum. Huddleston was the only author to work on every chapter. It was published by Cambridge University Press in 2002 and has been cited more than 8,000 times.
Linguistic categories include
The International Corpus of English(ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
Sidney Greenbaum was a British scholar of the English language and of linguistics. He was Quain Professor of English language and literature at the University College London from 1983 to 1990 and Director of the Survey of English Usage, 1983–96. With Randolph Quirk and others, he wrote A Comprehensive Grammar of the English Language. He also wrote Oxford English Grammar.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
English nouns form the largest category of words in English, both in terms of the number of different words and in terms of how often they are used in typical texts. The three main categories of English nouns are common nouns, proper nouns, and pronouns. A defining feature of English nouns is their ability to inflect for number, as through the plural –s morpheme. English nouns primarily function as the heads of noun phrases, which prototypically function at the clause level as subjects, objects, and predicative complements. These phrases are the only English phrases whose structure includes determinatives and predeterminatives, which add abstract specifying meaning such as definiteness and proximity. Like nouns in general, English nouns typically denote physical objects, but they also denote actions, characteristics, relations in space, and just about anything at all. Taken all together, these features separate English nouns from the language's other lexical categories, such as adjectives and verbs.
In English, possessive words or phrases exist for nouns and most pronouns, as well as some noun phrases. These can play the roles of determiners or of nouns.
A Comprehensive Grammar of the English Language is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.
Thetical grammar forms one of the two domains of discourse grammar, the other domain being sentence grammar. The building blocks of thetical grammar are theticals, that is, linguistic expressions which are interpolated in, or juxtaposed to, clauses or sentences but syntactically, semantically and, typically, prosodically independent from these structures. The two domains are associated with contrasting principles of designing texts: Whereas sentence grammar is essentially restricted to the structure of sentences in a propositional format, thetical grammar concerns the overall contours of discourse beyond the sentence, thereby being responsible for a higher level of discourse production.
In grammar, an object complement is a predicative expression that follows a direct object of an attributive ditransitive verb or resultative verb and that complements the direct object of the sentence by describing it. Object complements are constituents of the predicate. Noun phrases and adjective phrases most frequently function as object complements.
Jan Lars Svartvik is a Swedish linguist and former professor of English at Lund University (1970–1995). He is the author of several grammar books on English that are widely used in teaching English in Sweden. One of his research areas is also corpus linguistics.