Oxford English Corpus

Last updated

The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words. [1] It includes language from the UK, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. [2] The text is mainly collected from web pages; some printed texts, such as academic journals, have been collected to supplement particular subject areas. [2] The sources are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of blogs, emails, and social media". [2] This may be contrasted with similar databases that sample only a specific kind of writing. The corpus is generally available only to researchers at Oxford University Press, but other researchers who can demonstrate a strong need may apply for access. [2] [3]

The digital version of the Oxford English Corpus is formatted in XML and usually analysed with Sketch Engine software. [4] By April 27, 2006, the dictionary database had 1 billion words. [5]

Each document in the OE Corpus is accompanied by metadata including:

See also

Related Research Articles

<i>Oxford English Dictionary</i> Premier historical dictionary of the English language

The Oxford English Dictionary (OED) is the principal historical dictionary of the English language, published by Oxford University Press (OUP). It traces the historical development of the English language, providing a comprehensive resource to scholars and academic researchers, as well as describing usage in its many variations throughout the world.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

The Bank of English is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth countries is also being included.

<span class="mw-page-title-main">Brown Corpus</span> Data set of American English in 1961

The Brown University Standard Corpus of Present-Day American English is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistic for analysis of corpora

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language (IDS) in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

<i>Oxford Dictionary of English</i> Single-volume completely new dictionary first published in 1998

The Oxford Dictionary of English (ODE) is a single-volume English dictionary published by Oxford University Press, first published in 1998 as The New Oxford Dictionary of English (NODE). The word "new" was dropped from the title with the Second Edition in 2003. This dictionary is not based on the Oxford English Dictionary (OED) and should not be mistaken for a new or updated version of the OED. It is a completely new dictionary which strives to represent as faithfully as possible the current usage of English words. The Revised Second Edition contains 355,000 words, phrases, and definitions, including biographical references and thousands of encyclopaedic entries. The Third Edition was published in August 2010, with some new words, including "vuvuzela".

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

<span class="mw-page-title-main">LIVAC Synchronous Corpus</span>

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Formosan Straits news, as well as news on finance, sports and entertainment. By 2020, 3 billion characters of news media texts have been filtered so far, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and their speech communities in the Pan-Chinese region, and the results show considerable and important variations.

<span class="mw-page-title-main">Word sketch</span>

A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.

<span class="mw-page-title-main">SkELL</span>

SkELL is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target word(s). For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the World Wide Web, which has been cleaned of spam and includes only high-quality texts covering everyday, standard, formal, and professional language. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

A historical dictionary or dictionary on historical principles is a dictionary which deals not only with the latterday meanings of words but also the historical development of their forms and meanings. It may also describe the vocabulary of an earlier stage of a language's development without covering present-day usage at all. A historical dictionary is primarily of interest to scholars of language, but may also be used as a general dictionary.

References

  1. "The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. Retrieved 27 October 2016.
  2. 1 2 3 4 "The Oxford English Corpus". Oxford Dictionaries Online. Oxford University Press. Archived from the original on 1 January 2012. Retrieved 8 November 2014.
  3. "Compare COCA". Corpus of Contemporary American English. Archived from the original on 7 November 2014. Retrieved 8 November 2014.
  4. 1 2 The Oxford English Corpus. Retrieved February 4, 2014.
  5. "Dictionary database has billion words". Northwest Herald. 27 April 2006. p. 2. Retrieved 15 March 2020 via Newspapers.com.