Oxford English Corpus

Last updated January 12, 2025

The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words.^[1] It includes language from the UK, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa.^[2] The text is mainly collected from web pages; some printed texts, such as academic journals, have been collected to supplement particular subject areas.^[2] The sources are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of blogs, emails, and social media".^[2] This may be contrasted with similar databases that sample only a specific kind of writing. The corpus is generally available only to researchers at Oxford University Press, but other researchers who can demonstrate a strong need may apply for access.^[2]^[3]

The digital version of the Oxford English Corpus is formatted in XML and usually analysed with Sketch Engine software.^[4] By April 27, 2006, the dictionary database had 1 billion words. ^[5]

Each document in the OE Corpus is accompanied by metadata including:

title
author (if known; many websites make this difficult to determine reliably)
author gender (if known)
language type (e.g. British English, American English)
source website
year (+ date, if known)
date of collection
domain + subdomain
document statistics (number of tokens, sentences, etc.)^[4]

Related Research Articles

The Oxford English Dictionary (OED) is the principal historical dictionary of the English language, published by Oxford University Press (OUP), a University of Oxford publishing house. The dictionary, which published its first edition in 1884, traces the historical development of the English language, providing a comprehensive resource to scholars and academic researchers, and provides ongoing descriptions of English language usage in its variations around the world.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

The Bank of English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth countries is also being included.

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

<i>Oxford Dictionary of English</i> Single-volume dictionary, first published in 1998

The Oxford Dictionary of English (ODE) is a single-volume English dictionary published by Oxford University Press, first published in 1998 as The New Oxford Dictionary of English (NODE). The word "new" was dropped from the title with the Second Edition in 2003. The dictionary is not based on the Oxford English Dictionary (OED) – it is a separate dictionary which strives to represent faithfully the current usage of English words. The Revised Second Edition contains 355,000 words, phrases, and definitions, including biographical references and thousands of encyclopaedic entries. The Third Edition was published in August 2010, with some new words, including "vuvuzela".

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

The Cambridge International Corpus (CIC) is a collection of over 2 billion words of real spoken and written English. The texts are stored in a database that can be searched to see how English is used. The CIC also contains the Cambridge Learner Corpus, a unique collection of over 60,000 exam papers from Cambridge ESOL. It shows real mistakes students make and highlights the parts of English which cause problems for students.

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations.

A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (10¹⁰) words per language, which gave rise to the corpus family's name.

A historical dictionary or dictionary on historical principles is a dictionary which deals not only with the latterday meanings of words but also the historical development of their forms and meanings. It may also describe the vocabulary of an earlier stage of a language's development without covering present-day usage at all. A historical dictionary is primarily of interest to scholars of language, but may also be used as a general dictionary.

References

↑ "The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. 6 June 2015. Retrieved 27 October 2016.
1 2 3 4 "The Oxford English Corpus". Oxford Dictionaries Online. Oxford University Press. Archived from the original on 1 January 2012. Retrieved 8 November 2014.
↑ "Compare COCA". Corpus of Contemporary American English. Archived from the original on 7 November 2014. Retrieved 8 November 2014.
1 2 The Oxford English Corpus. Retrieved February 4, 2014.
↑ "Dictionary database has billion words". Northwest Herald. 27 April 2006. p. 2. Retrieved 15 March 2020– via Newspapers.com.

This article about the English language is a stub. You can help Wikipedia by expanding it.

This article about a digital library is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[sketchengine-1] "The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. 6 June 2015. Retrieved 27 October 2016.

[oec-2] 1 2 3 4 "The Oxford English Corpus". Oxford Dictionaries Online. Oxford University Press. Archived from the original on 1 January 2012. Retrieved 8 November 2014.

[3] "Compare COCA". Corpus of Contemporary American English. Archived from the original on 7 November 2014. Retrieved 8 November 2014.

[tech-4] 1 2 The Oxford English Corpus. Retrieved February 4, 2014.

[5] "Dictionary database has billion words". Northwest Herald. 27 April 2006. p. 2. Retrieved 15 March 2020– via Newspapers.com.

[1]

[2]

[3]

[4]

[5]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

Oxford English Corpus

See also

Related Research Articles

References