Corpus of Contemporary American English

Last updated

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus [1] of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). [2] [3]

Contents

Content

The Corpus of Contemporary American English (COCA) is composed of one billion words as of November 2021. [1] [2] [4] The corpus is constantly growing: In 2009 it contained more than 385 million words; [5] In 2010 the corpus grew in size to 400 million words; [6] By March 2019, [7] the corpus had grown to 560 million words. [7]

As of November 2021, the Corpus of Contemporary American English is composed of 485,202 texts. [4] According to the corpus website, [4] the current corpus (November 2021) is composed of texts that include 24-25 million words for each year 1990–2019.

For each year contained in the corpus (1990–2019), the corpus is evenly divided between six registers/genres: TV/movies, spoken, fiction, magazine, newspaper, and academic (see Texts and Registers page of the COCA website). In addition to the six registers that were previously listed, COCA (as of November 2021) also contains 125,496,215 words from blogs, and 129,899,426 from websites, making it a corpus that is truly composed of contemporary English (see Texts and Register page of COCA). [4]


The texts come from a variety of sources:

Availability

The Corpus of Contemporary American English is free to search for registered users.

Queries

The corpus of Global Web-based English (GloWbE; pronounced "globe") contains about 1.9 billion words of text from twenty different countries. This makes it about 100 times as large as other corpora like the International Corpus of English, and it allows for many types of searches that would not be possible otherwise. In addition to this online interface, you can also download full-text data from the corpus.

It is unique in the way that it allows one to carry out comparisons between different varieties of English. GloWbE is related to the many other corpora of English. [8]

See also

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

<span class="mw-page-title-main">Concordance (publishing)</span> List of words or terms in a published book

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.

The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

Law and corpus linguistics (LCL) is a new academic sub-discipline that uses large databases of examples of language usage equipped with tools designed by linguists called corpora to better get at the meaning of words and phrases in legal texts. Thus, LCL is the application of corpus linguistic tools, theories, and methodologies to issues of legal interpretation in much the same way law and economics is the application of economic tools, theories, and methodologies to various legal issues.

<span class="mw-page-title-main">CorCenCC</span> Welsh corpus

CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

References

  1. 1 2 Milana, Prior (2021). A Comparative Corpus Study on Intensifier Usage across Registers in American English (Thesis).
  2. 1 2 "Mark Davies, Professor of (Corpus) Linguistics, Brigham Young University (BYU)". www.mark-davies.org. Retrieved November 9, 2021.
  3. Kauhanen, Henri (March 21, 2011). "The Corpus of Contemporary American English: Background and history". VARIENG. Retrieved October 13, 2011.
  4. 1 2 3 4 "Homepage". corpus of Contemporary American English. Retrieved April 24, 2022.
  5. Davies, Mark (January 1, 2009). "The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights". International Journal of Corpus Linguistics. 14 (2): 159–190. doi:10.1075/ijcl.14.2.02dav. ISSN   1384-6655.
  6. Davies, Mark (December 1, 2010). "The Corpus of Contemporary American English as the first reliable monitor corpus of English". Literary and Linguistic Computing. 25 (4): 447–464. doi:10.1093/llc/fqq018. ISSN   0268-1145.
  7. 1 2 Davies, Mark; Kim, Jong Bok (March 1, 2019). "The advantages and challenges of "big data": Insights from the 14 billion word iWeb corpus". Linguistic Research. 36 (1): 1–34. doi: 10.17250/khisli.36.1.201903.001 . ISSN   1229-1374. S2CID   133013527.
  8. "Corpus of Web-Based Global English". www.english-corpora.org. Retrieved December 18, 2019.

Further reading