Lancaster-Oslo-Bergen Corpus

Last updated

The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s.

Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors. [1] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:

LabelText categoryBrown CorpusLOB Corpus
APress: reportage4444
BPress: editorial2727
CPress: reviews1717
DReligion1717
ESkills, trades and hobbies3638
FPopular lore4844
GBelles lettres, biography, essays7577
HMiscellaneous (documents, reports, etc.)3030
JLearned and scientific writings8080
KGeneral fiction2929
LMystery and detective fiction2424
MScience fiction66
NAdventure and western fiction2929
PRomance and love story2929
RHumour99
Total500500

The corpus has been also tagged, i.e. part-of-speech categories have been assigned to every word.[ citation needed ]

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

Henry Kučera, born Jindřich Kučera, was a Czech-American linguist who pioneered corpus linguistics, linguistic software, a major contributor to the American Heritage Dictionary, and a pioneer in the development of spell checking computer software. He is remembered in particular as one of the initiators of the Brown Corpus.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

<span class="mw-page-title-main">Brown Corpus</span> Data set of American English in 1961

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

LOB or LoB may refer to:

Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.

The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96-97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus.

<i>A Comprehensive Grammar of the English Language</i> 1985 compendium on the English language

A Comprehensive Grammar of the English Language is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

<span class="mw-page-title-main">W. Nelson Francis</span> American linguist

W. Nelson Francis was an American author, linguist, and university professor. He served as a member of the faculties of Franklin & Marshall College and Brown University, where he specialized in English and corpus linguistics. He is known for his work compiling a text collection entitled the Brown University Standard Corpus of Present-Day American English, which he completed with Henry Kučera.

The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.

Stig Johansson was a Swedish-Norwegian linguist.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References