Lancaster-Oslo-Bergen Corpus

Last updated

The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s. [1]

Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors. [2] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:

LabelText categoryBrown CorpusLOB Corpus
APress: reportage4444
BPress: editorial2727
CPress: reviews1717
DReligion1717
ESkills, trades and hobbies3638
FPopular lore4844
GBelles lettres, biography, essays7577
HMiscellaneous (documents, reports, etc.)3030
JLearned and scientific writings8080
KGeneral fiction2929
LMystery and detective fiction2424
MScience fiction66
NAdventure and western fiction2929
PRomance and love story2929
RHumour99
Total500500

The chief compilers of the LOB corpous were Geoffrey Leech (Lancaster University) and Stig Johansson (University of Oslo); see Leech & Johansson (2009) [3] .

The corpus has been also tagged, i.e. part-of-speech categories have been assigned to every word. [1]

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

<span class="mw-page-title-main">Brown Corpus</span> Data set of American English in 1961

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

LOB or LoB may refer to:

Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Language and Computers: Studies in Practical Linguistics is a book series on corpus linguistics and related areas. As studies in linguistics, volumes in the series have, by definition, their foundations in linguistic theory; however, they are not concerned with theory for theory's sake, but always with a definite direct or indirect interest in the possibilities of practical application in the dynamic area where language and computers meet.

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.

The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96–97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus.

<i>A Comprehensive Grammar of the English Language</i> 1985 compendium on the English language

A Comprehensive Grammar of the English Language is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

<span class="mw-page-title-main">W. Nelson Francis</span> American linguist

W. Nelson Francis was an American author, linguist, and university professor. He served as a member of the faculties of Franklin & Marshall College and Brown University, where he specialized in English and corpus linguistics. He is known for his work compiling a text collection entitled the Brown University Standard Corpus of Present-Day American English, which he completed with Henry Kučera.

<i>Longman Grammar of Spoken and Written English</i>

Longman Grammar of Spoken and Written English (LGSWE) is a descriptive grammar of English written by Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan, first published by Longman in 1999. It is an authoritative description of modern English, a successor to A Comprehensive Grammar of the English Language (ComGEL) published in 1985 and a predecessor of the Cambridge Grammar of the English Language (CamGEL) published in 2002. The authors and some reviewers consider it a complement rather than a replacement of the former since it follows – with few exceptions – the grammatical framework and concepts from ComGEL, which is also corroborated by the fact that one of LGSWE's authors, Geoffrey Leech, is also a co-author of ComGEL.

The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.

Stig Johansson was a Swedish-Norwegian linguist.

Michael Henry 'Mick' Short is a British linguist. He is currently an honorary professor at the Department of Linguistics and English Language of Lancaster University, United Kingdom. His research focuses on applied linguistics with a special focus on stylistics.

References

  1. 1 2 "CoRD | The Lancaster-Oslo/Bergen Corpus (LOB)". varieng.helsinki.fi. Retrieved 2024-11-12.
  2. LOB Corpus Manual
  3. Leech, Geoffrey; Johansson, Stig (2009). "The coming of ICAME" (PDF). ICAME Journal. 33: 5–20.