Lancaster-Oslo-Bergen Corpus

Last updated November 13, 2024

The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s.^[1]

Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors.^[2] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:

Label	Text category	Brown Corpus	LOB Corpus
A	Press: reportage	44	44
B	Press: editorial	27	27
C	Press: reviews	17	17
D	Religion	17	17
E	Skills, trades and hobbies	36	38
F	Popular lore	48	44
G	Belles lettres, biography, essays	75	77
H	Miscellaneous (documents, reports, etc.)	30	30
J	Learned and scientific writings	80	80
K	General fiction	29	29
L	Mystery and detective fiction	24	24
M	Science fiction	6	6
N	Adventure and western fiction	29	29
P	Romance and love story	29	29
R	Humour	9	9
	Total	500	500

The chief compilers of the LOB corpous were Geoffrey Leech (Lancaster University) and Stig Johansson (University of Oslo); see Leech & Johansson (2009)^[3].

The corpus has been also tagged, i.e. part-of-speech categories have been assigned to every word.^[1]

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

LOB or LoB may refer to:

Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Language and Computers: Studies in Practical Linguistics is a book series on corpus linguistics and related areas. As studies in linguistics, volumes in the series have, by definition, their foundations in linguistic theory; however, they are not concerned with theory for theory's sake, but always with a definite direct or indirect interest in the possibilities of practical application in the dynamic area where language and computers meet.

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.

The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96–97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus.

<i>A Comprehensive Grammar of the English Language</i> 1985 compendium on the English language

A Comprehensive Grammar of the English Language is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

<span class="mw-page-title-main">W. Nelson Francis</span> American linguist

W. Nelson Francis was an American author, linguist, and university professor. He served as a member of the faculties of Franklin & Marshall College and Brown University, where he specialized in English and corpus linguistics. He is known for his work compiling a text collection entitled the Brown University Standard Corpus of Present-Day American English, which he completed with Henry Kučera.

<i>Longman Grammar of Spoken and Written English</i>

Longman Grammar of Spoken and Written English (LGSWE) is a descriptive grammar of English written by Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan, first published by Longman in 1999. It is an authoritative description of modern English, a successor to A Comprehensive Grammar of the English Language (ComGEL) published in 1985 and a predecessor of the Cambridge Grammar of the English Language (CamGEL) published in 2002. The authors and some reviewers consider it a complement rather than a replacement of the former since it follows – with few exceptions – the grammatical framework and concepts from ComGEL, which is also corroborated by the fact that one of LGSWE's authors, Geoffrey Leech, is also a co-author of ComGEL.

The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.

Stig Johansson was a Swedish-Norwegian linguist.

Michael Henry 'Mick' Short is a British linguist. He is currently an honorary professor at the Department of Linguistics and English Language of Lancaster University, United Kingdom. His research focuses on applied linguistics with a special focus on stylistics.

References

1 2 "CoRD | The Lancaster-Oslo/Bergen Corpus (LOB)". varieng.helsinki.fi. Retrieved 2024-11-12.
↑ LOB Corpus Manual
↑ Leech, Geoffrey; Johansson, Stig (2009). "The coming of ICAME" (PDF). ICAME Journal. 33: 5–20.

External links

This article about a digital library is a stub. You can help Wikipedia by expanding it.

This article about the English language is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[cord-1] 1 2 "CoRD | The Lancaster-Oslo/Bergen Corpus (LOB)". varieng.helsinki.fi. Retrieved 2024-11-12.

[2] LOB Corpus Manual

[3] Leech, Geoffrey; Johansson, Stig (2009). "The coming of ICAME" (PDF). ICAME Journal. 33: 5–20.

[1]

[2]

[3]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine