German Reference Corpus

Last updated

The German Reference Corpus (original: Deutsches Referenzkorpus; short: DeReKo) is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language (Leibniz Institute for the German Language, abbr.: IDS) in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens (as of August 2010) and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

Contents

Alternative names

The German Reference Corpus is often referred to by other names, such as Mannheim corpora, IDS corpora, COSMAS corpora and the corresponding German translations. The name Deutsches Referenzkorpus (DeReKo) was originally used for a specific portion of the current archive which was collected between 1999 and 2002 by a number of institutions in a joint project under the same name. Since 2004, Deutsches Referenzkorpus (DeReKo) is the official name of the full corpus archive.

Conception and composition

The German Reference Corpus comprises fictional and academic texts, a large number of newspaper texts and several other text types. The texts cover the time range from around 1950 to the present.

In contrast to other well-known corpora and corpus archives (such as the British National Corpus), however, the German Reference Corpus is explicitly not designed as a balanced corpus: The distribution of DeReKo texts across time or text types does not match some predefined percentages.

This conception complies with the fact that whether or not a given corpus constitutes a balanced or even representative language sample may only be assessed with respect to a specific language domain (i.e., the statistical population). Because different linguistic investigations generally aim at different language domains, the declared purpose of the German Reference Corpus is to serve as a versatile superordinate sample, or primordial sample (German: Ur-Stichprobe) of contemporary written German, from which corpus users may draw a specialised subsample (a so-called virtual corpus ) to represent the language domain they wish to investigate.

Access

Due to copyright and licence restrictions, the DeReKo archive may not be copied nor offered for download. It can be queried and analyzed free of charge via the system COSMAS II - end-users are required to register by name and to agree to use the corpus data exclusively for non-commercial, academic purposes. COSMAS II enables users to compile from DeReKo a virtual corpus suitable for their specific research questions.

See also

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Dr. Hermann Moisl is a retired senior lecturer and visiting fellow in Linguistics at Newcastle University. He was educated at various institutes, including Trinity College Dublin and the University of Oxford.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistic for analysis of corpora

Linguistic categories include

<span class="mw-page-title-main">Internet linguistics</span>

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

The Croatian Language Corpus (CLC) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).

Corpora in Translation Studies Gradually the translator’s workplace has changed over the last ten years. Personal computers now have the capacity to process information easier and quicker than ever before, and so today's computer could be considered an important or even essential tool in translation. However, problems arise in the use of computers in translation, as the computer is no substitute for traditional tools such as monolingual and bilingual dictionaries, terminologies and encyclopaedias on paper or in digital format and although we can easily access a large amount of information, we need to find the right and reliable information.

<span class="mw-page-title-main">Leibniz Institute for the German Language</span> Institute in Germany

The Leibniz Institute for the German Language in Mannheim, Germany, is a linguistic and social research institute and a member of the Leibniz Association. Under the leadership of Prof. Dr. Henning Lobin, director of the institute, and Prof. Dr. Arnulf Deppermann, vice director of the institute, IDS employs a staff of about 160. The IDS was established in Mannheim in 1964 and is still headquartered there. It is the central extramural institute for research and documentation of the German language in its contemporary usage and its recent history. As a member of the Leibniz-Gemeinschaft (Leibniz-Association), the IDS is financed both by the federal government and by the state of Baden-Wuerttemberg.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

References