Corpus manager

Last updated April 21, 2022

A corpus manager (corpus browser or corpus query system) is a tool for multilingual corpus analysis, which allows effective searching in corpora.^[1]

A corpus manager usually represents a complex tool that allows one to perform searches for language forms or sequences. It may provide information about the context or allow the user to search by positional attributes, such as lemma, tag, etc. These are called concordances. Other features include the ability to search for Collocations, frequency statistics as well as metadata information about the processed text.^[2] The narrower meaning of corpus manager refers only to the server side or the corpus query engine, whereas the client side is simply called the user interface.

A corpus manager can be software installed on a personal computer or it might be provided as a web service.

List of corpus managers

BNCweb^[3] – a web-based interface for the British National Corpus
CQPweb^[4] - a web-based interface for the study of a large variety of corpora including the Spoken BNC2014
BYU-BNC^[5] – a website that allows searches of the British National Corpora and others created at Brigham Young University
Coma^[6] – a tool extension of the system EXMARaLDA for working with oral corpora on a computer
NoSketch Engine^[7] – a free open-source corpus management system combining Manatee (back-end) and Bonito (web interface)
KonText^[8] – an extended and modified web interface to NoSketch Engine (a Bonito replacement)
Sketch Engine ^[9]^[10] – text corpus management and analysis software with more than 500 corpora in 90+ languages
WordSmith Tools ^[11] – a software package primarily for linguists

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

Computer-aided translation (CAT), also referred to as machine-assisted translation (MAT) or machine-aided human translation (MAHT), is the use of software to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.

Croatian National Corpus is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time.

Data-driven learning (DDL) is an approach to foreign language learning. Whereas most language learning is guided by teachers and textbooks, data-driven learning treats language as data and students as researchers undertaking guided discovery tasks. Underpinning this pedagogical approach is the data - information - knowledge paradigm. It is informed by a pattern-based approach to grammar and vocabulary, and a lexicogrammatical approach to language in general. Thus the basic task in DDL is to identify patterns at all levels of language. From their findings, foreign language students can see how an aspect of language is typically used, which in turn informs how they can use it in their own speaking and writing. Learning how to frame language questions and use the resources to obtain data and interpret it is fundamental to learner autonomy. When students arrive at their own conclusions through such procedures, they use their higher order thinking skills and are creating knowledge.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of Corpus Linguistics at Brigham Young University (BYU).

WordSmith Tools is a software package primarily for linguists, in particular for work in the field of corpus linguistics. It is a collection of modules for searching patterns in a language. The software handles many languages.

EXMARaLDA is a set of free software tools for creating, managing and analyzing spoken language corpora. It consists of a transcription tool, a tool for administering corpus meta data and a tool for doing queries on spoken language corpora. EXMARaLDA is used for doing conversation and discourse analysis, dialectology, phonology and research into first and second language acquisition in children and adults. EXMARaLDA is based on the open standards XML and Unicode and programmed in Java.

Corpora in Translation Studies Gradually the translator’s workplace has changed over the last ten years. Personal computers now have the capacity to process information easier and quicker than ever before, and so today's computer could be considered an important or even essential tool in translation. However, problems arise in the use of computers in translation, as the computer is no substitute for traditional tools such as monolingual and bilingual dictionaries, terminologies and encyclopaedias on paper or in digital format and although we can easily access a large amount of information, we need to find the right and reliable information.

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

SkELL is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target word(s). For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the World Wide Web, which has been cleaned of spam and includes only high-quality texts covering everyday, standard, formal, and professional language. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (10¹⁰) words per language, which gave rise to the corpus family's name.

The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

References

↑ "Korpusový manažer". Wiki Český národní korpus. Český národní korpus. 8 April 2015. Retrieved 18 April 2015.
↑ Kouklakis, George; Mikros, George; Markopoulos, George; Koutsis, Ilias (2007). "Corpus Manager A Tool for Multilingual Corpus Analysis" (PDF). Proceedings from Corpus Linguistics Conference. University of Athens: 1–12.
↑ interface to the British National Corpus more about British National Corpus
↑ CQPweb Main Page
↑ BYU-BNC: BRITISH NATIONAL CORPUS interface
↑ EXMARaLDA Corpus-Manager Hamburger Zentrum für Sprachkorpora
↑ NoSketch Engine (an open-source project combining Manatee, Bonito and Crystal into a powerful and free corpus management system)
↑ A basic query interface for working with corpora Institute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Prague
↑ The Sketch Engine homepage
↑ Concordancers, Search Engines, Text-analysis Tools Archived 15 March 2015 at the Wayback Machine a list on University of Wollongong website
↑ WordSmith Tools homepage

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Korpusový manažer". Wiki Český národní korpus. Český národní korpus. 8 April 2015. Retrieved 18 April 2015.

[2] Kouklakis, George; Mikros, George; Markopoulos, George; Koutsis, Ilias (2007). "Corpus Manager A Tool for Multilingual Corpus Analysis" (PDF). Proceedings from Corpus Linguistics Conference. University of Athens: 1–12.

[3] terface to the British National Corpus more about British National Corpus

[4] CQPweb Main Page

[5] BYU-BNC: BRITISH NATIONAL CORPUS interface

[6] EXMARaLDA Corpus-Manager Hamburger Zentrum für Sprachkorpora

[7] NoSketch Engine (an open-source project combining Manatee, Bonito and Crystal into a powerful and free corpus management system)

[8] A basic query interface for working with corpora Institute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Prague

[9] The Sketch Engine homepage

[10] Concordancers, Search Engines, Text-analysis Tools Archived 15 March 2015 at the Wayback Machine a list on University of Wollongong website

[11] WordSmith Tools homepage

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]