EXMARaLDA

Last updated
EXMARaLDA
Developer(s) Thomas Schmidt and Kai Wörner, from the Special Research Centre on Multilingualism - University of Hamburg.
Initial release2001
Written in Java
Operating system Windows, Linux, Macintosh, FreeBSD, Solaris
Available inEnglish, German, French, Swedish, Turkish
Type Corpus manager, Linguistic research software
License GNU GPL
Website exmaralda.org

EXMARaLDA (Extensible Markup Language for Discourse Annotation) is a set of free software tools for creating, managing and analyzing spoken language corpora. It consists of a transcription tool (comparable to tools like Praat or Transcriber), [1] a tool for administering corpus meta data and a tool for doing queries (KWIC searches) on spoken language corpora. EXMARaLDA is used for doing conversation and discourse analysis, dialectology, phonology and research into first and second language acquisition in children and adults. EXMARaLDA is based on the open standards XML and Unicode and programmed in Java.

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Transcription in the linguistic sense is the systematic representation of spoken language in written form. The source can either be utterances or preexisting text in another writing system.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition. Its earliest transcripts date from the 1960s, and as of 2015 has contents in 26 languages from 230 different corpora, all of which are publicly available worldwide. Recently, CHILDES has been made into a component of the larger corpus TalkBank, which also includes language data from aphasics, second language acquisition, conversation analysis, and classroom language learning. CHILDES is mainly used for analyzing the language of young children and directed to the child speech of adults.

Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

The International Corpus of English(ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

Corpus-assisted discourse studies is related historically and methodologically to the discipline of corpus linguistics. The principal endeavor of corpus-assisted discourse studies is the investigation, and comparison of features of particular discourse types, integrating into the analysis the techniques and tools developed within corpus linguistics. These include the compilation of specialised corpora and analyses of word and word-cluster frequency lists, comparative keyword lists and, above all, concordances.

The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.

Phonologie du Français Contemporain (PFC) is an international linguistics research project devoted to the creation and use of resources for the study of the phonology of contemporary French.

Linguistics is the scientific study of language. The modern-day scientific study of linguistics takes all aspects of language into account — i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural. Linguistics is based on the theoretical as well as descriptive study of language, and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in an informal manner that did not employ scientific methods.

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

Anna-Brita Stenström is a linguist whose areas of research include corpus linguistics, sociolinguistics, pragmatics, and discourse analysis. She has initiated and co-directed three online corpora of adolescent language: The Bergen Corpus of London Teenage Language (COLT), Ungdomsspråk och språkkontakt i Norden (UNO), and Corpus Oral de Lenguaje Adolescente (COLA). She is Professor Emerita of English Linguistics at the University of Bergen, Norway.

A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.

The CLAN (Computerized Language ANalysis) program is a cross-platform program designed by Brian MacWhinney and written by Leonid Spektor for the purpose of creating and analyzing transcripts in the Child Language Exchange System (CHILDES) database. CLAN is open source software and can be freely downloaded.

ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media. It is applied in humanities and social sciences research for the purpose of documentation and of qualitative and quantitative analysis. It is distributed as free and open source software under the GNU General Public License, version 3.

The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

References

  1. "EXMARALDA". EXMARaLDA. Hamburger Zentrum für Sprachkorpora. 9 April 2015. Retrieved 19 April 2015.