Transcriber

Last updated
Transcriber
Stable release
1.5.1 / June 6, 2005
Written in Tcl/Tk, C
Operating system Cross-platform
License GPL
Website transag.sourceforge.net   OOjs UI icon edit-ltr-progressive.svg

Transcriber is an open-source software tool for the transcription and annotation of speech signals for linguistic research. It supports multiple hierarchical layers of segmentation, named entity annotation, speaker lists, topic lists, and overlapping speakers. Two views of the sound pressure waveform at different resolutions may be viewed simultaneously. Various character encodings, including Unicode, are supported.

Contents

Annotations from Transcriber may be exported in XML. OASIS' Cover Pages publishes the open DTD used by Transcriber. [1]

Transcriber is written in Tcl/Tk with the Snack audio library and is therefore available on most major platforms. It is distributed under the GNU General Public License. Transcriber has been superseded by TranscriberAG. [2]

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

The ELRA Language Resources Association (ELRA) is a not-for-profit organisation established under the law of the Grand Duchy of Luxembourg. Its seat is in Luxembourg, and its headquarters is in Paris, France.

The Pangloss Collection is a digital library whose objective is to store and facilitate access to audio recordings in endangered languages of the world. Developed by the LACITO centre of CNRS in Paris, the collection provides free online access to documents of connected, spontaneous speech, in otherwise little-documented languages of all continents.

Linguistic categories include

Language resource management Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

The International Conference on Language Resources and Evaluation is an international conference organised by the ELRA Language Resources Association every other year with the support of institutions and organisations involved in Natural language processing. The series of LREC conferences was launched in Granada in 1998.

A discourse relation is a description of how two segments of discourse are logically and/or structurally connected to one another.

Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

The LRE Map is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map".

<span class="mw-page-title-main">Grażyna Vetulani</span> Polish philologist

Grażyna Małgorzata Vetulani née Świerczyńska is a Polish philologist and linguist, professor of the humanities, professor at the Adam Mickiewicz University in Poznań and the Nicolaus Copernicus University in Toruń.

Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

<span class="mw-page-title-main">Joseph Mariani</span>

Joseph Mariani is a French computer science researcher and pioneer in the field of speech processing.

ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media. It is applied in humanities and social sciences research for the purpose of documentation and of qualitative and quantitative analysis. It is distributed as free and open source software under the GNU General Public License, version 3.

Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time, there are just over 200 treebanks of more than 100 languages available in the UD inventory.

In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."

<span class="mw-page-title-main">CorCenCC</span> Welsh corpus

CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

References

  1. "Transcriber - Speech Segmentation and Annotation DTD". OASIS. November 16, 2000. Retrieved February 20, 2009.
  2. "TranscriberAG a tool for segmenting, labeling and transcribing speech" . Retrieved March 11, 2017.

Bibliography