Developer(s) | The Language Archive |
---|---|
Initial release | 2000 |
Stable release | 6.1 / March 12, 2021 [1] |
Written in | Java |
Operating system | Windows, macOS, Linux |
Platform | IA-32, x86-64 |
Available in | English |
Type | Language documentation, qualitative data analysis |
License | GPLv3 |
Website | archive |
ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. [2] It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media. It is applied in humanities and social sciences research (language documentation, sign language and gesture research) for the purpose of documentation and of qualitative and quantitative analysis. [3] It is distributed as free and open source software under the GNU General Public License, version 3.
ELAN is a well established professional-grade software and is widely used in academia. [4] [5] [6] It has been well received in several academic disciplines, for example, in psychology, medicine, psychiatry, education, and behavioral studies, on topics such as human computer interaction, [7] sign language and conversation analysis, [8] [9] [10] group interactions, [11] music therapy, [12] bilingualism and child language acquisition, [13] analysis of non-verbal communication and gesture analysis, [14] and animal behavior. [15]
Several third-party tools have been developed to enrich and analyse ELAN data and corpora. [16] [17] [18] [19]
Its features include:
ELAN is developed by the Max Planck Institute for Psycholinguistics in Nijmegen. The first version was released around the year 2000 under the name EAT, Eudico Annotation Tool. It was renamed to ELAN in 2002. Since then, two to three new versions are released each year. It is developed in the programming language Java with interfaces to platform native media frameworks developed in C, C++, and Objective-C.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.
Transcription in the linguistic sense is the systematic representation of spoken language in written form. The source can either be utterances or preexisting text in another writing system.
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition. Its earliest transcripts date from the 1960s, and as of 2015 has contents in 26 languages from 230 different corpora, all of which are publicly available worldwide. Recently, CHILDES has been made into a component of the larger corpus TalkBank, which also includes language data from aphasics, second language acquisition, conversation analysis, and classroom language learning. CHILDES is mainly used for analyzing the language of young children and directed to the child speech of adults.
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.
The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.
A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
The International Corpus of English(ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.
Computer-assistedqualitative data analysis software (CAQDAS) offers tools that assist with qualitative research such as transcription analysis, coding and text interpretation, recursive abstraction, content analysis, discourse analysis, grounded theory methodology, etc.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.
The field of language documentation in the modern context involves a complex and ever-evolving set of tools and methods, and the study and development of their use - and, especially, identification and promotion of best practices - can be considered a sub-field of language documentation proper. Among these are ethical and recording principles, workflows and methods, hardware tools, and software tools.
In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."
Hierarchical Event Descriptors (HED) is a conceptual and software framework that includes a family of controlled vocabularies for annotating experimental metadata and experienced events on the timeline of neuroimaging and behavioral experiments. The goal of HED is to standardize annotations and the mechanisms for handling these annotations to enable searching, comparing, and extracting data of interest for analysis. HED is the event annotation mechanism used by the Brain Imaging Data Structure standard for describing events.
{{cite journal}}
: Check |url=
value (help)