Linguistic Data Consortium

Last updated

The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from the US Defense Advanced Research Projects Agency (DARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation. [1] The director of LDC is Mark Liberman. [2]

See also

  1. "About LDC". Linguistic Data Consortium. Retrieved June 18, 2024.
  2. "Staff". Linguistic Data Consortium. Retrieved June 18, 2024.

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<span class="mw-page-title-main">Wiktionary</span> Multilingual online dictionary

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 193 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Brian James MacWhinney is a Professor of Psychology and Modern Languages at Carnegie Mellon University. He specializes in first and second language acquisition, psycholinguistics, and the neurological bases of language, and he has written and edited several books and over 100 peer-reviewed articles and book chapters on these subjects. MacWhinney is best known for his competition model of language acquisition and for creating the CHILDES and TalkBank corpora. He has also helped to develop a stream of pioneering software programs for creating and running psychological experiments, including PsyScope, an experimental control system for the Macintosh; E-Prime, an experimental control system for the Microsoft Windows platform; and System for Teaching Experimental Psychology (STEP), a database of scripts for facilitating and improving psychological and linguistic research.

Mark Yoffe Liberman is an American linguist. He has a dual appointment at the University of Pennsylvania, as Trustee Professor of Phonetics in the Department of Linguistics, and as a professor in the Department of Computer and Information Sciences. He is the founder and director of the Linguistic Data Consortium. Liberman is the Faculty Director of Ware College House at the University of Pennsylvania.

<span class="mw-page-title-main">Eva Hajičová</span> Czech linguist

Eva Hajičová [ˈɛva ˈɦajɪt͡ʃovaː] is a Czech linguist, specializing in topic–focus articulation and corpus linguistics. In 2006, she was awarded the Association for Computational Linguistics (ACL) Lifetime Achievement Award. She was named a fellow of the ACL in 2011.

Linguistic categories include

Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars as the use of computers in the study of machine-readable dictionaries. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.

Sandra Annear Thompson is an American linguist specializing in discourse analysis, typology, and interactional linguistics. She is Professor Emerita of Linguistics at the University of California, Santa Barbara (UCSB). She has published numerous books, her research has appeared in many linguistics journals, and she serves on the editorial board of several prominent linguistics journals.

Linguistics is the scientific study of language. Linguistics is based on a theoretical as well as a descriptive study of language and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in conjunction with literary study and did not employ scientific methods. Modern-day linguistics is considered a science because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language – i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.

The World Oral Literature Project was "an urgent global initiative to document and disseminate endangered oral literatures before they disappear without record". Directed by Dr Mark Turin and co-located at the Museum of Archaeology and Anthropology, at the University of Cambridge and Yale University, the project was established in January 2009.

The LINGUIST List is an online resource for the academic field of linguistics. It was founded by Anthony Aristar in early 1990 at the University of Western Australia, and is used as a reference by the National Science Foundation in the United States. Its main and oldest feature is the premoderated electronic mailing list, with subscribers all over the world.

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Helen Aristar-Dry is an American linguist who currently serves as the series editor for SpringerBriefs in Linguistics. Most notably, from 1991 to 2013 she co-directed The LINGUIST List with Anthony Aristar. She has served as principal investigator or co-Principal Investigator on over $5,000,000 worth of research grants from the National Science Foundation and the National Endowment for the Humanities. She retired as Professor of English Language and Literature from Eastern Michigan University in 2013.

<span class="mw-page-title-main">Dafydd Gibbon</span> British professor

Dafydd Gibbon is a British emeritus professor of English and General Linguistics at Bielefeld University in Germany, specialising in computational linguistics, the lexicography of spoken languages, applied phonetics and phonology. He is particularly concerned with endangered languages and has received awards from the Ivory Coast, Nigeria and Poland.

Claire Hardaker is a British linguist. She is senior lecturer at the Department of Linguistics and English Language of Lancaster University, United Kingdom. Her research involves forensic linguistics and corpus linguistics. Her research focuses on deceptive, manipulative, and aggressive language in a range of online data. She has investigated behaviours ranging from trolling and disinformation to human trafficking and online scams. Her research typically uses corpus linguistic methods to approach forensic linguistic analyses.

In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."

The Switchboard Telephone Speech Corpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas Instruments via a DARPA grant, and released in 1992 by NIST. The corpus contains 2,400 telephone conversations among 543 US speakers. Participants did not know each other, and conversations were held on topics from a predetermined list.