The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map". [1]
The LRE Map is intended to be an instrument for collecting information about language resources and to become, at the same time, a community for users, a place to share and discover resources, discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools.
The large amount of information contained in the Map can be analyzed in many different ways. For instance, the LRE Map can provide information about the most frequent type of resource, the most represented language, the applications for which resources are used or are being developed, the proportion of new resources vs. already existing ones, or the way in which resources are distributed to the community.
Several institutions worldwide maintain catalogues of language resources (ELRA, LDC, NICT Universal Catalogue, ACL Data and Code Repository, OLAC, LT World, etc.) [2] However, it has been estimated that only 10% of existing resources are known, either through distribution catalogues or via direct publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it briefly emerges being when a resource is presented in the context of a research paper or report at some conference. Even in this case, nevertheless, it might be that a resource remains in the background simply because the focus of the research is not on the resource per se.
The LRE Map originated under the name "LREC Map" during the preparation of LREC 2010 conference. [3] More specifically, the idea was discussed within the FlaReNet project, and in collaboration with ELRA and the Institute of Computational Linguistics of CNR in Pisa, the Map was put in place at LREC 2010. [4] The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map.
The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, [5] EMNLP-2010, [6] RANLP-2011, [7] LREC 2012, [8] LREC 2014 [9] and LREC 2016. [10]
After this generalization to other conferences, the LREC Map has been renamed as the LRE Map.
The size of the database increases over time. The data collected amount to 4776 entries.
Each resource is described according to the following attributes:
The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts.
The map has a great potential for many uses, in addition to being an information gathering tool:
The data were then cleaned and sorted by Joseph Mariani (CNRS-LIMSI IMMI) and Gil Francopoulo (CNRS-LIMSI IMMI + Tagmatica) in order to compute the various matrices of the final FLaReNet [11] reports. One of them, the matrix for written data at LREC 2010 is as follows:
Corpus | Lexicon | Ontology | Grammar/Language Model | Terminology | |
---|---|---|---|---|---|
Bulgarian | 7 | 6 | 1 | 1 | 1 |
Czech | 12 | 7 | 2 | 1 | 1 |
Danish | 6 | 2 | 0 | 2 | 0 |
Dutch | 17 | 8 | 2 | 1 | 2 |
English | 206 | 77 | 18 | 11 | 10 |
Estonian | 3 | 1 | 0 | 0 | 1 |
Finnish | 3 | 2 | 0 | 1 | 0 |
French | 44 | 24 | 3 | 4 | 5 |
German | 43 | 15 | 4 | 2 | 3 |
Greek | 10 | 3 | 2 | 0 | 0 |
Hungarian | 8 | 4 | 0 | 1 | 1 |
Irish | 1 | 0 | 0 | 0 | 0 |
Italian | 32 | 16 | 4 | 2 | 0 |
Latvian | 9 | 0 | 0 | 0 | 1 |
Lithuanian | 4 | 0 | 2 | 0 | 1 |
Maltese | 1 | 0 | 0 | 1 | 0 |
Polish | 7 | 2 | 1 | 2 | 1 |
Portuguese | 19 | 6 | 1 | 1 | 0 |
Romanian | 12 | 7 | 1 | 1 | 0 |
Slovak | 2 | 0 | 0 | 1 | 0 |
Slovene | 5 | 1 | 0 | 0 | 0 |
Spanish | 29 | 19 | 4 | 5 | 2 |
Swedish | 19 | 4 | 0 | 1 | 0 |
Other Europe | 19 | 11 | 3 | 3 | 2 |
Regional Europe | 18 | 8 | 0 | 1 | 3 |
Multilingual | 5 | 3 | 1 | 0 | 1 |
Language independent | 9 | 3 | 16 | 2 | 1 |
Non applicable | 2 | 0 | 2 | 1 | 0 |
Total | 552 | 229 | 67 | 45 | 36 |
English is the most studied language. Secondly, come French and German languages and then Italian and Spanish.
The LRE Map has been extended to Language Resources and Evaluation Journal [12] and other conferences.
WordNet is a lexical database of semantic relations between words. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.
The European Language Resources Association (ELRA) is a not-for-profit organisation established under the law of the Grand Duchy of Luxembourg. Its seat is in Luxembourg and its headquarters is in Paris, France.
Beryl T. "Sue" Atkins was a British lexicographer, specialising in computational lexicography, who pioneered the creation of bilingual dictionaries from corpus data.
Christiane D. Fellbaum is a Lecturer with Rank of Professor in the Program in Linguistics and the Computer Science Department at Princeton University. The co-developer of the WordNet project, she is also its current director.
Linguistic categories include
Language resource management - Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.
The International Conference on Language Resources and Evaluation is an international conference organised by the European Language Resources Association every other year with the support of institutions and organisations involved in Natural language processing. The series of LREC conferences was launched in Granada in 1998.
In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.
The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.
The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.
Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.
UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.
NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.
UBY is a large-scale lexical-semantic resource for natural language processing (NLP) developed at the Ubiquitous Knowledge Processing Lab (UKP) in the department of Computer Science of the Technische Universität Darmstadt . UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German.
Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.
Joseph Mariani is a French computer science researcher and pioneer in the field of speech processing.
Helen Aristar-Dry is an American linguist who currently serves as the series editor for SpringerBriefs in Linguistics. Most notably, from 1991 to 2013 she co-directed The LINGUIST List with Anthony Aristar. She has served as principal investigator or co-Principal Investigator on over $5,000,000 worth of research grants from the National Science Foundation and the National Endowment for the Humanities. She retired as Professor of English Language and Literature from Eastern Michigan University in 2013.
In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."
CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.
Concepticon is an open-source online lexical database of linguistic concept lists. It links concept labels in concept lists to concept sets.