OntoLex

Last updated

OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it (W3C Ontology-Lexica Community Group). [1]

Contents

OntoLex-Lemon vocabulary

The OntoLex-Lemon vocabulary represents a vocabulary for publishing lexical data as a knowledge graph, in an RDF format and/or as Linguistic Linked Open Data. Since its publication as a W3C Community report in 2016, [2] it serves as ``a de facto standard to represent ontology-lexica on the Web´´. [3] OntoLex-Lemon is a revision of the Lemon vocabulary originally proposed by McCrae et al. (2011). [4]

Fig. 1. OntoLex-Lemon core model Lemon OntoLex Core(1).png
Fig. 1. OntoLex-Lemon core model

The core elements of OntoLex-Lemon, shown in Fig. 1, are:

Aside from the core module (namespace http://www.w3.org/ns/lemon/ontolex#), other modules specify designated vocabulary for representing lexicon metadata [6] (namespace http://www.w3.org/ns/lemon/lime#), lexical-semantic relations (e.g., translation and variation, namespace http://www.w3.org/ns/lemon/vartrans#), multi-word expressions (decomposition, namespace http://www.w3.org/ns/lemon/decomp#) and syntactic frames (namespace http://www.w3.org/ns/lemon/synsem#).

The data structures of OntoLex-Lemon are comparable with those of other dictionary formats (see related vocabularies below). The innovative element about OntoLex-Lemon is that it provides such a data model as an RDF vocabulary, as this enables novel use cases that are based on web technologies rather than stand-alone dictionaries (e.g., translation inference, see applications below). For the foreseeable future, OntoLex-Lemon will also remain unique in this role, as the (Linguistic) Linked Open Data community strongly encourages to reuse existing vocabularies [7] and as of Dec 2019, OntoLex-Lemon is the only established (i.e., published by W3C or another standardization initiative) vocabulary for its purpose. This is also reflected in recent extensions to the original OntoLex-Lemon specification, where novel modules have been developed to extend the use of OntoLex-Lemon to novel areas of application:

Applications

OntoLex-Lemon is widely used for lexical resources in the context of Linguistic Linked Open Data. Selected applications include

OntoLex development is regularly addressed in scientific events dedicated to ontologies, linked data or lexicography. Since 2017, a designated workshop series on the OntoLex module is conducted biannually. [38]

Related vocabularies that focus on standardizing and publishing lexical resources include DICT (text-based format), the XML Dictionary eXchange Format, TEI-Dict (XML) and the Lexical Markup Framework (abstract model usually serialized in XML; the Lemon vocabulary originally evolved from an RDF serialization of LMF). OntoLex-Lemon differs from these earlier models in being a native Linked Open Data vocabulary that does not (just) formalize structure and semantics of machine-readable dictionaries, but is designed to facilitate information integration between them.

Related Research Articles

Lexicology is the branch of linguistics that analyzes the lexicon of a specific language. A word is the smallest meaningful unit of a language that can stand on its own, and is made up of small components called morphemes and even smaller elements known as phonemes, or distinguishing sounds. Lexicology examines every feature of a word – including formation, spelling, origin, usage, and definition.

WordNet Computational lexicon of English

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website.

In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.

Glossary Alphabetical list of terms relevant to a certain field of study or action

A glossary also known as a vocabulary or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a book and includes terms within that book that are either newly introduced, uncommon, or specialized. While glossaries are most commonly associated with non-fiction books, in some cases, fiction novels may come with a glossary for unfamiliar terms.

<span class="mw-page-title-main">Wiktionary</span> Multilingual online dictionary

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustrations, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of words into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 185 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

Linguistic categories include

DOGMA, short for Developing Ontology-Grounded Methods and Applications, is the name of research project in progress at Vrije Universiteit Brussel's STARLab, Semantics Technology and Applications Research Laboratory. It is an internally funded project, concerned with the more general aspects of extracting, storing, representing and browsing information.

Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars as the use of computers in the study of machine-readable dictionaries. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.

OneSource is an evolving data analysis tool used internally by the Air Combat Command (ACC) Vocabulary Services Team, and made available to general data management community. It is used by the greater US Department of Defense (DoD) and NATO community for controlled vocabulary management and exploration. It provides its users with a consistent view of syntactical, lexical, and semantic data vocabularies through a community-driven web environment. It was created with the intention of directly supporting the DoD Net-centric Data Strategy of visible, understandable, and accessible data assets.

A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms to work with. The inference rules are commonly specified by means of an ontology language, and often a description logic language. Many reasoners use first-order predicate logic to perform reasoning; inference commonly proceeds by forward chaining and backward chaining. There are also examples of probabilistic reasoners, including non-axiomatic reasoning systems, and probabilistic logic networks.

Language resource management - Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information. In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

BabelNet Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.

UBY is a large-scale lexical-semantic resource for natural language processing (NLP) developed at the Ubiquitous Knowledge Processing Lab (UKP) in the department of Computer Science of the Technische Universität Darmstadt . UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."

References

  1. "OntoLex community portal". W3C. Retrieved 6 December 2019.
  2. Cimiano, Phillip; McCrae, John P.; Buitelaar, Paul. "Lexicon Model for Ontologies: Community Report, 10 May 2016 Final Community Group Report 10 May 2016". W3C. Retrieved 6 December 2019.
  3. Julia Bosque-Gil, Jorge Gracia and Elena Montiel-Ponsoda (July 2017). "Towards a module for lexicography in OntoLex" (PDF). Kernerman Dictionary News. No. 25. Retrieved 5 April 2020.
  4. McCrae, John; Spohr, Dennis; Cimiano, Philipp (2011). "Linking lexical resources and ontologies on the Semantic Web with Lemon". Proceedings of the Extended Semantic Web Conference (ESWC-2011), Iraklion, Greece: 245–259.
  5. Bosque-Gil, Julia; Gracia, Jorge. "The OntoLex Lemon Lexicography Module". W3C. Retrieved 6 December 2019.
  6. Fiorelli, Manuel; Stellato, Armando; McCrae, John P.; Cimiano, Philipp; Pazienza, Maria Teresa (2015). Gandon, Fabien; Sabou, Marta; Sack, Harald; d’Amato, Claudia; Cudré-Mauroux, Philippe; Zimmermann, Antoine (eds.). "LIME: The Metadata Module for OntoLex". The Semantic Web. Latest Advances and New Domains. Lecture Notes in Computer Science. Springer International Publishing. 9088: 321–336. doi:10.1007/978-3-319-18818-8_20. ISBN   978-3-319-18818-8.
  7. "Linguistic Linked Open Data. Information about the current status of the growing cloud of linguistic linked open data" . Retrieved 10 December 2019.
  8. Bosque-Gil, Julia; Gracia, Jorge. "The OntoLex Lemon Lexicography Module Final Community Group Report 17 September 2019". W3C. Retrieved 10 December 2019.
  9. "Morphology" . Retrieved 10 December 2019.
  10. Klimek, Bettina; McCrae, John P.; Bosque-Gil, Julia; Ionov, Maxim; Tauber, James K.; Chiarcos, Christian. Challenges for the Representation of Morphology in Ontology Lexicons, in: Kosem, I., Zingano Kuhn, T., Correia, M., Ferreria, J. P., Jansen, M., Pereira, I., Kallas, J., Jakubíček, M., Krek, S. & Tiberius, C. (eds.) 2019. Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal (PDF). Brno: Lexical Computing CZ, s.r.o. pp. 570–591.
  11. "Frequency, Attestation and Corpus Information" . Retrieved 10 December 2019.
  12. Chiarcos, Christian; Ionov, Maxim. "OntoLex-Lemon Module for Frequency, Attestation and Corpus Information (draft specification)" . Retrieved 9 April 2020.{{cite web}}: CS1 maint: url-status (link)
  13. "LexInfo - Data Category Ontology for OntoLex-Lemon" . Retrieved 4 January 2020.
  14. censign. "Call for Participation: OASIS Lexicographic Infrastructure Data Model and API (LEXIDMA) TC". OASIS. Retrieved 10 December 2019.
  15. Schmitz, P.; Francesconi, E.; Hajlaoui, N.; Batouche, B.; Stellato, A. (2018). Semantic Interoperability of Multilingual Language Resources by Automatic Mapping, In: International Conference on Electronic Government and the Information Systems Perspective. Cham: Springer. pp. 153–163.
  16. Batouche, Brahim; Schmitz, Peter; Francesconi, Enrico; Hajlaoui, Najeh (December 2, 2018). PMKI–Public Multilingual Knowledge. Documentation of the PMKI data modelInfrastructure (PDF). European Technical Specification. Retrieved 10 December 2019.
  17. Lenardič, Jakob. "CLARIN-IT presents LexO: Where Lexicography Meets the Semantic Web". CLARIN. Retrieved 10 December 2019.
  18. The AIMS Team. "Version 4.0.2 of VocBench was released in August 2018". FAO of the United Nations in Italy. Retrieved 10 December 2019.
  19. Stellato, Armando; Rajbhandari, Sachit; Turbati, Andrea; Fiorelli, Manuel; Caracciolo, Caterina; Lorenzetti, Tiziano; Keizer, Johannes; Pazienza, Maria Teresa (2015). Gandon, Fabien; Sabou, Marta; Sack, Harald; d’Amato, Claudia; Cudré-Mauroux, Philippe; Zimmermann, Antoine (eds.). "VocBench: A Web Application for Collaborative Development of Multilingual Thesauri" (PDF). The Semantic Web. Latest Advances and New Domains. Lecture Notes in Computer Science. Springer International Publishing. 9088: 38–53. doi:10.1007/978-3-319-18818-8_3. ISBN   978-3-319-18818-8.
  20. "VocBench 3: a Collaborative Semantic Web Editor for Ontologies, Thesauri and Lexicons | www.semantic-web-journal.net". semantic-web-journal.net. Retrieved 2020-01-17.
  21. Ilan Kernerman and Dorielle Lonke (July 2019). "Lexicala API: A new era in dictionary data" (PDF). Kernerman Dictionary News. No. 27. Retrieved 5 April 2020.
  22. "Dictionary of Old Occitan medico-botanical terminology" . Retrieved 10 December 2019.
  23. "TIAD-2017 Shared Task – Translation Inference Across Dictionaries. Call for Participation" . Retrieved 10 December 2019.
  24. McCrae, John P.; Bond, Francis; Buitelaar, Paul; Cimiano, Philipp; Declerck, Thierry; Gracia, Jorge; Kernerman, Ilan; Montiel Ponsoda, Elena; Ordan, Noam; Piasacki, Maciej (June 18, 2017). Proceedings of the LDK 2017 Workshops: 1st Workshop on the OntoLex Model (OntoLex-2017), Shared Task on Translation Inference Across Dictionaries & Challenges for Wordnets. CEUR. Retrieved 10 December 2019.
  25. "TIAD 2019. 2nd Translation Inference Across Dictionaries (TIAD) Shared Task" . Retrieved 10 December 2019.
  26. Gracia, Jorge; Kabashi, Besim; Kernerman, Ilan (May 20, 2019). Proceedings of TIAD-2019 Shared Task – Translation Inference Across Dictionaries. Leipzig, Germany: CEUR.
  27. "TIAD 2020 -- 2rd Translation Inference Across Dictionaries (TIAD) shared task".
  28. "Dbnary Wiktionary as Linguistic Linked Open Data" . Retrieved 10 December 2019.
  29. Sérasset, Gilles (2016). "DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF". Semantic Web. Retrieved 10 December 2019.
  30. Kamholz, David; Pool, Jonathan; Colowick, Susan M. (2014). PanLex: Building a Resource for Panlingual Lexical Translation, In Proceedings of the 9th Language Resource and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014. European Language Resource Association. pp. 3145–3150. Retrieved 10 December 2019.
  31. "Princeton WordNet 3.1. WordNet RDF" . Retrieved 10 December 2019.
  32. "Global Wordnet Formats: RDF" . Retrieved 10 December 2019.
  33. "BabelNet SPARQL endpoint" . Retrieved 10 December 2019.
  34. Ehrmann, M.; Ceccioni, F.; Vanella, D.; McCrae, J.P.; Cimiano, P.; Navigli, R. Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. In: Proceedings of the 9th Language Resource and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014. European Language Resource Association. pp. 401–408. Retrieved 10 December 2019.
  35. "LiLa SPARQL endpoint" . Retrieved 4 April 2020.
  36. "LiLa query interface" . Retrieved 4 April 2020.
  37. Passarotti, M.C.; Cecchini, F.M.; Franzini, G.; Litta, E.; Mambrini, F.; Ruffolo, P. LiLa: Linking Latin. A Knowledge Base of Linguistic Resources and NLP Tools. In: Proceedings of the 2nd Conference on Language, Data and Knowledge (LDK 2019), Leipzig, Germany, 20-23 May 2019. CEUR Workshop Proceedings. Retrieved 4 April 2020.
  38. Cimiano, Philipp (July 2017). "OntoLex 2017 – 1st workshop on the OntoLex model" (PDF). Kernerman Dictionary News. No. 25. Retrieved 5 April 2020.