Linguistic Linked Open Data

Last updated

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Contents

Definition and Development

LLOD Cloud (2016-05-24) LLOD-Cloud-2016-05-24.png
LLOD Cloud (2016-05-24)

Linguistic Linked Open Data describes the publication of data for linguistics and natural language processing using the following principles: [1]

The primary benefits of LLOD have been identified as: [2]

The home of the LLOD cloud diagram is under linguistic-lod.org [3]

LLOD vocabularies

Aside from gathering metadata and generating the LLOD cloud diagram, the LLOD community is driving the development of community standards with respect to vocabularies, metadata and best practice recommendations.

According to the state-of-the-art overview by Cimiano et al. (2020), [4] these include:

As of mid-2020, most of these community standards are actively worked on. Particularly problematic is the existence of multiple incompatible standards for linguistic annotations, and in early 2020, the W3C Community Group Linked Data for Language Technology has begun to work towards a consolidation of these (and other) vocabularies for linguistic annotations on the web. [15]

Community

The LLOD cloud diagram has been developed and is maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (since 2014 Open Knowledge), an open and interdisciplinary of experts in language resources.

The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users.

Several W3C Business and Community Groups focus on specialized aspects of LLOD:

LLOD development is driven forward by and documented in a series of international workshops, datathons, and associated publications. Among others, these include

Applications of LLOD

Linguistic Linked Open Data is applied to address a number of scientific research problems:

Linguistic Linked Open Data is closely related with the development of

Selected research projects

Uses and development of LLOD have been subject to several large-scale research projects, including

Selected resources

As of October 2018, the 10 most frequently linked resources in the LLOD diagram are (in order of the number of linked datasets):

Aspects

There are a number of recurring discussions regarding the different aspects of the term, its applicability and for a particular type of resources. [32]

Linguistic Data: Scope and Classification

Aside from resources used in and created for linguistic research, the LLOD cloud diagram also includes ontologies, terminologies and general knowledge bases whose development was not originally driven by interest in language sciences or language technology, e.g., the DBpedia. As a criterion for inclusion into the LLOD diagram, the OWLG requires "linguistic relevance": "[A] dataset is linguistically relevant if it provides or describes language data that can be used for the purpose of linguistic research or natural language processing." [33] This does include linguistic resources in a strict sense ("condition 1": an annotated or otherwise structured resource created for application in language sciences or language technology, as demonstrated, for example, by a scientific publication at a linguistics-related journal or conference), but also resources "that can be used for annotating, enriching, retrieving or classifying language resources ... [if their relevance] can be verified by the existence of links between a resource (whose linguistic relevance is to be confirmed) and resources fulfilling condition (1)" ("condition 2"). [34]

A related issue is the classification of linguistically relevant datasets (or language resources in general). The OWLG developed the following classification for the LLOD cloud diagram: [35]

Note that in this classification, term bases might be slightly different in that they do not provide grammatical information, however, since they formalize semantic knowledge, they are of immanent relevance for natural language processing tasks, such as named entity recognition or anaphora resolution.

Open Data: Availability

LLOD is defined in relation to Linked Open Data, and LLOD resources (data) should thus conform to licenses in accordance with the Open Definition. [36] For generating the LLOD cloud diagram (and the LOD diagram), this does, however, not seem to be enforced yet, so that the technical criterion is availability over the web and a metadata entry. In the OWLG, it has been repeatedly discussed whether non-commercial (academic) resources could be included with a general consensus of admitting them for the moment (2015) but subsequently enforcing stricter requirements along with the growth of the LLOD cloud. As of January 2018, it was not agreed upon yet when this move was about to happen. [37] As of January 2020, machine-readable license metadata was available for 86 LLOD resources, of these, 82 adopted open licenses, 4 adopted non-commercial licenses. [38]

In a broader sense, the term LLOD technology (infrastructures, tools, vocabularies) can also used to refer to the technology independently from whether actually open resources are involved, e.g., in the name of the EU project Pret-a-LLOD that features several commercial business cases. [39] This is justified for applications that consume (rather than provide) open data, but moreover, also when linked data technology and the adoptation of other LLOD conventions (esp., the use of RDF vocabularies developed in the context of LLOD) are applied in order to facilitates the seamless integration of LLOD resources (open resources).

The abbreviation "LLOD" can be used to refer to either LLOD technology (use of Linked Data and LLOD vocabularies, independent from the legal status of the data being processed) and LLOD resources (open data). For disambiguation, the terms "LLOD resources" and "LLOD technology" can be used. For emphasizing application or applicability to non-open resources, also "LLD" (Linguistic Linked Data) has been used. [40] A possible compromise is the acronym "LL(O)D" for the technology. A "Licensed Linguistic Linked Data" cloud that contains non-open resources does currently (June 2020) not exist. [38]

Linked Data: Formats

The definition of Linked Data requires the application of RDF or related standards. This includes the W3C recommendations SPARQL, Turtle, JSON-LD, RDF-XML, RDFa, etc. In language technology and the language sciences, however, other formalisms are currently more popular, and the inclusion of such data into the LLOD cloud diagram has been occasionally requested. [32] For several such languages, W3C-standardized wrapping mechanisms exist (e.g., for XML, CSV or relational databases, see Knowledge extraction#Extraction from structured sources to RDF), and such data can be integrated under the condition that the corresponding mapping is provided along with the source data.

Selected literature

A 2022 review paper is:

An exhaustive description on the state of the art on LLOD is provided by

The concept of a Linguistic Linked Open Data cloud has been originally introduced by

The first book on the topic is

According to Cimiano et al. (2020), [41] other seminal publications since then include

Developments from 2015 to 2019 are summarized in the collected volume by

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies. Ontologies are a formal way to describe taxonomies and classification networks, essentially defining the structure of knowledge for various domains: the nouns representing classes of objects and the verbs representing relations between the objects.

RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

Semantic publishing on the Web, or semantic web publishing, refers to publishing information on the web as documents accompanied by semantic markup. Semantic publication provides a way for computers to understand the structure and even the meaning of the published information, making information search and data integration more efficient.

Ontotext is a software company with offices in Europe and USA. It is the semantic technology branch of Sirma Group. Its main domain of activity is the development of software based on the Semantic Web languages and standards, in particular RDF, OWL and SPARQL. Ontotext is best known for the Ontotext GraphDB semantic graph database engine. Another major business line is the development of enterprise knowledge management and analytics systems that involve big knowledge graphs. Those systems are developed on top of the Ontotext Platform that builds on top of GraphDB capabilities for text mining using big knowledge graphs.

Linguistic categories include

DOGMA, short for Developing Ontology-Grounded Methods and Applications, is the name of research project in progress at Vrije Universiteit Brussel's STARLab, Semantics Technology and Applications Research Laboratory. It is an internally funded project, concerned with the more general aspects of extracting, storing, representing and browsing information.

<span class="mw-page-title-main">Linked data</span> Structured data and method for its publication

In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

<span class="mw-page-title-main">Asset Description Metadata Schema</span>

The Asset Description Metadata Schema (ADMS) is a common metadata vocabulary to describe standards, so-called interoperability assets, on the Web.

<span class="mw-page-title-main">BabelNet</span> Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.

UBY is a large-scale lexical-semantic resource for natural language processing (NLP) developed at the Ubiquitous Knowledge Processing Lab (UKP) in the department of Computer Science of the Technische Universität Darmstadt . UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German.

Drama annotation is the process of annotating the metadata of a drama. Given a drama expressed in some medium, the process of metadata annotation identifies what are the elements that characterize the drama and annotates such elements in some metadata format. For example, in the sentence "Laertes and Polonius warn Ophelia to stay away from Hamlet." from the text Hamlet, the word "Laertes", which refers to a drama element, namely a character, will be annotated as "Char", taken from some set of metadata. This article addresses the drama annotation projects, with the sets of metadata and annotations proposed in the scientific literature, based markup languages and ontologies.

OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it.

In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."

References

  1. 1 2 Open Linguistics Working Group. "Linguistic LOD". linguistic-lod.org. LIDER project. Retrieved 2016-05-24.
  2. Chiarcos, Christian; McCrae, John; Cimiano, Philipp; Fellbaum, Christiane (2013). Towards open data for linguistics: Lexical Linked Data (PDF). Heidelberg: In: Alessandro Oltramari, Piek Vossen, Lu Qin, and Eduard Hovy (eds.), New Trends of Research in Ontologies and Lexical Resources. Springer. Retrieved 2016-05-24.
  3. "Linguistic Linked Open Data. Information about the current status of the growing cloud of linguistic linked open data" . Retrieved 10 December 2019.
  4. Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. ISBN   978-3-030-30224-5.
  5. "Lexicon Model for Ontologies: Community Report, 10 May 2016". www.w3.org. Retrieved 2020-06-05.
  6. "Deliverables of W3C's Web Annotation Working Group". w3c.github.io. Retrieved 2020-06-05.
  7. Hellmann, Sebastian; Lehmann, Jens; Auer, Sören; Brümmer, Martin (2013). "Integrating NLP Using Linked Data". In Alani, Harith; Kagal, Lalana; Fokoue, Achille; Groth, Paul; Biemann, Chris; Parreira, Josiane Xavier; Aroyo, Lora; Noy, Natasha; Welty, Chris (eds.). Advanced Information Systems Engineering. pp. 98–113. doi: 10.1007/978-3-642-41338-4_7 . ISBN   978-3-642-41338-4.{{cite book}}: |journal= ignored (help)
  8. Chiarcos, Christian; Fäth, Christian (2017). "CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way". In Gracia, Jorge; Bond, Francis; McCrae, John P.; Buitelaar, Paul; Chiarcos, Christian; Hellmann, Sebastian (eds.). Language, Data, and Knowledge. Lecture Notes in Computer Science. Vol. 10318. Cham: Springer International Publishing. pp. 74–88. doi:10.1007/978-3-319-59888-8_6. ISBN   978-3-319-59888-8.
  9. Chiarcos, Christian (2012). "POWLA: Modeling Linguistic Corpora in OWL/DL". In Simperl, Elena; Cimiano, Philipp; Polleres, Axel; Corcho, Oscar; Presutti, Valentina (eds.). The Semantic Web: Research and Applications. Lecture Notes in Computer Science. Vol. 7295. Berlin, Heidelberg: Springer. pp. 225–239. doi: 10.1007/978-3-642-30284-8_22 . ISBN   978-3-642-30284-8.
  10. Chiarcos, Christian; Sukhareva, Maria (2015-01-01). "OLiA – Ontologies of Linguistic Annotation". Semantic Web. 6 (4): 379–386. doi:10.3233/SW-140167. ISSN   1570-0844. S2CID   5956950.
  11. Cimiano, P.; Buitelaar, P.; McCrae, J.; Sintek, M. (2011-03-01). "LexInfo: A declarative model for the lexicon-ontology interface". Journal of Web Semantics. 9 (1): 29–51. doi:10.1016/j.websem.2010.11.001. ISSN   1570-8268.
  12. de Melo, Gerard (2015-01-01). "Lexvo.org: Language-related information for the Linguistic Linked Data cloud". Semantic Web. 6 (4): 393–400. doi:10.3233/SW-150171. ISSN   1570-0844.
  13. "Data Catalog Vocabulary (DCAT) - Version 2". www.w3.org. Retrieved 2020-06-05.
  14. McCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". In Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.). The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Vol. 9341. Cham: Springer International Publishing. pp. 271–282. doi: 10.1007/978-3-319-25639-9_42 . ISBN   978-3-319-25639-9.
  15. ld4lt/linguistic-annotation, ld4lt, 2020-05-19, retrieved 2020-06-05
  16. "Best Practices for Multilingual Linked Open Data Community Group" . Retrieved 9 December 2019.
  17. "Linked Data for Language Technology Community Group" . Retrieved 9 December 2019.
  18. Bird, Steven; Liberman, Mark. "Towards a formal framework for linguistic annotations" (PDF). In: Proceedings of the International Conference on Spoken Language Processing, Sydney, 1998. Retrieved 2016-05-25.[ permanent dead link ]
  19. ISO 24612:2012. "Language resource management -- Linguistic annotation framework (LAF)". ISO. Retrieved 2016-05-25.
  20. Eckart, Richard (2008). Choosing an XML database for linguistically annotated corpora. SDV. Sprache und Datenverarbeitung 32.1/2008: International Journal for Language Data Processing, Workshop Datenbanktechnologien für hypermediale linguistische Anwendungen (KONVENS 2008), Universitätsverlag Rhein-Ruhr, Berlin, Sep 2008. pp. 7–22.
  21. Chiarcos, Christian. "Interoperability of Corpora and Annotations (draft version)" (PDF). In: Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (eds.) Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata, 2012. Retrieved 2016-05-25.
  22. "lod2.okfn.org (archived version)". Archived from the original on 7 March 2014. Retrieved 9 December 2019.
  23. "Multilingual Ontologies for Networked Knowledge (Monnet)". European Commission, CORDIS EU research results. Retrieved 10 December 2019.
  24. "LIDER: Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe". European Commission, CORDIS EU research results. Retrieved 10 December 2019.
  25. "Quality Translation by Deep Language Engineering Approaches". European Commission, CORDIS EU research results. Retrieved 10 December 2019.
  26. "Linked Open Dictionaries (LiODi)" . Retrieved 10 December 2019.
  27. "Open Framework of E-Services for Multilingual and Semantic Enrichment of Digital Content" . Retrieved 10 December 2019.
  28. "POSTDATA – Poetry Standardization and Linked Open Data" . Retrieved 10 December 2019.
  29. "Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin" . Retrieved 10 December 2019.
  30. "Pret-a-LLOD project home page" . Retrieved 10 December 2019. "Pret-a-LLOD". European Commission, CORDIS EU research results. Retrieved 10 December 2019.
  31. "CA18209 - European network for Web-centred linguistic data science". cost. European Cooperation in Science and Technology. Retrieved 10 December 2019.
  32. 1 2 For a history of these discussions, see the Open Linguistics mailing list archives, available only as a backup under https://github.com/open-linguistics/linguistics.okfn.org/tree/master/backup
  33. Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. p. 33. ISBN   978-3-030-30224-5.
  34. Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. pp. 33–34. ISBN   978-3-030-30224-5.
  35. Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. pp. 36f. ISBN   978-3-030-30224-5.
  36. Chiarcos, Christian and Pareja-Lora, Antonio (2020), Open Data—Linked Data—Linked Open Data—Linguistic Linked Open Data (LLOD): A General Introduction. In: Pareja-Lora, Antonio; Lust, Barbara; Blume, Maria; Chiarcos, Christian (eds.). Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences. The MIT Press, p.1-18.
  37. "linguistics.okfn.org/003004.html at master · open-linguistics/linguistics.okfn.org · GitHub". GitHub . Retrieved 2020-06-05.
  38. 1 2 Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. p. 37. ISBN   978-3-030-30224-5.
  39. "Prêt-à-LLOD – Prêt-à-LLOD project website" . Retrieved 2020-06-05.
  40. See the title of the book by Cimiano, Chiarcos, Gracia, McCrae (2020). However, the acronym LLD (June 2020: 7 unambiguous Google scholar matches) seems to be rarely used in comparison to LLOD (June 2020: 309 unambiguous Google scholar matches).
  41. Cimiano, Philipp; Chiarcos, Christian; McCrae, John P.; Gracia, Jorge (2020). Linguistic Linked Data: Representation, Generation and Applications. Springer International Publishing. pp. vi. ISBN   978-3-030-30224-5.