Version | 1.7 |
---|---|
Framework | Java |
Type | Multilingual lexical semantic resource |
License | Free licenses for the software, mix of licenses for the included resources |
Website | https://www.ukp.tu-darmstadt.de/data/lexical-resources/uby |
UBY [1] is a large-scale lexical-semantic resource for natural language processing (NLP) developed at the Ubiquitous Knowledge Processing Lab (UKP) in the department of Computer Science of the Technische Universität Darmstadt . UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German.
UBY applies a word sense alignment approach (subfield of word sense disambiguation) for combining information about nouns and verbs. [2] Currently, UBY contains 12 integrated resources in English and German.
UBY-LMF [3] [4] is a format for standardizing lexical resources for Natural Language Processing (NLP). [5] UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. [6] In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.
UBY is available as part of the open resource repository DKPro. DKPro UBY is a Java framework for creating and accessing sense-linked lexical resources in accordance with the UBY-LMF lexicon model. While the code of UBY is licensed under a mix of free licenses such as GPL and CC by SA, some of the included resources are under different licenses such as academic use only.
There is also a Semantic Web version of UBY called lemonUby. [7] lemonUby is based on the lemon model as proposed in the Monnet project. lemon is a model for modeling lexicon and machine-readable dictionaries and linked to the Semantic Web and the Linked Data cloud.
BabelNet is an automatically lexical semantic resource that links Wikipedia to the most popular computational lexicons such as WordNet. At first glance, UBY and BabelNet seem to be identical and competitive projects; however, the two resources follow different philosophies. In its early stage, BabelNet was primarily based on the alignment of WordNet and Wikipedia, which by the very nature of Wikipedia implied a strong focus on nouns, and especially named entities. Later on, the focus of BabelNet was shifted more towards other parts of speech. UBY, however, was focused from the very beginning on verb information, especially, syntactic information, which is contained in resources, such as VerbNet or FrameNet. Another main difference is that UBY models other resources completely and independently from each other, so that UBY can be used as wholesale replacement of each of the contained resources. A collective access to multiple resources is provided through the available resource alignments. Moreover, the LMF model in UBY allows unified way of access for all as well as individual resources. Meanwhile, BabelNet follow an approach similar to WordNet and bakes selected information types into so called Babel Synsets. This makes access and processing of the knowledge more convenient, however, it blurs the lines between the linked knowledge bases. Additionally, BabelNet enriches the original resources, e.g., by providing automatically created translations for concepts which are not lexicalized in a particular language. Although this provides a great boost of coverage for multilingual applications, the automatic inference of information is always prone to a certain degree of error.
In summary, due to the listed differences between the two resources, the usage of one or the other might be preferred depending on the particular application scenario. In fact, the two resources can be used to provide extensive lexicographic knowledge, especially, if they are linked together. The open and well-documented structure of the two resource provide a crucial milestone to achieve this goal.
UBY has been successfully used in different NLP tasks such as Word Sense Disambiguation, [8] Word Sense Clustering, [9] Verb Sense Labeling [10] and Text Classification. [11] UBY also inspired other projects on automatic construction of lexical semantic resources. [12] Furthermore, lemonUby was used to improve machine translation results, especially, finding translations for unknown words. [13]
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 193 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.
The VerbNet project maps PropBank verb types to their corresponding Levin classes. It is a lexical resource that incorporates both semantic and syntactic information about its contents.
Linguistic categories include
Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars as the use of computers in the study of machine-readable dictionaries. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.
Machine-readable dictionary (MRD) is a dictionary stored as machine-readable data instead of being printed on paper. It is an electronic dictionary and lexical database.
Language resource management – Lexical markup framework, produced by ISO/TC 37, is the ISO standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.
In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.
The Ubiquitous Knowledge Processing Lab is a research lab at the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Iryna Gurevych.
SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.
GermaNet is a semantic network for the German language. It relates nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet is free for academic use, after signing a license. GermaNet has much in common with the English WordNet and can be viewed as an on-line thesaurus or a light-weight ontology. GermaNet has been developed and maintained at the University of Tübingen since 1997 within the research group for General and Computational Linguistics. It has been integrated into the EuroWordNet, a multilingual lexical-semantic database.
BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.
UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.
In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
Iryna Gurevych, member Leopoldina, is a Ukrainian computer scientist. She is Professor at the Department of Computer Science of the Technical University of Darmstadt and Director of Ubiquitous Knowledge Processing Lab.
OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it.
{{cite book}}
: |journal=
ignored (help)