BulNet

Last updated

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language. [1] [2]

Contents

It follows the Princeton WordNet (PWN) framework which implements the traditional semantic networks whose structure consists of nodes and relations between the nodes. [3] [4] [5]

General information

BulNet was started within the EU-funded project BalkaNet - a Multilingual Semantic Network of the Balkan Languages. After BalkaNet's completion. development of BulNet continued with Bulgarian government support.

Contents of BulNet

Categories

As of 2015, BulNet contained more than 80,000 synonym sets distributed into nine parts of speech - nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, particles and interjections.

The words included in BulNet have been selected according to different criteria. The main criteria are the frequency analysis of the word occurrences in large text corpora and the inclusion of synsets. The synsets include those already featured in the wordnets of other languages and synsets that correspond to high-frequency word senses found in parallel corpora.

Synsets

Each synset encodes the relation of equivalence between a number of lexical items - LITERALS (at least one should be explicitly represented in the SYNSET), each of them having a unique meaning (specified by the value of SENSE) - which pertain to one and the same part of speech (specified as the value of POS) and represent one and the same lexical meaning (specified as the value of DEF). Each synset is linked to its counterpart in PWN 3.0 by means of a unique identification number - ID. The common synsets in the Balkan languages are marked as common concepts subsets - BCS.

In a monolingual database, a synset should be linked to at least one other synset through an intralingual relation. Non-obligatory information may also be encoded such as examples of usage, stylistic peculiarities, morphological or syntactic properties, author and last edit details.

Semantic relations

The large number of relations encoded in BulNet effectively illustrates the language's semantic and derivational richness that offers diverse opportunities for numerous applications of the multilingual database. BulNet offers linguistic solutions at the semantic level such as options for synonym selection, queries for semantic relations of a word in the language's lexical system (antonymy, holonymy, etc.), explanatory definition queries and translation equivalents for a lexical item.

BulNet is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language. [1] [2]

Hydra

Hydra is an OS-independent system designed for wordnet development, validation and exploration. The program enables users to browse and edit any number of monolingual wordnets at a time. The individual wordnets are synchronised, so that equivalent synonym sets, or synsets, may be viewed and explored in parallel. [6]

Related Research Articles

WordNet Computational lexicon of English

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website.

EuroWordNet is a system of semantic networks for European languages, based on WordNet. Each language develops its own wordnet but they are interconnected with interlingual links stored in the Interlingual Index (ILI).

Word-sense disambiguation (WSD) is an open problem in computational linguistics concerned with identifying which sense of a word is used in a sentence. The solution to this issue impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Hyponymy and hypernymy Semantic relations involving the type-of property

In linguistics, hyponymy is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym denoting a supertype. In other words, the semantic field of the hyponym is included within that of the hypernym. In simpler terms, a hyponym is in a type-of relationship with its hypernym. For example: pigeon, crow, eagle, and seagull are all hyponyms of bird, their hypernym; which itself is a hyponym of animal, its hypernym.

Semantic lexicon

A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. Semantic lexicons are built upon semantic networks, which represent the semantic relations between words. The difference between a semantic lexicon and a semantic network is that a semantic lexicon has definitions for each word, or a "gloss".

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning short or long distances. A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept that the term represents.

Language resource management - Lexical markup framework, is the ISO International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

GermaNet is a semantic network for the German language. It relates nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet is free for academic use, after signing a license. GermaNet has much in common with the English WordNet and can be viewed as an on-line thesaurus or a light-weight ontology. GermaNet has been developed and maintained at the University of Tübingen since 1997 within the research group for General and Computational Linguistics. It has been integrated into the EuroWordNet, a multilingual lexical-semantic database.

IndoWordNet is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu.

Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:

BabelNet

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

plWordNet is a lexico-semantic database of the Polish language. It includes sets of synonymous lexical units (synsets) followed by short definitions. plWordNet serves as a thesaurus-dictionary where concepts (synsets) and individual word meanings are defined by their location in the network of mutual relations, reflecting the lexico-semantic system of the Polish language. plWordNet is also used as one of the basic resources for the construction of natural language processing tools for Polish.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The Bulgarian Part of Speech-annotated Corpus (BulPosCor) is a morphologically annotated general monolingual corpus of written language where each item in a text is assigned a grammatical tag. BulPosCor is created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences and consists of 174 697 lexical items. BulPosCor has been compiled from the Structured "Brown" Corpus of Bulgarian by sampling 300+ word-excerpts from the original BCB files in such a way as to preserve the BCB overall structure. The annotation process consists of a primary stage of automatically assigning tags from the Bulgarian Grammar Dictionary and a stage of manual resolving of morphological ambiguities. The disambiguated corpus consists of 174,697 lexical units.

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Arabic Ontology

Arabic Ontology is a linguistic ontology for the Arabic language, which can be used as an Arabic Wordnet with ontologically-clean content. People use it also as a tree of the concepts/meanings of the Arabic terms. It is a formal representation of the concepts that the Arabic terms convey, and its content is ontologically well-founded, and benchmarked to scientific advances and rigorous knowledge sources rather than to speakers’ naïve beliefs as wordnets typically do . The Ontology tree can be explored online.

OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it.

References

  1. 1 2 Koeva, S. Derivational and morphosemantic relations in Bulgarian Wordnet. In Intelligent Information Systems, XVI, Warsaw, Academic Publishing House, 2008, 359—389. ISBN   978-83-60434-44-4. "Archived copy" (PDF). Archived from the original (PDF) on 2011-07-08. Retrieved 2015-05-12.CS1 maint: archived copy as title (link)
  2. 1 2 Tsvetana Dimitrova, Ekaterina Tarpomanova and Borislav Rizov. Coping with Derivation in the Bulgarian Wordnet. In: Heili Orav, Christiane Fellbaum and Piek Vossen (Eds.) Proceedings of the Seventh Global Wordnet Conference, Tartu, Estonia, 2014, pp. 109-117. .
  3. Koeva, S., G. Totkov and A. Genov. Towards Bulgarian WordNet. Romanian Journal of Information Science and Technology, Vol. 7, No. 1-2, 45-61, 2004. ISSN   1453-8245.
  4. Koeva, S. Bulgarian WordNet – development and perspectives. In International Conference Cognitive Modeling in Linguistics, Varna, 2005, 270-271.
  5. Koeva, S. Bulgarian Wordnet - current state, applications and prospects. In Bulgarian-American Dialogues, Prof. M. Drinov Academic Publishing House, Sofia, 2010, 120-132. ISBN   978-954-322-383-1.
  6. Borislav Rizov. Hydra: A Software System for Wordnet. In: Heili Orav, Christiane Fellbaum and Piek Vossen (Eds.) Proceedings of the Seventh Global Wordnet Conference, Tartu, Estonia, 2014, pp. 142-147. .

Sources