PlWordNet

Last updated

plWordNet is a lexico-semantic database of the Polish language. It includes sets of synonymous lexical units (synsets) followed by short definitions. plWordNet serves as a thesaurus-dictionary where concepts (synsets) and individual word meanings (lexical units) are defined by their location in the network of mutual relations, reflecting the lexico-semantic system of the Polish language. [1] plWordNet is also used as one of the basic resources for the construction of natural language processing tools for Polish. [1]

Contents

History

plWordNet is being developed at Wrocław University of Technology as part of CLARIN. The works have been carried out by The WrocUT Language Technology Group G4.19 since 2005, [2] funded by the Ministry of Science and Higher Education and by the EU. The thesaurus has been built from the 'ground up' by lexicographers and natural language engineers. [3] The first version of plWordNet was published in 2009 – it contained 20,223 lemmas, 26,990 lexical units and 17,695 synsets. [4] Version 4.0 was released in 2018. The most recent version is plWordNet 4.2.

Content

Data retrieved 2014-05-30 PlWordNet and Princeton WordNet - content statistics.png
Data retrieved 2014-05-30

Currently, plWordNet contains 195k lemmas, 295k lexical units and 228k synsets. [5] It has already outgrown Princeton WordNet with respect to the number of lexical units. plWordNet consists of nouns (135k), verbs (21k), adjectives (29k) and adverbs (8k). [5] Each meaning of a given word is a separate lexical unit. Units that represent the same concept, and do not differ significantly in stylistic register, have been combined into synsets - sets of synonyms. Each lexical unit is assigned to one of the domains (semantic categories), indicating its general meaning. plWordNet domains correspond to Princeton WordNet lexicographers' files.

Semantic categories in plWordNet

Noun domains [6] Verb domains [7] Adjective domains [8]
  • the highest in the hierarchy (bhp)
  • attribute (cech)
  • motive (cel)
  • time (czas)
  • body (czc)
  • emotion (czuj)
  • act (czy)
  • group (grp)
  • quantity (il)
  • food (jedz)
  • shape (ksz)
  • location (msc)
  • person (os)
  • communication (por)
  • possession (pos)
  • process (prc)
  • plant (rsl)
  • natural object (rz)
  • substance (sbst)
  • state (st)
  • classification (sys)
  • cognition (umy)
  • artefact (wytw)
  • event (zdarz)
  • natural phenomenon (zj)
  • animal (zw)
  • emotion (cczuj)
  • consumption (cjedz)
  • communication (cpor)
  • possession (cpos)
  • state (cst)
  • cognition (cumy)
  • creation (cwytw)
  • contact (dtk)
  • body (hig)
  • weather (pog)
  • perception (pst)
  • motion (ruch)
  • social (sp)
  • competition (wal)
  • change (zmn)
  • deadjectival (grad)
  • quality (jak)
  • deverbal (odcz)
  • relation (rel)

Lexical unit description

Some lexical units are provided with the information about stylistic register, short definition, usage examples and link to the relevant Wikipedia article.

nounmiastotown, city
domainmiejsce i umiejscowienieplace and location
definitionduży, gęsto zabudowany i zaludniony teren posiadający odrębną administrację; miejsce życia ludzi pracujących w przemyśle lub usługachbig, densely built-up and populated area with a separate administration; living place of people working in industry or services
exampleW mieście człowiek ma większą szansę na zrobienie kariery i zarobienie pieniędzy, choć jednocześnie łatwiej tam niż na wsi popaść w ubóstwo.It is much easier to make a career in a city than in a village, but it is also much easier to fall into poverty.

The most important element defining words meanings are lexico-semantic and derivational relations, which hold between synsets and between lexical units. One synset groups such lexical units, which share the same set of relations. [9] Based on the relations assigned to the synsets and units, tools for natural language processing can conclude about meaning of the lemma, which is important for example in word-sense disambiguation.

Selected noun relations [9]

RelationTestExample
synonymy
  • If he/she/it is X, then he/she/it is also Y
  • If he/she/it is Y, then he/she/it is also X
{kot2; kot domowy1}, 'cat, domestic cat'
inter-register synonymy
  • X and Y share a hypernym, their sets of hyponyms do not overlap
  • X and Y are not synonyms
  • If he/she/it is X, then he/she/it is also Y [to the extent of the stylistic register difference]
  • If he/she/it is X, then he/she/it is also Y [to the extent of the stylistic register difference]
{chłopiec1}, {gówniarz1}, 'boy, ~brat, squirt'
hypo-/hypernymy
  • If he/she/it is X, then he/she/it must be Y
  • If he/she/it is Y, then he/she/it not necessarily is X
  • If he/she/it is not Y, then he/she/it cannot be X
{buk1} jest rodzajem {drzewo liściaste1} , 'beech' is a kind of 'deciduous tree'
mero-/holonymy
  • X jest częścią Y
  • Y nie jest częścią X
  • Y jest całością, której częścią jest X
{poduszka powietrzna1} jest częścią {samochód1}, 'air bag' is a part of 'car'

Polish synsets are connected to the corresponding Princeton WordNet synsets with a set of inter-lingual lexico-semantic relations (such as for instance synonymy, partial synonymy, hyponymy). 91 578 synsets have been mapped so far (which amounts to about 2/3 of plWordNet synsets, among which mainly nouns). [10] The mapping enables the application of plWordNet in machine translation, e.g. in the online service offered by Google Translate. Mapping can be instrumental in leveraging textual analysis tools from English to Polish. [11]

Applications

plWordNet is available on the open access license, allowing free browsing. It has been made available to the users in the form of an online dictionary, mobile application and web services. Some application of plWordNet:

Related Research Articles

A lexicon is the vocabulary of a language or branch of knowledge. In linguistics, a lexicon is a language's inventory of lexemes. The word lexicon derives from Greek word λεξικόν, neuter of λεξικός meaning 'of or for words'.

Semantics is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and computer science.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">WordNet</span> Computational lexicon of English

WordNet is a lexical database of semantic relations between words. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.

EuroWordNet is a system of semantic networks for European languages, based on WordNet. Each language develops its own wordnet but they are interconnected with interlingual links stored in the Interlingual Index (ILI).

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Polysemy is the capacity for a sign to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from monosemy, where a word has a single meaning.

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

FrameNet is a research and resource development project based at the International Computer Science Institute (ICSI) in Berkeley, California, which has produced an electronic resource based on a theory of meaning called frame semantics. The data that FrameNet has analyzed show that the sentence "John sold a car to Mary" essentially describes the same basic situation as "Mary bought a car from John", just from a different perspective. A semantic frame is a conceptual structure describing an event, relation, or object along with its participants. The FrameNet lexical database contains over 1,200 semantic frames, 13,000 lexical units and 202,000 example sentences. Charles J. Fillmore, who developed the theory of frame semantics which serves as the theoretical the basis of FrameNet, founded the project in 1997 and continued to lead the effort until he died in 2014. Frame Semantic theory and FrameNet have been influential in linguistics and natural language processing, where it led to the task of automatic Semantic Role Labeling.

<span class="mw-page-title-main">Semantic lexicon</span>

A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. Semantic lexicons are built upon semantic networks, which represent the semantic relations between words. The difference between a semantic lexicon and a semantic network is that a semantic lexicon has definitions for each word, or a "gloss".

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning short or long distances. A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept that the term represents.

Language resource management - Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.

GermaNet is a semantic network for the German language. It relates nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet is free for academic use, after signing a license. GermaNet has much in common with the English WordNet and can be viewed as an on-line thesaurus or a light-weight ontology. GermaNet has been developed and maintained at the University of Tübingen since 1997 within the research group for General and Computational Linguistics. It has been integrated into the EuroWordNet, a multilingual lexical-semantic database.

IndoWordNet is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu.

<span class="mw-page-title-main">BabelNet</span> Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

<span class="mw-page-title-main">Arabic Ontology</span> Linguistic ontology

Arabic Ontology is a linguistic ontology for the Arabic language, which can be used as an Arabic WordNet with ontologically clean content. People use it also as a tree of the concepts/meanings of the Arabic terms. It is a formal representation of the concepts that the Arabic terms convey, and its content is ontologically well-founded, and benchmarked to scientific advances and rigorous knowledge sources rather than to speakers’ naïve beliefs as wordnets typically do . The Ontology tree can be explored online.

OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it.

References

  1. 1 2 "Słowosieć".
  2. Maziarz M., Piasecki M., Szpakowicz S., Approaching plWordNet 2.0, http://nlp.pwr.wroc.pl/ltg/files/publications/paper%2042.pdf
  3. "PlWordNet 3.1".
  4. Piasecki M., Szpakowicz S., Broda B., A Wordnet from the Ground Up, Wrocław 2009, s. 170, http://www.plwordnet.pwr.wroc.pl/main/content/files/publications/A_Wordnet_from_the_Ground_Up.pdf
  5. 1 2 Detailed comparative statistics of plWN and PWN can be found at plWN webpage: http://plwordnet.pwr.wroc.pl/wordnet/stats [access: 30.06.2014]
  6. Rabiega-Wiśniewska J., Maziarz M., Piasecki M., Szpakowicz S., Opis relacji leksykalno-semantycznych w Słowosieci 2.0. Rzeczownik, s. 4.
  7. Hojka B., Maziarz M., Piasecki M., Rabiega-Wiśniewska J., Szpakowicz S., Opis relacji leksykalno-semantycznych w Słowosieci 2.0. Czasownik, s. 15-16.
  8. Maziarz M., Szpakowicz S., Piasecki M., Semantic Relations among Adjectives in Polish WordNet 2.0: A New Relation Set, Discussion and Evaluation, Cognitive Studies / Études Cognitives, t. 12, s. 149–179, 2012.
  9. 1 2 Maziarz M., Piasecki M., Szpakowicz S., Rabiega-Wiśniewska J., Semantic Relations Among Nouns in Polish Wordnet Grounded in Lexicographic and Semantic Tradition, Cognitive Studies/Études Cognitives, t, 11, s. 161-181, 2011.
  10. http://plwordnet.pwr.wroc.pl/wordnet/stats [access: 30.05.2014]
  11. Klimczak, Karol M. (2020). "Text Analysis in Finance: The challenges for efficient application". Innovation in Financial Services: Balancing Public and Private Interests. Routledge. p. 199-216. doi:10.4324/9781003051664-15. ISBN   9781003051664.