Semantic lexicon

Last updated
A visual representation of a Semantic Lexicon Hierarchical Model Mental Lexicon.png
A visual representation of a Semantic Lexicon

A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. [1] Semantic lexicons are built upon semantic networks, which represent the semantic relations between words. The difference between a semantic lexicon and a semantic network is that a semantic lexicon has definitions for each word, or a "gloss". [2]

Contents

Structure

Semantic lexicons are made up of lexical entries. These entries are not orthographic, but semantic, eliminating issues of homonymy and polysemy. These lexical entries are interconnected with semantic relations, such as hyperonymy, hyponymy, meronymy, or troponymy. Synonymous entries are grouped together in what the Princeton WordNet calls "synsets" [2] Most semantic lexicons are made up of four different "sub-nets": [2] nouns, verbs, adjectives, and adverbs, though some researchers have taken steps to add an "artificial node" interconnecting the sub-nets. [3]

Nouns

Nouns are ordered into a taxonomy, structured into a hierarchy where the broadest and most encompassing noun is located at the top, such as "thing", with the nouns becoming more and more specific the further they are from the top. The very top noun in a semantic lexicon is called a unique beginner. [4] The most specific nouns (those that do not have any subordinates), are terminal nodes. [3]

Semantic lexicons also distinguish between types, where a type of something has characteristics of a thing such as a Rhodesian Ridgeback being a type of dog, and instances, where something is an example of said thing, such as Dave Grohl is an instance of a musician. Instances are always terminal nodes because they are solitary and don’t have other words or ontological categories belonging to them. [2]

Semantic lexicons also address meronymy, [5] which is a “part-to-whole” relationship, such as keys are part of a laptop. The necessary attributes that define a specific entry are also necessarily present in that entry’s hyponym. So, if a computer has keys, and a laptop is a type of computer, then a laptop must have keys. However, there are many instances where this distinction can become vague. A good example of this is the item chair. Most would define a chair as having legs and a seat (as in the part one sits on). However, there are some artistic or modern chairs that do not have legs at all. Beanbags also do not have legs, but few would argue that they aren't chairs. Questions like this are the core questions that drive research and work in the fields of taxonomy and ontology.

Verbs

Verb synsets are arranged much like their noun counterparts: the more general and encompassing verbs are near the top of the hierarchy while troponyms (verbs that describe a more specific way of doing something) are grouped beneath. Verb specificity moves along a vector, with the verbs becoming more and more specific in reference to a certain quality. [2] For example. The set "walk / run / sprint" becomes more specific in terms of the speed, and "dislike / hate / abhor" becomes more specific in terms of the intensity of the emotion.

The ontological groupings and separations of verbs is far more arguable than their noun counterparts. It is widely accepted that a dog is a type of animal and that a stool is a type of chair, but it can be argued that abhor is on the same emotional plane as hate (that they are synonyms and not super/subordinates). It can also be argued that love and adore are synonyms, or that one is more specific than the other. Thus, the relations between verbs are not as agreed-upon as that of nouns.

Another attribute of verb synset relations is that there are also ordered into verb pairs. In these pairs, one verb necessarily entails the other in the way that massacre entails kill, and know entails believe. [2] These verb pairs can be troponyms and their superordinates, as is the case in the first example, or they can be in completely different ontological categories, as in the case in the second example.

Adjectives

Adjective synset relations are very similar to verb synset relations. They are not quite as neatly hierarchical as the noun synset relations, and they have fewer tiers and more terminal nodes. However, there are generally less terminal nodes per ontological category in adjective synset relations than that of verbs. Adjectives in semantic lexicons are organized in word pairs as well, with the difference being that their word pairs are antonyms instead of entailments. More generic polar adjectives such as hot and cold, or happy and sad are paired. Then other adjectives that are semantically similar are linked to each of these words. Hot is linked to warm, heated, sizzling, and sweltering, while cold is linked to cool, chilly, freezing, and nippy. These semantically similar adjectives are considered indirect antonyms [2] to the opposite polar adjective (i.e. nippy is an indirect antonym to hot). Adjectives that are derived from a verb or a noun are also directly linked to said verb or noun across sub-nets. For example, enjoyable is linked to the semantically similar adjectives agreeable, and pleasant, as well as to its origin verb, enjoy.

Adverbs

There are very few adverbs accounted for in semantic lexicons. This is because most adverbs are taken directly from their adjective counterparts, in both meaning and form, and changed only morphologically (i.e. happily is derived from happy, and luckily is derived from lucky, which is derived from luck). The only adverbs that are accounted for specifically are ones without these connections, such as really, mostly, and hardly. [2]

Challenges facing semantic lexicons

The effects of the Princeton WordNet project extend far past English, though most research in the field revolves around the English language. Creating a semantic lexicon for other languages has proved to be very useful for Natural Language Processing applications. One of the main focuses of research in semantic lexicons is linking lexicons of different languages to aid in machine translation. The most common approach is to attempt to create a shared ontology that serves as a “middleman” of sorts between semantic lexicons of two different languages. [6] This is an extremely challenging and as-of-yet unsolved issue in the Machine Translation field. One issue arises from the fact that no two languages are word-for-word translations of each other. That is, every language has some sort of structural or syntactic difference from every other. In addition, languages often have words that don’t translate easily into other languages, and certainly not with an exact word-to-word match. Proposals have been made to create a set framework for wordnets. Research has shown that every known human language has some sort of concept resembling synonymy, hyponymy, meronymy, and antonymy. However, every idea so far proposed has been met with criticism for using a pattern that works best for English and less for other languages. [6]

Another obstacle in the field is that no solid guidelines exist for semantic lexicon framework and contents. Each lexicon project in each different language has had a slightly (or not so slightly) different approach to their wordnet. There is not even an agreed-upon definition of what a “word” is. Orthographically, they are defined as a string of letters with spaces on either side, but semantically it becomes a very debated subject. For example, though it is not difficult to define dog or rod as words, but what about guard dog or lightning rod? The latter two examples would be considered orthographically separate words, though semantically they make up one concept: one is a type of dog and one is a type of rod. In addition to these confusions, wordnets are also idiosyncratic, in that they do not consistently label items. They are redundant, in that they often have several words assigned to each meaning (synsets). They are also open-ended, in that they often focus on and extend into terminology and domain-specific vocabulary. [6]

Other names

List of semantic lexicons

See also

Related Research Articles

In linguistics, declension is the changing of the form of a word, generally to express its syntactic function in the sentence, by way of some inflection. Declensions may apply to nouns, pronouns, adjectives, adverbs, and articles to indicate number, case, gender, and a number of other grammatical categories. Meanwhile, the inflectional change of verbs is called conjugation.

<span class="mw-page-title-main">WordNet</span> Computational lexicon of English

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.

An adjective is a word that describes or defines a noun or noun phrase. Its semantic role is to change information given by the noun.

An adverb is a word or an expression that generally modifies a verb, adjective, another adverb, determiner, clause, preposition, or sentence. Adverbs typically express manner, place, time, frequency, degree, level of certainty, etc., answering questions such as how, in what way, when, where, to what extent. This is called the adverbial function and may be performed by single words (adverbs) or by multi-word adverbial phrases and adverbial clauses.

In grammar, a part of speech or part-of-speech is a category of words that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior, sometimes similar morphological behavior in that they undergo inflection for similar properties and even similar semantic behavior. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, and determiner.

<span class="mw-page-title-main">Synonym</span> Words or phrases having the same meaning

A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words begin, start, commence, and initiate are all synonyms of one another: they are synonymous. The standard test for synonymy is substitution: one form can be replaced by another in a sentence without changing its meaning. Words are considered synonymous in only one particular sense: for example, long and extended in the context long time or extended time are synonymous, but long cannot be used in the phrase extended family. Synonyms with exactly the same meaning share a seme or denotational sememe, whereas those with inexactly similar meanings share a broader denotational or connotational sememe and thus overlap within a semantic field. The former are sometimes called cognitive synonyms and the latter, near-synonyms, plesionyms or poecilonyms.

In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is long entails that it is not short. It is referred to as a 'binary' relationship because there are two members in a set of opposites. The relationship between opposites is known as opposition. A member of a pair of opposites can generally be determined by the question What is the opposite of  X ?

Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

In linguistics, a modifier is an optional element in phrase structure or clause structure which modifies the meaning of another element in the structure. For instance, the adjective "red" acts as a modifier in the noun phrase "red ball", providing extra details about which particular ball is being referred to. Similarly, the adverb "quickly" acts as a modifier in the verb phrase "run quickly". Modification can be considered a high-level domain of the functions of language, on par with predication and reference.

In generative linguistics, Distributed Morphology is a theoretical framework introduced in 1993 by Morris Halle and Alec Marantz. The central claim of Distributed Morphology is that there is no divide between the construction of words and sentences. The syntax is the single generative engine that forms sound-meaning correspondences, both complex phrases and complex words. This approach challenges the traditional notion of the Lexicon as the unit where derived words are formed and idiosyncratic word-meaning correspondences are stored. In Distributed Morphology there is no unified Lexicon as in earlier generative treatments of word-formation. Rather, the functions that other theories ascribe to the Lexicon are distributed among other components of the grammar.

Syntax is concerned with the way sentences are constructed from smaller parts, such as words and phrases. Two steps can be distinguished in the study of syntax. The first step is to identify different types of units in the stream of speech and writing. In natural languages, such units include sentences, phrases, and words. The second step is to analyze how these units build up larger patterns, and in particular to find general rules that govern the construction of sentences.http://people.dsv.su.se/~vadim/cmnew/chapter2/ch2_21.htm

In linguistics, a semantic field is a lexical set of words grouped semantically that refers to a specific subject. The term is also used in anthropology, computational semiotics, and technical exegesis.

<span class="mw-page-title-main">English adverbs</span>

English adverbs are words such as so, just, how, well, also, very, even, only, really, and why that head adverb phrases, and whose most typical members function as modifiers in verb phrases and clauses, along with adjective and adverb phrases. The category is highly heterogeneous, but a large number of the very typical members are derived from adjectives + the suffix -ly and modify any word, phrase or clause other than a noun. Adverbs form an open lexical category in English. They do not typically license or function as complements in other phrases. Semantically, they are again highly various, denoting manner, degree, duration, frequency, domain, modality, and much more.

GermaNet is a semantic network for the German language. It relates nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet is free for academic use, after signing a license. GermaNet has much in common with the English WordNet and can be viewed as an on-line thesaurus or a light-weight ontology. GermaNet has been developed and maintained at the University of Tübingen since 1997 within the research group for General and Computational Linguistics. It has been integrated into the EuroWordNet, a multilingual lexical-semantic database.

Syntactic bootstrapping is a theory in developmental psycholinguistics and language acquisition which proposes that children learn word meanings by recognizing syntactic categories and the structure of their language. It is proposed that children have innate knowledge of the links between syntactic and semantic categories and can use these observations to make inferences about word meaning. Learning words in one's native language can be challenging because the extralinguistic context of use does not give specific enough information about word meanings. Therefore, in addition to extralinguistic cues, conclusions about syntactic categories are made which then lead to inferences about a word's meaning. This theory aims to explain the acquisition of lexical categories such as verbs, nouns, etc. and functional categories such as case markers, determiners, etc.

<span class="mw-page-title-main">BabelNet</span> Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

plWordNet is a lexico-semantic database of the Polish language. It includes sets of synonymous lexical units (synsets) followed by short definitions. plWordNet serves as a thesaurus-dictionary where concepts (synsets) and individual word meanings are defined by their location in the network of mutual relations, reflecting the lexico-semantic system of the Polish language. plWordNet is also used as one of the basic resources for the construction of natural language processing tools for Polish.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

<span class="mw-page-title-main">Malayalam WordNet</span>

Malayalam WordNet (പദശൃംഖല) is an online WordNet created for Malayalam Language. Malayalam WordNet has been developed by the Department of Computer Science, Cochin University Of Science And Technology.

References

  1. Theng, Yin-Leng (2009). Handbook of Research on Digital Libraries: Design, Development, and Impact. University of Michigan: Information Science Reference. ISBN   9781599048796.
  2. 1 2 3 4 5 6 7 8 "About WordNet".
  3. 1 2 Lemnitzer, L. "Enriching GermaNet: a case study of lexical acquisition". Seminar für Sprachwissenschaft, Universitat Tubingen.
  4. Boyd-Graber, J. (2006). "Adding Dense, Weighted Connections to WordNet". Proceedings of the Third International Wordnet Conference.
  5. Hinrichs, E. (December 2012). "Using part-whole relations for automatic deduction of compound-international relations in GermaNet". International Journal on Semantic Web and Information Systems . 3.
  6. 1 2 3 Fellbaum, C. (May 2012). "Challenges for a Multilingual Wordnet". Language Resources and Evaluation. 46 (2): 313–326. doi:10.1007/s10579-012-9186-z. S2CID   254379442.