Distributional semantics

Last updated
How words are related in a given language is demonstrated in the "semantic space", which mathematically corresponds to the vector space. Distributional semantics.png
How words are related in a given language is demonstrated in the "semantic space", which mathematically corresponds to the vector space.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Contents

Distributional hypothesis

The distributional hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings. [1]

The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth in the 1950s. [2]

The distributional hypothesis is the basis for statistical semantics. Although the Distributional Hypothesis originated in linguistics, [3] it is now receiving attention in cognitive science especially regarding the context of word use. [4]

In recent years, the distributional hypothesis has provided the basis for the theory of similarity-based generalization in language learning: the idea that children can figure out how to use words they've rarely encountered before by generalizing about their use from distributions of similar words. [5] [6]

The distributional hypothesis suggests that the more semantically similar two words are, the more distributionally similar they will be in turn, and thus the more that they will tend to occur in similar linguistic contexts.

Whether or not this suggestion holds has significant implications for both the data-sparsity problem in computational modeling, [7] and for the question of how children are able to learn language so rapidly given relatively impoverished input (this is also known as the problem of the poverty of the stimulus).

Distributional semantic modeling in vector spaces

Distributional semantics favor the use of linear algebra as a computational tool and representational framework. The basic approach is to collect distributional information in high-dimensional vectors, and to define distributional/semantic similarity in terms of vector similarity. [8] Different kinds of similarities can be extracted depending on which type of distributional information is used to collect the vectors: topical similarities can be extracted by populating the vectors with information on which text regions the linguistic items occur in; paradigmatic similarities can be extracted by populating the vectors with information on which other linguistic items the items co-occur with. Note that the latter type of vectors can also be used to extract syntagmatic similarities by looking at the individual vector components.

The basic idea of a correlation between distributional and semantic similarity can be operationalized in many different ways. There is a rich variety of computational models implementing distributional semantics, including latent semantic analysis (LSA), [9] [10] Hyperspace Analogue to Language (HAL), syntax- or dependency-based models, [11] random indexing, semantic folding [12] and various variants of the topic model. [13]

Distributional semantic models differ primarily with respect to the following parameters:

Distributional semantic models that use linguistic items as context have also been referred to as word space, or vector space models. [15] [16]

Beyond Lexical Semantics

While distributional semantics typically has been applied to lexical items—words and multi-word terms—with considerable success, not least due to its applicability as an input layer for neurally inspired deep learning models, lexical semantics, i.e. the meaning of words, will only carry part of the semantics of an entire utterance. The meaning of a clause, e.g. "Tigers love rabbits.", can only partially be understood from examining the meaning of the three lexical items it consists of. Distributional semantics can straightforwardly be extended to cover larger linguistic item such as constructions, with and without non-instantiated items, but some of the base assumptions of the model need to be adjusted somewhat. Construction grammar and its formulation of the lexical-syntactic continuum offers one approach for including more elaborate constructions in a distributional semantic model and some experiments have been implemented using the Random Indexing approach. [17]

Compositional distributional semantic models extend distributional semantic models by explicit semantic functions that use syntactically based rules to combine the semantics of participating lexical units into a compositional model to characterize the semantics of entire phrases or sentences. This work was originally proposed by Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh of Oxford University in their 2008 paper, "A Compositional Distributional Model of Meaning". [18] Different approaches to composition have been explored—including neural models—and are under discussion at established workshops such as SemEval. [19]

Applications

Distributional semantic models have been applied successfully to the following tasks:

Software

See also

People

Related Research Articles

The following outline is provided as an overview and topical guide to linguistics:

<span class="mw-page-title-main">Semantics</span> Study of meaning in language

Semantics is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and computer science.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Conceptual semantics is a framework for semantic analysis developed mainly by Ray Jackendoff in 1976. Its aim is to provide a characterization of the conceptual elements by which a person understands words and sentences, and thus to provide an explanatory semantic representation. Explanatory in this sense refers to the ability of a given linguistic theory to describe how a component of language is acquired by a child.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Cognitive semantics is part of the cognitive linguistics movement. Semantics is the study of linguistic meaning. Cognitive semantics holds that language is part of a more general human cognitive ability, and can therefore only describe the world as people conceive of it. It is implicit that different linguistic communities conceive of simple things and processes in the world differently, not necessarily some difference between a person's conceptual world and the real world.

Frame semantics is a theory of linguistic meaning developed by Charles J. Fillmore that extends his earlier case grammar. It relates linguistic semantics to encyclopedic knowledge. The basic idea is that one cannot understand the meaning of a single word without access to all the essential knowledge that relates to that word. For example, one would not be able to understand the word "sell" without knowing anything about the situation of commercial transfer, which also involves, among other things, a seller, a buyer, goods, money, the relation between the money and the goods, the relations between the seller and the goods and the money, the relation between the buyer and the goods and the money and so on. Thus, a word activates, or evokes, a frame of semantic knowledge relating to the specific concept to which it refers.

In linguistics, a semantic field is a lexical set of words grouped semantically that refers to a specific subject. The term is also used in anthropology, computational semiotics, and technical exegesis.

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

The mental lexicon is defined as a mental dictionary that contains information regarding the word store of a language user, such as their meanings, pronunciations, and syntactic characteristics. The mental lexicon is used in linguistics and psycholinguistics to refer to individual speakers' lexical, or word, representations. However, there is some disagreement as to the utility of the mental lexicon as a scientific construct.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Word2vec</span> Models used to produce word embeddings

Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

DisCoCat is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.

References

  1. Harris 1954
  2. Firth 1957
  3. Sahlgren 2008
  4. McDonald & Ramscar 2001
  5. Gleitman 2002
  6. Yarlett 2008
  7. Wishart, Ryder; Prokopidis, Prokopis (2017). Topic Modelling Experiments on Hellenistic Corpora (PDF). Proceedings of the Workshop on Corpora in the Digital Humanities 17. S2CID   9191936.
  8. Rieger 1991
  9. Deerwester et al. 1990
  10. Landauer, Thomas K.; Dumais, Susan T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge". Psychological Review. 104 (2): 211–240. doi:10.1037/0033-295x.104.2.211.
  11. Padó & Lapata 2007
  12. De Sousa Webber, Francisco (2015). "Semantic Folding Theory And its Application in Semantic Fingerprinting". arXiv: 1511.08855 [cs.AI].
  13. Jordan, Michael I.; Ng, Andrew Y.; Blei, David M. (2003). "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (Jan): 993–1022.
  14. Church, Kenneth Ward; Hanks, Patrick (1989). "Word association norms, mutual information, and lexicography". Proceedings of the 27th Annual Meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics: 76–83. doi: 10.3115/981623.981633 .
  15. Schütze 1993
  16. Sahlgren 2006
  17. Karlgren, Jussi; Kanerva, Pentti (July 2019). "High-dimensional distributed semantic spaces for utterances". Natural Language Engineering. 25 (4): 503–517. arXiv: 2104.00424 . doi:10.1017/S1351324919000226. S2CID   201141249.
  18. Clark, Stephen; Coecke, Bob; Sadrzadeh, Mehrnoosh (2008). "A compositional distributional model of meaning" (PDF). Proceedings of the Second Quantum Interaction Symposium: 133–140.
  19. "SemEval-2014, Task 1".

Sources