Statistical semantics

Last updated December 25, 2024

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

History

The term statistical semantics was first used by Warren Weaver in his well-known paper on machine translation.^[1] He argued that word sense disambiguation for machine translation should be based on the co-occurrence frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by J.R. Firth.^[2] This assumption is known in linguistics as the distributional hypothesis.^[3] Emile Delavenay defined statistical semantics as the "statistical study of the meanings of words and their frequency and order of recurrence".^[4] "Furnas et al. 1983" is frequently cited as a foundational contribution to statistical semantics.^[5] An early success in the field was latent semantic analysis.

Applications

Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of semantics, by applying statistical techniques to large corpora:

Measuring the similarity in word meanings ^[6]^[7]^[8]^[9]
Measuring the similarity in word relations ^[10]
Modeling similarity-based generalization ^[11]
Discovering words with a given relation^[12]
Classifying relations between words^[13]
Extracting keywords from documents^[14]^[15]
Measuring the cohesiveness of text^[16]
Discovering the different senses of words^[17]
Distinguishing the different senses of words^[18]
Subcognitive aspects of words^[19]
Distinguishing praise from criticism^[20]

Related fields

Statistical semantics focuses on the meanings of common words and the relations between common words, unlike text mining, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of computational semantics, which is in turn a subfield of computational linguistics and natural language processing.

Many of the applications of statistical semantics (listed above) can also be addressed by lexicon-based algorithms, instead of the corpus-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or noisier new text types from e.g. social media than lexicon-based algorithms are.^[21] However, the best performance on an application is often achieved by combining the two approaches.^[22]

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ and $H$ , with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD), named-entity normalization (NEN), or Concept Recognition, is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the main idea is to first identify "Paris" and "France" as named entities, and then to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris" and "France" to the french country.

In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

In natural language processing, a sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information.

DisCoCat is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.

References

Sources

Delavenay, Emile (1960). An Introduction to Machine Translation. New York, NY: Thames and Hudson. OCLC 1001646.
Firth, John R. (1957). "A synopsis of linguistic theory 1930-1955". Studies in Linguistic Analysis . Oxford: Philological Society: 1–32.
Reprinted in Palmer, F.R., ed. (1968). Selected Papers of J.R. Firth 1952-1959. London: Longman. OCLC 123573912.
Frank, Eibe; Paynter, Gordon W.; Witten, Ian H.; Gutwin, Carl; Nevill-Manning, Craig G. (1999). "Domain-specific keyphrase extraction". Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence. IJCAI-99. Vol. 2. California: Morgan Kaufmann. pp. 668–673. CiteSeerX 10.1.1.148.3598 . ISBN 1-55860-613-0.
Furnas, George W.; Landauer, T. K.; Gomez, L. M.; Dumais, S. T. (1983). "Statistical semantics: Analysis of the potential performance of keyword information systems" (PDF). Bell System Technical Journal . 62 (6): 1753–1806. doi:10.1002/j.1538-7305.1983.tb03513.x. S2CID 22483184. Archived from the original (PDF) on 2016-03-04. Retrieved 2012-07-12.
Hearst, Marti A. (1992). "Automatic Acquisition of Hyponyms from Large Text Corpora" (PDF). Proceedings of the Fourteenth International Conference on Computational Linguistics. COLING '92. Nantes, France. pp. 539–545. CiteSeerX 10.1.1.36.701 . doi:10.3115/992133.992154. Archived from the original (PDF) on 2012-05-22. Retrieved 2012-07-12.
Landauer, Thomas K.; Dumais, Susan T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge". Psychological Review . 104 (2): 211–240. CiteSeerX 10.1.1.184.4759 . doi:10.1037/0033-295x.104.2.211. S2CID 1144461.
Lund, Kevin; Burgess, Curt; Atchley, Ruth Ann (1995). "Semantic and associative priming in high-dimensional semantic space" (PDF). Proceedings of the 17th Annual Conference of the Cognitive Science Society. Cognitive Science Society. pp. 660–665.^{[ permanent dead link ‍]}
McDonald, Scott; Ramscar, Michael (2001). "Testing the distributional hypothesis: The influence of context on judgements of semantic similarity". Proceedings of the 23rd Annual Conference of the Cognitive Science Society. pp. 611–616. CiteSeerX 10.1.1.104.7535 .
Pantel, Patrick; Lin, Dekang (2002). "Discovering word senses from text". Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD '02. pp. 613–619. CiteSeerX 10.1.1.12.6771 . doi:10.1145/775047.775138. ISBN 1-58113-567-X.
Sahlgren, Magnus (2008). "The Distributional Hypothesis" (PDF). Rivista di Linguistica. 20 (1): 33–53. Archived from the original (PDF) on 2012-03-15. Retrieved 2012-11-20.

Sahlgren, Magnus; Karlgren, Jussi (2009). Terminology mining in social media. CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management. doi:10.1145/1645953.1646006.

Terra, Egidio L.; Clarke, Charles L. A. (2003). "Frequency estimates for statistical word similarity measures" (PDF). Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003. HLT/NAACL 2003. pp. 244–251. CiteSeerX 10.1.1.12.9041 . doi:10.3115/1073445.1073477. Archived from the original (PDF) on 2013-11-03. Retrieved 2012-07-12.
Turney, Peter D. (May 2000). "Learning algorithms for keyphrase extraction". Information Retrieval . 2 (4): 303–336. arXiv: cs/0212020 . CiteSeerX 10.1.1.11.1829 . doi:10.1023/A:1009976227802. S2CID 7007323.
Turney, Peter D. (2001). "Answering subcognitive Turing Test questions: A reply to French". Journal of Experimental and Theoretical Artificial Intelligence . 13 (4): 409–419. arXiv: cs/0212015 . CiteSeerX 10.1.1.12.8734 . doi:10.1080/09528130110100270. S2CID 59099.
Turney, Peter D. (2003). "Coherent keyphrase extraction via Web mining". Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence. IJCAI-03. Acapulco, Mexico. pp. 434–439. arXiv: cs/0308033 . Bibcode:2003cs........8033T. CiteSeerX 10.1.1.100.3751 .
Turney, Peter D. (2004). "Word sense disambiguation by Web mining for word co-occurrence probabilities". Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. SENSEVAL-3. Barcelona, Spain. pp. 239–242. arXiv: cs/0407065 . Bibcode:2004cs........7065T.
Turney, Peter D. (2006). "Similarity of semantic relations". Computational Linguistics . 32 (3): 379–416. arXiv: cs/0608100 . Bibcode:2006cs........8100T. CiteSeerX 10.1.1.75.8007 . doi:10.1162/coli.2006.32.3.379. S2CID 2468783.
Turney, Peter D.; Littman, Michael L. (October 2003). "Measuring praise and criticism: Inference of semantic orientation from association". ACM Transactions on Information Systems . 21 (4): 315–346. arXiv: cs/0309034 . Bibcode:2003cs........9034T. CiteSeerX 10.1.1.9.6425 . doi:10.1145/944012.944013. S2CID 2024.
Turney, Peter D.; Littman, Michael L. (2005). "Corpus-based Learning of Analogies and Semantic Relations". Machine Learning . 60 (1–3): 251–278. arXiv: cs/0508103 . Bibcode:2005cs........8103T. CiteSeerX 10.1.1.90.9819 . doi:10.1007/s10994-005-0913-1. S2CID 9322367.
Turney, Peter D.; Littman, Michael L.; Bigham, Jeffrey; Shnayder, Victor (2003). "Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems". Proceedings of the International Conference on Recent Advances in Natural Language Processing. RANLP-03. Borovets, Bulgaria. pp. 482–489. arXiv: cs/0309035 . Bibcode:2003cs........9035T. CiteSeerX 10.1.1.5.2939 .
Weaver, Warren (1955). "Translation" (PDF). In Locke, W.N.; Booth, D.A. (eds.). Machine Translation of Languages. Cambridge, Massachusetts: MIT Press. pp. 15–23. ISBN 0-8371-8434-7. Archived from the original (PDF) on 2019-01-29. Retrieved 2012-07-12.
Yarlett, Daniel G. (2008). Language Learning Through Similarity-Based Generalization (PDF) (PhD thesis). Stanford University. Archived from the original (PDF) on 2014-04-19.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Weaver 1955

[2] Firth 1957

[3] Sahlgren 2008

[4] Delavenay 1960

[5] Furnas et al. 1983

[6] Lund, Burgess & Atchley 1995

[7] Landauer & Dumais 1997

[8] McDonald & Ramscar 2001

[9] Terra & Clarke 2003

[10] Turney 2006

[11] Yarlett 2008

[12] Hearst 1992

[13] Turney & Littman 2005

[14] Frank et al. 1999

[15] Turney 2000

[16] Turney 2003

[17] Pantel & Lin 2002

[18] Turney 2004

[19] Turney 2001

[20] Turney & Littman 2003

[21] Sahlgren & Karlgren 2009

[22] Turney et al. 2003

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]