Automatic acquisition of sense-tagged corpora

Last updated January 22, 2024

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far^[update] be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.

Existing methods

Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically.^[1] WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD. The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are:

acquisition by direct Web searching (searches for monosemous synonyms, hypernyms, hyponyms, parsed gloss' words, etc.),
Yarowsky algorithm (bootstrapping),
acquisition via Web directories, and
acquisition via cross-language meaning evidences.

Summary

Optimistic results

The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word-sense disambiguation. Some results are certainly encouraging:

In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique^[2] and the examples taken from the ODP Web directories.^[3] In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved.
It has been shown^[4] that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2.
Web examples made a significant contribution to the best Senseval-2 English all-words system.^[5]

Difficulties

There are, however, several open research issues related to the use of Web examples in WSD:

High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training).^[6]
The most complete evaluation of Web examples for supervised WSD^[4] indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline.
Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002^[7]) with Agirre and Martínez (2004^[4]), or Agirre and Martínez (2000^[6]) with Mihalcea and Moldovan (1999^[8]). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples).
Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense).^[4] It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus).
In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense^[7] or assigning senses to Web directories and then scanning directories for examples;^[9] but this problem is nevertheless far from being solved.
Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective.

Future

Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web:

The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project),^[10] which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual).
The Web has been used to enrich WordNet senses with domain information: topic signatures^[11] and Web directories,^[9] which have in turn been successfully used for WSD.
Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages.^[12]^[13]

It is clear,^{[ according to whom? ]} however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

The Lesk algorithm is a classical algorithm for word sense disambiguation introduced by Michael E. Lesk in 1986. It operates on the premise that words within a given context are likely to share a common meaning. This algorithm compares the dictionary definitions of an ambiguous word with the words in its surrounding context to determine the most appropriate sense. Variations, such as the Simplified Lesk algorithm, have demonstrated improved precision and efficiency. However, the Lesk algorithm has faced criticism for its sensitivity to definition wording and its reliance on brief glosses. Researchers have sought to enhance its accuracy by incorporating additional resources like thesauruses and syntactic models.

In computational linguistics the Yarowsky algorithm is an unsupervised learning algorithm for word sense disambiguation that uses the "one sense per collocation" and the "one sense per discourse" properties of human languages for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:

The following outline is provided as an overview of and topical guide to natural-language processing:

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

Rada Mihalcea is a professor of computer science and engineering at the University of Michigan. Her research focuses on natural language processing, multimodal processing, and computational social science.

Mona Talat Diab is a computer science professor and director of Carnegie Mellon University's Language Technologies Institute. Previously, she was a professor at George Washington University and a research scientist with Facebook AI. Her research focuses on natural language processing, computational linguistics, cross lingual/multilingual processing, computational socio-pragmatics, Arabic language processing, and applied machine learning.

References

↑ Kilgarriff, A.; G. Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics 29(3)
↑ Mihalcea, Rada. 2002. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
↑ Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
1 2 3 4 Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
↑ Mihalcea, Rada. 2002a. Word sense disambiguation with pattern learning and automatic feature selection. Natural Language Engineering, 8(4): 348–358.
1 2 Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.
1 2 Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
↑ Mihalcea, Rada & Dan Moldovan. 1999. An automatic method for generating sense tagged corpora. Proceedings of the American Association for Artificial Intelligence (AAAI), Orlando, U.S.A., 461–466.
1 2 Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
↑ Chklovski, Tim & Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, U.S.A., 116–122.
↑ Agirre, Eneko, Olatz Ansa, Eduard H. Hovy & David Martínez. 2000. Enriching very large ontologies using the WWW. Proceedings of the Ontology Learning Workshop, European Conference on Artificial Intelligence (ECAI), Berlin, Germany.
↑ Denis Turdakov, Pavel Velikhov. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation // SYRCoDIS.— 2008.
↑ Турдаков Денис. Устранение лексической многозначности терминов Википедии на основе скрытой модели Маркова // XI Всероссийская научная конференция «Электронные библиотеки: перспективные методы и технологии, электронные коллекции».— 2009. pdf (russian)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Kilgarriff, A.; G. Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics 29(3)

[2] Mihalcea, Rada. 2002. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.

[3] Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.

[:1-4] 1 2 3 4 Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.

[5] Mihalcea, Rada. 2002a. Word sense disambiguation with pattern learning and automatic feature selection. Natural Language Engineering, 8(4): 348–358.

[:0-6] 1 2 Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.

[:2-7] 1 2 Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.

[8] Mihalcea, Rada & Dan Moldovan. 1999. An automatic method for generating sense tagged corpora. Proceedings of the American Association for Artificial Intelligence (AAAI), Orlando, U.S.A., 461–466.

[Santamaría,_Celina_2003-9] 1 2 Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.

[10] Chklovski, Tim & Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, U.S.A., 116–122.

[11] Agirre, Eneko, Olatz Ansa, Eduard H. Hovy & David Martínez. 2000. Enriching very large ontologies using the WWW. Proceedings of the Ontology Learning Workshop, European Conference on Artificial Intelligence (ECAI), Berlin, Germany.

[12] Denis Turdakov, Pavel Velikhov. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation // SYRCoDIS.— 2008.

[13] Турдаков Денис. Устранение лексической многозначности терминов Википедии на основе скрытой модели Маркова // XI Всероссийская научная конференция «Электронные библиотеки: перспективные методы и технологии, электронные коллекции».— 2009. pdf (russian)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]