Yarowsky algorithm

Last updated

In computational linguistics the Yarowsky algorithm is an unsupervised learning algorithm for word sense disambiguation that uses the "one sense per collocation" and the "one sense per discourse" properties of human languages for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.

Contents

Application

The algorithm starts with a large, untagged corpus, in which it identifies examples of the given polysemous word, and stores all the relevant sentences as lines. For instance, Yarowsky uses the word "plant" in his 1995 paper to demonstrate the algorithm. If it is assumed that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label (i.e. sense A and B), then assign the appropriate label to all training examples containing the seed collocations. In this case, the words "life" and "manufacturing" are chosen as initial seed collocations for senses A and B respectively. The residual examples (85%98% according to Yarowsky) remain untagged.

The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a dictionary’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a predicate-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database WordNet can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.

A decision list algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense | Collocation), and the decision list is ranked by the log-likelihood ratio:

A smoothing algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.

The new resulting classifier will then be applied to the whole sample set. Add those examples in the residual that are tagged as A or B with probability above a reasonable threshold to the seed sets. The decision-list algorithm and the above adding step are applied iteratively. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the "one sense per discourse" property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.

In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.

The algorithm will continue to iterate until no more reliable collocations are found. The ‘One sense per discourse’ property can be used here for error correction. For a target word that has a binary sense partition, if the occurrences of the majority sense A exceed that of the minor sense B by a certain threshold, the minority ones will be relabeled as A. According to Yarowsky, for any sense to be clearly dominant, the occurrences of the target word should not be less than 4.

When the algorithm converges on a stable residual set, a final decision list of the target word is obtained. The most reliable collocations are at the top of the new list instead of the original seed words. The original untagged corpus is then tagged with sense labels and probabilities. The final decision list may now be applied to new data, the collocation with the highest rank in the list is used to classify the new data. For example, if the highest ranking collocation of the target word in the new data set is of sense A, then the target word is classified as sense A.

See also

Related Research Articles

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

<span class="mw-page-title-main">Naive Bayes spam filtering</span>

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in Bill said Alice would arrive soon, and she did, the words Alice and she refer to the same person.

The Brill tagger is an inductive method for part-of-speech tagging. It was described and invented by Eric Brill in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is:

In corpus linguistics a key word is a word which occurs in a text more often than we would expect to occur by chance alone. Key words are calculated by carrying out a statistical test which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use. Keyness is then the quality a word or phrase has of being "key" in its context.

In linguistics, a word sense is one of the meanings of a word. For example, a dictionary may have over 50 different senses of the word "play", each of these having a different meaning based on the context of the word's usage in a sentence, as follows:

We went to see the playRomeo and Juliet at the theater.

The coach devised a great play that put the visiting team on the defensive.

The children went out to play in the park.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

A phraseme, also called a set phrase, fixed expression, idiomatic phrase, multiword expression, or idiom, is a multi-word or multi-morphemic utterance whose components include at least one that is selectionally constrained or restricted by linguistic convention such that it is not freely chosen. In the most extreme cases, there are expressions such as X kicks the bucket ≈ ‘person X dies of natural causes, the speaker being flippant about X’s demise’ where the unit is selected as a whole to express a meaning that bears little or no relation to the meanings of its parts. All of the words in this expression are chosen restrictedly, as part of a chunk. At the other extreme, there are collocations such as stark naked, hearty laugh, or infinite patience where one of the words is chosen freely based on the meaning the speaker wishes to express while the choice of the other (intensifying) word is constrained by the conventions of the English language. Both kinds of expression are phrasemes, and can be contrasted with ’’free phrases’’, expressions where all of the members are chosen freely, based exclusively on their meaning and the message that the speaker wishes to communicate.

Cohesion is the grammatical and lexical linking within a text or sentence that holds a text together and gives it meaning. It is related to the broader concept of coherence.

The Lesk algorithm is a classical algorithm for word sense disambiguation introduced by Michael E. Lesk in 1986.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

Collocation extraction is the task of using a computer to extract collocations automatically from a corpus.

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:

References