Statistical machine translation

Last updated

Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach.

Contents

The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, [1] including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in the late 1980s and early 1990s by researchers at IBM's Thomas J. Watson Research Center [2] [3] [4]

Basis

The idea behind statistical machine translation comes from information theory. A document is translated according to the probability distribution that a string in the target language (for example, English) is the translation of a string in the source language (for example, French).

The problem of modeling the probability distribution has been approached in a number of ways. One approach which lends itself well to computer implementation is to apply Bayes Theorem, that is , where the translation model is the probability that the source string is the translation of the target string, and the language model is the probability of seeing that target language string. This decomposition is attractive as it splits the problem into two subproblems. Finding the best translation is done by picking up the one that gives the highest probability:

.

For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.

As the translation systems were not able to store all native strings and their translations, a document was typically translated sentence by sentence, but even this was not enough. Language models were typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there was additional complexity due to different sentence lengths and word orders in the languages.

The statistical translation models were initially word based (Models 1-5 from IBM Hidden Markov model from Stephan Vogel [5] and Model 6 from Franz-Joseph Och [6] ), but significant advances were made with the introduction of phrase based models. [7] Later work incorporated syntax or quasi-syntactic structures. [8]

Benefits

The most frequently cited[ citation needed ] benefits of statistical machine translation over rule-based approach were:

Shortcomings

Phrase-based translation

In phrase-based translation, the aim was to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words were called blocks or phrases, however, typically they were not linguistic phrases, but phrasemes that were found using statistical methods from corpora. It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see syntactic categories) decreased the quality of translation. [10]

The chosen phrases were further mapped one-to-one based on a phrase translation table, and could be reordered. This table could be learnt based on word-alignment, or directly from a parallel corpus. The second model was trained using the expectation maximization algorithm, similarly to the word-based IBM model. [11]

Syntax-based translation

Syntax-based translation was based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances. [12] Until the 1990s, with advent of strong stochastic parsers, the statistical counterpart of the old idea of syntax-based translation did not take off. Examples of this approach included DOP-based MT and later synchronous context-free grammars.

Hierarchical phrase-based translation

Hierarchical phrase-based translation combined the phrase-based and syntax-based approaches to translation. It used synchronous context-free grammar rules, but the grammars could be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents. This idea was first introduced in Chiang's Hiero system (2005). [8]

Challenges with statistical machine translation

Problems that statistical machine translation did not solve included:

Sentence alignment

In parallel corpora single sentences in one language can be found translated into several sentences in the other and vice versa. [12] Long sentences may be broken up, short sentences may be merged. There are even some languages that use writing systems without clear indication of a sentence end (for example, Thai). Sentence aligning can be performed through the Gale-Church alignment algorithm. Through this and other mathematical models efficient search and retrieval of the highest scoring sentence alignment is possible.

Word alignment

Sentence alignment is usually either provided by the corpus or obtained by aforementioned Gale-Church alignment algorithm. To learn e.g. the translation model, however, we need to know which words align in a source-target sentence pair. The IBM-Models or the HMM-approach were attempts at solving this challenge.

Function words that have no clear equivalent in the target language were another challenge for the statistical models. For example, when translating from English to German, the sentence "John does not live here," the word "does" doesn't have a clear alignment in the translated sentence "John wohnt hier nicht." Through logical reasoning, it may be aligned with the words "wohnt" (as in English it contains grammatical information for the word "live") or "nicht" (as it only appears in the sentence because it is negated) or it may be unaligned. [11]

Statistical anomalies

An example of such an anomaly was that "I took the train to Berlin" was mis-translated as "I took the train to Paris" due to the statistical abundance of "train to Paris" in the training set.

Idioms

Depending on the corpora used, idioms could not translate "idiomatically". For example, using Canadian Hansard as the bilingual corpus, "hear" was almost invariably translated to "Bravo!" since in Parliament "Hear, Hear!" becomes "Bravo!". [13]

This problem is connected with word alignment, as in very specific contexts the idiomatic expression aligned with words that resulted in an idiomatic expression of the same meaning in the target language. However, it is unlikely, as the alignment usually doesn't work in any other contexts. For that reason, idioms could only be subjected to phrasal alignment, as they could not be decomposed further without losing their meaning. This problem was specific for word-based translation. [11]

Different word orders

Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement.

In speech recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the machine translator can only manage small sequences of words, and word order has to be thought of by the program designer. Attempts at solutions have included re-ordering models, where a distribution of location changes for each item of translation is guessed from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.

Out of vocabulary (OOV) words

SMT systems typically store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.

See also

Notes and references

  1. W. Weaver (1955). Translation (1949). In: Machine Translation of Languages, MIT Press, Cambridge, MA.
  2. P. Brown; John Cocke; S. Della Pietra; V. Della Pietra; Frederick Jelinek; Robert L. Mercer; P. Roossin (1988). "A statistical approach to language translation". Coling'88. Association for Computational Linguistics. 1: 71–76. Retrieved 22 March 2015.
  3. P. Brown; John Cocke; S. Della Pietra; V. Della Pietra; Frederick Jelinek; John D. Lafferty; Robert L. Mercer; P. Roossin (1990). "A statistical approach to machine translation". Computational Linguistics. MIT Press. 16 (2): 79–85. Retrieved 22 March 2015.
  4. P. Brown; S. Della Pietra; V. Della Pietra; R. Mercer (1993). "The mathematics of statistical machine translation: parameter estimation". Computational Linguistics. MIT Press. 19 (2): 263–311. Retrieved 22 March 2015.
  5. S. Vogel, H. Ney and C. Tillmann. 1996. HMM-based Word Alignment in Statistical Translation. In COLING ’96: The 16th International Conference on Computational Linguistics, pp. 836-841, Copenhagen, Denmark.
  6. Och, Franz Josef; Ney, Hermann (2003). "A Systematic Comparison of Various Statistical Alignment Models". Computational Linguistics. 29: 19–51. doi: 10.1162/089120103321337421 .
  7. P. Koehn, F.J. Och, and D. Marcu (2003). Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).
  8. 1 2 D. Chiang (2005). A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05).
  9. Zhou, Sharon (July 25, 2018). "Has AI surpassed humans at translation? Not even close!". Skynet Today. Retrieved 2 August 2018.
  10. Philipp Koehn, Franz Josef Och, Daniel Marcu: Statistical Phrase-Based Translation (2003)
  11. 1 2 3 Koehn, Philipp (2010). Statistical Machine Translation. Cambridge University Press. ISBN   978-0-521-87415-1.
  12. 1 2 Philip Williams; Rico Sennrich; Matt Post; Philipp Koehn (1 August 2016). Syntax-based Statistical Machine Translation. Morgan & Claypool Publishers. ISBN   978-1-62705-502-4.
  13. W. J. Hutchins and H. Somers. (1992). An Introduction to Machine Translation, 18.3:322. ISBN   978-0-12-362830-5


Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

<span class="mw-page-title-main">Machine translation</span> Use of software for language translation

Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

Phrase structure rules are a type of rewrite rule used to describe a given language's syntax and are closely associated with the early stages of transformational grammar, proposed by Noam Chomsky in 1957. They are used to break down a natural language sentence into its constituent parts, also known as syntactic categories, including both lexical categories and phrasal categories. A grammar that uses phrase structure rules is a type of phrase structure grammar. Phrase structure rules as they are commonly employed operate according to the constituency relation, and a grammar that employs phrase structure rules is therefore a constituency grammar; as such, it stands in contrast to dependency grammars, which are based on the dependency relation.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

<span class="mw-page-title-main">Dictionary-based machine translation</span>

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

In linguistics, focus is a grammatical category that conveys which part of the sentence contributes new, non-derivable, or contrastive information. In the English sentence "Mary only insulted BILL", focus is expressed prosodically by a pitch accent on "Bill" which identifies him as the only person Mary insulted. By contrast, in the sentence "Mary only INSULTED Bill", the verb "insult" is focused and thus expresses that Mary performed no other actions towards Bill. Focus is a cross-linguistic phenomenon and a major topic in linguistics. Research on focus spans numerous subfields including phonetics, syntax, semantics, pragmatics, and sociolinguistics.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

<span class="mw-page-title-main">Bitext word alignment</span> Identifying translation relationships among the words in a bitext

Bitext word alignment or simply word alignment is the natural language processing task of identifying translation relationships among the words in a bitext, resulting in a bipartite graph between the two sides of the bitext, with an arc between two words if and only if they are translations of one another. Word alignment is typically done after sentence alignment has already identified pairs of sentences that are translations of one another.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

IBM alignment models are a sequence of increasingly complex models used in statistical machine translation to train a translation model and an alignment model, starting with lexical translation probabilities and moving to reordering and word duplication. They underpinned the majority of statistical machine translation systems for almost twenty years starting in the early 1990s, until neural machine translation began to dominate. These models offer principled probabilistic formulation and (mostly) tractable inference.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.