![]() | This article may be in need of reorganization to comply with Wikipedia's layout guidelines .(January 2025) |
Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation, [1] that superseded the previous rule-based approach that required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural machine translation.
The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, [2] including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in the late 1980s and early 1990s by researchers at IBM's Thomas J. Watson Research Center. [3] [4] [5] Before the introduction of neural machine translation, it was by far the most widely studied machine translation method.
The idea behind statistical machine translation comes from information theory. A document is translated according to the probability distribution that a string in the target language (for example, English) is the translation of a string in the source language (for example, French).
The problem of modeling the probability distribution has been approached in a number of ways. One approach which lends itself well to computer implementation is to apply Bayes Theorem, that is , where the translation model is the probability that the source string is the translation of the target string, and the language model is the probability of seeing that target language string. This decomposition is attractive as it splits the problem into two subproblems. Finding the best translation is done by picking up the one that gives the highest probability:
For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.
As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence. Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but this introduces additional complexity due to different sentence lengths and word orders in the languages.
Statistical translation models were initially word based (Models 1-5 from IBM Hidden Markov model from Stephan Vogel [6] and Model 6 from Franz-Joseph Och [7] ), but significant advances were made with the introduction of phrase based models. [8] Later work incorporated syntax or quasi-syntactic structures. [9]
The most frequently cited[ citation needed ] benefits of statistical machine translation (SMT) over rule-based approach are:
In word-based translation, the fundamental unit of translation is a word in some natural language. Typically, the number of words in translated sentences are different, because of compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Necessarily it is assumed by information theory that each covers the same concept. In practice this is not really true. For example, the English word corner can be translated in Spanish by either rincón or esquina, depending on whether it is to mean its internal or external angle.
Simple word-based translation can't translate between languages with different fertility. Word-based translation systems can relatively simply be made to cope with high fertility, such that they could map a single word to multiple words, but not the other way about[ citation needed ]. For example, if we were translating from English to French, each word in English could produce any number of French words— sometimes none at all. But there's no way to group two English words producing a single French word.
An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes the training program for IBM models and HMM model and Model 6. [7]
The word-based translation is not widely used today; phrase-based systems are more common. Most phrase-based systems are still using GIZA++ to align the corpus[ citation needed ]. The alignments are used to extract phrases or deduce syntax rules. [11] And matching words in bi-text is still a problem actively discussed in the community. Because of the predominance of GIZA++, there are now several distributed implementations of it online. [12]
In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases. These are typically not linguistic phrases, but phrasemes that were found using statistical methods from corpora. It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see syntactic categories) decreased the quality of translation. [13]
The chosen phrases are further mapped one-to-one based on a phrase translation table, and may be reordered. This table could be learnt based on word-alignment, or directly from a parallel corpus. The second model is trained using the expectation maximization algorithm, similarly to the word-based IBM model. [14]
Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances. [15] Until the 1990s, with advent of strong stochastic parsers, the statistical counterpart of the old idea of syntax-based translation did not take off. Examples of this approach include DOP-based MT and later synchronous context-free grammars.
Hierarchical phrase-based translation combines the phrase-based and syntax-based approaches to translation. It uses synchronous context-free grammar rules, but the grammars can be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents. This idea was first introduced in Chiang's Hiero system (2005). [9]
A language model is an essential component of any statistical machine translation system, which aids in making the translation as fluent as possible. It is a function that takes a translated sentence and returns the probability of it being said by a native speaker. A good language model will for example assign a higher probability to the sentence "the house is small" than to "small the is house". Other than word order, language models may also help with word choice: if a foreign word has multiple possible translations, these functions may give better probabilities for certain translations in specific contexts in the target language. [14]
![]() | This section needs expansion. You can help by adding to it. (May 2012) |
Problems with statistical machine translation include:
Single sentences in one language can be found translated into several sentences in the other and vice versa. [15] Long sentences may be broken up, while short sentences may be merged. There are even languages that use writing systems without clear indication of a sentence end, such as Thai. Sentence aligning can be performed through the Gale-Church alignment algorithm. Efficient search and retrieval of the highest scoring sentence alignment is possible through this and other mathematical models.
Sentence alignment is usually either provided by the corpus or obtained by the aforementioned Gale-Church alignment algorithm. To learn e.g. the translation model, however, we need to know which words align in a source-target sentence pair. The IBM-Models or the HMM-approach were attempts at solving this challenge.
Function words that have no clear equivalent in the target language are another issue for the statistical models. For example, when translating from English to German, in the sentence "John does not live here", the word "does" has no clear alignment in the translated sentence "John wohnt hier nicht". Through logical reasoning, it may be aligned with the words "wohnt" (as it contains grammatical information for the English word "live") or "nicht" (as it only appears in the sentence because it is negated) or it may be unaligned. [14]
An example of such an anomaly is the phrase "I took the train to Berlin" being mistranslated as "I took the train to Paris" due to the statistical abundance of "train to Paris" in the training set.
Depending on the corpora used, the use of idiom and linguistic register might not receive a translation that accurately represents the original intent. For example, the popular Canadian Hansard bilingual corpus primarily consists of parliamentary speech examples, where "Hear, Hear!" is frequently associated with "Bravo!" Using a model built on this corpus to translate ordinary speech in a conversational register would lead to incorrect translation of the word hear as Bravo! [19]
This problem is connected with word alignment, as in very specific contexts the idiomatic expression aligned with words that resulted in an idiomatic expression of the same meaning in the target language. However, it is unlikely, as the alignment usually does not work in any other contexts. For that reason, idioms could only be subjected to phrasal alignment, as they could not be decomposed further without losing their meaning. This problem was specific for word-based translation. [14]
Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement.
In speech recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the machine translator can only manage small sequences of words, and word order has to be thought of by the program designer. Attempts at solutions have included re-ordering models, where a distribution of location changes for each item of translation is guessed from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.
SMT systems typically store different word forms as separate symbols without any relation to each other, and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.
Statistical machine translation is related to other data-driven methods in machine translation, such as the earlier work on example-based machine translation. Contrast this to systems that are based on hand-crafted rules.
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Bitext word alignment or simply word alignment is the natural language processing task of identifying translation relationships among the words in a bitext, resulting in a bipartite graph between the two sides of the bitext, with an arc between two words if and only if they are translations of one another. Word alignment is typically done after sentence alignment has already identified pairs of sentences that are translations of one another.
The noisy channel model is a framework used in spell checkers, question answering, speech recognition, and machine translation. In this model, the goal is to find the intended word given a word where the letters have been scrambled in some manner.
The history of natural language processing describes the advances of natural language processing. There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence.
The following outline is provided as an overview of and topical guide to natural-language processing:
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
IBM alignment models are a sequence of increasingly complex models used in statistical machine translation to train a translation model and an alignment model, starting with lexical translation probabilities and moving to reordering and word duplication. They underpinned the majority of statistical machine translation systems for almost twenty years starting in the early 1990s, until neural machine translation began to dominate. These models offer principled probabilistic formulation and (mostly) tractable inference.
Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.
A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network–based models, which have been superseded by large language models. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model. Special tokens are introduced to denote the start and end of a sentence and .