Neural machine translation

Last updated April 20, 2024

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

It is the dominant approach today^[1]^: 293^[2]^: 1 and can produce translations that rival human translations when translating between high-resource languages under specific conditions.^[3] However, there still remain challenges, especially with languages where less high-quality data is available,^[4]^[5]^[1]^: 293 and with domain shift between the data a system was trained on and the texts it is supposed to translate.^[1]^: 293 NMT systems also tend to produce fairly literal translations.^[5]

Overview

In the translation task, a sentence $\mathbf {x} =x_{1,I}$ (consisting of $I$ tokens $x_{i}$ ) in the source language is to be translated into a sentence $\mathbf {y} =x_{1,J}$ (consisting of $J$ tokens $x_{j}$ ) in the target language. The source and target tokens (which in the simple event are used for each other in order for a particular game ] vectors, so they can be processed mathematically.

NMT models assign a probability $P(y|x)$ ^[2]^: 5^[6]^: 1 to potential translations y and then search a subset of potential translations for the one with the highest probability. Most NMT models are auto-regressive: They model the probability of each target token as a function of the source sentence and the previously predicted target tokens. The probability of the whole translation then is the product of the probabilities of the individual predicted tokens:^[2]^: 5^[6]^: 2

P(y|x)=\prod _{j=1}^{J}P(y_{j}|y_{1,i-1},\mathbf {x} )

NMT models differ in how exactly they model this function $P$ , but most use some variation of the encoder-decoder architecture:^[6]^: 2^[7]^: 469 They first use an encoder network to process $\mathbf {x}$ and encode it into a vector or matrix representation of the source sentence. Then they use a decoder network that usually produces one target word at a time, taking into account the source representation and the tokens it previously produced. As soon as the decoder produces a special end of sentence token, the decoding process is finished. Since the decoder refers to its own previous outputs during, this way of decoding is called auto-regressive.

History

Early approaches

In 1987, Robert B. Allen demonstrated the use of feed-forward neural networks for translating auto-generated English sentences with a limited vocabulary of 31 words into Spanish. In this experiment, the size of the network's input and output layers was chosen to be just large enough for the longest sentences in the source and target language, respectively, because the network did not have any mechanism to encode sequences of arbitrary length into a fixed-size representation. In his summary, Allen also already hinted at the possibility of using auto-associative models, one for encoding the source and one for decoding the target.^[8]

Lonnie Chrisman built upon Allen's work in 1991 by training separate recursive auto-associative memory (RAAM) networks (developed by Jordan B. Pollack^[9]) for the source and the target language. Each of the RAAM networks is trained to encode an arbitrary-length sentence into a fixed-size hidden representation and to decode the original sentence again from that representation. Additionally, the two networks are also trained to share their hidden representation; this way, the source encoder can produce a representation that the target decoder can decode.^[10] Forcada and Ñeco simplified this procedure in 1997 to directly train a source encoder and a target decoder in what they called a recursive hetero-associative memory.^[11]

Also in 1997, Castaño and Casacuberta employed an Elman's recurrent neural network in another machine translation task with very limited vocabulary and complexity.^[12]^[13]

Even though these early approaches were already similar to modern NMT, the computing resources of the time were not sufficient to process datasets large enough for the computational complexity of the machine translation problem on real-world texts.^[1]^: 39^[14]^: 2 Instead, other methods like statistical machine translation rose to become the state of the art of the 1990s and 2000s.

Hybrid approaches

During the time when statistical machine translation was prevalent, some works used neural methods to replace various parts in the statistical machine translation while still using the log-linear approach to tie them together.^[1]^: 39^[2]^: 1 For example, in various works together with other researchers, Holger Schwenk replaced the usual n-gram language model with a neural one ^[15]^[16] and estimated phrase translation probabilities using a feed-forward network.^[17]

NMT becomes dominant

CNNs and RNNs

In 2013 and 2014, end-to-end neural machine translation had their breakthrough with Kalchbrenner & Blunsom using a convolutional neural network (CNN) for encoding the source^[18] and both Cho et al. and Sutskever et al. using a recurrent neural network (RNN) instead.^[19]^[20] All three used an RNN conditioned on a fixed encoding of the source as their decoder to produce the translation. However, these models performed poorly on longer sentences.^[21]^: 107^[1]^: 39^[2]^: 7 This problem was addressed when Bahdanau et al. introduced attention to their encoder-decoder architecture: At each decoding step, the state of the decoder is used to calculate a source representation that focuses on different parts of the source and uses that representation in the calculation of the probabilities for the next token.^[22] Based on these RNN-based architectures, Baidu launched the "first large-scale NMT system"^[23]^: 144 in 2015, followed by Google in 2016.^[23]^: 144^[24] From that year on, neural models also became the prevailing choice in the main machine translation conference Workshop on Statistical Machine Translation.^[25]

Gehring et al. combined a CNN encoder with an attention mechanism in 2017, which handled long-range dependencies in the source better than previous approaches and also increased translation speed because a CNN encoder is parallelizable, whereas an RNN encoder has to encode one token at a time due to its recurrent nature.^[26]^: 230 In the same year, “Microsoft Translator released AI-powered online neural machine translation (NMT)^[27]. DeepL Translator, which was at the time based on a CNN encoder, was also released in the same year and was judged by several news outlets to outperform its competitors.^[28]^[29]^[30] It has also been seen that OpenAI's GPT-3 released in 2020 can function as a neural machine translation system. Some other machine translation systems, such as Microsoft translator and SYSTRAN can be also seen to have integrated neural networks into their operations.

The transformer

Another network architecture that lends itself to parallelization is the transformer, which was introduced by Vaswani et al. also in 2017.^[31] Like previous models, the transformer still uses the attention mechanism for weighting encoder output for the decoding steps. However, the transformer's encoder and decoder networks themselves are also based on attention instead of recurrence or convolution: Each layer weights and transforms the previous layer's output in a process called self-attention. Since the attention mechanism does not have any notion of token order, but the order of words in a sentence is obviously relevant, the token embeddings are combined with an explicit encoding of their position in the sentence.^[2]^: 15^[6]^: 7 Since both the transformer's encoder and decoder are free from recurrent elements, they can both be parallelized during training. However, the original transformer's decoder is still auto-regressive, which means that decoding still has to be done one token at a time during inference.

The transformer model quickly became the dominant choice for machine translation systems^[2]^: 44 and was still by far the most-used architecture in the Workshop on Statistical Machine Translation in 2022 and 2023.^[32]^: 35–40^[33]^: 28–31

Usually, NMT models’ weights are initialized randomly and then learned by training on parallel datasets. However, since using large language models (LLMs) such as BERT pre-trained on large amounts of monolingual data as a starting point for learning other tasks has proven very successful in wider NLP, this paradigm is also becoming more prevalent in NMT. This is especially useful for low-resource languages, where large parallel datasets do not exist.^[4]^{: 689–690} An example of this is the mBART model, which first trains one transformer on a multilingual dataset to recover masked tokens in sentences, and then fine-tunes the resulting autoencoder on the translation task.^[34]

Generative LLMs

Instead of fine-tuning a pre-trained language model on the translation task, sufficiently large generative models can also be directly prompted to translate a sentence into the desired language. This approach was first comprehensively tested and evaluated for GPT 3.5 in 2023 by Hendy et al. They found that "GPT systems can produce highly fluent and competitive translation outputs even in the zero-shot setting especially for the high-resource language translations".^[35]^: 22 The WMT23 evaluated the same approach (but using GPT-4) and found that it was on par with the state of the art when translating into English, but not quite when translating into lower-resource languages.^[33]^: 16–17 This is plausible considering that GPT models are trained mainly on English text.^[36]

Comparison with statistical machine translation

NMT has overcome several challenges that were present in statistical machine translation (SMT):

NMT's full reliance on continuous representation of tokens overcame sparsity issues caused by rare words or phrases. Models were able to generalize more effectively.^[18]^: 1^[37]^{: 900–901}
The limited n-gram length used in SMT's n-gram language models caused a loss of context. NMT systems overcome this by not having a hard cut-off after a fixed number of tokens and by using attention to choosing which tokens to focus on when generating the next token.^[37]^{: 900–901}
End-to-end training of a single model improved translation performance and also simplified the whole process.^{[ citation needed ]}
The huge n-gram models (up to 7-gram) used in SMT required large amounts of memory,^[38]^: 88 whereas NMT requires less.

Training procedure

Cross-entropy loss

NMT models are usually trained to maximize the likelihood of observing the training data. I.e., for a dataset of $T$ source sentences $X=\mathbf {x} ^{(1)},...,\mathbf {x} ^{(T)}$ and corresponding target sentences $Y=\mathbf {y} ^{(1)},...,\mathbf {y} ^{(T)}$ , the goal is finding the model parameters $\theta ^{*}$ that maximize the sum of the likelihood of each target sentence in the training data given the corresponding source sentence:

\theta ^{*}={\underset {\theta }{\operatorname {arg\,max} }}\sum _{i}^{T}P_{\theta }(\mathbf {y} ^{(i)}|\mathbf {x} ^{(i)})

Expanding to token level yields:

\theta ^{*}={\underset {\theta }{\operatorname {arg\,max} }}\sum _{i}^{T}\prod _{j=1}^{J^{(i)}}P(y_{j}^{(i)}|y_{1,j-1}^{(i)},\mathbf {x} ^{(i)})

Since we are only interested in the maximum, we can just as well search for the maximum of the logarithm instead (which has the advantage that it avoids floating point underflow that could happen with the product of low probabilities).

\theta ^{*}={\underset {\theta }{\operatorname {arg\,max} }}\sum _{i}^{T}\log \prod _{j=1}^{J^{(i)}}P(y_{j}^{(i)}|y_{1,j-1}^{(i)},\mathbf {x} ^{(i)})

Using the fact that the logarithm of a product is the sum of the factors’ logarithms and flipping the sign yields the classic cross-entropy loss:

\theta ^{*}={\underset {\theta }{\operatorname {arg\,min} }}-\sum _{i}^{T}\log \sum _{j=1}^{J^{(i)}}P(y_{j}^{(i)}|y_{1,j-1}^{(i)},\mathbf {x} ^{(i)})

In practice, this minimization is done iteratively on small subsets (mini-batches) of the training set using stochastic gradient descent.

Teacher forcing

During inference, auto-regressive decoders use the token generated in the previous step as the input token. However, the vocabulary of target tokens is usually very large. So, at the beginning of the training phase, untrained models will pick the wrong token almost always; and subsequent steps would then have to work with wrong input tokens, which would slow down training considerably. Instead, teacher forcing is used during the training phase: The model (the “student” in the teacher forcing metaphor) is always fed the previous ground-truth tokens as input for the next token, regardless of what it predicted in the previous step.

Translation by prompt engineering LLMs

As outlined in the history section above, instead of using an NMT system that is trained on parallel text, one can also prompt a generative LLM to translate a text. These models differ from an encoder-decoder NMT system in a number of ways:^[35]^: 1

Generative language models are not trained on the translation task, let alone on a parallel dataset. Instead, they are trained on a language modeling objective, such as predicting the next word in a sequence drawn from a large dataset of text. This dataset can contain documents in many languages, but is in practice dominated by English text.^[36] After this pre-training, they are fine-tuned on another task, usually to follow instructions.^[39]
Since they are not trained on translation, they also do not feature an encoder-decoder architecture. Instead, they just consist of a transformer's decoder.
In order to be competitive on the machine translation task, LLMs need to be much larger than other NMT systems. E.g., GPT-3 has 175 billion parameters,^[40]^: 5 while mBART has 680 million^[34]^: 727 and the original transformer-big has “only” 213 million.^[31]^: 9 This means that they are computationally more expensive to train and use.

A generative LLM can be prompted in a zero-shot fashion by just asking it to translate a text into another language without giving any further examples in the prompt. Or one can include one or several example translations in the prompt before asking to translate the text in question. This is then called one-shot or few-shot learning, respectively. For example, the following prompts were used by Hendy et al. (2023) for zero-shot and one-shot translation:^[35]

### Translate this sentence from [source language] to [target language], Source: [source sentence] ### Target:

Translate this into 1. [target language]: [shot 1 source] 1. [shot 1 reference] Translate this into 1. [target language]: [input] 1.

Literature

Koehn, Philipp (2020). Neural Machine Translation. Cambridge University Press.
Stahlberg, Felix (2020). Neural Machine Translation: A Review and Survey.

Related Research Articles

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Natural evolution strategies (NES) are a family of numerical optimization algorithms for black box problems. Similar in spirit to evolution strategies, they iteratively update the (continuous) parameters of a search distribution by following the natural gradient towards higher expected fitness.

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

Machine learning-based attention is a mechanism which intuitively mimics cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks.

References

1 2 3 4 5 6 Koehn, Philipp (2020). Neural Machine Translation. Cambridge University Press.
1 2 3 4 5 6 7 Stahlberg, Felix (2020-09-29). "Neural Machine Translation: A Review and Survey". arXiv: 1912.02047v2 [cs.CL].
↑ Popel, Martin; Tomkova, Marketa; Tomek, Jakub; Kaiser, Łukasz; Uszkoreit, Jakob; Bojar, Ondřej; Žabokrtský, Zdeněk (2020-09-01). "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals". Nature Communications. 11 (1): 4381. doi:10.1038/s41467-020-18073-9. hdl: 11346/BIBLIO@id=368112263610994118 . ISSN 2041-1723.
1 2 Haddow, Barry; Bawden, Rachel; Miceli Barone, Antonio Valerio; Helcl, Jindřich; Birch, Alexandra (2022). "Survey of Low-Resource Machine Translation". Computational Linguistics. 48 (3): 673–732. arXiv: 2109.00486 . doi:10.1162/coli_a_00446.
1 2 Poibeau, Thierry (2022). Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; Maegaard, Bente (eds.). "On "Human Parity" and "Super Human Performance" in Machine Translation Evaluation". Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 6018–6023.
1 2 3 4 Tan, Zhixing; Wang, Shuo; Yang, Zonghan; Chen, Gang; Huang, Xuancheng; Sun, Maosong; Liu, Yang (2020-12-31). "Neural Machine Translation: A Review of Methods, Resources, and Tools". arXiv: 2012.15515 [cs.CL].
↑ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "12.4.5 Neural Machine Translation". Deep Learning. MIT Press. pp. 468–471. Retrieved 2022-12-29.
↑ Allen, Robert B. (1987). Several Studies on Natural Language and Back-Propagation. IEEE First International Conference on Neural Networks. Vol. 2. San Diego. pp. 335–341. Retrieved 2022-12-30.
↑ Chrisman, Lonnie (1991). "Learning Recursive Distributed Representations for Holistic Computation". Connection Science. 3 (4): 345–366. doi:10.1080/09540099108946592. ISSN 0954-0091.
↑ Pollack, Jordan B. (1990). "Recursive distributed representations". Artificial Intelligence. 46 (1): 77–105.
↑ Forcada, Mikel L.; Ñeco, Ramón P. (1997). "Recursive hetero-associative memories for translation". Biological and Artificial Computation: From Neuroscience to Technology: 453–462.
↑ Castaño, Asunción; Casacuberta, Francisco (1997). A connectionist approach to machine translation. 5th European Conference on Speech Communication and Technology (Eurospeech 1997). Rhodes, Greece. pp. 91–94. doi:10.21437/Eurospeech.1997-50.
↑ Castaño, Asunción; Casacuberta, Francisco; Vidal, Enrique (1997-07-23). Machine translation using neural networks and finite-state models. Proceedings of the 7th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages. St John's College, Santa Fe.
↑ Yang, Shuoheng; Wang, Yuxin; Chu, Xiaowen (2020-02-18). "A Survey of Deep Learning Techniques for Neural Machine Translation". arXiv: 2002.07526 [cs.CL].
↑ Schwenk, Holger; Dechelotte, Daniel; Gauvain, Jean-Luc (2006). Continuous Space Language Models for Statistical Machine Translation. Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney, Australia. pp. 723–730.
↑ Schwenk, Holger (2007). "Contiuous space language models". Computer Speech and Language. 3 (21): 492–518. doi:10.1016/j.csl.2006.09.003.
↑ Schwenk, Holger (2012). Continuous Space Translation Models for Phrase-Based Statistical Machine Translation. Proceedings of COLING 2012: Posters. Mumbai, India. pp. 1071–1080.
1 2 Kalchbrenner, Nal; Blunsom, Philip (2013). "Recurrent Continuous Translation Models". Proceedings of the Association for Computational Linguistics: 1700–1709.
↑ Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics. pp. 1724–1734. arXiv: 1406.1078 . doi:10.3115/v1/D14-1179.
↑ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". Advances in Neural Information Processing Systems. 27. Curran Associates, Inc.
↑ Cho, Kyunghyun; van Merriënboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics. pp. 103–111. arXiv: 1409.1259 . doi:10.3115/v1/W14-4012.
↑ Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv: 1409.0473 [cs.CL].
1 2 Wang, Haifeng; Wu, Hua; He, Zhongjun; Huang, Liang; Church, Kenneth Ward (2022-11-01). "Progress in Machine Translation". Engineering. 18: 143–153. doi:10.1016/j.eng.2021.03.023.
↑ Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv: 1609.08144 [cs.CL].
↑ Bojar, Ondrej; Chatterjee, Rajen; Federmann, Christian; Graham, Yvette; Haddow, Barry; Huck, Matthias; Yepes, Antonio Jimeno; Koehn, Philipp; Logacheva, Varvara; Monz, Christof; Negri, Matteo; Névéol, Aurélie; Neves, Mariana; Popel, Martin; Post, Matt; Rubino, Raphael; Scarton, Carolina; Specia, Lucia; Turchi, Marco; Verspoor, Karin; Zampieri, Marcos (2016). "Findings of the 2016 Conference on Machine Translation" (PDF). ACL 2016 First Conference on Machine Translation (WMT16). The Association for Computational Linguistics: 131–198. Archived from the original (PDF) on 2018-01-27. Retrieved 2018-01-27.
↑ Gehring, Jonas; Auli, Michael; Grangier, David; Dauphin, Yann (2017). A Convolutional Encoder Model for Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics. pp. 123–135. arXiv: 1611.02344 . doi:10.18653/v1/P17-1012.
↑ Translator, Microsoft (2018-04-18). "Microsoft brings AI-powered translation to end users and developers, whether you're online or offline". Microsoft Translator Blog. Retrieved 2024-04-19.{{cite web}}: |last= has generic name (help)
↑ Coldewey, Devin (2017-08-29). "DeepL schools other online translators with clever machine learning". TechCrunch. Retrieved 2023-12-26.
↑ Leloup, Damien; Larousserie, David (2022-08-29). "Quel est le meilleur service de traduction en ligne?". Le Monde. Retrieved 2023-01-10.
↑ Pakalski, Ingo (2017-08-29). "DeepL im Hands On: Neues Tool übersetzt viel besser als Google und Microsoft". Golem. Retrieved 2023-01-10.
1 2 Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia (2017). Attenion Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). pp. 5998–6008.
↑ Kocmi, Tom; Bawden, Rachel; Bojar, Ondřej; Dvorkovich, Anton; Federmann, Christian; Fishel, Mark; Gowda, Thamme; Graham, Yvette; Grundkiewicz, Roman; Haddow, Barry; Knowles, Rebecca; Koehn, Philipp; Monz, Christof; Morishita, Makoto; Nagata, Masaaki (2022). Koehn, Philipp; Barrault, Loïc; Bojar, Ondřej; Bougares, Fethi; Chatterjee, Rajen; Costa-jussà, Marta R.; Federmann, Christian; Fishel, Mark; Fraser, Alexander (eds.). Findings of the 2022 Conference on Machine Translation (WMT22). Proceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics. pp. 1–45.
1 2 Kocmi, Tom; Avramidis, Eleftherios; Bawden, Rachel; Bojar, Ondřej; Dvorkovich, Anton; Federmann, Christian; Fishel, Mark; Freitag, Markus; Gowda, Thamme; Grundkiewicz, Roman; Haddow, Barry; Koehn, Philipp; Marie, Benjamin; Monz, Christof; Morishita, Makoto (2023). Koehn, Philipp; Haddow, Barry; Kocmi, Tom; Monz, Christof (eds.). Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. Proceedings of the Eighth Conference on Machine Translation. Singapore: Association for Computational Linguistics. pp. 1–42. doi: 10.18653/v1/2023.wmt-1.1 .
1 2 Liu, Yinhan; Gu, Jiatao; Goyal, Naman; Li, Xian; Edunov, Sergey; Ghazvininejad, Marjan; Lewis, Mike; Zettlemoyer, Luke (2020). "Multilingual Denoising Pre-training for Neural Machine Translation". Transactions of the Association for Computational Linguistics. 8: 726–742. arXiv: 2001.08210 . doi:10.1162/tacl_a_00343.
1 2 3 Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany (2023-02-18). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv: 2302.09210 [cs.CL].
1 2 "GPT 3 dataset statistics: languages by character count". OpenAI. 2020-06-01. Retrieved 2023-12-23.
1 2 Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach (4th, global ed.). Pearson.
↑ Federico, Marcello; Cettolo, Mauro (2007). Callison-Burch, Chris; Koehn, Philipp; Fordyce, Cameron Shaw; Monz, Christof (eds.). "Efficient Handling of N-gram Language Models for Statistical Machine Translation". Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: Association for Computational Linguistics: 88–95.
↑ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (2018). Improving Language Understanding by Generative Pre-Training (PDF) (Technical report). OpenAI. Retrieved 2023-12-26.
↑ Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared D; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020). "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 1877–1901.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Koehn2020-1] 1 2 3 4 5 6 Koehn, Philipp (2020). Neural Machine Translation. Cambridge University Press.

[Stahlberg2020-2] 1 2 3 4 5 6 7 Stahlberg, Felix (2020-09-29). "Neural Machine Translation: A Review and Survey". arXiv: 1912.02047v2 [cs.CL].

[Popel2020-3] Popel, Martin; Tomkova, Marketa; Tomek, Jakub; Kaiser, Łukasz; Uszkoreit, Jakob; Bojar, Ondřej; Žabokrtský, Zdeněk (2020-09-01). "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals". Nature Communications. 11 (1): 4381. doi:10.1038/s41467-020-18073-9. hdl: 11346/BIBLIO@id=368112263610994118 . ISSN 2041-1723.

[Haddow2022-4] 1 2 Haddow, Barry; Bawden, Rachel; Miceli Barone, Antonio Valerio; Helcl, Jindřich; Birch, Alexandra (2022). "Survey of Low-Resource Machine Translation". Computational Linguistics. 48 (3): 673–732. arXiv: 2109.00486 . doi:10.1162/coli_a_00446.

[Poibeau2022-5] 1 2 Poibeau, Thierry (2022). Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; Maegaard, Bente (eds.). "On "Human Parity" and "Super Human Performance" in Machine Translation Evaluation". Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 6018–6023.

[Tan2020-6] 1 2 3 4 Tan, Zhixing; Wang, Shuo; Yang, Zonghan; Chen, Gang; Huang, Xuancheng; Sun, Maosong; Liu, Yang (2020-12-31). "Neural Machine Translation: A Review of Methods, Resources, and Tools". arXiv: 2012.15515 [cs.CL].

[Goodfellow2013-7] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "12.4.5 Neural Machine Translation". Deep Learning. MIT Press. pp. 468–471. Retrieved 2022-12-29.

[Allen1987-8] Allen, Robert B. (1987). Several Studies on Natural Language and Back-Propagation. IEEE First International Conference on Neural Networks. Vol. 2. San Diego. pp. 335–341. Retrieved 2022-12-30.

[Pollack1990-9] Chrisman, Lonnie (1991). "Learning Recursive Distributed Representations for Holistic Computation". Connection Science. 3 (4): 345–366. doi:10.1080/09540099108946592. ISSN 0954-0091.

[Chrisman1991-10] Pollack, Jordan B. (1990). "Recursive distributed representations". Artificial Intelligence. 46 (1): 77–105.

[Forcada1997-11] Forcada, Mikel L.; Ñeco, Ramón P. (1997). "Recursive hetero-associative memories for translation". Biological and Artificial Computation: From Neuroscience to Technology: 453–462.

[Castano1997a-12] Castaño, Asunción; Casacuberta, Francisco (1997). A connectionist approach to machine translation. 5th European Conference on Speech Communication and Technology (Eurospeech 1997). Rhodes, Greece. pp. 91–94. doi:10.21437/Eurospeech.1997-50.

[Castano1997b-13] Castaño, Asunción; Casacuberta, Francisco; Vidal, Enrique (1997-07-23). Machine translation using neural networks and finite-state models. Proceedings of the 7th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages. St John's College, Santa Fe.

[Yang2020-14] Yang, Shuoheng; Wang, Yuxin; Chu, Xiaowen (2020-02-18). "A Survey of Deep Learning Techniques for Neural Machine Translation". arXiv: 2002.07526 [cs.CL].

[Schwenk2006-15] Schwenk, Holger; Dechelotte, Daniel; Gauvain, Jean-Luc (2006). Continuous Space Language Models for Statistical Machine Translation. Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney, Australia. pp. 723–730.

[Schwenk2007-16] Schwenk, Holger (2007). "Contiuous space language models". Computer Speech and Language. 3 (21): 492–518. doi:10.1016/j.csl.2006.09.003.

[Schwenk2012-17] Schwenk, Holger (2012). Continuous Space Translation Models for Phrase-Based Statistical Machine Translation. Proceedings of COLING 2012: Posters. Mumbai, India. pp. 1071–1080.

[KalchbrennerBlunsom2013-18] 1 2 Kalchbrenner, Nal; Blunsom, Philip (2013). "Recurrent Continuous Translation Models". Proceedings of the Association for Computational Linguistics: 1700–1709.

[Cho2014EncDec-19] Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics. pp. 1724–1734. arXiv: 1406.1078 . doi:10.3115/v1/D14-1179.

[Sutskever2014-20] Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". Advances in Neural Information Processing Systems. 27. Curran Associates, Inc.

[Cho2014Properties-21] Cho, Kyunghyun; van Merriënboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics. pp. 103–111. arXiv: 1409.1259 . doi:10.3115/v1/W14-4012.

[Bahdanau2015-22] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv: 1409.0473 [cs.CL].

[Wang2022-23] 1 2 Wang, Haifeng; Wu, Hua; He, Zhongjun; Huang, Liang; Church, Kenneth Ward (2022-11-01). "Progress in Machine Translation". Engineering. 18: 143–153. doi:10.1016/j.eng.2021.03.023.

[Wu2016-24] Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv: 1609.08144 [cs.CL].

[WMT2016-25] Bojar, Ondrej; Chatterjee, Rajen; Federmann, Christian; Graham, Yvette; Haddow, Barry; Huck, Matthias; Yepes, Antonio Jimeno; Koehn, Philipp; Logacheva, Varvara; Monz, Christof; Negri, Matteo; Névéol, Aurélie; Neves, Mariana; Popel, Martin; Post, Matt; Rubino, Raphael; Scarton, Carolina; Specia, Lucia; Turchi, Marco; Verspoor, Karin; Zampieri, Marcos (2016). "Findings of the 2016 Conference on Machine Translation" (PDF). ACL 2016 First Conference on Machine Translation (WMT16). The Association for Computational Linguistics: 131–198. Archived from the original (PDF) on 2018-01-27. Retrieved 2018-01-27.

[Gehring2017-26] Gehring, Jonas; Auli, Michael; Grangier, David; Dauphin, Yann (2017). A Convolutional Encoder Model for Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics. pp. 123–135. arXiv: 1611.02344 . doi:10.18653/v1/P17-1012.

[27] Translator, Microsoft (2018-04-18). "Microsoft brings AI-powered translation to end users and developers, whether you're online or offline". Microsoft Translator Blog. Retrieved 2024-04-19.{{cite web}}: |last= has generic name (help)

[DeepLTechCrunch-28] Coldewey, Devin (2017-08-29). "DeepL schools other online translators with clever machine learning". TechCrunch. Retrieved 2023-12-26.

[DeepLLeMonde-29] Leloup, Damien; Larousserie, David (2022-08-29). "Quel est le meilleur service de traduction en ligne?". Le Monde. Retrieved 2023-01-10.

[DeepLGolem-30] Pakalski, Ingo (2017-08-29). "DeepL im Hands On: Neues Tool übersetzt viel besser als Google und Microsoft". Golem. Retrieved 2023-01-10.

[Vaswani2017-31] 1 2 Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia (2017). Attenion Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). pp. 5998–6008.

[WMT2022-32] Kocmi, Tom; Bawden, Rachel; Bojar, Ondřej; Dvorkovich, Anton; Federmann, Christian; Fishel, Mark; Gowda, Thamme; Graham, Yvette; Grundkiewicz, Roman; Haddow, Barry; Knowles, Rebecca; Koehn, Philipp; Monz, Christof; Morishita, Makoto; Nagata, Masaaki (2022). Koehn, Philipp; Barrault, Loïc; Bojar, Ondřej; Bougares, Fethi; Chatterjee, Rajen; Costa-jussà, Marta R.; Federmann, Christian; Fishel, Mark; Fraser, Alexander (eds.). Findings of the 2022 Conference on Machine Translation (WMT22). Proceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics. pp. 1–45.

[WMT2023-33] 1 2 Kocmi, Tom; Avramidis, Eleftherios; Bawden, Rachel; Bojar, Ondřej; Dvorkovich, Anton; Federmann, Christian; Fishel, Mark; Freitag, Markus; Gowda, Thamme; Grundkiewicz, Roman; Haddow, Barry; Koehn, Philipp; Marie, Benjamin; Monz, Christof; Morishita, Makoto (2023). Koehn, Philipp; Haddow, Barry; Kocmi, Tom; Monz, Christof (eds.). Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. Proceedings of the Eighth Conference on Machine Translation. Singapore: Association for Computational Linguistics. pp. 1–42. doi: 10.18653/v1/2023.wmt-1.1 .

[Liu2020-34] 1 2 Liu, Yinhan; Gu, Jiatao; Goyal, Naman; Li, Xian; Edunov, Sergey; Ghazvininejad, Marjan; Lewis, Mike; Zettlemoyer, Luke (2020). "Multilingual Denoising Pre-training for Neural Machine Translation". Transactions of the Association for Computational Linguistics. 8: 726–742. arXiv: 2001.08210 . doi:10.1162/tacl_a_00343.

[Hendy2023-35] 1 2 3 Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany (2023-02-18). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv: 2302.09210 [cs.CL].

[GPT3LanguagesByCharacterCount2020-36] 1 2 "GPT 3 dataset statistics: languages by character count". OpenAI. 2020-06-01. Retrieved 2023-12-23.

[Russell2020-37] 1 2 Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach (4th, global ed.). Pearson.

[Federico2007-38] Federico, Marcello; Cettolo, Mauro (2007). Callison-Burch, Chris; Koehn, Philipp; Fordyce, Cameron Shaw; Monz, Christof (eds.). "Efficient Handling of N-gram Language Models for Statistical Machine Translation". Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: Association for Computational Linguistics: 88–95.

[Radford2018-39] Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (2018). Improving Language Understanding by Generative Pre-Training (PDF) (Technical report). OpenAI. Retrieved 2023-12-26.

[Brown2020-40] Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared D; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020). "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 1877–1901.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]