Attention (machine learning)

Last updated

Machine learning-based attention is a mechanism which intuitively mimics cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel (such as in transformers) or sequentially (such as recurrent neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

Contents

Attention was developed to address the weaknesses of leveraging information from the hidden outputs of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence is expected to be attenuated. Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.

Earlier uses attached this mechanism to a serial recurrent neural network's language translation system (below), but later uses in Transformers large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.

Predecessors

Predecessors of the mechanism were used in recurrent neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units, [1] and hyper-networks. [2] They have been used in long short-term memory (LSTM) networks, multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controller's memory, [3] reasoning tasks in differentiable neural computers, and neural Turing machines. [4] [5] [6] [7] [8]

Core calculations

The attention network was designed to identify the highest correlations amongst words within a sentence, assuming that it has learned those patterns from the training corpus. This correlation is captured in neuronal weights through backpropagation, either from self-supervised pretraining or supervised fine-tuning.

The example below (a encoder-only QKV variant of an attention network) shows how correlations are identified once a network has been trained and has the right weights. When looking at the word "that" in the sentence "see that girl run", the network should be able to identify "girl" as a highly correlated word. For simplicity this example focuses on the word "that", but in reality all words receive this treatment in parallel and the resulting soft-weights and context vectors are stacked into matrices for further task-specific use.

Attention-qkv.png
The Qw and Kw sub-networks of a single "attention head" calculate the soft weights, originating from the word "that". (Encoder-only QKV variant).

The sentence is sent through 3 parallel streams (left), which emerge at the end as the context vector (right). The word embedding size is 300 and the neuron count is 100 in each sub-network of the attention head.

  • The capital letter X denotes a matrix sized 4 × 300, consisting of the embeddings of all four words.
  • The small underlined letter x denotes the embedding vector (sized 300) of the word "that".
  • The attention head includes three (vertically arranged in the illustration) sub-networks, each having 100 neurons, being Wq, Wk and Wv their respective weight matrices, all them sized 300 × 100.
  • q (from "query") is a vector sized 100, Wk ("key") and Wv ("value") are 4x100 matrices.
  • The asterisk within parenthesis "(*)" denotes the softmax( qWk / 100 ). Softmax result is a vector sized 4 that later on is multiplied by the matrix V=XWv to obtain the context vector.
  • Rescaling by 100 prevents a high variance in qWkT that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.

Notation: the commonly written row-wise softmax formula above assumes that vectors are rows, which contradicts the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise softmax, resulting in the more correct form

Context = (XWv)T × softmax( (Wk XT) × (xWq)T / 100 ).

The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words. [5]

The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.

Note that the context vector for "that" does not rely on context vectors for the other words; therefore the context vectors of all words can be calculated using the whole matrix X, which includes all the word embeddings, instead of a single word's embedding vector x in the formula above, thus parallelizing the calculations. Now, the softmax can be interpreted as a matrix softmax acting on separate rows. This is a huge advantage over recurrent networks which must operate sequentially.

The common query-key analogy with database queries suggests an asymmetric role for these vectors, where one item of interest (the query) is matched against all possible items (the keys). However, parallel calculations matches all words of the sentence with itself; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning, [9] masked language tasks, [10] stripped down transformers, [11] bigram statistics, [12] pairwise convolutions, [13] and arithmetic factoring. [14]

A language translation example

To build a machine that translates English to French, an attention unit is grafted to the basic Encoder-Decoder (diagram below). In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 trained, fully-connected neural network layers called query, key, and value.

A step-by-step sequence of a language translation.
Attention-1-sn.png
Encoder-decoder with attention. [15] The left part (black lines) is the encoder-decoder, the middle part (orange lines) is the attention unit, and the right part (in grey & colors) is the computed data. Grey regions in H matrix and w vector are zero values. Numerical subscripts indicate vector sizes while lettered subscripts i and i − 1 indicate time steps.
Legend
LabelDescription
100Max. sentence length
300 Embedding size (word dimension)
500Length of hidden vector
9k, 10kDictionary size of input & output languages respectively.
x, Y9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s500-long decoder hidden state vector.
E500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary). [16] The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score100-long alignment score
w100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
AAttention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H500×100. 100 hidden vectors h concatenated into a matrix
c500-long context vector = H * w. c is a linear combination of h vectors weighted by w.

Viewed as a matrix, the attention weights show how the network adjusts its focus according to context. [17]

Iloveyou
je0.940.020.04
t'0.110.010.88
aime0.030.950.02

This view of the attention weights addresses the neural network "explainability" problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".

Variants

Many variants of attention implement soft weights, such as

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, [23] channel attention, [24] or combinations. [25] [26]

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

1. encoder-decoder dot product2. encoder-decoder QKV3. encoder-only dot product4. encoder-only QKV5. Pytorch tutorial
Both encoder & decoder are needed to calculate attention. Attn-xy-dot.png
Both encoder & decoder are needed to calculate attention.
Both encoder & decoder are needed to calculate attention. Attn-xy-qkv.png
Both encoder & decoder are needed to calculate attention.
Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. wij = xi xj Attn-xx-dot.png
Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. wij = xi xj
Decoder is not used to calculate attention. Attn-xx-qkv.png
Decoder is not used to calculate attention.
A fully-connected layer is used to calculate attention instead of dot product correlation. Attn-pytorch-tutorial.png
A fully-connected layer is used to calculate attention instead of dot product correlation.
Legend
LabelDescription
Variables X, H, S, TUpper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, TS, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, HH, encoder hidden state; X, input word embeddings.
WAttention coefficients
Qw, Kw, Vw, FCWeight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗⊕, vector concatenation; ⊗, matrix multiplication.
corrColumn-wise softmax(matrix of all combinations of dot products). The dot products are xi * xj in variant #3, hi* sj in variant 1, and column i ( Kw * H ) * column j ( Qw * S ) in variant 2, and column i ( Kw * X ) * column j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the d where d is the height of the QKV matrices.

Mathematical representation

Standard Scaled Dot-Product Attention

where are the query, key, and value matrices, is the dimension of the keys. Value vectors in matrix are weighted using the weights resulting from the softmax operation.

Multi-Head Attention

where each head is computed as:

and , and are parameter matrices.

Bahdanau (Additive) Attention

where and and are learnable weight matrices. [17]

Luong Attention (General)

where is a learnable weight matrix. [21]

See also

Related Research Articles

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. It behaves like a highway network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the 1997 LSTM networks, Transformer models, the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

References

  1. Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF). Cambridge, Mass: Bradford Books. ISBN   978-0-262-68053-0.
  2. Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.
  3. 1 2 Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID   16683347.
  4. Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN   1476-4687. PMID   27732574. S2CID   205251479.
  5. 1 2 Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  6. Ramachandran, Prajit; Parmar, Niki; Vaswani, Ashish; Bello, Irwan; Levskaya, Anselm; Shlens, Jonathon (2019-06-13). "Stand-Alone Self-Attention in Vision Models". arXiv: 1906.05909 [cs.CV].
  7. Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention". arXiv: 2103.03206 [cs.CV].
  8. Ray, Tiernan. "Google's Supermodel: DeepMind Perceiver is a step on the road to an AI machine that could process anything and everything". ZDNet. Retrieved 2021-08-19.
  9. Zhang, Ruiqi (2024). "Trained Transformers Learn Linear Models In-Context" (PDF). Journal of Machine Learning Research 1-55. 25.
  10. Rende, Riccardo (2023). "Mapping of attention mechanisms to a generalized Potts model". arXiv: 2304.07235 .
  11. He, Bobby (2023). "Simplifying Transformers Blocks". arXiv: 2311.01906 .
  12. "Transformer Circuits". transformer-circuits.pub.
  13. Transformer Neural Network Derived From Scratch. 2023. Event occurs at 05:30. Retrieved 2024-04-07.
  14. Charton, François (2023). "Learning the Greatest Common Divisor: Explaining Transformer Predictions". arXiv: 2308.15594 .
  15. Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv: 1703.03906 [cs.CV].
  16. "Pytorch.org seq2seq tutorial" . Retrieved December 2, 2021.
  17. 1 2 3 Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv: 1409.0473 [cs.CL].
  18. Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
  19. Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
  20. Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz; Kaiser, Lukasz; Belanger, David; Colwell, Lucy; Weller, Adrian (2020). "Rethinking Attention with Performers". arXiv: 2009.14794 [cs.CL].
  21. 1 2 3 Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv: 1508.04025v5 [cs.CL].
  22. "Papers with Code - A Decomposable Attention Model for Natural Language Inference". paperswithcode.com.
  23. Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv: 1904.05873 . doi:10.1109/ICCV.2019.00679. ISBN   978-1-7281-4803-8. S2CID   118673006.
  24. Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv: 1709.01507 . doi:10.1109/CVPR.2018.00745. ISBN   978-1-5386-6420-9. S2CID   206597034.
  25. Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv: 1807.06521 [cs.CV].
  26. Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv: 2204.04218 [eess.IV].
  27. Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.
  28. Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.
  29. Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.
  30. Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.