Part of a series on |
Machine learning and data mining |
---|
Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.
Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.
1950's 1960's | Psychology biology of attention. cocktail party effect [1] - focusing on content by filtering out background noise. filter model of attention, [2] partial report paradigm, and saccade control. [3] 1965 - Group Method of Data Handling [4] [5] (Kolmogorov-Gabor polynomials implement multiplicative units or "gates" [6] ) |
1980's | sigma pi units, [7] higher order neural networks [8] Neocognitron and its variants. [9] [10] |
1990's | fast weight controller. [11] [12] [13] [14] Neuron weights generate fast "dynamic links" similar to keys & values. [15] |
2014 | RNN + Attention. [16] Attention network was grafted onto RNN encoder decoder to improve language translation of long sentences. See Overview section. |
2015 | Attention applied to images [17] [18] [19] |
2017 | Transformers [20] = Attention + position encoding + MLP + skip connections. This design improved accuracy and removed the sequential disadvantages of the RNN. |
Academic reviews of the history of the attention mechanism are provided in Niu et al. [21] and Soydaner. [22]
The modern era of machine attention was revitalized by grafting an attention mechanism (Fig 1. orange) to an Encoder-Decoder.
![]() Fig 1. Encoder-decoder with attention. [23] Numerical subscripts (100, 300, 500, 9k, 10k) indicate vector sizes while lettered subscripts i and i − 1 indicate time steps. Pinkish regions in H matrix and w vector are zero values. See Legend for details.
|
Figure 2 shows the internal step-by-step operation of the attention block (A) in Fig 1.
This attention scheme has been compared to the Query-Key analogy of relational databases. That comparison suggests an asymmetric role for the Query and Key vectors, where one item of interest (the Query vector "that") is matched against all possible items (the Key vectors of each word in the sentence). However, both Self and Cross Attentions' parallel calculations matches all tokens of the K matrix with all tokens of the Q matrix; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understanding attention mechanisms further by studying their roles in focused settings, such as in-context learning, [25] masked language tasks, [26] stripped down transformers, [27] bigram statistics, [28] N-gram statistics, [29] pairwise convolutions, [30] and arithmetic factoring. [31]
In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced.
Consider an example of translating I love you to French. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime.
In the I love you example, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix:
I | love | you | |
---|---|---|---|
je | 0.94 | 0.02 | 0.04 |
t' | 0.11 | 0.01 | 0.88 |
aime | 0.03 | 0.95 | 0.02 |
Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.
Many variants of attention implement soft weights, such as
For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, [35] channel attention, [36] or combinations. [37] [38]
These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.
Label | Description |
---|---|
Variables X, H, S, T | Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column. |
S, T | S, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2. |
X, H | H, encoder hidden state; X, input word embeddings. |
W | Attention coefficients |
Qw, Kw, Vw, FC | Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix. |
⊕, ⊗ | ⊕, vector concatenation; ⊗, matrix multiplication. |
corr | Column-wise softmax(matrix of all combinations of dot products). The dot products are xi * xj in variant #3, hi* sj in variant 1, and column i ( Kw * H ) * column j ( Qw * S ) in variant 2, and column i ( Kw * X ) * column j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the √d where d is the height of the QKV matrices. |
The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency. [43]
For matrices: and , the scaled dot-product, or QKV attention is defined as: where denotes transpose and the softmax function is applied independently to every row of its argument. The matrix contains queries, while matrices jointly contain an unordered set of key-value pairs. Value vectors in matrix are weighted using the weights resulting from the softmax operation, so that the rows of the -by- output matrix are confined to the convex hull of the points in given by the rows of .
To understand the permutation invariance and permutation equivariance properties of QKV attention, [44] let and be permutation matrices; and an arbitrary matrix. The softmax function is permutation equivariant in the sense that:
By noting that the transpose of a permutation matrix is also its inverse, it follows that:
which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of ); and invariant to re-ordering of the key-value pairs in . These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as:
is permutation equivariant with respect to re-ordering the rows of the input matrix in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for multi-head attention, which is defined below.
When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have rows, a masked attention variant is used: where the mask, is a strictly upper triangular matrix, with zeros on and below the diagonal and in every element above the diagonal. The softmax output, also in is then lower triangular, with zeros in all elements above the diagonal. The masking ensures that for all , row of the attention output is independent of row of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.
Multi-head attention where each head is computed with QKV attention as: and , and are parameter matrices.
The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices, :
from which we also see that multi-head self-attention:
is equivariant with respect to re-ordering of the rows of input matrix .
where and are learnable weight matrices. [16]
where is a learnable weight matrix. [32]
Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences.
For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors . These can then be applied to a dot-product attention mechanism, to obtainor more succinctly, . This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.
For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights for all , called "causal masking". This attention mechanism is the "causally masked self-attention".