Attention (machine learning)

Last updated
Attention mechanism, overview Attention mechanism overview.svg
Attention mechanism, overview
Attention mechanism with attention weights, overview Attention mechanism output.svg
Attention mechanism with attention weights, overview

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

Contents

Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network language translation system, but the later transformer design removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.

Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

History

Academic reviews of the history of the attention mechanism are provided in Niu et al. [1] and Soydaner. [2]

Predecessors

Selective attention in humans had been well studied in neuroscience and cognitive psychology. [3] In 1953, Colin Cherry studied selective attention in the context of audition, known as the cocktail party effect. [4]

In 1958, Donald Broadbent proposed the filter model of attention. [5] Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, insofar as the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve the entire visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene. [6]

These research developments inspired algorithms such as the Neocognitron and its variants. [7] [8] Meanwhile, developments in neural networks had inspired circuit models of biological visual attention. [9] [2] One well-cited network from 1998, for example, was inspired by the low-level primate visual system. It produced saliency maps of images using handcrafted (not learned) features, which were then used to guide a second neural network in processing patches of the image in order of reducing saliency. [10]

A key aspect of attention mechanism can be written (schematically) as where the angled brackets denote dot product. This shows that it involves a multiplicative operation. Multiplicative operations within artificial neural networks had been studied under the names of Group Method of Data Handling (1965) [11] [12] (where Kolmogorov-Gabor polynomials implement multiplicative units or "gates" [13] ), higher-order neural networks , [14] multiplication units, [15] sigma-pi units, [16] fast weight controllers, [17] and hyper-networks. [18]

Linearized attention

Jürgen Schmidhuber's fast weight controller (1992) [17] implements what was later called "linearized attention" or "linear attention" by Angelos Katharopoulos et al. (2020). [19] [20] One of its two networks has "fast weights" or "dynamic links" (1981). [21] [22] [23] A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. [17] This was later shown to be equivalent to the unnormalized linear Transformer. [20] A follow-up paper (1993) on a similar system with "active weight changing capabilities" states: "Weights to units that are not illuminated by adaptive internal spotlights of attention essentially remain invariant." [24]

Recurrent attention

During the deep learning era, attention mechanism was developed to solve similar problems in encoding-decoding. [1]

In machine translation, the seq2seq model, as it was proposed in 2014, [25] would encode an input text into a fixed-length vector, which would then be decoded into an output text. If the input text is long, the fixed-length vector would be unable to carry enough information for accurate decoding. An attention mechanism was proposed to solve this problem.

An image captioning model was proposed in 2015, citing inspiration from the seq2seq model. [26] that would encode an input image into a fixed-length vector. (Xu et al 2015), [27] citing (Bahdanau et al 2014), [28] applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder must process the sequence token-by-token. Decomposable attention [29] attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" (alignment is the terminology used by Bahdanau et al [28] ) in order to allow for parallel processing.

The idea of using the attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers [30] and neural Turing machines. [31] It was termed intra-attention [32] where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were brought together in 2017 with the Transformer architecture, published in the Attention Is All You Need paper.

Machine translation

Comparison of the data flow in CNN, RNN, and self-attention Self-attention in CNN, RNN, and self-attention.svg
Comparison of the data flow in CNN, RNN, and self-attention

In neural machine translation, the seq2seq method developed in the early 2010s uses two neural networks. An encoder network encodes an input sentence into numerical vectors, which a decoder network decodes into an output sentence in another language. During the evolution of seq2seq in the 2014-2017 period, the attention mechanism was refined, until it appeared in the Transformer in 2017.

seq2seq machine translation

Decoder cross-attention, computing the context vector with alignment soft weights. Legend: c = Context, a = alignment soft weights, v = output vectors of the Value network. Encoder cross-attention, computing the context vector by a linear sum.png
Decoder cross-attention, computing the context vector with alignment soft weights. Legend: c = Context, a = alignment soft weights, v = output vectors of the Value network.
Animation of seq2seq with RNN and attention mechanism Seq2seq with RNN and attention mechanism.gif
Animation of seq2seq with RNN and attention mechanism

Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control <end>", which should translate to "la zone de contrôle international <end>". Here, we use the special <end> token as a control character to delimit the end of input for both the encoder and the decoder.

An input sequence of text is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors , where stands for "hidden vector".

After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence , autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word:

  1. (, "<start>") → "la"
  2. (, "<start> la") → "la zone"
  3. (, "<start> la zone") → "la zone de"
  4. ...
  5. (, "<start> la zone de contrôle international") → "la zone de contrôle international <end>"

Here, we use the special <start> token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "<end>" appears in the decoder output.

Attention weights

In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. [33] In the I love you example above, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix:

Iloveyou
je0.940.020.04
t'0.110.010.88
aime0.030.950.02

Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.

This view of the attention weights addresses some of the neural network explainability problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime.

Attention weights

Decoder cross-attention, computing the attention weights by dot-product Encoder cross-attention.png
Decoder cross-attention, computing the attention weights by dot-product

As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder send in a query, and obtain a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key.

The decoder first processes the "<start>" input partially, to obtain an intermediate vector , the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map into a query vector . Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map into key vectors . The linear maps are useful for providing the model with enough freedom to find the best way to represent the data.

Now, the query and keys are compared by taking dot products: . Ideally, the model should have learned to compute the keys and values, such that is large, is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest.

In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over . This can be accomplished by the softmax function, thus giving us the attention weights:This is then used to compute the context vector:where are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices , the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same.

This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention".

More succinctly, we can write it aswhere the matrix is the matrix whose rows are . Note that the querying vector, , is not necessarily the same as the key-value vector . In fact, it is theoretically possible for query, key, and value vectors to all be different, though that is rarely done in practice.

Self-attention

Encoder self-attention, block diagram Encoder self-attention, block diagram.png
Encoder self-attention, block diagram
Encoder self-attention, detailed diagram Encoder self-attention, detailed diagram.png
Encoder self-attention, detailed diagram

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences.

For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors . These can then be applied to a dot-product attention mechanism, to obtainor more succinctly, . This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.

Masking

Decoder self-attention with causal masking, detailed diagram Decoder self-attention with causal masking, detailed diagram.png
Decoder self-attention with causal masking, detailed diagram

For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights for all , called "causal masking". This attention mechanism is the "causally masked self-attention".

General attention

A step-by-step sequence of a language translation

In general, the attention unit consists of dot products, with 3 trained, fully-connected neural network layers called query, key, and value.

Attention-1-sn.png
Encoder-decoder with attention. [34] Numerical subscripts (100, 300, 500, 9k, 10k) indicate vector sizes while lettered subscripts i and i − 1 indicate time steps. Grey regions in H matrix and w vector are zero values. See Legend for details.
Legend
LabelDescription
100Max. sentence length
300 Embedding size (word dimension)
500Length of hidden vector
9k, 10kDictionary size of input & output languages respectively.
x, Y9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s500-long decoder hidden state vector.
E500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary). [35] The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score100-long alignment score
w100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
AAttention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H500×100. 100 hidden vectors h concatenated into a matrix
c500-long context vector = H * w. c is a linear combination of h vectors weighted by w.
Attention-qkv.png
The diagram shows the Attention forward pass calculating correlations of the word "that" with other words in "See that girl run." Given the right weights from training, the network should be able to identify "girl" as a highly correlated word. Some things to note:
  • This example focuses on the attention of a single word "that". In practice, the attention of each word is calculated in parallel to speed up calculations. Simply changing the lowercase "x" vector to the uppercase "X" matrix will yield the formula for this.
  • Softmax scaling qWkT / 100 prevents a high variance in qWkT that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.
  • Notation: the commonly written row-wise softmax formula above assumes that vectors are rows, which runs contrary to the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise softmax, resulting in the more correct form
.

Variants

Many variants of attention implement soft weights, such as

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, [38] channel attention, [39] or combinations. [40] [41]

Much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning, [42] masked language tasks, [43] stripped down transformers, [44] bigram statistics, [45] N-gram statistics, [46] pairwise convolutions, [47] and arithmetic factoring. [48]

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

1. encoder-decoder dot product2. encoder-decoder QKV3. encoder-only dot product4. encoder-only QKV5. Pytorch tutorial
Both encoder & decoder are needed to calculate attention. Attn-xy-dot.png
Both encoder & decoder are needed to calculate attention.
Both encoder & decoder are needed to calculate attention. Attn-xy-qkv.png
Both encoder & decoder are needed to calculate attention.
Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. wij = xi xj. Attn-xx-dot.png
Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. wij = xi xj.
Decoder is not used to calculate attention. Attn-xx-qkv.png
Decoder is not used to calculate attention.
A fully-connected layer is used to calculate attention instead of dot product correlation. Attn-pytorch-tutorial.png
A fully-connected layer is used to calculate attention instead of dot product correlation.
Legend
LabelDescription
Variables X, H, S, TUpper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, TS, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, HH, encoder hidden state; X, input word embeddings.
WAttention coefficients
Qw, Kw, Vw, FCWeight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗⊕, vector concatenation; ⊗, matrix multiplication.
corrColumn-wise softmax(matrix of all combinations of dot products). The dot products are xi * xj in variant #3, hi* sj in variant 1, and column i ( Kw * H ) * column j ( Qw * S ) in variant 2, and column i ( Kw * X ) * column j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the d where d is the height of the QKV matrices.

Mathematical representation

Standard Scaled Dot-Product Attention

For matrices: and , the scaled dot-product, or QKV attention is defined as: where denotes transpose and the softmax function is applied independently to every row of its argument. The matrix contains queries, while matrices jointly contain an unordered set of key-value pairs. Value vectors in matrix are weighted using the weights resulting from the softmax operation, so that the rows of the -by- output matrix are confined to the convex hull of the points in given by the rows of .

To understand the permutation invariance and permutation equivariance properties of QKV attention, [53] let and be permutation matrices; and an arbitrary matrix. The softmax function is permutation equivariant in the sense that:

By noting that the transpose of a permutation matrix is also its inverse, it follows that:

which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of ); and invariant to re-ordering of the key-value pairs in . These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as:

is permutation equivariant with respect to re-ordering the rows of the input matrix in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for multi-head attention, which is defined below.

Masked Attention

When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have rows, a masked attention variant is used: where the mask, is a stricly upper triangular matrix, with zeros on and below the diagonal and in every element above the diagonal. The softmax output, also in is then lower triangular, with zeros in all elements above the diagonal. The masking ensures that for all , row of the attention ouput is independent of row of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.

Multi-Head Attention

Decoder multiheaded cross-attention Encoder cross-attention, multiheaded version.png
Decoder multiheaded cross-attention

Multi-head attention where each head is computed with QKV attention as: and , and are parameter matrices.

The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices, :

from which we also see that multi-head self-attention:

is equivariant with respect to re-ordering of the rows of input matrix .

Bahdanau (Additive) Attention

where and and are learnable weight matrices. [33]

Luong Attention (General)

where is a learnable weight matrix. [36]

See also

Related Research Articles

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Recurrent neural networks (RNNs) are a class of artificial neural networks for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">Dilution (neural networks)</span>

Dilution and dropout are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units during the training process of a neural network. Both trigger the same type of regularization.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020, BERT was a ubiquitous baseline in natural language processing (NLP) experiments.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

<span class="mw-page-title-main">Seq2seq</span> Family of machine learning approaches

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

<span class="mw-page-title-main">Contrastive Language-Image Pre-training</span> Technique in neural networks for learning joint representations of text and images

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

<span class="mw-page-title-main">Attention Is All You Need</span> 2017 research paper by Google

"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

References

  1. 1 2 Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN   0925-2312.
  2. 1 2 Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. doi:10.1007/s00521-022-07366-3. ISSN   0941-0643.
  3. Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN   978-0-19-530572-2.
  4. Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl: 11858/00-001M-0000-002A-F750-3 . ISSN   0001-4966.
  5. Broadbent, D (1958). Perception and Communication. London: Pergamon Press.
  6. Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN   0042-6989. PMID   7660596.
  7. Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985–4992. doi:10.1364/AO.26.004985. ISSN   0003-6935. PMID   20523477.
  8. Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23). "Multiple Object Recognition with Visual Attention". arXiv: 1412.7755 [cs.LG].
  9. Koch, Christof; Ullman, Shimon (1987). "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry". In Vaina, Lucia M. (ed.). Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience. Dordrecht: Springer Netherlands. pp. 115–141. doi:10.1007/978-94-009-3833-5_5. ISBN   978-94-009-3833-5 . Retrieved 2024-08-06.
  10. Itti, L.; Koch, C.; Niebur, E. (November 1998). "A model of saliency-based visual attention for rapid scene analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (11): 1254–1259. doi:10.1109/34.730558.
  11. Ivakhnenko, A. G. (1973). Cybernetic Predicting Devices. CCM Information Corporation.
  12. Ivakhnenko, A. G.; Grigorʹevich Lapa, Valentin (1967). Cybernetics and forecasting techniques. American Elsevier Pub. Co.
  13. Schmidhuber, Jürgen (2022). "Annotated History of Modern AI and Deep Learning". arXiv: 2212.11279 [cs.NE].
  14. Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972–4978. doi:10.1364/AO.26.004972. ISSN   0003-6935. PMID   20523475.
  15. Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN   0364-0213.
  16. Rumelhart, David E.; Hinton, G. E.; Mcclelland, James L. (1987-07-29). "A General Framework for Parallel Distributed Processing" (PDF). In Rumelhart, David E.; Hinton, G. E.; PDP Research Group (eds.). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. Cambridge, Massachusetts: MIT Press. ISBN   978-0-262-68053-0.
  17. 1 2 3 4 Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID   16683347.
  18. Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01). "HyperNetworks". arXiv: 1609.09106 [cs.LG].
  19. Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). "Transformers are RNNs: Fast autoregressive Transformers with linear attention". ICML 2020. PMLR. pp. 5156–5165.
  20. 1 2 3 Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
  21. Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95-119. Springer, Berlin, 1994.
  22. Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27-39, Dec. 1982.
  23. Hinton, Geoffrey E.; Plaut, David C. (1987). "Using Fast Weights to Deblur Old Memories". Proceedings of the Annual Meeting of the Cognitive Science Society. 9.
  24. Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
  25. Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014). "Sequence to sequence learning with neural networks". arXiv: 1409.3215 [cs.CL].
  26. Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". pp. 3156–3164.
  27. Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057.
  28. 1 2 Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (19 May 2016). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv: 1409.0473 [cs.CL]. (orig-date 1 Sep 2014)
  29. Parikh, Ankur; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016). "A Decomposable Attention Model for Natural Language Inference". Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 2249–2255. arXiv: 1606.01933 . doi:10.18653/v1/d16-1244.
  30. Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN   1476-4687. PMID   27732574. S2CID   205251479.
  31. Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10). "Neural Turing Machines". arXiv: 1410.5401 [cs.NE].
  32. 1 2 Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20). "Long Short-Term Memory-Networks for Machine Reading". arXiv: 1601.06733 [cs.CL].
  33. 1 2 3 Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv: 1409.0473 [cs.CL].
  34. Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv: 1703.03906 [cs.CV].
  35. "Pytorch.org seq2seq tutorial" . Retrieved December 2, 2021.
  36. 1 2 3 Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv: 1508.04025v5 [cs.CL].
  37. "Learning Positional Attention for Sequential Recommendation". catalyzex.com.
  38. Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv: 1904.05873 . doi:10.1109/ICCV.2019.00679. ISBN   978-1-7281-4803-8. S2CID   118673006.
  39. Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv: 1709.01507 . doi:10.1109/CVPR.2018.00745. ISBN   978-1-5386-6420-9. S2CID   206597034.
  40. Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv: 1807.06521 [cs.CV].
  41. Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv: 2204.04218 [eess.IV].
  42. Zhang, Ruiqi (2024). "Trained Transformers Learn Linear Models In-Context" (PDF). Journal of Machine Learning Research 1-55. 25. arXiv: 2306.09927 .
  43. Rende, Riccardo (2024). "Mapping of attention mechanisms to a generalized Potts model". Physical Review Research. 6 (2): 023057. arXiv: 2304.07235 . Bibcode:2024PhRvR...6b3057R. doi:10.1103/PhysRevResearch.6.023057.
  44. He, Bobby (2023). "Simplifying Transformers Blocks". arXiv: 2311.01906 [cs.LG].
  45. Nguyen, Timothy (2024). "Understanding Transformers via N-gram Statistics". arXiv: 2407.12034 [cs.CL].
  46. "Transformer Circuits". transformer-circuits.pub.
  47. Transformer Neural Network Derived From Scratch. 2023. Event occurs at 05:30. Retrieved 2024-04-07.
  48. Charton, François (2023). "Learning the Greatest Common Divisor: Explaining Transformer Predictions". arXiv: 2308.15594 [cs.LG].
  49. Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.
  50. Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.
  51. Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.
  52. Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.
  53. Lee, Juho; Lee, Yoonho; Kim, Jungtaek; Kosiorek, Adam R; Choi, Seungjin; Teh, Yee Whye (2018). "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". arXiv: 1810.00825 [cs.LG].