machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, importance is represented b
"soft"
weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width

sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...

that can range from tens to millions of tokens in size. Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(RNN) language translation system, but a more recent design, namely the

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme. Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

History

Academic reviews of the history of the attention mechanism are provided in Niu et al. and Soydaner.

Overview

The modern era of machine attention was revitalized by grafting an attention mechanism (Fig 1. orange) to an Encoder-Decoder. Figure 2 shows the internal step-by-step operation of the attention block (A) in Fig 1. This attention scheme has been compared to the Query-Key analogy of relational databases. That comparison suggests an asymmetric role for the Query and Key vectors, where one item of interest (the Query vector "that") is matched against all possible items (the Key vectors of each word in the sentence). However, both Self and Cross Attentions' parallel calculations matches all tokens of the K matrix with all tokens of the Q matrix; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understanding attention mechanisms further by studying their roles in focused settings, such as in-context learning, masked language tasks, stripped down transformers, bigram statistics, N-gram statistics, pairwise convolutions, and arithmetic factoring.

Interpreting attention weights

In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. Consider an example of translating ''I love you'' to French. On the first pass through the decoder, 94% of the attention weight is on the first English word ''I'', so the network offers the word ''je''. On the second pass of the decoder, 88% of the attention weight is on the third English word ''you'', so it offers ''t. On the last pass, 95% of the attention weight is on the second English word ''love'', so it offers ''aime''. In the ''I love you'' example, the second word ''love'' is aligned with the third word ''aime''. Stacking soft row vectors together for ''je'', ''t, and ''aime'' yields an alignment matrix: Sometimes, alignment can be multiple-to-multiple. For example, the English phrase ''look it up'' corresponds to ''cherchez-le''. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.

Variants

Many variants of attention implement soft weights, such as * fast weight programmers, or fast weight controllers (1992). A "slow"

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

outputs the "fast" weights of another neural network through

outer product In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions ''n'' and ''m'', the ...

s. The slow network learns by gradient descent. It was later renamed as "linearized self-attention". * Bahdanau-style attention, also referred to as ''additive attention'', * Luong-style attention, which is known as ''multiplicative attention'', * Early attention mechanisms similar to modern self-attention were proposed using recurrent neural networks. However, the highly parallelizable self-attention was introduced in 2017 and successfully used in the Transformer model, * ''positional attention'' and ''factorized positional attention''. For

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, channel attention, or combinations. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Overview section above.

Optimizations

Flash attention

The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency.

FlexAttention

FlexAttentionhttps://pytorch.org/blog/flexattention/ is an attention kernel developed by Meta that allows users to modify attention scores prior to softmax and dynamically chooses the optimal attention algorithm.

Self-Attention and Transformers

The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which completely replaced recurrence with attention mechanisms. As a result, Transformers became the foundation for models like BERT, GPT, and T5 (Vaswani et al., 2017).

Applications

Attention is widely used in natural language processing, computer vision, and speech recognition. In NLP, it improves context understanding in tasks like question answering and summarization. In vision, visual attention helps models focus on relevant image regions, enhancing object detection and image captioning.

Mathematical representation

Standard Scaled Dot-Product Attention

For matrices:

\mathbf\in\mathbb^, \mathbf\in\mathbb^

and

\mathbf\in\mathbb^

, the scaled dot-product, or QKV attention is defined as:

\text(\mathbf, \mathbf, \mathbf) = \text\left(\frac\right)\mathbf\in\mathbb^

where

^T

denotes

transpose In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...

and the

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

is applied independently to every row of its argument. The matrix

\mathbf

contains

m

queries, while matrices

\mathbf, \mathbf

jointly contain an ''unordered'' set of

n

key-value pairs. Value vectors in matrix

\mathbf

are weighted using the weights resulting from the softmax operation, so that the rows of the

m

-by-

d_v

output matrix are confined to the

convex hull In geometry, the convex hull, convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space, ...

of the points in

\mathbb^

given by the rows of

\mathbf

. To understand the permutation invariance and permutation equivariance properties of QKV attention, let

\mathbf\in\mathbb^

and

\mathbf\in\mathbb^

permutation matrices In mathematics, particularly in Matrix (mathematics), matrix theory, a permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column with all other entries 0. An permutation matrix can represent a permu ...

; and

\mathbf\in\mathbb^

an arbitrary matrix. The softmax function is permutation equivariant in the sense that: :

\text(\mathbf\mathbf\mathbf) = \mathbf\,\text(\mathbf)\mathbf

By noting that the transpose of a permutation matrix is also its inverse, it follows that: :

\text(\mathbf\mathbf, \mathbf\mathbf, \mathbf\mathbf) = \mathbf\,\text(\mathbf, \mathbf, \mathbf)

which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of

\mathbf

); and invariant to re-ordering of the key-value pairs in

\mathbf,\mathbf

. These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as: :

\mathbf\mapsto\text(\mathbf\mathbf_q, \mathbf\mathbf_k, \mathbf\mathbf_v)

is permutation equivariant with respect to re-ordering the rows of the input matrix

X

in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for ''multi-head attention'', which is defined below.

Masked Attention

When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have

n

rows, a masked attention variant is used:

\text(\mathbf, \mathbf, \mathbf) = \text\left(\frac+\mathbf\right)\mathbf

where the mask,

\mathbf\in\mathbb^

is a strictly upper triangular matrix, with zeros on and below the diagonal and

-\infty

in every element above the diagonal. The softmax output, also in

\mathbb^

is then ''lower triangular'', with zeros in all elements above the diagonal. The masking ensures that for all

1\le i, row i of the attention output is independent of row j of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.

Multi-Head Attention

Multi-head attention

\text(\mathbf, \mathbf, \mathbf) = \text(\text_1, ..., \text_h)\mathbf^O

where each head is computed with QKV attention as:

\text_i = \text(\mathbf\mathbf_i^Q, \mathbf\mathbf_i^K, \mathbf\mathbf_i^V)

and

\mathbf_i^Q, \mathbf_i^K, \mathbf_i^V

, and

\mathbf^O

are parameter matrices. The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices,

\mathbf, \mathbf

: :

\text(\mathbf\mathbf, \mathbf\mathbf, \mathbf\mathbf) = \mathbf\,\text(\mathbf, \mathbf, \mathbf)

from which we also see that multi-head self-attention: :

\mathbf\mapsto\text(\mathbf\mathbf_q, \mathbf\mathbf_k, \mathbf\mathbf_v)

is equivariant with respect to re-ordering of the rows of input matrix

X

Bahdanau (Additive) Attention

\text(\mathbf, \mathbf, \mathbf) = \text(\tanh(\mathbf_Q\mathbf + \mathbf_K\mathbf)\mathbf)

where

\mathbf_Q

and

\mathbf_K

are learnable weight matrices.

Luong Attention (General)

\text(\mathbf, \mathbf, \mathbf) = \text(\mathbf\mathbf\mathbf^T)\mathbf

where

\mathbf

is a learnable weight matrix.

Self Attention

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences. For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed

lookup table In computer science, a lookup table (LUT) is an array data structure, array that replaces runtime (program lifecycle phase), runtime computation of a mathematical function (mathematics), function with a simpler array indexing operation, in a proc ...

. This gives a sequence of hidden vectors

h_0, h_1, \dots

. These can then be applied to a dot-product attention mechanism, to obtain

\begin
h_0' &= \mathrm(h_0 W^Q, HW^K, H W^V) \\ 
h_1' &= \mathrm(h_1 W^Q, HW^K, H W^V) \\
&\cdots
\end

or more succinctly,

H' = \mathrm(H W^Q, HW^K, H W^V)

. This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.

Masking

For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights

w_ = 0

for all

i < j

, called "causal masking". This attention mechanism is the "causally masked self-attention".

References

External links

* * Dan Jurafsky and James H. Martin (2022
''Speech and Language Processing'' (3rd ed. draft, January 2022)
ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers * Alex Graves (4 May 2020)
Attention and Memory in Deep Learning
(video lecture),

DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...

/ UCL, via YouTube {{Artificial intelligence navbox Machine learning

History

Overview

Interpreting attention weights

Variants

Optimizations

Flash attention

FlexAttention

Self-Attention and Transformers

Applications

Mathematical representation

Standard Scaled Dot-Product Attention

Masked Attention

Multi-Head Attention

Bahdanau (Additive) Attention

Luong Attention (General)

Self Attention

Masking

See also

References

External links