The transformer is a
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
architecture based on the multi-head
attention
Attention or focus, is the concentration of awareness on some phenomenon to the exclusion of other stimuli. It is the selective concentration on discrete information, either subjectively or objectively. William James (1890) wrote that "Atte ...
mechanism, in which text is converted to numerical representations called
tokens, and each token is converted into a vector via lookup from a
word embedding
In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
table.
At each layer, each
token is then
contextualized within the scope of the
context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier
recurrent neural architectures (RNNs) such as
long short-term memory
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...
(LSTM).
Later variations have been widely adopted for training
large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
s (LLM) on large (language)
datasets.
The modern version of the transformer was proposed in the 2017 paper "
Attention Is All You Need
"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism p ...
" by researchers at
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
.
Transformers were first developed as an improvement over previous architectures for
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
,
but have found many applications since. They are used in large-scale
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
(
vision transformer
A vision transformer (ViT) is a Transformer (machine learning model), transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text into Byte pair encoding, tokens), serializes each patch into ...
s),
reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
,
audio
Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to:
Sound
*Audio signal, an electrical representation of sound
*Audio frequency, a frequency in the audio spectrum
*Digital audio, representation of sound ...
,
multimodal learning
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, imp ...
,
robotics
Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots.
Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...
, and even playing
chess
Chess is a board game for two players. It is an abstract strategy game that involves Perfect information, no hidden information and no elements of game of chance, chance. It is played on a square chessboard, board consisting of 64 squares arran ...
.
It has also led to the development of
pre-trained systems, such as
generative pre-trained transformer
A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an Neural network (machine learning), artificial neural network that is used in natural ...
s (GPTs)
and
BERT (bidirectional encoder representations from transformers).
History
Predecessors
For many years, sequence modelling and generation was done by using plain
recurrent neural network
Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
s (RNNs). A well-cited early example was the
Elman network
Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
(1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the
vanishing-gradient problem
In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...
leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.
A key breakthrough was
LSTM
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hi ...
(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an
attention mechanism
In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented b"soft"weights assigned to eac ...
which used neurons that multiply the outputs of other neurons, so-called ''multiplicative units''. Neural networks using multiplicative units were later called ''sigma-pi networks''
or ''
higher-order networks''. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers.
However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.
Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is
quadratic in the size of the context window. The linearly scaling
fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input.
One of its two networks has "fast weights" or "dynamic links" (1981).
[Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95–119. Springer, Berlin, 1994.][Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27–39, Dec. 1982.] A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries.
This was later shown to be equivalent to the unnormalized linear Transformer.
Attention with seq2seq
The idea of encoder-decoder sequence transduction had been developed in the early 2010s; commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.
[ irst version posted to arXiv on 10 Sep 2014/ref>
A 380M-parameter model for machine translation uses two long short-term memories (LSTM).] Its architecture consists of two parts. The ''encoder'' is an LSTM that takes in a sequence of tokens and turns it into a vector. The ''decoder'' is another LSTM that converts the vector into a sequence of tokens. Similarly, another 130M-parameter model used gated recurrent unit
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a ...
s (GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.
These early seq2seq models had no attention mechanism, and the state vector is accessible only after the ''last'' word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. This is because the input is processed sequentially by one recurrent network into a ''fixed''-size output vector, which is then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, degrading the output. As evidence, reversing the input sentence improved seq2seq translation.
The ''RNNsearch'' model introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of the ''fixed-size'' output vector), allowing the model to process long-distance dependencies more easily. The name is because it "emulates searching through a source sentence during decoding a translation".
The relative performances were compared between global (that of ''RNNsearch'') and local (sliding window) attention model architectures for machine translation, finding that mixed attention had higher quality than global attention, while local attention reduced translation time.
In 2016, Google Translate
Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...
was revamped to Google Neural Machine Translation, which replaced the previous model based on statistical machine translation
Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...
. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it outperformed the statistical approach, which took ten years to develop.
Parallelizing attention
Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs. In 2016, ''decomposable attention'' applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence would be sufficient for language translation, thus the title "attention is ''all'' you need". That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit, a well-known computational linguist, was skeptical. In the same year, self-attention (called ''intra-attention or'' ''intra-sentence attention'') was proposed for LSTMs.
In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "Attention is all you need
"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism p ...
" paper. At the time, the focus of the research was on improving seq2seq
Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization.
Seq2seq uses sequence transfor ...
for machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence. Its parallelizability was an important factor to its widespread use in large neural networks.
AI boom era
Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles. Transformer architecture is now used alongside many generative models that contribute to the ongoing AI boom
The AI boom is an ongoing period of rapid Progress in artificial intelligence, progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include lar ...
.
In language modelling, ELMo
Elmo is a Muppet character on the children's television show ''Sesame Street''. A furry red monster who speaks in a high-pitched falsetto voice and frequently refers to himself in the third person, he hosts the last full 15-minute segmen ...
(2018) was a bi-directional LSTM that produces contextualized word embedding
In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
s, improving upon the line of research from bag of words
The bag-of-words (BoW) model is a model of text which uses an unordered collection (a "bag") of words. It is used in natural language processing and information retrieval (IR). It disregards word order (and thus most of syntax or grammar) but ca ...
and word2vec
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
. It was followed by BERT (2018), an encoder-only Transformer model. In 2019 October, Google started using BERT to process search queries. In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model.
Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation
Natural language generation (NLG) is a software process that produces natural language output. A widely cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the ...
. In 2022, a chatbot based on GPT-3, ChatGPT
ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...
, became unexpectedly popular, triggering a boom around large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
s.
Since 2020, Transformers have been applied in modalities beyond text, including the vision transformer
A vision transformer (ViT) is a Transformer (machine learning model), transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text into Byte pair encoding, tokens), serializes each patch into ...
, speech recognition, robotics, and multimodal. The vision transformer, in turn, stimulated new developments in convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
s. Image and video generators like DALL-E
DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as Prompt engineering, ''prompts''.
The first ...
(2021), Stable Diffusion 3 (2024), and Sora (2024), use Transformers to analyse input data (like text prompts) by breaking it down into "tokens" and then calculating the relevance between each token using self-attention, which helps the model understand the context and relationships within the data.
Training
Methods for stabilizing training
The plain transformer architecture had difficulty converging. In the original paper the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.
A 2020 paper found that using layer normalization ''before'' (instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.
Pretrain-finetune
Transformers typically are first pretrained by self-supervised learning on a large generic dataset, followed by supervised fine-tuning on a small task-specific dataset. The pretrain dataset is typically an unlabeled large corpus, such as The Pile. Tasks for pretraining and fine-tuning commonly include:
* language modeling
A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
* next-sentence prediction
* question answering
* reading comprehension
Reading comprehension is the ability to process written text, understanding, understand its meaning, and to integrate with what the reader already knows. Reading Comprehension of spoken language, comprehension relies on two abilities that are co ...
* sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
* paraphrasing
A paraphrase () or rephrase is the rendering of the same text in different words without losing the meaning of the text itself. More often than not, a paraphrased text can convey its meaning better than the original words. In other words, it is a ...
The T5 transformer report documents a large number of natural language
A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
pretraining tasks. Some examples are:
* restoring or repairing incomplete or corrupted text. For example, the input, ''"Thank youme to your partyweek",'' might generate the output, ''"Thank you for inviting me to your party last week".''
* translation between natural languages (machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
)
* judging the pragmatic acceptability of natural language. For example, the following sentence might be judged "not acceptable", because even though it is syntactically well-formed, it is improbable in ordinary human usage: ''The course is jumping well.''
Note that while each of these tasks is trivial or obvious for human native speakers of the language (or languages), they have typically proved challenging for previous generations of machine learning architecture.
Tasks
In general, there are 3 classes of language modelling tasks: "masked", "autoregressive", and "prefixLM". These classes are independent of a specific modeling architecture such as Transformer, but they are often discussed in the context of Transformer.
In a masked task, one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
for the task is typically sum of log-perplexities for the masked-out tokens: and the model is trained to minimize this loss function. The BERT series of models are trained for masked token prediction and another task.
In an autoregressive task, the entire sequence is masked at first, and the model produces a probability distribution for the first token. Then the first token is revealed and the model predicts the second token, and so on. The loss function for the task is still typically the same. The GPT series of models are trained by autoregressive tasks.
In a prefixLM task, the sequence is divided into two parts. The first part is presented as context, and the model predicts the first token of the second part. Then that would be revealed, and the model predicts the second token, and so on. The loss function for the task is still typically the same. The T5 series of models are trained by prefixLM tasks.
Note that "masked" as in "masked language modelling" is not "masked" as in " masked attention", and "prefixLM" (prefix language modeling) is not "prefixLM" (prefix language model).
Architecture
All transformers have the same primary components:
* Tokenizers, which convert text into tokens.
* Embedding layer, which converts tokens and positions of the tokens into vector representations.
* Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
* Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.
The following description follows exactly the Transformer as described in the original paper. There are variants, described in the following section.
By convention, we write all vectors as row vectors. This, for example, means that pushing a vector through a linear layer means multiplying it by a weight matrix on the right, as .
Tokenization
As the Transformer architecture natively processes numerical data, not text, there must be a translation between text and tokens. A token is an integer that represents a character, or a short segment of characters. On the input side, the input text is parsed into a token sequence. Similarly, on the output side, the output tokens are parsed back to text. The module doing the conversion between texts and token sequences is a tokenizer
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
.
The set of all tokens is the vocabulary of the tokenizer, and its size is the ''vocabulary size'' . When faced with tokens outside the vocabulary, typically a special token is used, written as " NK for "unknown".
Some commonly used tokenizers are byte pair encoding
Byte-pair encoding (also known as BPE, or digram coding) is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. A slightly modified version of the algori ...
, WordPiece, and SentencePiece.
Embedding
Each token is converted into an embedding vector via a lookup table
In computer science, a lookup table (LUT) is an array data structure, array that replaces runtime (program lifecycle phase), runtime computation of a mathematical function (mathematics), function with a simpler array indexing operation, in a proc ...
. Equivalently stated, it multiplies a one-hot
In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are '1' except ...
representation of the token by an embedding matrix . For example, if the input token is , then the one-hot representation is