A transformer is a deep learning architecture developed by researchers at

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

and based on the multi-head

attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...

mechanism, proposed in a 2017 paper "

Attention Is All You Need "Attention Is All You Need" is a 2017 landmark Academic publishing, research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the Transformer (machine learni ...

". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a

word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...

table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head

attention mechanism In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...

allowing the signal for key tokens to be amplified and less important tokens to be diminished. Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier recurrent neural architectures (RNNs) such as

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

(LSTM). Later variations have been widely adopted for training

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

s (LLM) on large (language) datasets, such as the

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

and

Common Crawl Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally eve ...

. Transformers were first developed as an improvement over previous architectures for

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

, but have found many applications since then. They are used in large-scale

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

computer vision Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...

( vision transformers),

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

, audio, multi-modal processing, robotics, and even playing

chess Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...

. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (Bidirectional Encoder Representations from Transformers).

History

Predecessors

For many years, sequence modelling and generation was done by using plain

recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

(RNNs). A well-cited early example was the

Elman network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

(1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the

vanishing-gradient problem In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's ...

leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens. A key breakthrough was

LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...

(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an

which used neurons that multiply the outputs of other neurons, so-called ''multiplicative units''. Neural networks using multiplicative units were called ''sigma-pi networks'' or '' higher-order networks'', but they faced high computational complexity. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence. An early attempt to overcome this was the

fast weight Fast or FAST may refer to: * Fast (noun), high speed or velocity * Fast (noun, verb), to practice fasting, abstaining from food and/or water for a certain period of time Acronyms and coded Computing and software * ''Faceted Application of Subje ...

controller (1992) which computed the weight matrix for further processing depending on the input. It used the fast weights architecture (1987), where one neural network outputs the weights of another neural network. It was later shown to be equivalent to the linear Transformer without normalization.

Attention with seq2seq

The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see

irst version posted to arXiv on 10 Sep 2014 An infrared search and track (IRST) system (sometimes known as infrared sighting and tracking) is a method for detecting and tracking objects which give off infrared radiation, such as the infrared signatures of jet aircraft and helicopters. IR ...

/ref> for previous papers). The papers most commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014. (Sutskever et al, 2014) was a 380M-parameter model for machine translation using two

(LSTM). The architecture consists of two parts. The ''encoder'' is an LSTM that takes in a sequence of tokens and turns it into a vector. The ''decoder'' is another LSTM that converts the vector into a sequence of tokens. Similarly, (Cho et al, 2014) was 130M-parameter model that used

gated recurrent unit Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an o ...

s (GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq. These early seq2seq models had no attention mechanism, and the state vector is accessible only after the ''last'' word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a ''fixed''-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation. (Bahdanau et al, 2014) introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem, allowing the model to process long-distance dependencies more easily. They called their model ''RNNsearch'', as it "emulates searching through a source sentence during decoding a translation". (Luong et al, 2015) compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time. In 2016,

Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an A ...

was revamped to

Google Neural Machine Translation Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate. GNMT improves ...

, which replaced the previous model based on

statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrast ...

. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop. In the same year, self-attention ''avant la lettre'', originally called ''intra-attention or'' ''intra-sentence attention'', was proposed for LSTMs.

Parallelizing attention

Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, ''decomposable attention'' applied a self-attention mechanism to feedforward networks, which are easy