Generative Pre-trained Transformer 2 (GPT-2) is an

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machine A machine is a physical system using Power (physics), power to apply Force, forces and control Motion, moveme ...

created by

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

in February 2019. GPT-2

translates Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

text, answers questions, summarizes passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of OpenAI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset. The GPT architecture implements a deep

neural network A neural network is a network or neural circuit, circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up ...

, specifically a

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

model, which uses

attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...

in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. This model allows for greatly increased

parallelization Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...

, and outperforms previous benchmarks for RNN/CNN/LSTM-based models. OpenAI released the complete version of the GPT-2 language model (with 1.5 billion parameters) in November 2019. GPT-2 was to be followed by the 175-billion-parameter

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...

, revealed to the public in 2020 (whose source code has never been made available). Access to GPT-3 is provided exclusively through

APIs Apis or APIS may refer to: * Apis (deity), an ancient Egyptian god * Apis (Greek mythology), several different figures in Greek mythology * Apis (city), an ancient seaport town on the northern coast of Africa **Kom el-Hisn, a different Egyptian ci ...

offered by OpenAI and

Microsoft Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...

Background

Since the

origins Origin(s) or The Origin may refer to: Arts, entertainment, and media Comics and manga * ''Origin'' (comics), a Wolverine comic book mini-series published by Marvel Comics in 2002 * ''The Origin'' (Buffy comic), a 1999 ''Buffy the Vampire Sl ...

of computing,

has been an object of study; the "

imitation game The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...

", postulated by

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical c ...

in 1950 (and often called the "Turing test") proposed to establish an electronic or mechanical system's capacity for intelligent action by an evaluator's ability to distinguish its behavior from that of a human. The term "

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

" was first used to describe a possible approach to artificial intelligence as early as 1959 by IBM researcher

Arthur Samuel Arthur Lee Samuel (December 5, 1901 – July 29, 1990) was an American pioneer in the field of computer gaming and artificial intelligence. He popularized the term "machine learning" in 1959. The Samuel Checkers-playing Program was among the wo ...

; current use of the term encompasses a broad variety of statistical learning,

data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...

and

approaches to computational problems (often falling under the aegis of artificial intelligence).

Computational linguistics

Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

using computers, a task originally conceived as a subfield of

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

, was attempted as soon as computing hardware had the capacity; the first application of a dictionary

look-up table In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...

was developed at

Birkbeck College , mottoeng = Advice comes over nightTranslation used by Birkbeck. , established = , type = Public research university , endowment = £4.3 m (2014) , budget = £109 ...

in London in 1948. The 1954 Georgetown Experiment was a demonstration of fully automated

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

, in which sixty Russian sentences were translated into English (mostly by replacement of words with their English synonyms). The translations were often crude; the system had only 6 grammar rules and a 250-word vocabulary, and no attempt was made to analyze or translate syntactic structure. However, the experiment proved to the public that computers could interpret and process natural language, and secured

CIA The Central Intelligence Agency (CIA ), known informally as the Agency and historically as the Company, is a civilian foreign intelligence service of the federal government of the United States, officially tasked with gathering, processing, ...

funding for further research. Direct substitution remains a standard against which machine translation programs are evaluated. Systems for using natural language in human-computer interaction (HCI) also began to emerge in the mid-20th century. SHRDLU, a program developed at MIT in 1968–1970, consisted of a virtual environment of several objects which a user interacted with through commands in natural language (e.g."Find a block which is taller than the one you are holding and put it into the box").

ELIZA ELIZA is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum. Created to demonstrate the superficiality of communication between humans and machines ...

, a

chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beh ...

written in 1966, analyzed a human interlocutor's text for

keywords Keyword may refer to: Computing * Keyword (Internet search), a word or phrase typically used by bloggers or online content creator to rank a web page on a particular topic * Index term, a term used as a keyword to documents in an information syste ...

and provided conversationally appropriate responses. While many subjects claimed an inability to distinguish ELIZA's conversation from that of a human, the question of whether this constituted intelligence proved contentious (the most famous script parodied a

psychotherapist Psychotherapy (also psychological therapy, talk therapy, or talking therapy) is the use of psychological methods, particularly when based on regular personal interaction, to help a person change behavior, increase happiness, and overcome prob ...

by, largely, repeating what the user had said back to them). While initial attempts at machine translation had been purely computational, by the 1950s the dominant approach to

had come to emphasize

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...

's concept of

universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible h ...

; NLP research in that era, accordingly, consisted largely of attempts to reduce statements in arbitrary languages to putative underlying language-agnostic logical structures. In the 1970s, semantic NLP systems would begin to eschew ''syntactic'' encodings in favor of more general ''semantic'' encodings. However, until the advent of

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

, most systems continued to rely on large (and increasingly unwieldly) sets of manually programmed rules, which failed to scale up as initially predicted. The field of artificial intelligence continued to develop in the late 20th century, but occasional periods of stagnation known as "

AI winter In the history of artificial intelligence, an AI winter is a period of reduced funding and interest in artificial intelligence research. while Russell & Norvig in 2003 described another as starting soon after 1988.

Neural networks

An early concept in artificial intelligence,

connectionism Connectionism refers to both an approach in the field of cognitive science that hopes to explain mind, mental phenomena using artificial neural networks (ANN) and to a wide range of techniques and algorithms using ANNs in the context of artificial ...

, sought to produce intelligent behavior through

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...

s designed to simulate the behavior of

neurons A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa. ...

in biological brains. The first example of an artificial neural network was the SNARC, built in 1951. The

perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function which can decide whether or not an ...

(a type of

binary classifier Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

) was introduced in 1957 by psychologist

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning. Life and career Rosenblatt was born in New Rochelle, New York as son o ...

; his machine was designed for

image recognition Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the huma ...

using 400

photocell Photodetectors, also called photosensors, are sensors of light or other electromagnetic radiation. There is a wide variety of photodetectors which may be classified by mechanism of detection, such as photoelectric or photochemical effects, or b ...

s connected to "neurons", with weightings determined by

potentiometer A potentiometer is a three- terminal resistor with a sliding or rotating contact that forms an adjustable voltage divider. If only two terminals are used, one end and the wiper, it acts as a variable resistor or rheostat. The measuring instrum ...

s (and adjusted with electric motors during its learning process). Perceptron systems became the subject of great interest; a ''

New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...

'' article described the perceptron as "the embryo of an electronic computer that he Navyexpects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence". Perceptron systems, however, fell out of favor for decades following a 1969 book by

Marvin Minsky Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive and computer scientist concerned largely with research of artificial intelligence (AI), co-founder of the Massachusetts Institute of Technology's AI laboratory, a ...

and

Seymour Papert Seymour Aubrey Papert (; 29 February 1928 – 31 July 2016) was a South African-born American mathematician, computer scientist, and educator, who spent most of his career teaching and researching at MIT. He was one of the pioneers of artifici ...

('' Perceptrons: an introduction to computational geometry''), which pointed out several shortcomings of the then-present

state of the art The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...

(single-layer perceptrons), including an inability to encode the

exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...

(XOR) function. The book was considered, at the time, to discredit the perceptron approach (as well as neural networks in general) as a promising area of research. Neural networks become capable of classifying different inputs (i.e. sorting them into distinct categories) through a process known as "learning". This begins with the network's weights (the amount by which each neuron's "activation" influences the activation of each specific neuron in the subsequent layer) being initialized to random quantities; in this state, the output of the network is similarly random. An

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

, like a loss function, is defined, which is capable of quantitatively measuring how close the output of the network is to its desired performance (for example, how often an input consisting of a handwritten number results in the sole activation of the output neuron corresponding to that number). From this, and from the performance of the network, the weights can be adjusted in order to improve its performance.

Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...

, a supervised algorithm first applied to machine learning systems in Paul Werbos' 1974 dissertation, efficiently calculates "gradients", which are vector fields describing the optimal adjustment of all weights in the entire network for a given input/output example. The use of these gradients to train neural networks, a practice known as

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

, enabled the creation of much more complex systems, and wide-scale application of neural networks to

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

would occur in the

1980s File:1980s replacement montage02.PNG, 420px, From left, clockwise: The first Space Shuttle, ''Columbia'', lifts off in 1981; US president Ronald Reagan and Soviet leader Mikhail Gorbachev ease tensions between the two superpowers, leading to t ...

. In 1985, D.B. Parker would rediscover Werbos' method; in 1986, Rumelhart, Hinton and Williams would apply it to generate internal representations of incoming data in neural networks with hidden layers, referred to as " deep learning" networks; this research would later form the basis for recurrent neural networks. Traditional feed-forward neural networks (FFNNs) are so named because each layer takes in output from the previous layer, and feeds it into the next; a FFNN's structure contains no " cycles" where information flows backwards. In contrast, a

recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

(RNN) has at least one cycle of activation flow. RNNs are often used for processing sequences of data (and predicting future sequence items), since the network can process each item using both the item itself and its own output from processing the previous item. The

neocognitron __NOTOC__ The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese Handwriting recognition, handwritten character recognition and other pattern recognition task ...

, proposed by Kunihiko Fukushima in 1979 based on models of neural architecture in the mammalian

visual cortex The visual cortex of the brain is the area of the cerebral cortex that processes visual information. It is located in the occipital lobe. Sensory input originating from the eyes travels through the lateral geniculate nucleus in the thalamus and ...

, provided the basis for

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

s (CNNs), often used in image processing. By "sliding" a small layer over a larger input, a CNN can perform deeper processing with less computation. For example, a 100×100 image has 10,000 pixels, which would require 10,000 weights to process with a fully connected layer; a convolutional layer consisting of a 5×5 "window" sliding over the image can perform

edge detection Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuiti ...

using only 25 learnable parameters. Convolutional layers are combined by "pooling layers", and processed by "fully connected" layers (which are typically multilayer perceptrons).

Machine learning for natural language processing

Due to their ability to process sequential information, recurrent neural networks have seen use in many NLP applications; unlike FFNNs, they are capable of encoding different weights (and giving different output) for identical items based on their surroundings in a sequence—that is to say, a RNN system that parsed one word at a time could still associate a "

black dog Black dog or blackdog may refer to: Arts and entertainment Fictional entities * Black Dog, a bio-robot in the 1982 Bulgarian animated science fiction film ''The Treasure Planet'' * The Black Dog, an inn in 2015–2016 British drama TV series '' T ...

" with fuzzy paws, a "

corn dog A corn dog (also spelled corndog) is a sausage (usually a hot dog) on a stick that has been coated in a thick layer of cornmeal batter and deep fried. It originated in the United States and is commonly found in American cuisine. History Newly a ...

" with ketchup, and a "

sun dog A sun dog (or sundog) or mock sun, also called a parhelion (plural parhelia) in meteorology, is an atmospheric optical phenomenon that consists of a bright spot to one or both sides of the Sun. Two sun dogs often flank the Sun within a 22° ...

" with refraction. Moreover, since the retention of information from previous sequence items can be performed

recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematic ...

, RNN systems can be designed that recall items arbitrarily far back in a sequence: for example, being able to continue the sequences "Tom looked at the black dog", "Tom looked at the corn dog", and "Tom looked at the sun dog" with "fondly", "hungrily", and "indirectly", respectively. While capable of impressive solutions, many-layered FFNNs and RNNs both proved vulnerable to the

vanishing gradient problem In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's ...

: since gradients (encoded as finite-precision numbers) are required to backpropagate across all layers of a model, they can "vanish" to zero (or "explode" to infinity) over a sufficiently large number of layers. The

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

network (LSTM), first proposed by

Sepp Hochreiter Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018 ...

and

Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artifi ...

in 1995–1997, sought to resolve this issue by introducing a novel architecture consisting of multiple distinct "cells" with "input", "output" and "forget" gates. In 2009, an LSTM-based model submitted by

Alex Graves Alexander John Graves (born July 23, 1965) is an American film director, television director, television producer and screenwriter. Early life Alex Graves was born in Kansas City, Missouri. His father, William Graves, was a reporter for '' Th ...

' team won the ICDAR competition for

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other dev ...

; another was the most accurate model in the competition and a third was the fastest. Another issue RNNs and LSTMs encounter is that they can only take into account the context of ''previous'' sequence items. This can create issues when parsing sentences like "Tom rode his bike to the store, put out the kickstand, and turned off the engine", in which the necessary context of the "

bike A bicycle, also called a pedal cycle, bike or cycle, is a human-powered or motor-powered assisted, pedal-driven, single-track vehicle, having two wheels attached to a frame, one behind the other. A is called a cyclist, or bicyclist. ...

" being a

motorcycle A motorcycle (motorbike, bike, or trike (if three-wheeled)) is a two or three-wheeled motor vehicle steered by a handlebar. Motorcycle design varies greatly to suit a range of different purposes: long-distance travel, commuting, cruisin ...

is revealed only at the end. One method of solving problems like this is the bidirectional LSTM, which proceeds in both directions simultaneously, giving access to both "past" and "future" input features.

Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...

s use tags to connect inputs directly to outputs. There exist combinations of the above approaches, like the LSTM-CRF network and the BI-LSTM-CRF network. Other improvements on the RNN model include neural Turing machines,

adaptive computation time Adaptation, in biology, is the process or trait by which organisms or population better match their environment Adaptation may also refer to: Arts * Adaptation (arts), a transfer of a work of art from one medium to another ** Film adaptation, ...

, neural programmers, and attention mechanisms, the latter of which form the basis for GPT-2 and related technologies.

Selective focusing

By the early 2010s, the best performance in neural machine translation was achieved with the encoder–decoder model, in which a RNN or LSTM "encoder network" encoded source sentences into vectors, and a "decoder network" of similar architecture processed these vectors into translated output. 2014 saw the introduction of significantly more complex "

" mechanisms, which vastly augmented these models' performance. Attention mechanisms gave these models the ability to adaptively focus their decoder networks' "attention" on specific aspects of the source text, rather than forcing them to parse the entire text as one vector. 2017 then saw the introduction of "

" models, which went a step further by using attention mechanisms to replace the RNN/LSTM architecture entirely.

Attention mechanisms

One constraint of encoder–decoder models was the difficulty of compressing the encodings of larger sentences into fixed-length vectors; performance often deteriorated on larger inputs. In 2014, Bahdanau et al. introduced an extension to the encoder–decoder model that could "align and translate jointly". For each word of the source sentence that was translated, the Bahdanau model's encoder (a bidirectional RNN with 1000 hidden units in each direction) searched the entire rest of that sentence for the positions of relevant information. Rather than giving the decoder a fixed-length vector encoding of the entire input sequence (like previous models), it produced "context vectors", associated with those positions as well as previously generated target words. The decoder (which also had 1000 hidden units) then used these context vectors to decide where to focus its "attention". Research into "attention" mechanisms was continued by Luong et al. in a 2015 paper. A "global" approach based on the Bahdanau paper was attempted, as well as a "local" approach wherein only a subset of source words were "considered" at a time; the local approach, while more architecturally complicated, was less computationally expensive and easier to train. It took 7–10 days to fully train an English–German translation model, which was specifically designed to be capable of translating 1,000 target words per second; its accuracy was tested against the 2014 ACL Workshop on Machine Translation (WMT'14) task for English–German sentence pairs, and achieved a result of 23.0 BLEU—a 2.1 BLEU improvement on the previous best result achieved by previous attempts, a phrase-based language model from Buck et al. 2014.

Transformers

While attention mechanisms were effective in improving performance when used to augment existing convolutional and recurrent neural network architectures, it was soon discovered that performant models could be built using attention mechanisms on their own, without anything else underlying them. In June 2017, the

architecture was first introduced, in a paper released by researchers from Google Brain
Google Research
and

University of Toronto The University of Toronto (UToronto or U of T) is a public research university in Toronto, Ontario, Canada, located on the grounds that surround Queen's Park. It was founded by royal charter in 1827 as King's College, the first institu ...

. Transformers are a type of model based solely on attention mechanisms, discarding

convolution In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution' ...

and

recurrence Recurrence and recurrent may refer to: *''Disease recurrence'', also called relapse *''Eternal recurrence'', or eternal return, the concept that the universe has been recurring, and will continue to recur, in a self-similar form an infinite number ...

altogether. Unlike previous RNN-based models, transformers can process sequential input without needing to perform computation on each item in sequence; this means they can be massively

parallelized Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different for ...

. On the WMT'14 French–English task, a specifically trained French–English translation model using the transformer architecture was able to establish a new single-model benchmark of 41.8 BLEU. Since their introduction, transformers have seen use in many NLP applications.

Generative Pre-trained Transformer

On June 11, 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced the ''Generative Pre-trained Transformer'' (GPT). At this point, the best-performing neural NLP models primarily employed

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

from large amounts of manually labeled data. This reliance on supervised learning limited their use on datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models; many languages (such as

Swahili Swahili may refer to: * Swahili language, a Bantu language official in Kenya, Tanzania and Uganda and widely spoken in the African Great Lakes * Swahili people, an ethnic group in East Africa * Swahili culture Swahili culture is the culture of ...

or Haitian Creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building. In contrast, GPT's "semi-supervised" approach involved two stages: an unsupervised

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, ...

"pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task. The use of a transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".

During transfer, we utilize task-specific input adaptations derived from traversal-style approaches, which process structured text input as a single contiguous sequence of tokens.

Corpus

The unsupervised pre-training was performed using BooksCorpus, a dataset of over 7,000 unpublished fiction books from various genres; this dataset was chosen in part because its long passages of continuous text conditioned the model to handle long-range information. Other available datasets, while larger, were rejected on the basis that they lacked this long-range structure (being "shuffled" at a sentence level). The ''ftfy'' library was used to clean the BooksCorpus text (standardize punctuation and whitespace); it was tokenized using ''spaCy''.

Architecture

GPT's architecture itself was a twelve-layer decoder-only transformer, using twelve masked self-attention heads, with 64 dimensional states each (for a total of 768). Rather than simple

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

, the

Adam optimization algorithm Adam; el, Ἀδάμ, Adám; la, Adam is the name given in Genesis 1-5 to the first human. Beyond its use as the name of the first man, ''adam'' is also used in the Bible as a pronoun, individually as "a human" and in a collective sense as ...

was used; the learning rate was increased linearly from zero over the first 2,000 updates, to a maximum of 2.5×10⁻⁴, and annealed to 0 using a cosine schedule.

We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm is used extensively throughout the model, a simple weight initialization of N(0,0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges 3nd residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in Loshchilov et al. 2017, with ''w = 0.01'' on all non bias or gain weights.
..br/> We used learned position embeddings instead of the sinusoidal version proposed in the original work.
..br/>Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25^e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. ''λ'' was set to 0.5.

While GPT's fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture. Despite this, GPT still improved on previous benchmarks in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on a number of diverse tasks.

Performance

On natural language inference (also known as ''

textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...

'') tasks, models are evaluated on their ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral". Examples of such datasets include QNLI (

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...

articles) and MultiNLI (transcribed speech, popular fiction and government reports, among other sources); on these GPT achieved, respectively, a 5.8% and 1.5% improvement over previous best results. It similarly outperformed previous models on two tasks related to question answering and

commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...

—by 5.7% on RACE, a dataset of written question–answer pairs from middle and high school exams, and by 8.9% on the Story Cloze Test. Another task, ''semantic similarity'' (or ''paraphrase detection''), assesses whether a model can predict whether two sentences are paraphrases of one another; on the

Quora Quora () is a social question-and-answer website based in Mountain View, California. It was founded on June 25, 2009, and made available to the public on June 21, 2010. Users can collaborate by editing questions and commenting on answers that ...

Question Pairs (QQP) dataset, GPT improved on previous best-performing models by 4.2%. In a text classification task using the Corpus of Linguistic Acceptability (CoLA), GPT achieved a score of 45.4, versus a previous best of 35.0. Finally, on GLUE, a multi-task test, GPT achieved an overall score of 72.8 (compared to a previous record of 68.9).

Scale-up

GPT-2 was created as a direct scale-up of GPT, with both its parameter count and dataset size increased by a factor of 10. Both are unsupervised

models trained to generate text by predicting the next word in a sequence of tokens. The GPT-2 model has 1.5 billion parameters, and was trained on a

dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...

of 8 million web pages. While GPT-2 was reinforced on very simple criteria (interpreting a sequence of words in a text sample and predicting the most likely next word), it produces full sentences and paragraphs by continuing to predict additional words, generating fully comprehensible (and

semantically Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...

meaningful) statements in

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...

. Notably, GPT-2 was evaluated on its performance on tasks in a zero-shot setting.

Training

Since the transformer architecture enabled massive parallelization, GPT-series models could be trained on larger corpora than previous NLP models. While the initial GPT model demonstrated that the approach was viable, GPT-2 would further explore the emergent properties of networks trained on extremely large corpora. '' CommonCrawl'', a large corpus produced by

web crawling A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web s ...

and previously used in training NLP systems, was considered due to its large size, but was rejected after further review revealed large amounts of unintelligible content. Instead, OpenAI developed a new corpus, known as ''

WebText Hypertext is text displayed on a computer display or other electronic devices with references (hyperlinks) to other text that the reader can immediately access. Hypertext documents are interconnected by hyperlinks, which are typically ac ...

''; rather than scraping content indiscriminately from the

World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...

, WebText was generated by scraping only pages linked to by

Reddit Reddit (; stylized in all lowercase as reddit) is an American social news news aggregator, aggregation, Review site#Rating site, content rating, and Internet forum, discussion website. Registered users (commonly referred to as "Redditors") subm ...

posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned;

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

documents were parsed into plain text, duplicate pages were eliminated, and Wikipedia pages were removed (since their presence in many other datasets could have induced

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

). While the cost of training GPT-2 is known to have been $256 per hour, the amount of hours it took to complete training is unknown; therefore, the overall training cost cannot be estimated accurately. However, comparable large language models using transformer architectures have had their costs documented in more detail; the training processes for BERT and XLNet consumed, respectively, $6,912 and $245,000 of resources.

Performance

Due to the broadness of its dataset, and the broadness of its approach, GPT-2 became capable of performing a diverse range of tasks beyond simple text generation: answering questions, summarizing, and even

translating Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

between languages in a variety of specific domains, without being instructed in anything beyond how to predict the next word in a sequence. One example of generalized learning is GPT-2's ability to perform machine translation between French and English, for which task GPT-2's performance was assessed using WMT-14 translation tasks. GPT-2's training corpus included virtually no French text; non-English text was deliberately removed while cleaning the dataset prior to training, and as a consequence, only 10MB of French of the remaining 40,000MB was available for the model to learn from (mostly from foreign-language quotations in English posts and articles). Despite this, GPT-2 achieved 5 BLEU on the WMT-14 English-to-French test set (slightly below the score of a translation via word-for-word substitution). It was also able to outperform several contemporary (2017) unsupervised machine translation baselines on the French-to-English test set, where GPT-2 achieved 11.5 BLEU. This remained below the highest-performing contemporary unsupervised approach (2019), which had achieved 33.5 BLEU. However, other models used large amounts of French text to achieve these results; GPT-2 was estimated to have used a monolingual French corpus approximately 1/500 the size of comparable approaches.

Release

GPT-2 was first announced on 14 February 2019. A February 2019 article in ''

The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media' ...

'' by James Vincent said that, while " hewriting it produces is usually easily identifiable as non-human", it remained "one of the most exciting examples yet" of language generation programs:

Give it a fake headline, and it’ll write the rest of the article, complete with fake quotations and statistics. Feed it the first line of a short story, and it’ll tell you what happens to your character next. It can even write fan fiction, given the right prompt.

The Guardian ''The Guardian'' is a British daily newspaper A newspaper is a periodical publication containing written information about current events and is often typed in black ink with a white or gray background. Newspapers can cover a wide ...

'' described this output as "plausible newspaper prose";

Kelsey Piper Kelsey Piper is an American journalist who is a staff writer at '' Vox'', where she writes for the column ''Future Perfect'', which covers a variety of topics from an effective altruism perspective. While attending Stanford University, she foun ...

of '' Vox'' said "one of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job". GPT-2's flexibility was described as "impressive" by ''

''; specifically, its ability to translate text between languages, summarize long articles, and answer trivia questions were noted. A study by the

University of Amsterdam The University of Amsterdam (abbreviated as UvA, nl, Universiteit van Amsterdam) is a public research university located in Amsterdam, Netherlands. The UvA is one of two large, publicly funded research universities in the city, the other bein ...

employing a modified

Turing test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...

found that at least in some scenarios, participants were unable to distinguish poems generated by GPT-2 from those written by humans.

Restrictions and partial release

While previous OpenAI models had been made immediately available to the public, OpenAI initially refused to make a public release of GPT-2's source code when announcing it in February, citing the risk of malicious use; limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement. One commonly-cited justification was that, since generated text was usually completely novel, it could be used by spammers to evade automated filters; OpenAI demonstrated a version of GPT-2 fine-tuned to "generate infinite positive – or negative – reviews of products". Another was that GPT-2 could be used to generate text that was

obscene An obscenity is any utterance or act that strongly offends the prevalent morality of the time. It is derived from the Latin ''obscēnus'', ''obscaenus'', "boding ill; disgusting; indecent", of uncertain etymology. Such loaded language can be u ...

racist Racism is the belief that groups of humans possess different behavioral traits corresponding to inherited attributes and can be divided based on the superiority of one race over another. It may also mean prejudice, discrimination, or antagonis ...

. Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter". The

Allen Institute for Artificial Intelligence The Allen Institute for AI (abbreviated AI2) is a research institute founded by late Microsoft co-founder Paul Allen. The institute seeks to achieve scientific breakthroughs by constructing Artificial Intelligence System, AI systems with reasonin ...

, in response to GPT-2, announced a tool to detect "neural fake news". However, opinion was divided. A February 2019 article in ''The Verge'' argued that the threat posed by GPT-2 had been exaggerated; Anima Anandkumar, a professor at

Caltech The California Institute of Technology (branded as Caltech or CIT)The university itself only spells its short form as "Caltech"; the institution considers other spellings such a"Cal Tech" and "CalTech" incorrect. The institute is also occasional ...

and director of machine learning research at

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

, said that there was no evidence that GPT-2 had the capabilities to pose the threats described by OpenAI, and that what they did was the "opposite of open", characterizing their refusal to release the full model as "malicious BS". ''The Gradient'' published an open letter to OpenAI requesting that they release the model publicly, comparing the threat posed by text-generation AI to the threat posed by the

printing press A printing press is a mechanical device for applying pressure to an inked surface resting upon a print medium (such as paper or cloth), thereby transferring the ink. It marked a dramatic improvement on earlier printing methods in which the ...

, and giving

Photoshop Adobe Photoshop is a raster graphics editor developed and published by Adobe Inc. for Windows and macOS. It was originally created in 1988 by Thomas and John Knoll. Since then, the software has become the industry standard not only in ras ...

as an example of "a technology that has (thankfully) not destroyed modern society despite its potential for chaos":

Thirty years later, society has emerged relatively unscathed despite Photoshop being simple enough for high school students to use and ubiquitous enough to commandeer its own verb. Why? Precisely because everyone knows about Photoshop.

774M release

While OpenAI did not release the fully-trained model or the corpora it was trained on, description of their methods in prior publications (and the free availability of underlying technology) made it possible for GPT-2 to be replicated by others as

free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, ...

; one such replication, OpenGPT-2, was released in August 2019, in conjunction with a freely licensed version of WebText called OpenWebText. The cloud compute costs for OpenGPT-2 were given as approximately $50,000. On August 20, 2019, OpenAI released a partial version of GPT-2, with 774 million parameters (roughly half the size of the full 1.5 billion parameter model).

Full 1.5B release

Initial concerns that GPT-2 would lend itself to widespread misuse did not come to pass; ''The Verge'' said that "there are reasons to be skeptical about claims that AI technology will usher in some sort of ‘infopocalypse.’ For a start, we already have programs that can generate plausible text at high volume for little cost: humans." By November 2019, OpenAI said that they had "seen no strong evidence of misuse so far", and the full version, with 1.5 billion parameters, was released on November 5, 2019.

Limitations

While GPT-2's ability to generate plausible passages of natural language text were generally remarked on positively, its shortcomings were noted as well, especially when generating texts longer than a couple paragraphs; ''Vox'' said "the prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get". ''The Verge'' similarly noted that longer samples of GPT-2 writing tended to "stray off topic" and lack overall coherence; ''

The Register ''The Register'' is a British technology news website co-founded in 1994 by Mike Magee, John Lettice and Ross Alderson. The online newspaper's masthead sublogo is "''Biting the hand that feeds IT''." Their primary focus is information tech ...

'' opined that "a human reading it should, after a short while, realize something's up", and noted that "GPT-2 doesn't answer questions as well as other systems that rely on algorithms to extract and retrieve information." GPT-2 deployment is resource-intensive; the full version of the model is larger than five gigabytes, making it difficult to embed locally into applications, and consumes large amounts of RAM. In addition, performing a single prediction "can occupy a CPU at 100% utilization for several minutes", and even with GPU processing, "a single prediction can take seconds". To alleviate these issues, the company Hugging Face created DistilGPT2, using

knowledge distillation In machine learning, knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small m ...

to produce a smaller model that "scores a few points lower on some quality benchmarks", but is "33% smaller and twice as fast".

Implementations and subsequent research

Possible applications of GPT-2 described by journalists included aiding humans in writing text like news articles. Even before the release of the full version, GPT-2 was used for a variety of applications and services, as well as for entertainment. In June 2019, a

subreddit Reddit (; stylized in all lowercase as reddit) is an American social news news aggregator, aggregation, Review site#Rating site, content rating, and Internet forum, discussion website. Registered users (commonly referred to as "Redditors") subm ...

named r/SubSimulatorGPT2 was created in which a variety of GPT-2 instances trained on different subreddits made posts and replied to each other's comments, creating a situation where one could observe "an AI personification of r/Bitcoin argue with the machine learning-derived spirit of r/ShittyFoodPorn"; by July of that year, a GPT-2-based software program released to

autocomplete Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In Android and iOS smartphones, this is called predictive text. In graphical user interfaces, users can typically press the tab ...

lines of code in a variety of programming languages was described by users as a "game-changer". In 2019, AI Dungeon was launched, which used GPT-2 to generate dynamic

text adventures '' Interactive fiction, often abbreviated IF, is software simulating environments in which players use text commands to control characters and influence the environment. Works in this form can be understood as literary narratives, either in the ...

based on user input. AI Dungeon now offers access to the largest release of

API as an optional paid upgrade, the free version of the site uses the 2nd largest release of GPT-3. Latitude, the company formed around AI Dungeon, raised $3.3 million in

seed funding Seed money, sometimes known as seed funding or seed capital, is a form of securities offering in which an investor invests capital in a startup company in exchange for an equity stake or convertible note stake in the company. The term ''seed'' ...

in 2021. Several websites host interactive demonstrations of different instances of GPT-2 and other transformer models. In February 2021, a crisis center for troubled teens announced that they would begin using a GPT-2-derived chatbot to help train counselors by allowing them to have conversations with simulated teens (this use was purely for internal purposes, and did not involve having GPT-2 communicate with the teens themselves).

References

{{Existential risk from artificial intelligence Deep learning software applications Language modeling Natural language generation Open-source artificial intelligence Software using the MIT license Unsupervised learning OpenAI