HOME

TheInfoList



OR:

Generative Pre-trained Transformer 2 (GPT-2) is an open-source
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
created by
OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...
in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of OpenAI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset. The GPT architecture implements a deep neural network, specifically a
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
model, which uses attention in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. This model allows for greatly increased
parallelization Parallel computing is a type of computation Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform ...
, and outperforms previous benchmarks for RNN/CNN/LSTM-based models. OpenAI released the complete version of the GPT-2 language model (with 1.5 billion parameters) in November 2019. GPT-2 was to be followed by the 175-billion-parameter
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...
, revealed to the public in 2020 (whose source code has never been made available). Access to GPT-3 is provided exclusively through APIs offered by OpenAI and
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
.


Background

Since the origins of computing,
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
has been an object of study; the " imitation game", postulated by
Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical co ...
in 1950 (and often called the "Turing test") proposed to establish an electronic or mechanical system's capacity for intelligent action by an evaluator's ability to distinguish its behavior from that of a human. The term "
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
" was first used to describe a possible approach to artificial intelligence as early as 1959 by IBM researcher
Arthur Samuel Arthur Lee Samuel (December 5, 1901 – July 29, 1990) was an American pioneer in the field of computer gaming and artificial intelligence. He popularized the term "machine learning" in 1959. The Samuel Checkers-playing Program was among the wo ...
; current use of the term encompasses a broad variety of statistical learning, data science and neural network approaches to computational problems (often falling under the aegis of artificial intelligence).


Computational linguistics

Natural language processing using computers, a task originally conceived as a subfield of computational linguistics, was attempted as soon as computing hardware had the capacity; the first application of a dictionary look-up table was developed at Birkbeck College in London in 1948. The 1954 Georgetown Experiment was a demonstration of fully automated
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, in which sixty Russian sentences were translated into English (mostly by replacement of words with their English synonyms). The translations were often crude; the system had only 6 grammar rules and a 250-word vocabulary, and no attempt was made to analyze or translate syntactic structure. However, the experiment proved to the public that computers could interpret and process natural language, and secured CIA funding for further research. Direct substitution remains a standard against which machine translation programs are evaluated. Systems for using natural language in human-computer interaction (HCI) also began to emerge in the mid-20th century.
SHRDLU SHRDLU was an early natural-language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. In the program, the user carries on a conversation with the computer, moving objects, naming collections and querying the ...
, a program developed at
MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the m ...
in 1968–1970, consisted of a virtual environment of several objects which a user interacted with through commands in natural language (e.g."Find a block which is taller than the one you are holding and put it into the box").
ELIZA ELIZA is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum. Created to demonstrate the superficiality of communication between humans and machines, ...
, a
chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would behav ...
written in 1966, analyzed a human interlocutor's text for keywords and provided conversationally appropriate responses. While many subjects claimed an inability to distinguish ELIZA's conversation from that of a human, the question of whether this constituted intelligence proved contentious (the most famous script parodied a psychotherapist by, largely, repeating what the user had said back to them). While initial attempts at machine translation had been purely computational, by the 1950s the dominant approach to computational linguistics had come to emphasize
Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky i ...
's concept of
universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible hu ...
; NLP research in that era, accordingly, consisted largely of attempts to reduce statements in arbitrary languages to putative underlying language-agnostic logical structures. In the 1970s, semantic NLP systems would begin to eschew ''syntactic'' encodings in favor of more general ''semantic'' encodings. However, until the advent of neural networks, most systems continued to rely on large (and increasingly unwieldly) sets of manually programmed rules, which failed to scale up as initially predicted. The field of artificial intelligence continued to develop in the late 20th century, but occasional periods of stagnation known as "
AI winter In the history of artificial intelligence, an AI winter is a period of reduced funding and interest in artificial intelligence research. while Russell & Norvig in 2003 described another as starting soon after 1988.


Neural networks

An early concept in artificial intelligence,
connectionism Connectionism refers to both an approach in the field of cognitive science that hopes to explain mental phenomena using artificial neural networks (ANN) and to a wide range of techniques and algorithms using ANNs in the context of artificial in ...
, sought to produce intelligent behavior through
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
s designed to simulate the behavior of neurons in biological brains. The first example of an artificial neural network was the SNARC, built in 1951. The
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
(a type of binary classifier) was introduced in 1957 by psychologist
Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning. Life and career Rosenblatt was born in New Rochelle, New York as son o ...
; his machine was designed for
image recognition Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human ...
using 400
photocell Photodetectors, also called photosensors, are sensors of light or other electromagnetic radiation. There is a wide variety of photodetectors which may be classified by mechanism of detection, such as photoelectric or photochemical effects, or b ...
s connected to "neurons", with weightings determined by
potentiometer A potentiometer is a three-terminal resistor with a sliding or rotating contact that forms an adjustable voltage divider. If only two terminals are used, one end and the wiper, it acts as a variable resistor or rheostat. The measuring instrum ...
s (and adjusted with electric motors during its learning process). Perceptron systems became the subject of great interest; a ''
New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
'' article described the perceptron as "the embryo of an electronic computer that he Navyexpects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence". Perceptron systems, however, fell out of favor for decades following a 1969 book by
Marvin Minsky Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive and computer scientist concerned largely with research of artificial intelligence (AI), co-founder of the Massachusetts Institute of Technology's AI laboratory, ...
and
Seymour Papert Seymour Aubrey Papert (; 29 February 1928 – 31 July 2016) was a South African-born American mathematician, computer scientist, and educator, who spent most of his career teaching and researching at MIT. He was one of the pioneers of artificia ...
('' Perceptrons: an introduction to computational geometry''), which pointed out several shortcomings of the then-present state of the art (single-layer perceptrons), including an inability to encode the
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
(XOR) function. The book was considered, at the time, to discredit the perceptron approach (as well as neural networks in general) as a promising area of research. Neural networks become capable of classifying different inputs (i.e. sorting them into distinct categories) through a process known as "learning". This begins with the network's weights (the amount by which each neuron's "activation" influences the activation of each specific neuron in the subsequent layer) being initialized to random quantities; in this state, the output of the network is similarly random. An
objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
, like a loss function, is defined, which is capable of quantitatively measuring how close the output of the network is to its desired performance (for example, how often an input consisting of a handwritten number results in the sole activation of the output neuron corresponding to that number). From this, and from the performance of the network, the weights can be adjusted in order to improve its performance.
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
, a supervised algorithm first applied to machine learning systems in Paul Werbos' 1974 dissertation, efficiently calculates "gradients", which are vector fields describing the optimal adjustment of all weights in the entire network for a given input/output example. The use of these gradients to train neural networks, a practice known as
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
, enabled the creation of much more complex systems, and wide-scale application of neural networks to natural language processing would occur in the
1980s File:1980s replacement montage02.PNG, 420px, From left, clockwise: The first Space Shuttle, ''Columbia'', lifts off in 1981; US president Ronald Reagan and Soviet leader Mikhail Gorbachev ease tensions between the two superpowers, leading to the ...
. In 1985, D.B. Parker would rediscover Werbos' method; in 1986, Rumelhart, Hinton and Williams would apply it to generate internal representations of incoming data in neural networks with hidden layers, referred to as " deep learning" networks; this research would later form the basis for
recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
. Traditional feed-forward neural networks (FFNNs) are so named because each layer takes in output from the previous layer, and feeds it into the next; a FFNN's structure contains no " cycles" where information flows backwards. In contrast, a
recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
(RNN) has at least one cycle of activation flow. RNNs are often used for processing sequences of data (and predicting future sequence items), since the network can process each item using both the item itself and its own output from processing the previous item. The
neocognitron __NOTOC__ The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the ins ...
, proposed by
Kunihiko Fukushima Kunihiko Fukushima ( Japanese: 福島 邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a Senior Research Scientist at the F ...
in 1979 based on models of neural architecture in the mammalian
visual cortex The visual cortex of the brain is the area of the cerebral cortex that processes visual information. It is located in the occipital lobe. Sensory input originating from the eyes travels through the lateral geniculate nucleus in the thalamus and ...
, provided the basis for
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
s (CNNs), often used in image processing. By "sliding" a small layer over a larger input, a CNN can perform deeper processing with less computation. For example, a 100×100 image has 10,000 pixels, which would require 10,000 weights to process with a fully connected layer; a convolutional layer consisting of a 5×5 "window" sliding over the image can perform
edge detection Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuitie ...
using only 25 learnable parameters. Convolutional layers are combined by "pooling layers", and processed by "fully connected" layers (which are typically multilayer perceptrons).


Machine learning for natural language processing

Due to their ability to process sequential information, recurrent neural networks have seen use in many NLP applications; unlike FFNNs, they are capable of encoding different weights (and giving different output) for identical items based on their surroundings in a sequence—that is to say, a RNN system that parsed one word at a time could still associate a " black dog" with fuzzy paws, a "
corn dog A corn dog (also spelled corndog) is a sausage (usually a hot dog) on a stick that has been coated in a thick layer of cornmeal batter and deep fried. It originated in the United States and is commonly found in American cuisine. History Newly ...
" with ketchup, and a " sun dog" with refraction. Moreover, since the retention of information from previous sequence items can be performed
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
, RNN systems can be designed that recall items arbitrarily far back in a sequence: for example, being able to continue the sequences "Tom looked at the black dog", "Tom looked at the corn dog", and "Tom looked at the sun dog" with "fondly", "hungrily", and "indirectly", respectively. While capable of impressive solutions, many-layered FFNNs and RNNs both proved vulnerable to the
vanishing gradient problem In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's ...
: since gradients (encoded as finite-precision numbers) are required to backpropagate across all layers of a model, they can "vanish" to zero (or "explode" to infinity) over a sufficiently large number of layers. The long short-term memory network (LSTM), first proposed by Sepp Hochreiter and
Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artifi ...
in 1995–1997, sought to resolve this issue by introducing a novel architecture consisting of multiple distinct "cells" with "input", "output" and "forget" gates. In 2009, an LSTM-based model submitted by
Alex Graves Alexander John Graves (born July 23, 1965) is an American film director, television director, television producer and screenwriter. Early life Alex Graves was born in Kansas City, Missouri. His father, William Graves, was a reporter for ''Th ...
' team won the
ICDAR The International Conference on Document Analysis and Recognition (ICDAR) is an international academic conference which is held every two years in a different city. It is about Optical character recognition, character and symbol recognition, printed ...
competition for
handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other de ...
; another was the most accurate model in the competition and a third was the fastest. Another issue RNNs and LSTMs encounter is that they can only take into account the context of ''previous'' sequence items. This can create issues when parsing sentences like "Tom rode his bike to the store, put out the kickstand, and turned off the engine", in which the necessary context of the " bike" being a
motorcycle A motorcycle (motorbike, bike, or trike (if three-wheeled)) is a two or three-wheeled motor vehicle steered by a handlebar. Motorcycle design varies greatly to suit a range of different purposes: long-distance travel, commuting, cruising ...
is revealed only at the end. One method of solving problems like this is the bidirectional LSTM, which proceeds in both directions simultaneously, giving access to both "past" and "future" input features.
Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...
s use tags to connect inputs directly to outputs. There exist combinations of the above approaches, like the LSTM-CRF network and the BI-LSTM-CRF network. Other improvements on the RNN model include neural Turing machines, adaptive computation time, neural programmers, and attention mechanisms, the latter of which form the basis for GPT-2 and related technologies.


Selective focusing

By the early 2010s, the best performance in neural machine translation was achieved with the encoder–decoder model, in which a RNN or
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...
"encoder network" encoded source sentences into vectors, and a "decoder network" of similar architecture processed these vectors into translated output. 2014 saw the introduction of significantly more complex " attention" mechanisms, which vastly augmented these models' performance. Attention mechanisms gave these models the ability to adaptively focus their decoder networks' "attention" on specific aspects of the source text, rather than forcing them to parse the entire text as one vector. 2017 then saw the introduction of "
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
" models, which went a step further by using attention mechanisms to replace the RNN/LSTM architecture entirely.


Attention mechanisms

One constraint of encoder–decoder models was the difficulty of compressing the encodings of larger sentences into fixed-length vectors; performance often deteriorated on larger inputs. In 2014, Bahdanau et al. introduced an extension to the encoder–decoder model that could "align and translate jointly". For each word of the source sentence that was translated, the Bahdanau model's encoder (a bidirectional RNN with 1000 hidden units in each direction) searched the entire rest of that sentence for the positions of relevant information. Rather than giving the decoder a fixed-length vector encoding of the entire input sequence (like previous models), it produced "context vectors", associated with those positions as well as previously generated target words. The decoder (which also had 1000 hidden units) then used these context vectors to decide where to focus its "attention". Research into "attention" mechanisms was continued by Luong et al. in a 2015 paper. A "global" approach based on the Bahdanau paper was attempted, as well as a "local" approach wherein only a subset of source words were "considered" at a time; the local approach, while more architecturally complicated, was less computationally expensive and easier to train. It took 7–10 days to fully train an English–German translation model, which was specifically designed to be capable of translating 1,000 target words per second; its accuracy was tested against the 2014 ACL Workshop on Machine Translation (WMT'14) task for English–German sentence pairs, and achieved a result of 23.0 BLEU—a 2.1 BLEU improvement on the previous best result achieved by previous attempts, a phrase-based language model from Buck et al. 2014.


Transformers

While attention mechanisms were effective in improving performance when used to augment existing convolutional and recurrent neural network architectures, it was soon discovered that performant models could be built using attention mechanisms on their own, without anything else underlying them. In June 2017, the
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
architecture was first introduced, in a paper released by researchers from
Google Brain Google Brain is a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combines open-ended machine learning research ...

Google Research
and
University of Toronto The University of Toronto (UToronto or U of T) is a public university, public research university in Toronto, Ontario, Canada, located on the grounds that surround Queen's Park (Toronto), Queen's Park. It was founded by royal charter in 1827 ...
. Transformers are a type of model based solely on attention mechanisms, discarding
convolution In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution'' ...
and recurrence altogether. Unlike previous RNN-based models, transformers can process sequential input without needing to perform computation on each item in sequence; this means they can be massively parallelized. On the WMT'14 French–English task, a specifically trained French–English translation model using the transformer architecture was able to establish a new single-model benchmark of 41.8 BLEU. Since their introduction, transformers have seen use in many NLP applications.


Generative Pre-trained Transformer

On June 11, 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced the ''Generative Pre-trained Transformer'' (GPT). At this point, the best-performing neural NLP models primarily employed
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
from large amounts of manually labeled data. This reliance on supervised learning limited their use on datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models; many languages (such as Swahili or Haitian Creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building. In contrast, GPT's "semi-supervised" approach involved two stages: an unsupervised generative "pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task. The use of a transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".
During transfer, we utilize task-specific input adaptations derived from traversal-style approaches, which process structured text input as a single contiguous sequence of tokens.


Corpus

The unsupervised pre-training was performed using BooksCorpus, a dataset of over 7,000 unpublished fiction books from various genres; this dataset was chosen in part because its long passages of continuous text conditioned the model to handle long-range information. Other available datasets, while larger, were rejected on the basis that they lacked this long-range structure (being "shuffled" at a sentence level). The ''ftfy'' library was used to clean the BooksCorpus text (standardize punctuation and whitespace); it was
tokenized In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
using ''spaCy''.


Architecture

GPT's architecture itself was a twelve-layer decoder-only transformer, using twelve masked self-attention heads, with 64 dimensional states each (for a total of 768). Rather than simple stochastic gradient descent, the Adam optimization algorithm was used; the learning rate was increased linearly from zero over the first 2,000 updates, to a maximum of 2.5×10−4, and annealed to 0 using a cosine schedule.
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm is used extensively throughout the model, a simple weight initialization of N(0,0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges 3nd residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in Loshchilov et al. 2017, with ''w = 0.01'' on all non bias or gain weights.
..br/> We used learned position embeddings instead of the sinusoidal version proposed in the original work.
..br/>Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. ''λ'' was set to 0.5.
While GPT's fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture. Despite this, GPT still improved on previous benchmarks in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on a number of diverse tasks.


Performance

On natural language inference (also known as ''
textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...
'') tasks, models are evaluated on their ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral". Examples of such datasets include QNLI (
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
articles) and MultiNLI (transcribed speech, popular fiction and government reports, among other sources); on these GPT achieved, respectively, a 5.8% and 1.5% improvement over previous best results. It similarly outperformed previous models on two tasks related to question answering and
commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...
—by 5.7% on RACE, a dataset of written question–answer pairs from middle and high school exams, and by 8.9% on the Story Cloze Test. Another task, ''semantic similarity'' (or ''paraphrase detection''), assesses whether a model can predict whether two sentences are paraphrases of one another; on the
Quora Quora () is a social question-and-answer website based in Mountain View, California. It was founded on June 25, 2009, and made available to the public on June 21, 2010. Users can collaborate by editing questions and commenting on answers that ...
Question Pairs (QQP) dataset, GPT improved on previous best-performing models by 4.2%. In a text classification task using the Corpus of Linguistic Acceptability (CoLA), GPT achieved a score of 45.4, versus a previous best of 35.0. Finally, on GLUE, a multi-task test, GPT achieved an overall score of 72.8 (compared to a previous record of 68.9).


Scale-up

GPT-2 was created as a direct scale-up of GPT, with both its parameter count and dataset size increased by a factor of 10. Both are unsupervised
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
models trained to generate text by predicting the next word in a sequence of tokens. The GPT-2 model has 1.5 billion parameters, and was trained on a
dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
of 8 million web pages. While GPT-2 was reinforced on very simple criteria (interpreting a sequence of words in a text sample and predicting the most likely next word), it produces full sentences and paragraphs by continuing to predict additional words, generating fully comprehensible (and semantically meaningful) statements in natural language. Notably, GPT-2 was evaluated on its performance on tasks in a zero-shot setting.


Training

Since the transformer architecture enabled massive parallelization, GPT-series models could be trained on larger corpora than previous NLP models. While the initial GPT model demonstrated that the approach was viable, GPT-2 would further explore the emergent properties of networks trained on extremely large corpora. '' CommonCrawl'', a large corpus produced by web crawling and previously used in training NLP systems, was considered due to its large size, but was rejected after further review revealed large amounts of unintelligible content. Instead, OpenAI developed a new corpus, known as '' WebText''; rather than scraping content indiscriminately from the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...
, WebText was generated by scraping only pages linked to by
Reddit Reddit (; stylized in all lowercase as reddit) is an American social news aggregation, content rating, and discussion website. Registered users (commonly referred to as "Redditors") submit content to the site such as links, text posts, imag ...
posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned;
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
documents were parsed into plain text, duplicate pages were eliminated, and Wikipedia pages were removed (since their presence in many other datasets could have induced overfitting). While the cost of training GPT-2 is known to have been $256 per hour, the amount of hours it took to complete training is unknown; therefore, the overall training cost cannot be estimated accurately. However, comparable large language models using transformer architectures have had their costs documented in more detail; the training processes for
BERT Bert or BERT may refer to: Persons, characters, or animals known as Bert *Bert (name), commonly an abbreviated forename and sometimes a surname *Bert, a character in the poem "Bert the Wombat" by The Wiggles; from their 1992 album Here Comes a Son ...
and XLNet consumed, respectively, $6,912 and $245,000 of resources.


Performance

Due to the broadness of its dataset, and the broadness of its approach, GPT-2 became capable of performing a diverse range of tasks beyond simple text generation: answering questions, summarizing, and even
translating Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transl ...
between languages in a variety of specific domains, without being instructed in anything beyond how to predict the next word in a sequence. One example of generalized learning is GPT-2's ability to perform machine translation between French and English, for which task GPT-2's performance was assessed using WMT-14 translation tasks. GPT-2's training corpus included virtually no French text; non-English text was deliberately removed while cleaning the dataset prior to training, and as a consequence, only 10MB of French of the remaining 40,000MB was available for the model to learn from (mostly from foreign-language quotations in English posts and articles). Despite this, GPT-2 achieved 5 BLEU on the WMT-14 English-to-French test set (slightly below the score of a translation via word-for-word substitution). It was also able to outperform several contemporary (2017) unsupervised machine translation baselines on the French-to-English test set, where GPT-2 achieved 11.5 BLEU. This remained below the highest-performing contemporary unsupervised approach (2019), which had achieved 33.5 BLEU. However, other models used large amounts of French text to achieve these results; GPT-2 was estimated to have used a monolingual French corpus approximately 1/500 the size of comparable approaches.


Release

GPT-2 was first announced on 14 February 2019. A February 2019 article in ''
The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media ...
'' by James Vincent said that, while " hewriting it produces is usually easily identifiable as non-human", it remained "one of the most exciting examples yet" of language generation programs:
Give it a fake headline, and it’ll write the rest of the article, complete with fake quotations and statistics. Feed it the first line of a short story, and it’ll tell you what happens to your character next. It can even write fan fiction, given the right prompt.
''
The Guardian ''The Guardian'' is a British daily newspaper. It was founded in 1821 as ''The Manchester Guardian'', and changed its name in 1959. Along with its sister papers ''The Observer'' and ''The Guardian Weekly'', ''The Guardian'' is part of the Gu ...
'' described this output as "plausible newspaper prose"; Kelsey Piper of '' Vox'' said "one of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job". GPT-2's flexibility was described as "impressive" by ''
The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media ...
''; specifically, its ability to translate text between languages, summarize long articles, and answer trivia questions were noted. A study by the
University of Amsterdam The University of Amsterdam (abbreviated as UvA, nl, Universiteit van Amsterdam) is a public research university located in Amsterdam, Netherlands. The UvA is one of two large, publicly funded research universities in the city, the other being ...
employing a modified
Turing test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...
found that at least in some scenarios, participants were unable to distinguish poems generated by GPT-2 from those written by humans.


Restrictions and partial release

While previous OpenAI models had been made immediately available to the public, OpenAI initially refused to make a public release of GPT-2's source code when announcing it in February, citing the risk of malicious use; limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement. One commonly-cited justification was that, since generated text was usually completely novel, it could be used by spammers to evade automated
filter Filter, filtering or filters may refer to: Science and technology Computing * Filter (higher-order function), in functional programming * Filter (software), a computer program to process a data stream * Filter (video), a software component tha ...
s; OpenAI demonstrated a version of GPT-2 fine-tuned to "generate infinite positive – or negative – reviews of products". Another was that GPT-2 could be used to generate text that was
obscene An obscenity is any utterance or act that strongly offends the prevalent morality of the time. It is derived from the Latin ''obscēnus'', ''obscaenus'', "boding ill; disgusting; indecent", of uncertain etymology. Such loaded language can be us ...
or racist. Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter". The
Allen Institute for Artificial Intelligence The Allen Institute for AI (abbreviated AI2) is a research institute founded by late Microsoft co-founder Paul Allen. The institute seeks to achieve scientific breakthroughs by constructing AI systems with reasoning, learning, and reading capabi ...
, in response to GPT-2, announced a tool to detect "neural fake news". However, opinion was divided. A February 2019 article in ''The Verge'' argued that the threat posed by GPT-2 had been exaggerated; Anima Anandkumar, a professor at
Caltech The California Institute of Technology (branded as Caltech or CIT)The university itself only spells its short form as "Caltech"; the institution considers other spellings such a"Cal Tech" and "CalTech" incorrect. The institute is also occasional ...
and director of machine learning research at
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
, said that there was no evidence that GPT-2 had the capabilities to pose the threats described by OpenAI, and that what they did was the "opposite of open", characterizing their refusal to release the full model as "malicious BS". ''The Gradient'' published an open letter to OpenAI requesting that they release the model publicly, comparing the threat posed by text-generation AI to the threat posed by the
printing press A printing press is a mechanical device for applying pressure to an inked surface resting upon a printing, print medium (such as paper or cloth), thereby transferring the ink. It marked a dramatic improvement on earlier printing methods in wh ...
, and giving
Photoshop Adobe Photoshop is a raster graphics editor developed and published by Adobe Inc. for Windows and macOS. It was originally created in 1988 by Thomas and John Knoll. Since then, the software has become the industry standard not only in raster ...
as an example of "a technology that has (thankfully) not destroyed modern society despite its potential for chaos":
Thirty years later, society has emerged relatively unscathed despite Photoshop being simple enough for high school students to use and ubiquitous enough to commandeer its own verb. Why? Precisely because everyone knows about Photoshop.


774M release

While OpenAI did not release the fully-trained model or the corpora it was trained on, description of their methods in prior publications (and the free availability of underlying technology) made it possible for GPT-2 to be replicated by others as
free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
; one such replication, OpenGPT-2, was released in August 2019, in conjunction with a freely licensed version of WebText called OpenWebText. The cloud compute costs for OpenGPT-2 were given as approximately $50,000. On August 20, 2019, OpenAI released a partial version of GPT-2, with 774 million parameters (roughly half the size of the full 1.5 billion parameter model).


Full 1.5B release

Initial concerns that GPT-2 would lend itself to widespread misuse did not come to pass; ''The Verge'' said that "there are reasons to be skeptical about claims that AI technology will usher in some sort of ‘infopocalypse.’ For a start, we already have programs that can generate plausible text at high volume for little cost: humans." By November 2019, OpenAI said that they had "seen no strong evidence of misuse so far", and the full version, with 1.5 billion parameters, was released on November 5, 2019.


Limitations

While GPT-2's ability to generate plausible passages of natural language text were generally remarked on positively, its shortcomings were noted as well, especially when generating texts longer than a couple paragraphs; ''Vox'' said "the prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get". ''The Verge'' similarly noted that longer samples of GPT-2 writing tended to "stray off topic" and lack overall coherence; ''
The Register ''The Register'' is a British technology news website co-founded in 1994 by Mike Magee, John Lettice and Ross Alderson. The online newspaper's masthead sublogo is "''Biting the hand that feeds IT''." Their primary focus is information tec ...
'' opined that "a human reading it should, after a short while, realize something's up", and noted that "GPT-2 doesn't answer questions as well as other systems that rely on algorithms to extract and retrieve information." GPT-2 deployment is resource-intensive; the full version of the model is larger than five gigabytes, making it difficult to embed locally into applications, and consumes large amounts of RAM. In addition, performing a single prediction "can occupy a CPU at 100% utilization for several minutes", and even with
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
processing, "a single prediction can take seconds". To alleviate these issues, the company
Hugging Face Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users ...
created DistilGPT2, using knowledge distillation to produce a smaller model that "scores a few points lower on some quality benchmarks", but is "33% smaller and twice as fast".


Implementations and subsequent research

Possible applications of GPT-2 described by journalists included aiding humans in writing text like news articles. Even before the release of the full version, GPT-2 was used for a variety of applications and services, as well as for entertainment. In June 2019, a
subreddit Reddit (; stylized in all lowercase as reddit) is an American social news aggregation, content rating, and discussion website. Registered users (commonly referred to as "Redditors") submit content to the site such as links, text posts, images ...
named r/SubSimulatorGPT2 was created in which a variety of GPT-2 instances trained on different subreddits made posts and replied to each other's comments, creating a situation where one could observe "an AI personification of r/Bitcoin argue with the machine learning-derived spirit of r/ShittyFoodPorn"; by July of that year, a GPT-2-based software program released to
autocomplete Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In Android and iOS smartphones, this is called predictive text. In graphical user interfaces, users can typically press the tab ...
lines of code in a variety of programming languages was described by users as a "game-changer". In 2019,
AI Dungeon ''AI Dungeon'' is a free-to-play single-player and multiplayer text adventure game which uses artificial intelligence (A.I.) to generate content. It allows players to create and share their own custom adventure settings. The game's first vers ...
was launched, which used GPT-2 to generate dynamic
text adventures '' Interactive fiction, often abbreviated IF, is software simulating environments in which players use text commands to control characters and influence the environment. Works in this form can be understood as literary narratives, either in the ...
based on user input. AI Dungeon now offers access to the largest release of
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...
API as an optional paid upgrade, the free version of the site uses the 2nd largest release of GPT-3. Latitude, the company formed around AI Dungeon, raised $3.3 million in
seed funding Seed money, sometimes known as seed funding or seed capital, is a form of securities offering in which an investor invests capital in a startup company in exchange for an equity stake or convertible note stake in the company. The term ''seed'' su ...
in 2021. Several websites host interactive demonstrations of different instances of GPT-2 and other transformer models. In February 2021, a crisis center for troubled teens announced that they would begin using a GPT-2-derived chatbot to help train counselors by allowing them to have conversations with simulated teens (this use was purely for internal purposes, and did not involve having GPT-2 communicate with the teens themselves).


References

{{Existential risk from artificial intelligence Deep learning software applications Language modeling Natural language generation Open-source artificial intelligence Software using the MIT license Unsupervised learning OpenAI