Bidirectional Encoder Representations from Transformers (BERT) is a

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

-based

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

technique for

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(NLP) pre-training developed by

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. In 2019, Google announced that it had begun leveraging BERT in its search engine, and by late 2020 it was using BERT in almost every English-language query. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in NLP experiments", counting over 150 research publications analyzing and improving the model. The original English-language BERT has two

models A model is an informative representation of an object, person or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a measure. Models c ...

: (1) the BERT_BASE: 12 encoders with 12 bidirectional self-attention heads, and (2) the BERT_LARGE: 24 encoders with 16 bidirectional self-attention heads. Both models are pre-trained from unlabeled data extracted from the BooksCorpus with 800M words and

English Wikipedia The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of , has the most arti ...

with 2,500M words.

Architecture

BERT is at its core a transformer language model with a variable number of encoder layers and self-attention heads. The architecture is "almost identical" to the original transformer implementation in Vaswani et al. (2017). BERT was pretrained on two tasks: ''language modeling'' (15% of tokens were masked and BERT was trained to predict them from context) and ''next sentence prediction'' (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence). As a result of the training process, BERT learns contextual embeddings for words. After pretraining, which is computationally expensive, BERT can be finetuned with fewer resources on smaller datasets to optimize its performance on specific tasks.

Performance

When BERT was published, it achieved

state-of-the-art The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...

performance on a number of natural language understanding tasks: * GLUE ( General Language Understanding Evaluation) task set (consisting of 9 tasks) * SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0 * SWAG (Situations With Adversarial Generations) * Sentiment Analysis: sentiment classifiers based on BERT achieved remarkable performance in several languages

Analysis

The reasons for BERT's

performance on these natural language understanding tasks are not yet well understood. Current research has focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences, analysis of internal vector representations through probing classifiers, and the relationships represented by

attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...

weights.

History

BERT has its origins from pre-training contextual representations including semi-supervised sequence learning, generative pre-training, ELMo, and ULMFit. Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. Context-free models such as word2vec or

GloVe A glove is a garment covering the hand. Gloves usually have separate sheaths or openings for each finger and the thumb. If there is an opening but no (or a short) covering sheath for each finger they are called fingerless gloves. Fingerless g ...

generate a single word embedding representation for each word in the vocabulary, where BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence. On October 25, 2019,

Google Search Google Search (also known simply as Google) is a search engine provided by Google. Handling more than 3.5 billion searches per day, it has a 92% share of the global search engine market. It is also the List of most visited websites, most-visi ...

announced that they had started applying BERT models for

English language English is a West Germanic language of the Indo-European language family, with its earliest forms spoken by the inhabitants of early medieval England. It is named after the Angles, one of the ancient Germanic peoples that migrated to the ...

search queries within the US. On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages. In October 2020, almost every single English-based query was processed by BERT.

Recognition

The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the

Association for Computational Linguistics The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language proces ...

(NAACL).

References

External links

Official GitHub repository

BERT on Devopedia
{{Differentiable computing Natural language processing Computational linguistics Speech recognition Computational fields of study