Perceiver is a
transformer
A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as
BERT and
GPT-3
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.
The architecture is a standa ...
, which preceded Perceiver.
It adopts an asymmetric
attention
Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...
mechanism to distill inputs into a latent bottleneck, allowing it to learn from large amounts of heterogeneous data. Perceiver matches or outperforms specialized models on classification tasks.
Perceiver was introduced in June 2021 by
DeepMind
DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was acquired by Google in 2014 and became a wholly owned subsidiary of Alphabet Inc, after Google's restru ...
.
It was followed by Perceiver IO in August 2021.
Design
Perceiver is designed without
modality
Modality may refer to:
Humanities
* Modality (theology), the organization and structure of the church, as distinct from sodality or parachurch organizations
* Modality (music), in music, the subject concerning certain diatonic scales
* Modalitie ...
-specific elements. For example, it does not have elements specialized to handle images, or text, or audio. Further it can handle multiple correlated input streams of heterogeneous types. It uses a small set of latent units that forms an attention bottleneck through which the inputs must pass. One benefit is to eliminate the quadratic scaling problem found in early transformers. Earlier work used custom
feature extractors for each modality.
It associates position and modality-specific features with every input element (e.g. every pixel, or audio sample). These features can be
learned
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of ...
or
constructed using high-fidelity
Fourier features.
Perceiver uses cross-attention to produce linear complexity layers and to detach network depth from input size. This decoupling allows deeper architectures.
Components
A cross-attention module maps a (larger) byte array (e.g., a pixel array) and a latent array (smaller) to another latent array,
reducing dimensionality. A transformer tower maps one latent array to another latent array, which is used to query the input again. The two components alternate. Both components use query-key-value (QKV) attention. QKV attention applies query, key, and value networks, which are typically
multilayer perceptron
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mul ...
s – to each element of an input array, producing three arrays that preserve the index dimensionality (or sequence length) of their inputs.
Perceiver IO
Perceiver IO can flexibly query the model's latent space to produce outputs of arbitrary size and semantics. It achieves results on tasks with structured output spaces, such as
natural language
In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...
and
visual
The visual system comprises the sensory organ (the eye) and parts of the central nervous system (the retina containing photoreceptor cells, the optic nerve, the optic tract and the visual cortex) which gives organisms the sense of sight ...
understanding, ''
StarCraft II
''StarCraft II'' is a military science fiction video game created by Blizzard Entertainment as a sequel to the successful ''StarCraft'' video game released in 1998. Set in a fictional future, the game centers on a galactic struggle for dominance a ...
'', and multi-tasking. Perceiver IO matches a Transformer-based BERT baseline on the
GLUE language benchmark without the need for input
tokenization
Tokenization may refer to:
* Tokenization (lexical analysis) in language processing
* Tokenization (data security) in the field of data security
* Word segmentation
* Tokenism
Tokenism is the practice of making only a perfunctory or symbolic ef ...
and achieves state-of-the-art performance on
Sintel
''Sintel'', code-named ''Project Durian'' during production, is a 2010 computer-animated fantasy short film. It was the third Blender "open movie". It was produced by Ton Roosendaal, chairman of the Blender Foundation, written by Esther Woud ...
optical flow estimation.
Outputs are produced by attending to the latent array using a specific output query associated with that particular output. For example to predict
optical flow
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
on one pixel a query would attend using the pixel’s xy coordinates plus an optical flow task embedding to produce a single flow vector. It is a variation on the encoder/decoder architecture used in other designs.
Performance
Perceiver's performance is comparable to
ResNet-50 and
ViT
VIT may refer to:
* Vitoria Airport (IATA code VIT; ICAO airport code LEVT), Vitoria-Gasteiz, Basque Country, Spain
* VIT University (disambiguation)
* Victorian Institute of Teaching
* Vellore Institute of Technology
* Vidyalankar Institute of ...
on
ImageNet without 2D
convolutions
In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution'' ...
. It attends to 50,000
pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device.
In most digital display devices, pixels are the s ...
s. It is competitive in all modalities in
AudioSet.
See also
*
Convolutional neural network
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
*
Transformer (machine learning model)
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer v ...
References
External links
*
* {{YouTube, P_xeshTnPZg, Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained), with the Fourier features explained in more detail
Machine learning