Perceiver is a

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as BERT and

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...

, which preceded Perceiver. It adopts an asymmetric

attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...

mechanism to distill inputs into a latent bottleneck, allowing it to learn from large amounts of heterogeneous data. Perceiver matches or outperforms specialized models on classification tasks. Perceiver was introduced in June 2021 by

DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was acquired by Google in 2014 and became a wholly owned subsidiary of Alphabet Inc, after Google's restru ...

. It was followed by Perceiver IO in August 2021.

Design

Perceiver is designed without

modality Modality may refer to: Humanities * Modality (theology), the organization and structure of the church, as distinct from sodality or parachurch organizations * Modality (music), in music, the subject concerning certain diatonic scales * Modalitie ...

-specific elements. For example, it does not have elements specialized to handle images, or text, or audio. Further it can handle multiple correlated input streams of heterogeneous types. It uses a small set of latent units that forms an attention bottleneck through which the inputs must pass. One benefit is to eliminate the quadratic scaling problem found in early transformers. Earlier work used custom feature extractors for each modality. It associates position and modality-specific features with every input element (e.g. every pixel, or audio sample). These features can be

learned Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of ...

or constructed using high-fidelity Fourier features. Perceiver uses cross-attention to produce linear complexity layers and to detach network depth from input size. This decoupling allows deeper architectures.

Components

A cross-attention module maps a (larger) byte array (e.g., a pixel array) and a latent array (smaller) to another latent array, reducing dimensionality. A transformer tower maps one latent array to another latent array, which is used to query the input again. The two components alternate. Both components use query-key-value (QKV) attention. QKV attention applies query, key, and value networks, which are typically

multilayer perceptron A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mul ...

s – to each element of an input array, producing three arrays that preserve the index dimensionality (or sequence length) of their inputs.

Perceiver IO

Perceiver IO can flexibly query the model's latent space to produce outputs of arbitrary size and semantics. It achieves results on tasks with structured output spaces, such as

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...

and

visual The visual system comprises the sensory organ (the eye) and parts of the central nervous system (the retina containing photoreceptor cells, the optic nerve, the optic tract and the visual cortex) which gives organisms the sense of sight ...

understanding, ''

StarCraft II ''StarCraft II'' is a military science fiction video game created by Blizzard Entertainment as a sequel to the successful ''StarCraft'' video game released in 1998. Set in a fictional future, the game centers on a galactic struggle for dominance a ...

'', and multi-tasking. Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input

tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ef ...

and achieves state-of-the-art performance on

Sintel ''Sintel'', code-named ''Project Durian'' during production, is a 2010 computer-animated fantasy short film. It was the third Blender "open movie". It was produced by Ton Roosendaal, chairman of the Blender Foundation, written by Esther Woud ...

optical flow estimation. Outputs are produced by attending to the latent array using a specific output query associated with that particular output. For example to predict

optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...

on one pixel a query would attend using the pixel’s xy coordinates plus an optical flow task embedding to produce a single flow vector. It is a variation on the encoder/decoder architecture used in other designs.

Performance

Perceiver's performance is comparable to ResNet-50 and

ViT VIT may refer to: * Vitoria Airport (IATA code VIT; ICAO airport code LEVT), Vitoria-Gasteiz, Basque Country, Spain * VIT University (disambiguation) * Victorian Institute of Teaching * Vellore Institute of Technology * Vidyalankar Institute of ...

on ImageNet without 2D

convolutions In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution'' ...

. It attends to 50,000

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the s ...

s. It is competitive in all modalities in AudioSet.

References

External links