A residual neural network (also referred to as a residual network or ResNet) is a

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for

image recognition Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

, and won the

ImageNet The ImageNet project is a large visual database designed for use in Outline of object recognition, visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictur ...

Large Scale Visual Recognition Challenge
ILSVRC
of that year. As a point of terminology, "residual connection" refers to the specific architectural motif of , where

f

is an arbitrary neural network module. The motif had been used previously (see §History for details). However, the publication of ResNet made it widely popular for feedforward networks, appearing in neural networks that are seemingly unrelated to ResNet. The residual connection stabilizes the training and convergence of deep neural networks with hundreds of layers, and is a common motif in deep neural networks, such as

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

models (e.g., BERT, and GPT models such as

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

), the

AlphaGo Zero AlphaGo Zero is a version of DeepMind's Go software AlphaGo. AlphaGo's team published an article in ''Nature'' in October 2017 introducing AlphaGo Zero, a version created without using data from human games, and stronger than any previous versio ...

system, the AlphaStar system, and the

AlphaFold AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. It is designed using deep learning techniques. AlphaFold 1 (2018) placed first in the overall ...

system.

Mathematics

Residual connection

In a multilayer neural network model, consider a subnetwork with a certain number of stacked layers (e.g., 2 or 3). Denote the underlying function performed by this subnetwork as

H(x)

, where

x

is the input to the subnetwork. Residual learning re-parameterizes this subnetwork and lets the parameter layers represent a "residual function"

F(x)=H(x)-x

. The output

y

of this subnetwork is then represented as: :

y = F(x) + x

The operation of "

+ \ x

" is implemented via a "skip connection" that performs an identity mapping to connect the input of the subnetwork with its output. This connection is referred to as a "residual connection" in later work. The function

F(x)

is often represented by matrix multiplication interlaced with

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

s and normalization operations (e.g.,

batch normalization Batch normalization (also known as batch norm) is a normalization technique used to make training of artificial neural networks faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to ...

layer normalization In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely ''data normalization'' and ''activation normalization''. Data normalization (or feature scaling) includes m ...

). As a whole, one of these subnetworks is referred to as a "residual block". A deep residual network is constructed by simply stacking these blocks.

Long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM) has a memory mechanism that serves as a residual connection. In an LSTM without a forget

gate A gate or gateway is a point of entry to or from a space enclosed by walls. The word is derived from Proto-Germanic language, Proto-Germanic ''*gatan'', meaning an opening or passageway. Synonyms include yett (which comes from the same root w ...

, an input

x_t

is processed by a function

F

and added to a memory cell

c_t

, resulting in

c_ = c_t + F(x_t)

. An LSTM with a forget gate essentially functions as a

highway network In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. It uses ''skip connections'' modulated by learned gating mechanisms to regulat ...

. To stabilize the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of the layers' inputs, it is recommended to replace the residual connections

x + f(x)

with

x/L + f(x)

, where

L

is the total number of residual layers.

Projection connection

If the function

F

is of type

F: \R^n \to \R^m

where

n \neq m

, then

F(x) + x

is undefined. To handle this special case, a projection connection is used: :

y = F(x) + P(x)

where

P

is typically a linear projection, defined by

P(x) = Mx

where

M

is a

m \times n

matrix. The matrix is trained via

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

, as is any other parameter of the model.

Signal propagation

The introduction of identity mappings facilitates signal propagation in both forward and backward paths.

Forward propagation

If the output of the

\ell

-th residual block is the input to the

(\ell+1)

-th residual block (assuming no activation function between blocks), then the

(\ell+1)

-th input is: :

x_ = F(x_) + x_

Applying this formulation recursively, e.g.: :

\begin x_ & = F(x_) + x_ \\
& = F(x_) + F(x_) + x_
\end

yields the general relationship: :

x_ = x_ + \sum_^ F(x_)

where

L

is the index of a residual block and

\ell

is the index of some earlier block. This formulation suggests that there is always a signal that is directly sent from a shallower block

\ell

to a deeper block

L

Backward propagation

The residual learning formulation provides the added benefit of mitigating the

vanishing gradient problem In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...

to some extent. However, it is crucial to acknowledge that the vanishing gradient issue is not the root cause of the degradation problem, which is tackled through the use of normalization. To observe the effect of residual blocks on backpropagation, consider the partial derivative of a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

\mathcal

with respect to some residual block input

x_

. Using the equation above from forward propagation for a later residual block

L>\ell

: :

\begin \frac
& = \frac\frac \\
& = \frac \left( 1 + \frac \sum_^ F(x_) \right) \\
& = \frac  + \frac \frac \sum_^ F(x_)
\end

This formulation suggests that the gradient computation of a shallower layer,

\frac

, always has a later term

\frac

that is directly added. Even if the gradients of the

F(x_)

terms are small, the total gradient

\frac

resists vanishing due to the added term

\frac

Variants of residual blocks

Basic block

A ''basic block'' is the simplest building block studied in the original ResNet. This block consists of two sequential 3x3 convolutional layers and a residual connection. The input and output dimensions of both layers are equal. ResNet_block

Bottleneck block

A ''bottleneck block'' consists of three sequential convolutional layers and a residual connection. The first layer in this block is a 1x1 convolution for dimension reduction (e.g., to 1/2 of the input dimension); the second layer performs a 3x3 convolution; the last layer is another 1x1 convolution for dimension restoration. The models of ResNet-50, ResNet-101, and ResNet-152 are all based on bottleneck blocks.

Pre-activation block

The ''pre-activation residual block'' applies activation functions before applying the residual function

F

. Formally, the computation of a pre-activation residual block can be written as: :

x_ = F(\phi(x_)) + x_

where

\phi

can be any activation (e.g.

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

) or normalization (e.g.

LayerNorm In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely ''data normalization'' and ''activation normalization''. Data normalization (or feature scaling) includes me ...

) operation. This design reduces the number of non-identity mappings between residual blocks, and allows an identity mapping directly from the input to the output. This design was used to train models with 200 to over 1000 layers, and was found to consistently outperform variants where the residual path is not an identity function. The pre-activation ResNet with 200 layers took 3 weeks to train for

on 8

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

in 2016. Since

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was par ...

blocks have been mostly implemented as pre-activation blocks. This is often referred to as "pre-normalization" in the literature of transformer models. Resnet-18_architecture

Applications

Originally, ResNet was designed for

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

. All transformer architectures include residual connections. Indeed, very deep transformers cannot be trained without them. The original ResNet paper made no claim on being inspired by biological systems. However, later research has related ResNet to biologically-plausible algorithms. A study published in ''Science'' in 2023 disclosed the complete

connectome A connectome () is a comprehensive map of neural connections in the brain, and may be thought of as its " wiring diagram". These maps are available in varying levels of detail. A functional connectome shows connections between various brain ...

of an insect brain (specifically that of a fruit fly larva). This study discovered "multilayer shortcuts" that resemble the skip connections in artificial neural networks, including ResNets.

History

Previous work

Residual connections were noticed in

neuroanatomy Neuroanatomy is the study of the structure and organization of the nervous system. In contrast to animals with radial symmetry, whose nervous system consists of a distributed network of cells, animals with bilateral symmetry have segregated, defi ...

, such as Lorente de No (1938). McCulloch and Pitts (1943) proposed artificial neural networks and considered those with residual connections. In 1961,

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...

described a three-layer

multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...

(MLP) model with skip connections. The model was referred to as a "cross-coupled system", and the skip connections were forms of cross-coupled connections. During the late 1980s, "skip-layer" connections were sometimes used in neural networks. Examples include: Lang and Witbrock (1988) trained a fully connected feedforward network where each layer skip-connects to all subsequent layers, like the later DenseNet (2016). In this work, the residual connection was the form where

P

is a randomly-initialized projection connection. They termed it a "short-cut connection". An early neural language model used residual connections and named them "direct connections". LSTM_3

Degradation problem

Sepp Hochreiter Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 201 ...

discovered the

in 1991 and argued that it explained why the then-prevalent forms of

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s did not work for long sequences. He and Schmidhuber later designed the LSTM architecture to solve this problem, which has a "cell state"

c_t

that can function as a generalized residual connection. The

(2015) applied the idea of an LSTM unfolded in time to

feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

s, resulting in the highway network. ResNet is equivalent to an open-gated highway network. Recurrent_neural_network_unfold

During the early days of deep learning, there were attempts to train increasingly deep models. Notable examples included the

AlexNet AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...

(2012), which had 8 layers, and the VGG-19 (2014), which had 19 layers. However, stacking too many layers led to a steep reduction in

training Training is teaching, or developing in oneself or others, any skills and knowledge or fitness that relate to specific useful competencies. Training has specific goals of improving one's capability, capacity, productivity and performance. I ...

accuracy, known as the "degradation" problem. In theory, adding additional layers to deepen a network should not result in a higher training

loss Loss may refer to: *Economic loss *Grief, an emotional response to loss **Animal loss, grief over the loss of an animal Mathematics, science, and technology * Angular misalignment loss, power loss caused by the deviation from optimum angular al ...

, but this is what happened with

VGGNet The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford. The VGG family includes various configurations with different depths, denoted by the letter "VGG" followe ...

. If the extra layers can be set as

identity mapping Graph of the identity function on the real numbers In mathematics, an identity function, also called an identity relation, identity map or identity transformation, is a function that always returns the value that was used as its argument, unc ...

s, however, then the deeper network would represent the same function as its shallower counterpart. There is some evidence that the optimizer is not able to approach identity mappings for the parameterized layers, and the benefit of residual connections was to allow identity mappings by default. In 2014, the state of the art was training deep neural networks with 20 to 30 layers. The research team for ResNet attempted to train deeper ones by empirically testing various methods for training deeper networks, until they came upon the ResNet architecture.

Subsequent work

Wide Residual Network (2016) found that using more channels and fewer layers than the original ResNet improves performance and GPU-computational efficiency, and that a block with two 3×3 convolutions is superior to other configurations of convolution blocks. DenseNet (2016) connects the output of each layer to the input to each subsequent layer: :

x_ = F(x_1, x_2, \dots, x_, x_)

Stochastic depth is a

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

method that randomly drops a subset of layers and lets the signal propagate through the identity skip connections. Also known as ''DropPath'', this regularizes training for deep models, such as vision transformers. ResNext_block

ResNeXt (2017) combines the Inception module with ResNet. Squeeze-and-Excitation Networks (2018) added squeeze-and-excitation (SE) modules to ResNet. An SE module is applied after a convolution, and takes a tensor of shape

\R^

(height, width, channels) as input. Each channel is averaged, resulting in a vector of shape

\R^C

. This is then passed through a

(with an architecture such as ''linear-ReLU-linear-sigmoid'') before it is multiplied with the original tensor. It won the

ILSVRC The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million ...

in 2017.

References

{{Artificial intelligence navbox Neural network architectures Deep learning