A residual neural network (ResNet) is an

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

(ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep

feedforward neural network A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks. The feedforward neural network was the ...

with hundreds of layers, much deeper than previous neural networks. ''Skip connections'' or ''shortcuts'' are used to jump over some layers ( HighwayNets may also learn the skip weights themselves through an additional weight matrix for their gates). Typical ''ResNet'' models are implemented with double- or triple- layer skips that contain nonlinearities ( ReLU) and batch normalization in between. Models with several parallel skips are referred to as ''DenseNets''. In the context of residual neural networks, a non-residual network may be described as a ''plain network''. Like in the case of

Long Short-Term Memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

recurrent neural networks there are two main reasons to add skip connections: to avoid the problem of vanishing gradients, thus leading to easier to optimize neural networks, where the gating mechanisms facilitate information flow across many layers ("information highways"), or to mitigate the Degradation (accuracy saturation) problem; where adding more layers to a suitably deep model leads to higher training error. During training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a HighwayNet should be used). Skipping effectively simplifies the network, using fewer layers in the initial training stages. This speeds learning by reducing the impact of vanishing gradients, as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the feature space. Towards the end of training, when all layers are expanded, it stays closer to the manifold and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover. A residual neural network was used to win the ImageNet 2015 competition, and has become the most cited neural network of the 21st century.

Forward propagation

Given a weight matrix

W^

for connection weights from layer

\ell-1

\ell

, and a weight matrix

W^

for connection weights from layer

\ell-2

\ell

, then the forward propagation through the activation function would be (aka '' HighwayNets'') :

\begin a^\ell & := \mathbf(W^ \cdot a^ + b^\ell + W^ \cdot a^) \\
& := \mathbf(Z^\ell + W^ \cdot a^)
\end

where :

a^\ell

the activations (outputs) of neurons in layer

\ell

, :

\mathbf

the activation function for layer

\ell

, :

W^

the weight matrix for neurons between layer

\ell-1

and

\ell

, and :

Z^\ell = W^ \cdot a^ + b^\ell

If the number of vertices on layer

\ell-2

equals the number of vertices on layer

\ell

and if

W^

is the identity matrix, then forward propagation through the activation function simplifies to

a^\ell := \mathbf(Z^\ell + a^).

In this case, the connection between layers

\ell-2

and

\ell

is called an identity block. In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka '' DenseNets'') :

a^\ell := \mathbf \left( Z^\ell + \sum_^K W^ \cdot a^ \right)

Backward propagation

During backpropagation learning for the normal path :

\Delta w^ := -\eta \frac = -\eta a^ \cdot \delta^\ell

and for the skip paths (nearly identical) :

\Delta w^ := -\eta \frac = -\eta a^ \cdot \delta^\ell

. In both cases :

\eta

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

(

\eta < 0)

, :

\delta^\ell

the error signal of neurons at layer

\ell

, and :

a_i^\ell

the activation of neurons at layer

\ell

. If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule. In the general case there can be

K

skip path weight matrices, thus :

\Delta w^ := -\eta \frac = -\eta a^{\ell-k} \cdot \delta^\ell

As the learning rules are similar, the weight matrices can be merged and learned in the same step.

References

Computational statistics Artificial neural networks Computational neuroscience