Residual Neural Network
   HOME

TheInfoList



OR:

A residual neural network (ResNet) is an
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
(ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. ''Skip connections'' or ''shortcuts'' are used to jump over some layers ( HighwayNets may also learn the skip weights themselves through an additional weight matrix for their gates). Typical ''ResNet'' models are implemented with double- or triple- layer skips that contain nonlinearities (
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
) and
batch normalization Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe an ...
in between. Models with several parallel skips are referred to as ''DenseNets''. In the context of residual neural networks, a non-residual network may be described as a ''plain network''. Like in the case of
Long Short-Term Memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ...
recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
there are two main reasons to add skip connections: to avoid the problem of vanishing gradients, thus leading to easier to optimize neural networks, where the gating mechanisms facilitate information flow across many layers ("information highways"), or to mitigate the Degradation (accuracy saturation) problem; where adding more layers to a suitably deep model leads to higher training error. During training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a HighwayNet should be used). Skipping effectively simplifies the network, using fewer layers in the initial training stages. This speeds learning by reducing the impact of vanishing gradients, as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
. Towards the end of training, when all layers are expanded, it stays closer to the manifold and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover. A residual neural network was used to win the ImageNet 2015 competition, and has become the most cited neural network of the 21st century.


Forward propagation

Given a weight matrix W^ for connection weights from layer \ell-1 to \ell, and a weight matrix W^ for connection weights from layer \ell-2 to \ell, then the forward propagation through the
activation function In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...
would be (aka '' HighwayNets'') : \begin a^\ell & := \mathbf(W^ \cdot a^ + b^\ell + W^ \cdot a^) \\ & := \mathbf(Z^\ell + W^ \cdot a^) \end where : a^\ell the activations (outputs) of neurons in layer \ell, : \mathbf the activation function for layer \ell, : W^ the weight matrix for neurons between layer \ell-1 and \ell, and : Z^\ell = W^ \cdot a^ + b^\ell If the number of vertices on layer \ell-2 equals the number of vertices on layer \ell and if W^ is the identity matrix, then forward propagation through the activation function simplifies to a^\ell := \mathbf(Z^\ell + a^). In this case, the connection between layers \ell-2 and \ell is called an identity block. In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka '' DenseNets'') : a^\ell := \mathbf \left( Z^\ell + \sum_^K W^ \cdot a^ \right) .


Backward propagation

During
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
learning for the normal path : \Delta w^ := -\eta \frac = -\eta a^ \cdot \delta^\ell and for the skip paths (nearly identical) : \Delta w^ := -\eta \frac = -\eta a^ \cdot \delta^\ell. In both cases : \eta a
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
(\eta < 0), : \delta^\ell the error signal of neurons at layer \ell, and : a_i^\ell the activation of neurons at layer \ell. If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule. In the general case there can be K skip path weight matrices, thus : \Delta w^ := -\eta \frac = -\eta a^{\ell-k} \cdot \delta^\ell As the learning rules are similar, the weight matrices can be merged and learned in the same step.


References

Computational statistics Artificial neural networks Computational neuroscience