machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the vanishing gradient problem is encountered when training

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

s with gradient-based learning methods and

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...

. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Part ...

of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional

activation function In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...

s such as the

hyperbolic tangent In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the un ...

function have gradients in the range , and backpropagation computes gradients by the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...

. This has the effect of multiplying of these small numbers to compute gradients of the early layers in an -layer network, meaning that the gradient (error signal) decreases exponentially with while the early layers train very slowly. Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's

diplom A ''Diplom'' (, from grc, δίπλωμα ''diploma'') is an academic degree in the German-speaking countries Germany, Austria, and Switzerland and a similarly named degree in some other European countries including Albania, Bulgaria, Belarus ...

thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed

backpropagation through time Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers. Algorithm Th ...

.) When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

Prototypical models

This section is based on.

Recurrent network model

A generic recurrent network has hidden states

h_1, h_2, ...

inputs

u_1, u_2, ...

, and outputs

x_1, x_2, ...

. Let it be parametrized by

\theta

, so that the system evolves as

(h_t, x_t) = F(h_, u_t, \theta)

Often, the output

x_t

is a function of

h_t

, as some

x_t = G(h_t)

. The vanishing gradient problem already presents itself clearly when

x_t = h_t

, so we identify them. This gives us a recurrent network with

x_t = F(x_, u_t, \theta)

Now, take its differential:

\begin
 dx_t &= \nabla_\theta F(x_, u_t, \theta) d\theta + \nabla_x F(x_, u_t, \theta) dx_ \\
  &= \nabla_\theta F(x_, u_t, \theta) d\theta + \nabla_x F(x_, u_t, \theta)(\nabla_\theta F(x_, u_, \theta) d\theta + \nabla_x F(x_, u_, \theta) dx_) \\
  &= \cdots \\
  &= \left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right) d\theta
\end

Training the network requires us to define a loss function to be minimized. Let it be

L(x_T, u_1, ..., u_T)

, then minimizing it by gradient descent gives

\Delta \theta = -\eta \cdot\left \nabla_x L(x_T)\left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right) 
\right T

where

\eta

is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form

\nabla_x F(x_, u_t, \theta) \nabla_x F(x_, u_, \theta)\nabla_x F(x_, u_, \theta)  \cdots

Example: recurrent network with sigmoid activation

For a concrete example, consider a typical recurrent network defined by

x_t = F(x_, u_t, \theta) = W_ \sigma(x_) + W_ u_t + b

where

\theta = (W_, W_)

is the network parameter,

\sigma

is the sigmoid activation function, applied to each vector coordinate separately, and

b

is the bias vector. Then,

\nabla_x F(x_, u_t, \theta) = W_ \mathop(\sigma'(x_))

, and so

\begin
\nabla_x F(x_, u_t, \theta) & \nabla_x F(x_, u_, \theta)\cdots \nabla_x F(x_, u_, \theta) \\
= W_ \mathop(\sigma'(x_)) & W_ \mathop(\sigma'(x_))  \cdots W_ \mathop(\sigma'(x_))

\end

Since

, \sigma', \leq 1

, the

operator norm In mathematics, the operator norm measures the "size" of certain linear operators by assigning each a real number called its . Formally, it is a norm defined on the space of bounded linear operators between two given normed vector spaces. Introdu ...

of the above multiplication is bounded above by

\, W_\, ^k

. So if the

spectral radius In mathematics, the spectral radius of a square matrix is the maximum of the absolute values of its eigenvalues. More generally, the spectral radius of a bounded linear operator is the supremum of the absolute values of the elements of its spectru ...

W_

\gamma<1

, then at large

k

, the above multiplication has operator norm bounded above by

\gamma^k \to 0

. This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation ():

\nabla_\theta L = \nabla_x L(x_T, u_1, ..., u_T)\left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right)

The components of

\nabla_\theta F(x, u, \theta)

are just components of

\sigma(x)

and

u

, so if

u_t, u_, ...

are bounded, then

\, \nabla _F(x_,u_,\theta)\,

is also bounded by some

M>0

, and so the terms in

\nabla_\theta L

decay as

M\gamma^k

. This means that, effectively,

\nabla_\theta L

is affected only by the first

O(\gamma^)

terms in the sum. If

\gamma\geq 1

, the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer.

Dynamical systems model

One-neuron recurrent network bifurcation diagram

Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation:

x_ = (1-\epsilon) x_t + \epsilon \sigma(wx_t + b) + \epsilon w' u_t

At the small

\epsilon

limit, the dynamics of the network becomes

\frac = -x(t) + \sigma(wx(t)+b) + w' u(t)

Consider first the

autonomous In developmental psychology and moral, political, and bioethical philosophy, autonomy, from , ''autonomos'', from αὐτο- ''auto-'' "self" and νόμος ''nomos'', "law", hence when combined understood to mean "one who gives oneself one's ow ...

case, with

u=0

. Set

w=5.0

, and vary

b

3, -2

. As

b

decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are

(x, b) = \left(x, \ln\left(\frac\right)-5x \right)

. Now consider

\frac

and

\frac

, where

T

is large enough that the system has settled into one of the stable points. If

(x(0), b)

puts the system very close to an unstable point, then a tiny variation in

x(0)

b

would make

x(T)

move from one stable point to the other. This makes

\frac

and

\frac

both very large, a case of the exploding gradient. If

(x(0), b)

puts the system far from an unstable point, then a small variation in

x(0)

would have no effect on

x(T)

, making

\frac= 0

, a case of the vanishing gradient. Note that in this case,

\frac\approx \frac = \left(\frac-5\right)^

neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors. For the general case, the intuition still holds. The situation is plotted in, Figures 3, 4, and 5.

Geometric model

Continue using the above one-neuron network, fixing

w = 5, x(0) = 0.5, u(t) = 0

, and consider a loss function defined by

L(x(T)) = (0.855 - x(T))^2

. This produces a rather pathological loss landscape: as

b

approach

-2.5

from above, the loss approaches zero, but as soon as

b

crosses

-2.5

, the attractor basin changes, and loss jumps to 0.50. Consequently, attempting to train

b

by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in, Figures 6.

Solutions

To overcome this problem, several methods were proposed.

Batch normalization

Batch normalization Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe an ...

is a standard method for solving both the exploding and the vanishing gradient problems.

Gradient clipping

recommends clipping the norm of

\nabla_\theta L

\epsilon

g = \begin
\nabla_\theta L \quad \text \, \nabla_\theta L\,  < \epsilon,\\
\epsilon \cdot \frac\quad \text
\end

where

\epsilon

is the "threshold" hyperparameter, the maximum norm that the gradient is allowed to reach. Simply clipping each entry of

\nabla_\theta L

separately by

\epsilon

works as well in practice. This does not solve the vanishing gradient problem.

Multi-level hierarchy

One is

Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artif ...

's multi-level hierarchy of networks (1992) pre-trained one level at a time through

unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

, fine-tuned through

.J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," ''Neural Computation'', 4, pp. 234–242, 1992. Here each level learns a compressed representation of the observations that is fed to the next level.

Related approach

Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised

to classify labeled data. The

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not betw ...

model by Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued

latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...

s. It uses a

restricted Boltzmann machine A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and rose ...

to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the

log likelihood The likelihood function (often simply called the likelihood) represents the probability of Realization (probability), random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a Sample (st ...

of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a

generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsis ...

by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. Hinton reports that his models are effective feature extractors over high-dimensional, structured data.

Long short-term memory

Another technique particularly used for

recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

s is the

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

(LSTM) network of 1997 by Hochreiter & Schmidhuber. In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three

ICDAR The International Conference on Document Analysis and Recognition (ICDAR) is an international academic conference which is held every two years in a different city. It is about Optical character recognition, character and symbol recognition, printed ...

2009 competitions in connected

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other dev ...

, without any prior knowledge about the three different languages to be learned.

Faster hardware

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way" since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs.

Residual networks

One of the newest and most effective ways to resolve the vanishing gradient problem is with

residual neural network A residual neural network (ResNet) is an artificial neural network (ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural n ...

s, or ResNets (not to be confused with recurrent neural networks). ResNets refer to neural networks where skip connections or residual connections are part of the network architecture. These skip connections allow gradient information to pass through the layers, by creating "highways" of information, where the output of a previous layer/activation is added to the output of a deeper layer. This allows information from the earlier parts of the network to be passed to the deeper parts of the network, helping maintain signal propagation even in deeper networks. Skip connections are a critical component of what allowed successful training of deeper neural networks. ResNets yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data. Note that ResNets are an ensemble of relatively shallow nets and do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network – rather, they avoid the problem simply by constructing ensembles of many short networks together. (Ensemble by Construction)

Other activation functions

Rectifier A rectifier is an electrical device that converts alternating current (AC), which periodically reverses direction, to direct current (DC), which flows in only one direction. The reverse operation (converting DC to AC) is performed by an Power ...

s such as

ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...

suffer less from the vanishing gradient problem, because they only saturate in one direction.

Weight initialization

Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks. Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N), where N is the number of neurons in a layer. Recently, Yilmaz and Poli performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the ''mean'' of the initial weights is set according to the formula: max(−1,-8/N). This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard

Other

Behnke relied only on the sign of the gradient (

Rprop Rprop, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich Brau ...

) when training his

Neural Abstraction Pyramid In biology, the nervous system is the highly complex part of an animal that coordinates its actions and sensory information by transmitting signals to and from different parts of its body. The nervous system detects environmental changes th ...

to solve problems like image reconstruction and face localization. Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g.,

random guess A guess (or an act of guessing) is a swift conclusion drawn from data directly at hand, and held as probable or tentative, while the person making the guess (the guesser) admittedly lacks material for a greater degree of certainty. A guess is als ...

or more systematically

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...

. This approach is not based on gradient and avoids the vanishing gradient problem.

Notes

References

{{Use dmy dates, date=August 2019 Artificial neural networks

Prototypical models

Recurrent network model

Example: recurrent network with sigmoid activation

Dynamical systems model

Geometric model

Solutions

Batch normalization

Gradient clipping

Multi-level hierarchy

Related approach

Long short-term memory

Faster hardware

Residual networks

Other activation functions

Weight initialization

Other

See also

Notes

References