In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, the vanishing gradient problem is encountered when training
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected units ...
s with
gradient-based learning methods and
backpropagation. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the
partial derivative
In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Pa ...
of the error function with respect to the current weight.
The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value.
In the worst case, this may completely stop the neural network from further training.
As one example of the problem cause, traditional
activation functions such as the
hyperbolic tangent function have gradients in the range , and backpropagation computes gradients by the
chain rule
In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x) ...
. This has the effect of multiplying of these small numbers to compute gradients of the early layers in an -layer network, meaning that the gradient (error signal) decreases exponentially with while the early layers train very slowly.
Back-propagation allowed researchers to train
supervised deep artificial neural networks from scratch, initially with little success.
Hochreiter's
diplom
A ''Diplom'' (, from grc, δίπλωμα ''diploma'') is an academic degree in the German-speaking countries Germany, Austria, and Switzerland and a similarly named degree in some other European countries including Albania, Bulgaria, Belaru ...
thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects
many-layered feedforward networks, but also
recurrent networks.
The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed
backpropagation through time.)
When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.
Prototypical models
This section is based on.
Recurrent network model
A generic recurrent network has hidden states
inputs
, and outputs
. Let it be parametrized by
, so that the system evolves as
Often, the output
is a function of
, as some
. The vanishing gradient problem already presents itself clearly when
, so we identify them. This gives us a recurrent network with
Now, take its
differential:
Training the network requires us to define a loss function to be minimized. Let it be
, then minimizing it by gradient descent gives
where
is the learning rate.
The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form
Example: recurrent network with sigmoid activation
For a concrete example, consider a typical recurrent network defined by
where
is the network parameter,
is the
sigmoid activation function, applied to each vector coordinate separately, and
is the bias vector.
Then,
, and so
Since
, the
operator norm of the above multiplication is bounded above by
. So if the
spectral radius of
is
, then at large
, the above multiplication has operator norm bounded above by
. This is the prototypical vanishing gradient problem.
The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation ():
The components of
are just components of
and
, so if
are bounded, then
is also bounded by some
, and so the terms in
decay as
. This means that, effectively,
is affected only by the first
terms in the sum.
If
, the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer.
Dynamical systems model

Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation:
At the small
limit, the dynamics of the network becomes
Consider first the
autonomous
In developmental psychology and moral, political, and bioethical philosophy, autonomy, from , ''autonomos'', from αὐτο- ''auto-'' "self" and νόμος ''nomos'', "law", hence when combined understood to mean "one who gives oneself one's ...
case, with
. Set
, and vary
in
. As
decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are
.
Now consider
and
, where
is large enough that the system has settled into one of the stable points.
If
puts the system very close to an unstable point, then a tiny variation in
or
would make
move from one stable point to the other. This makes
and
both very large, a case of the exploding gradient.
If
puts the system far from an unstable point, then a small variation in
would have no effect on
, making
, a case of the vanishing gradient.
Note that in this case,
neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors.
For the general case, the intuition still holds. The situation is plotted in,
Figures 3, 4, and 5.
Geometric model
Continue using the above one-neuron network, fixing
, and consider a loss function defined by
. This produces a rather pathological loss landscape: as
approach
from above, the loss approaches zero, but as soon as
crosses
, the attractor basin changes, and loss jumps to 0.50.
Consequently, attempting to train
by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in,
Figures 6.
Solutions
To overcome this problem, several methods were proposed.
Batch normalization
Batch normalization
Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe an ...
is a standard method for solving both the exploding and the vanishing gradient problems.
Gradient clipping
recommends clipping the norm of
by
:
where
is the "threshold" hyperparameter, the maximum norm that the gradient is allowed to reach. Simply clipping each entry of
separately by
works as well in practice.
This does not solve the vanishing gradient problem.
Multi-level hierarchy
One is
Jürgen Schmidhuber
Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artifi ...
's multi-level hierarchy of networks (1992) pre-trained one level at a time through
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, fine-tuned through
backpropagation.
[J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," ''Neural Computation'', 4, pp. 234–242, 1992.] Here each level learns a compressed representation of the observations that is fed to the next level.
Related approach
Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful
feature detectors. Then the network is trained further by supervised
backpropagation to classify labeled data. The
deep belief network model by Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued
latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s. It uses a
restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the
lower-bound of the
log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a
generative model
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is incons ...
by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.
Hinton reports that his models are effective feature extractors over high-dimensional, structured data.
Long short-term memory
Another technique particularly used for
recurrent neural networks is the
long short-term memory (LSTM) network of 1997 by
Hochreiter &
Schmidhuber.
In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three
ICDAR 2009 competitions in connected
handwriting recognition
Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other dev ...
, without any prior knowledge about the three different languages to be learned.
Faster hardware
Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by