Backpropagation through time (BPTT) is a
gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...
-based technique for training certain types of
recurrent neural network
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
s. It can be used to train
Elman networks. The algorithm was independently derived by numerous researchers.
Algorithm
The training data for a recurrent neural network is an ordered sequence of
input-output pairs,
. An initial value must be specified for the hidden state
. Typically, a vector of all zeros is used for this purpose.
BPTT begins by unfolding a recurrent neural network in time. The unfolded network contains
inputs and outputs, but every copy of the network shares the same parameters. Then the
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm is used to find the gradient of the cost with respect to all the network parameters.
Consider an example of a neural network that contains a
recurrent layer
and a
feedforward
Feedforward is the provision of context of what one wants to communicate prior to that communication. In purposeful activity, feedforward creates an expectation which the actor anticipates. When expected experience occurs, this provides confirmato ...
layer
. There are different ways to define the training cost, but the aggregated cost is always the average of the costs of each of the time steps. The cost of each time step can be computed separately. The figure above shows how the cost at time
can be computed, by unfolding the recurrent layer
for three time steps and adding the feedforward layer
. Each instance of
in the unfolded network shares the same parameters. Thus the weight updates in each instance (
) are summed together.
Pseudocode
Pseudocode for a truncated version of BPTT, where the training data contains
input-output pairs, but the network is unfolded for
time steps:
Back_Propagation_Through_Time(a, y) // a
is the input at time t. y
is the output
Unfold the network to contain ''k'' instances of ''f''
do until stopping criterion is met:
x := the zero-magnitude vector // x is the current context
for t from 0 to n − k do // t is time. n is the length of the training sequence
Set the network inputs to x, a
a
+1 ..., a
+k−1 p := forward-propagate the inputs over the whole unfolded network
e := y
+k− p; // error = target − prediction
Back-propagate the error, e, back across the whole unfolded network
Sum the weight changes in the k instances of f together.
Update all the weights in f and g.
x := f(x, a
; // compute the context for the next time-step
Advantages
BPTT tends to be significantly faster for training recurrent neural networks than general-purpose optimization techniques such as
evolutionary
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...
optimization.
Disadvantages
BPTT has difficulty with local optima. With recurrent neural networks, local optima are a much more significant problem than with feed-forward neural networks.
The recurrent feedback in such networks tends to create chaotic responses in the error surface which cause local optima to occur frequently, and in poor locations on the error surface.
See also
*
Backpropagation through structure
Backpropagation through structure (BPTS) is a gradient-based technique for training recursive neural nets (a superset of recurrent neural nets) and is extensively described in a 1996 paper written by Christoph Goller and Andreas Küchler.
Refere ...
References
{{DEFAULTSORT:Backpropagation Through Time
Artificial neural networks