Temporal difference (TD) learning refers to a class of
model-free reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like
Monte Carlo method
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be ...
s, and perform updates based on current estimates, like
dynamic programming methods.
While Monte Carlo methods only adjust their estimates once the final outcome is known, TD methods adjust predictions to match later, more accurate, predictions about the future before the final outcome is known.
This is a form of
bootstrapping
In general, bootstrapping usually refers to a self-starting process that is supposed to continue or grow without external input. Many analytical techniques are often called bootstrap methods in reference to their self-starting or self-supporting ...
, as illustrated with the following example:
Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday's weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday – and thus be able to change, say, Saturday's model before Saturday arrives.
Temporal difference methods are related to the temporal difference model of
animal learning.
Mathematical formulation
The tabular TD(0) method is one of the simplest TD methods. It is a special case of more general stochastic approximation methods. It estimates the
state value function of a finite-state
Markov decision process (MDP) under a policy
. Let
denote the state value function of the MDP with states
, rewards
and discount rate
under the policy
:
:
We drop the action from the notation for convenience.
satisfies the
Hamilton-Jacobi-Bellman Equation:
:
so
is an unbiased estimate for
. This observation motivates the following algorithm for estimating
.
The algorithm starts by initializing a table
arbitrarily, with one value for each state of the MDP. A positive
learning rate
In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
is chosen.
We then repeatedly evaluate the policy
, obtain a reward
and update the value function for the current state using the rule:
: