''Q''-learning is a model-free

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations. For any finite Markov decision process (FMDP), ''Q''-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. ''Q''-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. "Q" refers to the function that the algorithm computes – the expected rewards for an action taken in a given state.

Reinforcement learning

Reinforcement learning involves an

agent Agent may refer to: Espionage, investigation, and law *, spies or intelligence officers * Law of agency, laws involving a person authorized to act on behalf of another ** Agent of record, a person with a contractual agreement with an insuranc ...

, a set of ''states'' , and a set of ''actions'' per state. By performing an action

a \in A

, the agent transitions from state to state. Executing an action in a specific state provides the agent with a ''reward'' (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

s of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: * 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: * 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.

Algorithm

After

\Delta t

steps into the future the agent will decide some next step. The weight for this step is calculated as

\gamma^

, where

\gamma

(the ''discount factor'') is a number between 0 and 1 (

0 \le \gamma \le 1

) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start").

\gamma

may also be interpreted as the probability to succeed (or survive) at every step

\Delta t

. The algorithm, therefore, has a function that calculates the quality of a state–action combination: :

Q: S \times A \to \mathbb

. Before learning begins, is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time

t

the agent selects an action

a_t

, observes a reward

r_t

, enters a new state

s_

(that may depend on both the previous state

s_t

and the selected action), and

Q

is updated. The core of the algorithm is a

Bellman equation A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the "value" of a decision problem at a certain point in time ...

as a simple value iteration update, using the weighted average of the current value and the new information: :

Q^(s_,a_) \leftarrow \underbrace_ + \underbrace_ \cdot  \overbrace^

where ''

r_

'' is the reward received when moving from the state

s_

to the state

s_

, and

\alpha

is the

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

(0 < \alpha \le 1)

. Note that

Q^(s_t,a_t)

is the sum of three factors: *

(1 - \alpha)Q(s_t,a_t)

: the current value (weighted by one minus the learning rate) *

\alpha \, r_t

: the reward

r_t=r(s_t,a_t)

to obtain if action

a_t

is taken when in state

s_t

(weighted by learning rate) *

\alpha \gamma \max_Q(s_,a)

: the maximum reward that can be obtained from state

s_

(weighted by learning rate and discount factor) An episode of the algorithm ends when state

s_

is a final or ''terminal state''. However, ''Q''-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops. For all final states

s_f

Q(s_f, a)

is never updated, but is set to the reward value

r

observed for state

s_f

. In most cases,

Q(s_f,a)

can be taken to equal zero.

Influence of variables

Learning rate

The

or ''step size'' determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully

deterministic Determinism is a philosophical view, where all events are determined completely by previously existing causes. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping motives and consi ...

environments, a learning rate of

\alpha_t = 1

is optimal. When the problem is

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as

\alpha_t = 0.1

for all

t

Discount factor

The discount factor determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e.

r_t

(in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For , without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, ''Q''-function learning leads to propagation of errors and instabilities when the value function is approximated with an

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.

Initial conditions (''Q''₀)

Since ''Q''-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward

r

can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of

Q

. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates ''reset of initial conditions'' (RIC) is expected to predict participants' behavior better than a model that assumes any ''arbitrary initial condition'' (AIC). RIC seems to be consistent with human behaviour in repeated binary choice experiments.

Implementation

''Q''-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions since the likelihood of the agent visiting a particular state and performing a particular action is increasingly small.

Function approximation

''Q''-learning can be combined with

function approximation In general, a function approximation problem asks us to select a function among a that closely matches ("approximates") a in a task-specific way. The need for function approximations arises in many branches of applied mathematics, and compute ...

. This makes it possible to apply the algorithm to larger problems, even when the state space is continuous. One solution is to use an (adapted)

as a function approximator. Another possibility is to integrate Fuzzy Rule Interpolation (FRI) and use sparse fuzzy rule-bases instead of discrete Q-tables or ANNs, which has the advantage of being a human-readable knowledge representation form. Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

Quantization

Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity, the angle of the stick and the

angular velocity In physics, angular velocity or rotational velocity ( or ), also known as angular frequency vector,(UP1) is a pseudovector representation of how fast the angular position or orientation of an object changes with time (i.e. how quickly an objec ...

of the stick. This yields a four-element vector that describes one state, i.e. a snapshot of one state encoded into four values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values can be assigned to a bucket. The exact distance of the finger from its starting position (-Infinity to Infinity) is not known, but rather whether it is far away or not (Near, Far).

History

''Q''-learning was introduced by Chris Watkins in 1989. A convergence proof was presented by Watkins and

Peter Dayan Peter Dayan is director at the Max Planck Institute for Biological Cybernetics in Tübingen, Germany. He is co-author of ''Theoretical Neuroscience'', an influential textbook on computational neuroscience. He is known for applying Bayesian metho ...

in 1992. Watkins was addressing “Learning from delayed rewards”, the title of his PhD thesis. Eight years earlier in 1981 the same problem, under the name of “Delayed reinforcement learning”, was solved by Bozinovski's Crossbar Adaptive Array (CAA). The memory matrix

W = \, w(a,s)\,

was the same as the eight years later Q-table of Q-learning. The architecture introduced the term “state evaluation” in reinforcement learning. The crossbar learning algorithm, written in mathematical

pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...

in the paper, in each iteration performs the following computation: * In state perform action ; * Receive consequence state ; * Compute state evaluation ; * Update crossbar value

w'(a,s) = w(a,s) + v(s')

. The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...

: the state value of the consequence situation is backpropagated to the previously encountered situations. CAA computes state values vertically and actions horizontally (the "crossbar"). Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by the state evaluation function. This learning system was a forerunner of the Q-learning algorithm. In 2014,

Google DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was acquired by Google in 2014 and became a wholly owned subsidiary of Alphabet Inc, after Google's restru ...

patented an application of Q-learning to

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

, titled "deep reinforcement learning" or "deep Q-learning" that can play

Atari 2600 The Atari 2600, initially branded as the Atari Video Computer System (Atari VCS) from its release until November 1982, is a home video game console developed and produced by Atari, Inc. Released in September 1977, it popularized microprocessor- ...

games at expert human levels.

Variants

Deep Q-learning

The DeepMind system used a deep

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

, with layers of tiled

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...

al filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications. The technique used ''experience replay,'' a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed. This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.

Double Q-learning

Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action. In practice, two separate value functions are trained in a mutually symmetric fashion using separate experiences,

Q^A

and

Q^B

. The double Q-learning update step is then as follows: :

Q^A_(s_, a_) = Q^A_(s_, a_) + \alpha_(s_, a_) \left(r_ + \gamma Q^B_\left(s_, \mathop\operatorname_ Q^A_t(s_, a)\right) - Q^A_(s_, a_)\right)

, and :

Q^B_(s_, a_) = Q^B_(s_, a_) + \alpha_(s_, a_) \left(r_ + \gamma Q^A_\left(s_, \mathop\operatorname_ Q^B_t(s_, a)\right) - Q^B_(s_, a_)\right).

Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue. This algorithm was later modified in 2015 and combined with

, as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.

Others

Delayed Q-learning is an alternative implementation of the online ''Q''-learning algorithm, with probably approximately correct (PAC) learning. Greedy GQ is a variant of ''Q''-learning to use in combination with (linear) function approximation. The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values. Distributional Q-learning is a variant of ''Q''-learning which seeks to model the distribution of returns rather than the expected return of each action. It has been observed to facilitate estimate by deep neural networks and can enable alternative control methods, such as risk-sensitive control.

Limitations

The standard Q-learning algorithm (using a

Q

table) applies only to discrete action and state spaces.

Discretization In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical ...

of these values leads to inefficient learning, largely due to the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...

. However, there are adaptations of Q-learning that attempt to solve this problem such as Wire-fitted Neural Network Q-Learning.

References

External links

Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.

Strehl, Li, Wiewiora, Langford, Littman (2006). PAC model-free reinforcement learning

by Richard Sutton and Andrew S. Barto, an online textbook. Se

Piqle: a Generic Java Platform for Reinforcement Learning

Reinforcement Learning Maze
a demonstration of guiding an ant through a maze using ''Q''-learning

{{Differentiable computing Machine learning algorithms Reinforcement learning