The actor-critic algorithm (AC) is a family of

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

(RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration,

Q-learning ''Q''-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment ( model-free). It can handle problems with stochastic tra ...

, SARSA, and TD learning. An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function. Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

The actor uses a policy function

\pi(a, s)

, while the critic estimates either the

value function The value function of an optimization problem gives the value attained by the objective function at a solution, while only depending on the parameters of the problem. In a controlled dynamical system, the value function represents the optimal payo ...

V(s)

, the action-value Q-function

Q(s,a),

the advantage function

A(s,a)

, or any combination thereof. The actor is a parameterized function

\pi_\theta

, where

\theta

are the parameters of the actor. The actor takes as argument the state of the environment

s

and produces a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

\pi_\theta(\cdot ,  s)

. If the action space is discrete, then

\sum_ \pi_\theta(a ,  s) = 1

. If the action space is continuous, then

\int_ \pi_\theta(a ,  s) da = 1

. The goal of policy optimization is to improve the actor. That is, to find some

\theta

that maximizes the expected episodic reward

J(\theta)

J(\theta) = \mathbb_\left sum_^ \gamma^t r_t\right

where

\gamma

is the

discount factor In finance, discounting is a mechanism in which a debtor obtains the right to delay payments to a creditor, for a defined period of time, in exchange for a charge or fee.See "Time Value", "Discount", "Discount Yield", "Compound Interest", "Effic ...

r_t

is the reward at step

t

, and

T

is the time-horizon (which can be infinite). The goal of policy gradient method is to optimize

J(\theta)

gradient ascent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the grad ...

on the policy gradient

\nabla J(\theta)

. As detailed on the policy gradient method page, there are many

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

s of the policy gradient:

\nabla_\theta J(\theta) = \mathbb_\left S_j)
 \cdot \Psi_j
  \Big, S_0 = s_0 \right /math>where \Psi_j is a linear sum of the following:

* \sum_ (\gamma^i R_i) .
* \gamma^j\sum_ (\gamma^ R_i) : the REINFORCE algorithm.
* \gamma^j \sum_ (\gamma^ R_i) - b(S_j) : the REINFORCE with baseline algorithm. Here b is an arbitrary function.
* \gamma^j \left(R_j + \gamma V^( S_) - V^( S_)\right) : TD(1) learning .
* \gamma^j Q^(S_j, A_j) .
* \gamma^j A^(S_j, A_j) : Advantage Actor-Critic (A2C). * \gamma^j \left(R_j + \gamma R_ + \gamma^2 V^( S_) - V^( S_)\right) : TD(2) learning.
* \gamma^j \left(\sum_^ \gamma^k R_ + \gamma^n V^( S_) - V^( S_)\right) : TD(n) learning.
* \gamma^j \sum_^\infty \frac\cdot \left(\sum_^ \gamma^k R_ + \gamma^n V^( S_) - V^( S_)\right) : TD(λ) learning, also known as GAE (generalized advantage estimate). This is obtained by an exponentially decaying sum of the TD(n) learning terms.

Critic

In the unbiased estimators given above, certain functions such as

V^, Q^, A^

appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms. For example, if the critic is estimating the state-value function

V^(s)

, then it can be learned by any value function approximation method. Let the critic be a function approximator

V_\phi(s)

with parameters

\phi

. The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:

\delta_i = R_i + \gamma V_\phi(S_) - V_\phi(S_i)

The critic parameters are updated by gradient descent on the squared TD error:

\phi \leftarrow \phi - \alpha  \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)

where

\alpha

is the learning rate. Note that the gradient is taken with respect to the

\phi

V_\phi(S_i)

only, since the

\phi

\gamma V_\phi(S_)

constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use

automatic differentiation In mathematics and computer algebra, automatic differentiation (auto-differentiation, autodiff, or AD), also called algorithmic differentiation, computational differentiation, and differentiation arithmetic Hend Dawood and Nefertiti Megahed (2023) ...

, and requires "stopping the gradient" at that point. Similarly, if the critic is estimating the action-value function

Q^

, then it can be learned by

or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by

\phi

, denoted as

Q_\phi(s, a)

. The temporal difference error is then calculated as

\delta_i = R_i + \gamma Q_\theta(S_, A_) - Q_\theta(S_i,A_i)

. The critic is then updated by

\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)

The advantage critic can be trained by training both a Q-function

Q_\phi(s,a)

and a state-value function

V_\phi(s)

, then let

A_\phi(s,a) = Q_\phi(s,a) - V_\phi(s)

. Although, it is more common to train just a state-value function

V_\phi(s)

, then estimate the advantage by

A_\phi(S_i,A_i) \approx \sum_ \gamma^R_ + \gamma^V_\phi(S_) - V_\phi(S_i)

Here,

n

is a positive integer. The higher

n

is, the more lower is the bias in the advantage estimation, but at the price of higher variance. The Generalized Advantage Estimation (GAE) introduces a hyperparameter

\lambda

that smoothly interpolates between Monte Carlo returns (

\lambda = 1

, high variance, no bias) and 1-step TD learning (

\lambda = 0

, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with

\lambda

being the decay strength.

Variants

* Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C. * Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration. * Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.

References

* * * * * {{Cite journal , last1=Grondman , first1=Ivo , last2=Busoniu , first2=Lucian , last3=Lopes , first3=Gabriel A. D. , last4=Babuska , first4=Robert , date=November 2012 , title=A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , url=https://ieeexplore.ieee.org/document/6392457 , journal=IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , volume=42 , issue=6 , pages=1291–1307 , doi=10.1109/TSMCC.2012.2218595 , issn=1094-6977 Reinforcement learning Machine learning algorithms Artificial intelligence

Overview

Actor

Critic

Variants

See also

References