Actor-critic Algorithm
   HOME

TheInfoList



OR:

The actor-critic algorithm (AC) is a family of
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
(RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration,
Q-learning ''Q''-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment ( model-free). It can handle problems with stochastic tra ...
, SARSA, and TD learning. An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function. Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.


Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.


Actor

The actor uses a policy function \pi(a, s), while the critic estimates either the
value function The value function of an optimization problem gives the value attained by the objective function at a solution, while only depending on the parameters of the problem. In a controlled dynamical system, the value function represents the optimal payo ...
V(s), the action-value Q-function Q(s,a), the advantage function A(s,a), or any combination thereof. The actor is a parameterized function \pi_\theta, where \theta are the parameters of the actor. The actor takes as argument the state of the environment s and produces a
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
\pi_\theta(\cdot , s). If the action space is discrete, then \sum_ \pi_\theta(a , s) = 1. If the action space is continuous, then \int_ \pi_\theta(a , s) da = 1. The goal of policy optimization is to improve the actor. That is, to find some \theta that maximizes the expected episodic reward J(\theta): J(\theta) = \mathbb_\left sum_^ \gamma^t r_t\rightwhere \gamma is the
discount factor In finance, discounting is a mechanism in which a debtor obtains the right to delay payments to a creditor, for a defined period of time, in exchange for a charge or fee.See "Time Value", "Discount", "Discount Yield", "Compound Interest", "Effic ...
, r_t is the reward at step t , and T is the time-horizon (which can be infinite). The goal of policy gradient method is to optimize J(\theta) by
gradient ascent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the grad ...
on the policy gradient \nabla J(\theta). As detailed on the policy gradient method page, there are many
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
s of the policy gradient:\nabla_\theta J(\theta) = \mathbb_\left S_j) \cdot \Psi_j \Big, S_0 = s_0 \right/math>where \Psi_j is a linear sum of the following: * \sum_ (\gamma^i R_i). * \gamma^j\sum_ (\gamma^ R_i): the REINFORCE algorithm. * \gamma^j \sum_ (\gamma^ R_i) - b(S_j) : the REINFORCE with baseline algorithm. Here b is an arbitrary function. * \gamma^j \left(R_j + \gamma V^( S_) - V^( S_)\right): TD(1) learning. * \gamma^j Q^(S_j, A_j). * \gamma^j A^(S_j, A_j): Advantage Actor-Critic (A2C). * \gamma^j \left(R_j + \gamma R_ + \gamma^2 V^( S_) - V^( S_)\right): TD(2) learning. * \gamma^j \left(\sum_^ \gamma^k R_ + \gamma^n V^( S_) - V^( S_)\right): TD(n) learning. * \gamma^j \sum_^\infty \frac\cdot \left(\sum_^ \gamma^k R_ + \gamma^n V^( S_) - V^( S_)\right): TD(λ) learning, also known as GAE (generalized advantage estimate). This is obtained by an exponentially decaying sum of the TD(n) learning terms.


Critic

In the unbiased estimators given above, certain functions such as V^, Q^, A^ appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms. For example, if the critic is estimating the state-value function V^(s), then it can be learned by any value function approximation method. Let the critic be a function approximator V_\phi(s) with parameters \phi. The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:\delta_i = R_i + \gamma V_\phi(S_) - V_\phi(S_i)The critic parameters are updated by gradient descent on the squared TD error:\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)where \alpha is the learning rate. Note that the gradient is taken with respect to the \phi in V_\phi(S_i) only, since the \phi in \gamma V_\phi(S_) constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use
automatic differentiation In mathematics and computer algebra, automatic differentiation (auto-differentiation, autodiff, or AD), also called algorithmic differentiation, computational differentiation, and differentiation arithmetic Hend Dawood and Nefertiti Megahed (2023) ...
, and requires "stopping the gradient" at that point. Similarly, if the critic is estimating the action-value function Q^, then it can be learned by
Q-learning ''Q''-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment ( model-free). It can handle problems with stochastic tra ...
or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by \phi, denoted as Q_\phi(s, a). The temporal difference error is then calculated as \delta_i = R_i + \gamma Q_\theta(S_, A_) - Q_\theta(S_i,A_i). The critic is then updated by\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)The advantage critic can be trained by training both a Q-function Q_\phi(s,a) and a state-value function V_\phi(s), then let A_\phi(s,a) = Q_\phi(s,a) - V_\phi(s). Although, it is more common to train just a state-value function V_\phi(s), then estimate the advantage byA_\phi(S_i,A_i) \approx \sum_ \gamma^R_ + \gamma^V_\phi(S_) - V_\phi(S_i)Here, n is a positive integer. The higher n is, the more lower is the bias in the advantage estimation, but at the price of higher variance. The Generalized Advantage Estimation (GAE) introduces a hyperparameter \lambda that smoothly interpolates between Monte Carlo returns ( \lambda = 1 , high variance, no bias) and 1-step TD learning ( \lambda = 0 , low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with \lambda being the decay strength.


Variants

* Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C. * Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration. * Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.


See also

*
Reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
* Policy gradient method *
Deep reinforcement learning {{Short description, Subfield of machine learning Deep reinforcement learning (DRL) is a subfield of machine learning that combines principles of reinforcement learning (RL) and deep learning. It involves training agents to make decisions by interac ...


References

* * * * * {{Cite journal , last1=Grondman , first1=Ivo , last2=Busoniu , first2=Lucian , last3=Lopes , first3=Gabriel A. D. , last4=Babuska , first4=Robert , date=November 2012 , title=A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , url=https://ieeexplore.ieee.org/document/6392457 , journal=IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , volume=42 , issue=6 , pages=1291–1307 , doi=10.1109/TSMCC.2012.2218595 , issn=1094-6977 Reinforcement learning Machine learning algorithms Artificial intelligence