State–action–reward–state–action (SARSA) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

for learning a Markov decision process policy, used in the

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

area of

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote. This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S₁", the action the agent chooses "A₁", the reward "R₂" the agent gets for choosing this action, the state "S₂" that the agent enters after taking that action, and finally the next action "A₂" the agent chooses in its new state. The acronym for the

quintuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is on ...

(S_t, A_t, R_t+1, S_t+1, A_t+1) is SARSA. Some authors use a slightly different convention and write the quintuple (S_t, A_t, R_t, S_t+1, A_t+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention.

Algorithm

Q^(S_t, A_t) \leftarrow (1 - \alpha) Q(S_t,A_t) + \alpha \,_ + \gamma \, Q(S_, A_) /math>

A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an ''on-policy learning algorithm''. The Q value for a state-action is updated by an error, adjusted by the

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

α. Q values represent the possible reward received in the next time step for taking action ''a'' in state ''s'', plus the discounted future reward received from the next state-action observation. Watkin's Q-learning updates an estimate of the optimal state-action value function

Q^*

based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy. Some optimizations of Watkin's Q-learning may be applied to SARSA.

Hyperparameters

Learning rate (alpha)

The

determines to what extent newly acquired information overrides old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

Discount factor (gamma)

The discount factor determines the importance of future rewards. A discount factor of 0 makes the agent "opportunistic", or "myopic", e.g., by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the

Q

values may diverge.

Initial conditions ()

Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high (infinite) initial value, also known as "optimistic initial conditions", can encourage exploration: no matter what action takes place, the update rule causes it to have higher values than the other alternative, thus increasing their choice probability. In 2013 it was suggested that the first reward

r

could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of

Q

. This allows immediate learning in case of fixed deterministic rewards. This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.

References