technical-knockout.com : Basic Reinforcement Learning Tutorial 1

Basic Reinforcement Learning Tutorial 1

Background

Value Functions (state-action pair functions) estimate:
– how good a particular action will be in a given state
– what the return for that action is expected to be.

Q-Learning
– an off-policy (can update the estimated value functions using hypothetical actions, those which have not actually been tried) algorithm for temporal difference learning (method to estimate value functions).
– can be proven that given sufficient training,
– the Q-learning converges with probability 1 to a close approximation of the action-value function for an arbitrary target policy.
– learns the optimal policy even when actions are selected according to a more exploratory or even random policy.
– can be implemented as follows:

where:
s: is the previous state
a: is the previous action
Q(): is the Q-learning algorithm
s’: is the current state
alpha: is the the learning rate, set generally between 0 and 1. Setting it to 0 means that the Q-values are never updated, thereby nothing is learned. Setting alpha to a high value such as 0.9 means that learning can occur quickly.
gamma: is the discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards.
max,: is the the maximum reward that is attainable in the state following the current one (the reward for taking the optimal action thereafter).

The algorithm can be interpreted as:

Initialize the Q-values table, Q(s, a).
Observe the current state, s.
Choose an action, a, for that state based on the selection policy.
Take the action, and observe the reward, r, as well as the new state, s’.
Update the Q-value for the state using the observed reward and the maximum reward possible for the next state.
Set the state to the new state, and repeat the process until a terminal state is reached.