Work in progress
In this repo, you'll find two TensorFlow implementations from the paper Asynchronous Deep Reinforcement Learning Methods by Mnih et., al 2016: Asynchronous 1-step Q-learning and Asynchronous 1-step SARSA. By default, they run on OpenAIs Gym enviroment, but you can easily play around with other examples through minor edits in
To get started, simply run python
In this method, each parallel worker (or thread) interacts with its own copy of the enviroment. Each worker computes a gradient of the Q-learning loss at each state, which it accumulates over multiple timesteps before it applies them, making a similar effect to using minibatches. Each worker is given a different exploration rate, which add diversity of the exploration and helps to improve the robustness of the algorithm.
This method is very similar to 1-step Q-learning, with the exception of using a different target value for Q(s,a)
. While Q-learning uses r + ɣmaxQ(s',a'; θ')
, 1-step SARSA uses r + ɣQ(s',a'; θ')
where a'
represents the action taken in s'
// Algorithm for one worker.
// Assume global shared θ, θ', and the counter global_max = 0.
Initialize worker step counter ĺocal_step ← 0
Initialize target network weights θ' ← θ
Initialize network gradients dθ ← 0
Get initial state s
while global_step > global_max_steps do
Take action a with ε-greedy policy based on Q(s,a;θ)
Receive new state s' and reward
for terminal s' do
y = reward
for non-terminal s' do
for Q-learning do
y = reward * ɣmaxQ(s',a';θ')
for SARSA do
y = reward * ɣQ(s',a';θ')
Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ)) / ∂θ
s ← s'
global_step ← global_step + 1
local_step ← local_step + 1
if global_step % target_network_update == 0 then
Update the target network θ' ← θ
end if
if local_step % local_max_steps == 0 or s' is terminal then
Perform asynchronous update of θ using dθ.
Clear gradients dθ ← 0.
end if
- Name of the Atari game to play. Full list here.histrogram_summary
- How many episodes to average histogram summary over.load_checkpoint
- If it should should from available checkpoints.save_checkpoint
- If it should should save checkpoints when break is triggered.save_stats
If it should save stats for Tensorboard.random_seed
- Sets the random seed.use_gpu
- If TensorFlow operations should run on GPU rather than CPU.display
- If it you want to render the game.log
- For a verbose log.
- Number of asynchronous agents (threads) to train with.global_max_steps
-80 000 000
- Maximum training steps.local_max_steps
- Frequency with which each agent network is updated (I_target
-10 000
- Frequency with which the shared target network is updated (I_AsyncUpdate
- How many frames to skip (or actions to repeat) for each step.
- Training algorithm to use[q, sarsa]
. Defaults to Q-learning.gamma
- Discount factor for rewards.epsilon_anneal
-1 000 000
- Number of steps to anneal epsilon.
- Which optimizer to use[adam, gradientdescent, rmsprop]
. Defaults tormsprop
- RMSProp decay parameter.rms_epsilon
RMSProp epsilon parameter.learning_rate
- Initial learning rate.anneal_learning_rate
- If learning rate should be annealed over global max steps.
- If it should run continous evaluation throughout the training session.evaluation_episodes
- How many evaluation episodes to run (and average the evaluation over).evaluation_frequency
-100 000
- The frequency of evaluation runs.