Code accompanying a lecture series on Deep Reinforcement Learning at NTNU: http://www.talkingdrums.info/lecture/2017/11/21/drl-series/
- Get Python 3.6 version of Anaconda
- Run the following commands to set up Python environment
conda create --name <envname> python=3
source activate <envname>
conda install matplotlib
pip install gym
pip install --upgrade tensorflow
pip install keras
pip install h5py
- Run the following commands to set up Python environment
conda create --name <envname> python=3
activate <envname>
conda install matplotlib
pip install gym
pip install --upgrade tensorflow
pip install keras
pip install h5py
-
Try another RL environment/problem with the DQN implementation in
dqn.py
e.g. by changing size of state space for the catch problem), or from OpenAI Gym -
Play with the value network architecture (e.g. add or reduce layers/layer sizes)
-
Note that the basic implementation has only one network. Let's call it the online network. It is used to collect data as well as to compute targets in the
compute_targets
function. For easy problems, this can be fine. But for Atari, this will cause the computed targets to move in detrimental ways. Can you implement a second network, which you use to compute targets. It should be a clone of the online net (same architecture). Let's call it the target network. Parameters of online and target network should be synced at regular intervals ~ target network will be frozen between intervals. Here (slide 6) is some intuition on this -- accompanying video. Hint: Use Keras functionsset_weights()
andget_weights()
. For example, the following could be set up:target_network.set_weights(online_network.get_weights())
More hints here, and should also help with task 6!
-
Try prioritised sampling from the replay buffer. Hint: You can try to plug in the prioritised experience replay code from OpenAI baselines.
-
Try a dueling network architecture. Example implementation for how to change the network architecture to make it dueling can be found here.
-
Try selecting assumed optimal (argmax) action using online net and evaluating it using target net to compute targets. This creates an ensembling effect and makes target estimates better. It is also the idea behind double DQN.
- Try putting prioritised sampling, dueling architecture, and double DQN learning together!
-
Try another RL environment/problem e.g. from OpenAI Gym, with the simple policy gradient algorithm implementation in
pg.py
. -
Play with the policy network architecture.
-
Try returns with and without discounting. Do gradients become more noisy without discounting? Does it take longer to train to get the same performance as with discounting?
-
Full episode returns are called Monte Carlo returns. These have high variance. Discounting helps to some extent. But, gradients are still noisy, since returns modulate the gradients. Try including a baseline to reduce the variance in episodic returns. This can be done (already done for you in
pg_with_baseline_task.py
!) by setting up another network that outputs the value (expected return) of a state. Can you use it to compute the advantages, as opposed to full returns. Then you can you the advantages to modulate the gradients! Solution inpg_with_baseline.py
, but try figuring it out yourself first to see if you get the concept of baselines and action advantages. -
Try using same network for policy and value/baseline. Hint: Last but one layer (before output) can have two heads, one giving the policy and other giving value. A single loss function (summed) for both can also be constructed, to compute gradients more efficiently.
- Try changing the loss/objective function to make policy updates proximal. Proximality here means that the updates to the policy network should be such that the updated policy does not become very different from the policy before the update.
- Try including bootstrapping (as opposed to Monte Carlo sampling) in returns for each step during each episode. You may use the second network/value function to carry out the bootstrapping. In doing so, this network plays the role of a critic.
-
Consider this as an open book challenge. Work your way through the tasks in any order you like.
-
Team up with your neighbour or work through these by yourself. If something is not clear, ask me/Slack.
-
Basic implementations of value and policy based methods, and the environment setup instructions are provided. Go through these and see if you understand everything well. If not, ask.
-
Either build on top of these implementations as you go through the tasks, or implement your own from scratch!
-
Refer online lectures, blog posts, available code etc. Or ask.
Best practices from John Schulman (OpenAI) when working with deep RL Video Slides Notes
DQN intuition from Vlad Mnih (Deepmind)
Policy gradients intuition from Andrej Karpathy (Tesla)
Some online tutorials with code samples
Deep RL Bootcamp -- must watch if you want to start working in the field!
Full UCL course on RL by David Silver (Deepmind)
Full UC Berkeley course on Deep RL
-
Use Slack during/after lecture to discuss issues and share thoughts on your implementations.
-
Have fun!