This repo aims to implement various reinforcement learning agents using Keras (tf==2.2.0) and sklearn, for use with OpenAI Gym environments.
- Methods
- Off-policy
- Linear Q learning
- Deep Q learning
- Mountain car
- CartPole
- Pong
- Vizdoom
- GFootball
- Model extensions
- Replay buffer
- Unrolled Bellman
- Dueling architecture
- Multiple environments
- Double DQN
- Noisy network
- Policy gradient methods
- REINFORCE
- Mountain car
- CartPole
- Pong
- Actor-critic
- Mountain car
- CartPole
- Pong
- REINFORCE
- Off-policy
- Deep reinforcement learning hands-on, 2nd edition, Maxim Lapan
- The Lazy Programmers' courses:
- Lilian Weng's overviews of reinforcement learning. I try and use the same terminology as used in these posts.
- Multiple Github repos and Medium posts on individual techniques - these are cited in context.
git clone
cd reinforcement-learning-keras
pip install -r requirements.txt
Pong-NoFrameSkip-v4 with various wrappers.
Model:
State -> action model -> [value for action 1, value for action 2]
A deep Q learning agent that uses small neural network to approximate Q(s, a). It includes a replay buffer that allows for batched training updates, this is important for 2 reasons:
- As this method is off-policy (the action is selected as argmax(action values)), it can train on data collected during previous episodes. This reduces correlation in the training data.
- This is important for performance, especially when using a GPU. Calling multiple predict/train operations on single rows inside a loop is very inefficient.
This agent uses two copies of its model:
- One to predict the value of the next action, which us updated every episode step (with a batch sampled from the replay buffer)
- One to predict value of the actions in the current and next state for calculating the discounted reward. This model is updated with the weights from the first model at the end of each episode.
from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.atari.pong.pong_config import PongConfig
VirtualGPU(4096)
agent = DeepQAgent(**PongConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=10000)
Using cart-pole-v0 with step limit increased from 200 to 500.
![Episode play example]images/DQNAgent.gif) ![Convergence]images/DQNAgent.png)
from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig
VirtualGPU(256)
agent = DeepQAgent(**CartPoleConfig('dqn').build())
agent.train(verbose=True, render=True)
from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig
VirtualGPU(256)
agent = DeepQAgent(**MountainCarConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)
![Episode play example]images/DuelingDQNAgent.gif) ![Convergence]images/DuelingDQNAgent.png)
The dueling version is exactly the same as the DQN, expect with slightly different model architecture. The second to last layer is split into two layers with the units=1 and units=n_actions. The idea is that the model might learn V(s) and action advantages (A(s)) separately, which can speed up convergence.
The output of the network is still action values, however preceding layers are not fully connected; the values are now V(s) + A(s) and a subsequent Keras lambda layer is used to calculate the action advantages.
from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig
VirtualGPU(256)
agent = DeepQAgent(**CartPoleConfig('dueling_dqn').build())
agent.train(verbose=True, render=True)
Model:
State -> model for action 1 -> value for action 1
State -> model for action 2 -> value for action 2
This agent is based on The Lazy Programmers 2nd reinforcement learning course implementation. It uses a separate SGDRegressor models for each action to estimate Q(a|s). Each step, the model for the selected action is updated using .partial_fit. Action selection is off-policy and uses epsilon greedy; the selected either the argmax of action values, or a random action, depending on the current value of epsilon.
Environment observations are preprocessed in an sklearn pipeline that clips, scales, and creates features using RBFSampler.
from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig
agent = LinearQAgent(**MountainCarConfig('linear_q').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)
from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig
agent = LinearQAgent(**CartPoleConfig('linear_q').build())
agent.train(verbose=True, render=True)
![Episode play example]images/REINFORCEAgent.gif)
Model:
State -> model -> [probability of action 1, probability of action 2]
Refs:
https://github.com/Alexander-H-Liu/Policy-Gradient-and-Actor-Critic-Keras
Policy gradient models move the action selection policy into the model, rather than using argmax(action values). Model outputs are action probabilities rather than values (π(a|s), where π is the policy), making these methods inherently stochastic and removing the need for epsilon greedy action selection.
This agent uses a small neural network to predict action probabilities given a state. Updates are done in a Monte-Carlo fashion - ie. using all steps from a single episode. This removes the need for a complex replay buffer (list.append() does the job). However as the method is on-policy it requires data from the current policy for training. This means training data can't be collected across episodes (assuming policy is updated at the end of each). This means the training data in each batch (episode) is highly correlated, which slows convergence.
This model doesn't use any scaling or clipping for environment pre-processing. For some reason, using the same pre-processing as with the DQN models prevents it from converging. The cart-pole environment can potentially return really huge values when sampling from the observation space, but these are rarely seen during training. It seems to be fine to pretend they don't exist, rather than scaling inputs based environment samples, as done with in the other methods.
from rlk.agents.policy_gradient.reinforce_agent import ReinforceAgent
from tf2_vgpu import VirtualGPU
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig
VirtualGPU(256)
agent = ReinforceAgent(**CartPoleConfig('reinforce').build())
agent.train(verbose=True, render=True)
Install these two packages:
Additionally, to save monitor wrapper output, install the following packages:
sudo apt install libcanberra-gtk-module libcanberra-gtk3-module
from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.doom.vizdoom_basic_config import VizDoomBasicConfig
VirtualGPU(256)
agent = DeepQAgent(**VizDoomBasicConfig(agent_type='dqn', mode='stack').build())
agent.train(n_episodes=1000, max_episode_steps=10000, verbose=True, render=True)
The DQNs struggle to solve this environment on their own. See scripts and readme in scripts/doom/ for an example training with additional experience collection with (scripted) bots.
Work in progress. Involves pre-training the agent on historical data, and sampling experience from (policy) bots.
See notes in scripts/gfootball/readme.md