Skip to content

Reinforcement learning algorithms implemented in Keras (tensorflow==2.3) and sklearn

License

Notifications You must be signed in to change notification settings

garethjns/reinforcement-learning-keras

Repository files navigation

Reinforcement learning in Keras

Tests Quality Gate Status

This repo aims to implement various reinforcement learning agents using Keras (tf==2.2.0) and sklearn, for use with OpenAI Gym environments.

Episode play example

Planned agents

General references

Set-up

git clone 
cd reinforcement-learning-keras
pip install -r requirements.txt

Implemented algorithms and environment examples

Deep Q learner

Pong

Pong-NoFrameSkip-v4 with various wrappers.

Episode play example Convergence

Model:
State -> action model -> [value for action 1, value for action 2]

A deep Q learning agent that uses small neural network to approximate Q(s, a). It includes a replay buffer that allows for batched training updates, this is important for 2 reasons:

  • As this method is off-policy (the action is selected as argmax(action values)), it can train on data collected during previous episodes. This reduces correlation in the training data.
  • This is important for performance, especially when using a GPU. Calling multiple predict/train operations on single rows inside a loop is very inefficient.

This agent uses two copies of its model:

  • One to predict the value of the next action, which us updated every episode step (with a batch sampled from the replay buffer)
  • One to predict value of the actions in the current and next state for calculating the discounted reward. This model is updated with the weights from the first model at the end of each episode.

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.atari.pong.pong_config import PongConfig

VirtualGPU(4096) 
agent = DeepQAgent(**PongConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=10000)

Cart-pole

Using cart-pole-v0 with step limit increased from 200 to 500.

![Episode play example]images/DQNAgent.gif) ![Convergence]images/DQNAgent.png)

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256) 
agent = DeepQAgent(**CartPoleConfig('dqn').build())
agent.train(verbose=True, render=True)

MountainCar (not well tuned)

Episode play example Convergence

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig

VirtualGPU(256)
agent = DeepQAgent(**MountainCarConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)

Extensions

Dueling DQN

![Episode play example]images/DuelingDQNAgent.gif) ![Convergence]images/DuelingDQNAgent.png)

The dueling version is exactly the same as the DQN, expect with slightly different model architecture. The second to last layer is split into two layers with the units=1 and units=n_actions. The idea is that the model might learn V(s) and action advantages (A(s)) separately, which can speed up convergence.

The output of the network is still action values, however preceding layers are not fully connected; the values are now V(s) + A(s) and a subsequent Keras lambda layer is used to calculate the action advantages.

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256) 
agent = DeepQAgent(**CartPoleConfig('dueling_dqn').build())
agent.train(verbose=True, render=True)

Linear Q learner

Mountain car

Episode play example Convergence

Model:
State -> model for action 1 -> value for action 1
State -> model for action 2 -> value for action 2

This agent is based on The Lazy Programmers 2nd reinforcement learning course implementation. It uses a separate SGDRegressor models for each action to estimate Q(a|s). Each step, the model for the selected action is updated using .partial_fit. Action selection is off-policy and uses epsilon greedy; the selected either the argmax of action values, or a random action, depending on the current value of epsilon.

Environment observations are preprocessed in an sklearn pipeline that clips, scales, and creates features using RBFSampler.

from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig

agent = LinearQAgent(**MountainCarConfig('linear_q').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)

CartPole

Episode play example Convergence

Run example

from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig 

agent = LinearQAgent(**CartPoleConfig('linear_q').build())
agent.train(verbose=True, render=True)

REINFORCE (policy gradient)

CartPole

![Episode play example]images/REINFORCEAgent.gif) Convergence

Model:
State -> model -> [probability of action 1, probability of action 2]
Refs:
https://github.com/Alexander-H-Liu/Policy-Gradient-and-Actor-Critic-Keras

Policy gradient models move the action selection policy into the model, rather than using argmax(action values). Model outputs are action probabilities rather than values (π(a|s), where π is the policy), making these methods inherently stochastic and removing the need for epsilon greedy action selection.

This agent uses a small neural network to predict action probabilities given a state. Updates are done in a Monte-Carlo fashion - ie. using all steps from a single episode. This removes the need for a complex replay buffer (list.append() does the job). However as the method is on-policy it requires data from the current policy for training. This means training data can't be collected across episodes (assuming policy is updated at the end of each). This means the training data in each batch (episode) is highly correlated, which slows convergence.

This model doesn't use any scaling or clipping for environment pre-processing. For some reason, using the same pre-processing as with the DQN models prevents it from converging. The cart-pole environment can potentially return really huge values when sampling from the observation space, but these are rarely seen during training. It seems to be fine to pretend they don't exist, rather than scaling inputs based environment samples, as done with in the other methods.

from rlk.agents.policy_gradient.reinforce_agent import ReinforceAgent
from tf2_vgpu import VirtualGPU
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256)
agent = ReinforceAgent(**CartPoleConfig('reinforce').build())
agent.train(verbose=True, render=True)

Doom

Set up

Install these two packages:

Additionally, to save monitor wrapper output, install the following packages:

sudo apt install libcanberra-gtk-module libcanberra-gtk3-module

VizdoomBasic-v0

DQN

Episode play example Convergence

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.doom.vizdoom_basic_config import VizDoomBasicConfig

VirtualGPU(256)
agent = DeepQAgent(**VizDoomBasicConfig(agent_type='dqn', mode='stack').build())
agent.train(n_episodes=1000, max_episode_steps=10000, verbose=True, render=True)

VizDoomCorridor-v0

Double dueling DQN

Episode play example Convergence

The DQNs struggle to solve this environment on their own. See scripts and readme in scripts/doom/ for an example training with additional experience collection with (scripted) bots.

GFootball

Work in progress. Involves pre-training the agent on historical data, and sampling experience from (policy) bots.

See notes in scripts/gfootball/readme.md

About

Reinforcement learning algorithms implemented in Keras (tensorflow==2.3) and sklearn

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages