Skip to content

jacarvalho/nopg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nonparametric Off-Policy Policy Gradient


Tosatto, S.; Carvalho, J.; Abdulsamad, H.; Peters, J. (2020). A Nonparametric Off-Policy Policy Gradient, Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS). https://arxiv.org/abs/2001.02435

Nonparametric Off-Policy Policy Gradient (NOPG) is a Reinforcement Learning algorithm for off-policy datasets. The gradient estimate is computed in closed-form by modelling the transition probabilities with Kernel Density Estimation (KDE) and the reward function with Kernel Regression.

The current version of NOPG supports stochastic and deterministic policies, and works for continuous state and action spaces. An extension to discrete spaces will be made available in the near future.

It supports environments with openAI-gym like interfaces.

Link to CartPole video: https://www.youtube.com/watch?v=LKtnzc4TV98

Install

The code was tested with Python 3.7.6 in a machine with Ubuntu 18.04 and uses PyTorch for automatic gradient computation. We recommend using a GPU and large RAM to improve the training speed.

We assume you have miniconda3 installed in /home/$USER/miniconda3.

Install all dependencies with

bash setup.sh

Run

The easiest way to create an experiment is to follow the template in examples/template.py or directly look at the examples in the examples directory.

Example

Swing-up Pendulum with Uniformly sampled dataset and Deterministic Policy

Activate the virtual environment first and run the code with

python examples/pendulum_nopg_d_uniform.py

You should get roughly a non-discounted return close to -500.

About

Nonparametric Off-Policy Policy Gradient

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published