duju-ml2-takehome

Donguk Ju's take home project for ml2 internship

Review

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

Summary

In the paper the authors propose an Actor-Critic, off-policy model-free algorithms based on DPG(Deterministic Policy Gradient) for continuous action space inspired by Deep Q-Learning(DQN).

Introduction

While DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces.
An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most notably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom.
In this work the authors present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.
A naive application of this actor-critic method with neural function approximators is unstable for challenging problems. Prior to DQN, it was generally believed that learning value functions using large, non-linear function approximators was difficult and unstable.
Two innovation in this method
1. The network is trained off-policy with samples from a replay buffer to minimize correlations between samples.
2. The network is trained with a target Q network to give consistent targets during temporal difference backups.
Proposed model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the same hyper-parameters and network structure.

Background

Basic Components for RL
- Environment E
- timestep t
- observation x_t
- action a_t
- reward r_t
- assumption that the environment is fully observable s_t = x_t
- policy pi : S -> P(A)
- state space S, action space A
- initial state distribution p(s1)
- transition dynamics p(s_(t+1)|s_t,a_t)
- return(discounted future reward)
- discounting factor gamma in [0,1]
Goal is maximizing the expected return from the start distribution . We denote the discounted state visitation distribution for a policy pi as
Action Value Function (Q-function)

Recursive representation of Q function

If the policy is deterministic

The expectation depends only on the environment. This means that it is possible to learn Q offpolicy, using transitions which are generated from a different stochastic behavior policy.
Loss for Q
Replay buffer
Target network for y_t

Algorithms

DDPG is an actor-critic based on the DPG algorithm.

DPG

The DPG algorithm maintains a parameterized actor function mu(s) which specifies the current policy by deterministically mapping states to a specific action. The critic Q(s,a) is learned using the Bellman equation as in Q-learning. The actor is updated by following the applying the chain rule to the expected return from the start distribution J with respect to the actor parameters.

The authors' contribution here is to provide modifications to DPG, inspired by the success of DQN, which allow it to use neural network function approximators to learn in large state and action spaces online.

Replay buffer

In reinforcement learning, the assumption that the samples are independently and identically distributed does not hold, while most optimization algorithms assume it.
Additionally, with mini-batches, we can make efficient use of hardware optimizations.
To allow the algorithm to benefit from learning across a set of uncorrelated transitions, DDPG uses Replay buffer.
Replay buffer stores transition tuple (s_t, a_t, r_t, s_(t+1))
The buffer size can be large ~ 1e6.

Soft target updates

Directly implementing Q learning with neural networks proved to be unstable in many environments. Q is prone to divergence.
The author's solution is Soft target update.
They create a copy of the actor and critic networks that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks.
p' = tau*p + (1-tau)*p' with tau << 1
This means that the target values are constrained to change slowly, greatly improving the stability of learning.

Batch normalization

When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments.
The solution is Batch normalization.
It maintains a running average of the mean and variance to use for normalization during testing.

Exploration - Ornstein-Uhlenbeck process

A major challenge of learning in continuous action spaces is exploration.
An advantage of offpolicies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm.
The authors used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia.

DDPG algorithm and training process

Results

The algorithm is tested on simulations using MujoCo.

In order from the left: the cartpole swing-up task, a reaching task, a gasp and move task, a puck-hitting task, a monoped balancing task, two locomotion tasks and Torcs (driving simulator).
In all tasks, they ran experiments using both a low-dimensional state description (such as joint angles and positions) and high-dimensional renderings of the environment.
- They used action repeats as DQN for high dimensional renderings in order to make the problems approximately fully observable.

Performance curve.

original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark grey), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue).
- In particular, learning without a target network, as in the original work with DPG, is very poor in many environments.
- Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor.

Performance table.

The performance is normalized using two baseline.
- The first baseline is the mean return from a naive policy which samples actions from a uniform distribution over the valid action space.
- The second baseline is iLQG (Todorov & Li, 2005), a planning based solver with full access to the underlying physical model and its derivatives.
It can be challenging to learn accurate value estimates. Q-learning, for example, is prone to overestimating values (Hasselt, 2010). This work is extended to TD3.
- In simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks the Q estimates are worse, but DDPG is still able learn good policies.

Conclusion

Contribution : The work combines insights from recent advances in deep learning and reinforcement learning, resulting in an algorithm that robustly solves challenging problems across a variety of domains with continuous action spaces.
Limitation : As with most model-free reinforcement approaches, DDPG requires a large number of training episodes to find solutions.

Implementation Details

Adam Optimizer with learning rate 10^-4 and 10^-3 for the actor and critic respectively.
discount factor gamma = 0.99
soft target updates tau = 0.001
final output layer of the actor was a tanh layer
2 hidden layers with 400 and 300 units
Actions were included at 2nd hidden layer of Q
Final layer of actor and critic networks were initialized from a uniform distribution [-3x10^-3,3x10^-3]
minibatch size 64
replay buffer size 10^6
Ornstein-Uhlenbeck process noise with theta=0.15 and sigma=0.2
~~Q : L_2 weight decay of 10^-2~~
~~Batch normalization~~
For the stable training process, I didn't include l2 norm regularization and batch normalization in this implementation.

Experiment Instruction

The agent is trained on OpenAI Gym Mujoco control tasks.

Requirements

Non-python
- openCV
- mujoco
Python
- gym
- cv2
- torch

The script for DDPG is at ddpg directory. By typing the bash script below, you can train an agent for a specific task. The script will create a directory for the records of experiments(e.g. rewards, videos, graph)

python ddpg.py --task=InvertedPendulum-v2
python ddpg.py --task=HalfCheetah-v2

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Portfolio		Portfolio
ddpg		ddpg
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duju-ml2-takehome

Review

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

Summary

Introduction

Background

Algorithms

DPG

Replay buffer

Soft target updates

Batch normalization

Exploration - Ornstein-Uhlenbeck process

DDPG algorithm and training process

Results

The algorithm is tested on simulations using MujoCo.

Performance curve.

Performance table.

Conclusion

Implementation Details

Experiment Instruction

Requirements

Experiment Results

Inverted Pendulum v2

Half Cheetah v2

About

Releases

Packages

Contributors 2

Languages

ehddnr747/duju-ml2-takehome

Folders and files

Latest commit

History

Repository files navigation

duju-ml2-takehome

Review

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

Summary

Introduction

Background

Algorithms

DPG

Replay buffer

Soft target updates

Batch normalization

Exploration - Ornstein-Uhlenbeck process

DDPG algorithm and training process

Results

The algorithm is tested on simulations using MujoCo.

Performance curve.

Performance table.

Conclusion

Implementation Details

Experiment Instruction

Requirements

Experiment Results

Inverted Pendulum v2

Half Cheetah v2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages