Donguk Ju's take home project for ml2 internship
In the paper the authors propose an Actor-Critic, off-policy model-free algorithms based on DPG(Deterministic Policy Gradient) for continuous action space inspired by Deep Q-Learning(DQN).
- While DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces.
- An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most notably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom.
- In this work the authors present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.
- A naive application of this actor-critic method with neural function approximators is unstable for challenging problems. Prior to DQN, it was generally believed that learning value functions using large, non-linear function approximators was difficult and unstable.
- Two innovation in this method
- The network is trained off-policy with samples from a replay buffer to minimize correlations between samples.
- The network is trained with a target Q network to give consistent targets during temporal difference backups.
- Proposed model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the same hyper-parameters and network structure.
- Basic Components for RL
- Environment E
- timestep t
- observation x_t
- action a_t
- reward r_t
- assumption that the environment is fully observable s_t = x_t
- policy pi : S -> P(A)
- state space S, action space A
- initial state distribution p(s1)
- transition dynamics p(s_(t+1)|s_t,a_t)
- return(discounted future reward)
- discounting factor gamma in [0,1]
- Goal is maximizing the expected return from the start distribution . We denote the discounted state visitation distribution for a policy pi as
- Action Value Function (Q-function)
Recursive representation of Q function
If the policy is deterministic
The expectation depends only on the environment. This means that it is possible to learn Q offpolicy, using transitions which are generated from a different stochastic behavior policy. - Loss for Q
- Replay buffer
- Target network for y_t
DDPG is an actor-critic based on the DPG algorithm.
The DPG algorithm maintains a parameterized actor function mu(s) which specifies the current policy by deterministically mapping states to a specific action. The critic Q(s,a) is learned using the Bellman equation as in Q-learning. The actor is updated by following the applying the chain rule to the expected return from the start distribution J with respect to the actor parameters.
The authors' contribution here is to provide modifications to DPG, inspired by the success of DQN, which allow it to use neural network function approximators to learn in large state and action spaces online.
- In reinforcement learning, the assumption that the samples are independently and identically distributed does not hold, while most optimization algorithms assume it.
- Additionally, with mini-batches, we can make efficient use of hardware optimizations.
- To allow the algorithm to benefit from learning across a set of uncorrelated transitions, DDPG uses Replay buffer.
- Replay buffer stores transition tuple (s_t, a_t, r_t, s_(t+1))
- The buffer size can be large ~ 1e6.
- Directly implementing Q learning with neural networks proved to be unstable in many environments. Q is prone to divergence.
- The author's solution is Soft target update.
- They create a copy of the actor and critic networks that are used for calculating the target values. The weights of these target
networks are then updated by having them slowly track the learned networks.
p' = tau*p + (1-tau)*p' with tau << 1
- This means that the target values are constrained to change slowly, greatly improving the stability of learning.
- When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments.
- The solution is Batch normalization.
- It maintains a running average of the mean and variance to use for normalization during testing.
- A major challenge of learning in continuous action spaces is exploration.
- An advantage of offpolicies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm.
- The authors used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia.
- In order from the left: the cartpole swing-up task, a reaching task, a gasp and move task, a puck-hitting task, a monoped balancing task, two locomotion tasks and Torcs (driving simulator).
- In all tasks, they ran experiments using both a low-dimensional state description (such as joint angles and positions) and high-dimensional renderings of the environment.
- They used action repeats as DQN for high dimensional renderings in order to make the problems approximately fully observable.
- original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark grey), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue).
- In particular, learning without a target network, as in the original work with DPG, is very poor in many environments.
- Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor.
-
The performance is normalized using two baseline.
- The first baseline is the mean return from a naive policy which samples actions from a uniform distribution over the valid action space.
- The second baseline is iLQG (Todorov & Li, 2005), a planning based solver with full access to the underlying physical model and its derivatives.
-
It can be challenging to learn accurate value estimates. Q-learning, for example, is prone to overestimating values (Hasselt, 2010). This work is extended to TD3.
- In simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks the Q estimates are worse, but DDPG is still able learn good policies.
- Contribution : The work combines insights from recent advances in deep learning and reinforcement learning, resulting in an algorithm that robustly solves challenging problems across a variety of domains with continuous action spaces.
- Limitation : As with most model-free reinforcement approaches, DDPG requires a large number of training episodes to find solutions.
- Adam Optimizer with learning rate 10^-4 and 10^-3 for the actor and critic respectively.
- discount factor gamma = 0.99
- soft target updates tau = 0.001
- final output layer of the actor was a tanh layer
- 2 hidden layers with 400 and 300 units
- Actions were included at 2nd hidden layer of Q
- Final layer of actor and critic networks were initialized from a uniform distribution [-3x10^-3,3x10^-3]
- minibatch size 64
- replay buffer size 10^6
- Ornstein-Uhlenbeck process noise with theta=0.15 and sigma=0.2
Q : L_2 weight decay of 10^-2Batch normalization
For the stable training process, I didn't include l2 norm regularization and batch normalization in this implementation.
The agent is trained on OpenAI Gym Mujoco control tasks.
- Non-python
- openCV
- mujoco
- Python
- gym
- cv2
- torch
The script for DDPG is at ddpg directory. By typing the bash script below, you can train an agent for a specific task. The script will create a directory for the records of experiments(e.g. rewards, videos, graph)
python ddpg.py --task=InvertedPendulum-v2
python ddpg.py --task=HalfCheetah-v2