Skip to content

Latest commit

 

History

History
77 lines (48 loc) · 7.57 KB

README.md

File metadata and controls

77 lines (48 loc) · 7.57 KB

Deep Deterministic Policy Gradient (DDPG) algorithm on OpenAI's LunarLander

Summary

    The goal of this application is to implement DDPG algorithm[paper] on Open AI LunarLanderContinuous enviroment.

LunarLander Gif001 LunarLander Gif300 LunarLander Gif700

DDPG: Episode 1 vs Episode 300 vs Episode 700

Environment

    LunarLanderContinuous is OpenAI Box2D enviroment which corresponds to the rocket trajectory optimization which is a classic topic in Optimal Control. LunarLander enviroment contains the rocket and terrain with landing pad which is generated randomly. The lander has three engines: left, right and bottom. Goal is to, using these engines, land somewhere on landing pad with using as less fuel as possible. Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt.

LunarLander Enviroment

LunarLander Enviroment [Image source]

    State consists of the horizontal coordinate, the vertical coordinate, the horizontal speed, the vertical speed, the angle, the angular speed, 1 if first leg has contact, else 0, 1 if second leg has contact, else 0

    Reward for moving from the top of the screen to the landing pad and zero speed is about 100..140 points. If the lander moves away from the landing pad it loses reward. The episode finishes if the lander crashes or comes to rest, receiving an additional -100 or +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.

    Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.

    The episode ends when the lander lands on the terrain or crashes. Goal is reached when algorithm achieves mean score of 200 or higher on last 100 episodes (games).

Deep Deterministic Policy Gradient

    Deep Deterministic Policy Gradient represents mixture of DQN and Actor-Critic algorithms. Since DQN doesn't work with stochastic policy, it couldn't be applied to the enviroments with continuous actions. Therefore DDPG was introduced, with hybrid deterministic policy. Instead of having stochastic policy which generates probability distribution from which we sample actions, we will create deterministic policy.

    To ensure our deterministic policy, which will output action based on state, is enough exploratory, we will add random function noise to action we got from policy. Even though the DDPG paper uses an Ornstein-Uhlenbeck process for generating noise, later papers proved that using Gaussian noise is just as effective and it will be used here to generate noise.

     There are total of 4 neural networks in DDPG. DQN uses two NNs to create moving (NN which we update each step) and target (NN which is slowly following moving NN) neural network. Actor-Critic algoritms also have two NNs - Actor(policy NN which outputs actions from given state) and Critic(value function NN which gives estimated value of given state). Therefore DDPG, uses 4 NNs, moving and target for both Actor and Critic, where both target NNs follow moving NNs using polyak averaging.

DDPG algorithm DDPG algorithm

Improving DDPG

  • To improve exploration, for first 10,000 steps we will take random actions [Source]
  • Changing epsilon (from Adam optimizer) to hyperparameter instead of default value of 1e-8
  • Decaying learning rate to zero based on number of steps done
  • Decaying standard deviation (therefore random noise as well) to zero based on number of steps done [Source]

Testing

     To get accurate results, algorithm has additional class (test process) whose job is to occasionally test 100 episodes and calculate mean reward of last 100 episodes. By the rules, if test process gets 200 or higher mean score over last 100 games, goal is reached and we should terminate. If goal isn't reached, training process continues. Testing is done every 50,000 steps or when mean of last 10 returns is 200 or more.

Results

    One of the results can be seen on graph below, where X axis represents number of episodes in algorithm and Y axis represents episode reward, mean training return and mean test return (return = mean episode reward over last 100 episodes). Keep in mind that for goal to be reached mean test return has to reach 200.

Results graph

  • #33bbee Episode reward
  • #359a3c Mean training return
  • #ee3377 Mean test return
  • During multiple runs, mean test return is over 200, therefore we can conclude that goal is reached!

    Additional statistics

  • Fastest run reached the goal after 71,435 enviroment steps (194 episodes).
  • Highest reward in a single episode achieved is 315.3.

Rest of the data and TensorBoard

     If you wish to use trained models, there are saved NN models in /models. You will have to modify load.py PATH parameters and run the script to see results of training.

     If you dont want to bother with running the script, you can head over to the YouTube or see best recordings in /recordings.

    Rest of the training data can be found at /content/runs. If you wish to see it and compare it with the rest, I recommend using TensorBoard. After installation simply change the directory where the data is stored, use the following command

LOG_DIR = "full\path\to\data"
tensorboard --logdir=LOG_DIR --host=127.0.0.1

and open http://localhost:6006 in your browser. For information about installation and further questions visit TensorBoard github