The purpose of this project was to become familiar with reinforcement learning libraries, algorithms and hyperparameter tuning. To accomplish this I hand coded the Double Deep Q Learning and Proximal Policy Optimization algorithms for OpenAI Gym's Car Racing Environment [1]). While exploring, I gave myself a research question to avoid simply copying other's implementations. That is, can I obtain high results at a significantly lower computational cost than others?
Below are some of the top implementations from OpenAI Gym's official leaderboard [2]).
Model | Training Episodes | 100-Episode Average Score | Write-up |
---|---|---|---|
PPO | ≈10,000 (5*106 environment steps) |
913 | writeup |
DDQN | 900 | 907 | writeup |
World Models | 10,000+ | 906 | writeup |
PPO | 2760 | 820 (*actual) | writeup |
Weight Agnostic Neural Networks | N/A | 893 | writeup |
PPO | ≈3000 (1.6*106 environment steps) |
905 | writeup |
PPO | 5000 | 909 | writeup |
Poorer performing implementations include PPO, DQN, AC3 models done by Stanford Students which never surpassed a score of 600 [3], and a DQN fully connected network that achieved a score of 350 [4].
All reinforcement learning algorithms center around a value estimation of a particular configuration. That is, how much discounted reward can I expect starting off from a particular state
Q-learning estimates the value of state-action pairs
Figure 1: Visualization of Q-update (* denotes optimal actions).
Deep Q Networks we use a neural network to predict the
Policy Gradient methods sample actions according to a probability distribution
Proximal Policy Optimization is a type of actor-critic twist on the policy gradient theory. The actor network outputs parameters of the action policy as discussed above. The critic network outputs its value estimation of states which is used in updating. Looking at our policy loss we can see that we subtracted our critic's value estimation from our exprienced return
First, we greyscale and clip regions of the input that have low correlations to performance or are overly complex. This increases training efficiency by focusing more compute on key areas. Second, we extract speed information by summing the pixels of the speed bar and normalizing. Extracting speed this way is more computationally efficient that passing stacked consecutive frames to the CNN as done by others. However information is lost since speed magnitude precision is limited by image resolution.
Raw Data | Processed | |
---|---|---|
Image Simplification | (96,96,3) |
(84,96,1) |
Speed Extraction | speed magnitude ∈ {0,1,2,3,4,5} |
We also limited the quantity of poor training data by terminating episodes once a 0.1 second interval is spent on the grass and return a -100 reward.
For DQN we started off with a reduced action space. Speed was deterministically set so the agent need only control direction.
Steering Angle | Throttle | Brake | |
---|---|---|---|
Standard | ∈ [-1,1] | ∈ [0,1] | ∈ [0,1] |
Modified | ∈ {-0.3,0,0.3} | = 0.1 if speed<threshold 0 if speed>=threshold |
= 0 |
DQN architecture was inspired by [5]. Our model is likely overparamaterized since we reduced the action space and no longer pass in stacked frames. However we kept their structure for ease of comparison.
Input = processed image + speed
Output = Q values of steering actions
Input Shape | Function |
---|---|
(96,84,1) | Conv2d + LeakyRelu + Batchnorm |
(20,23,8) | Max-Pooling |
(10,11,8) | Conv2d + LeakyRelu + Batchnorm |
(12,13,16) | Max-Pooling |
(6,6,16) | LeakyRelu + Flatten |
(576+1) | speed appended to tensor |
(256) | LeakyRelu |
(50) | LeakyRelu |
(3) | Identity |
For PPO we split the action space into 2 independantly controlled tasks. That is, we have 1 actor critic netowrk controlling steering, and another identically structured model controlling gas and braking. This is a novel technique never before encountered for the CarRacing environment. Training was done by first training the steering network, with speed controlled by the same deterministic rules used for DDQN. Then, we fixed the steering network, and trained the thrust network. Both networks used the beta distribution, which corresponds nicely with our 1 dimensional bounded action space. In training actions were sample stochastically for exploration. In testing actions were the mean of the distribution for stability.
Instead of the advantage function Gt-V(s), we used AGAE. λ controls the depth to bootstrap, with the infinite summation of its relative weights adding to 1. γ is the usual decay rate of future rewards.
PPO architecture was inspired by [5] and [6] respectively. Our model is likely overparamaterized since we simplified the action space and no longer pass in stacked frames. However we kept their structure for ease of comparison.
Input = processed image + speed + steering angle
Output = beta distribution parameters and state value
Note that the fully connected layers diverge.
Input Shape | Function |
---|---|
(96,84,1) | Conv2d + LeakyRelu |
(20,23,8) | Max-Pooling |
(10,11,8) | Conv2d + LeakyRelu |
(12,13,16) | Max-Pooling |
(6,6,16) | LeakyRelu + Flatten |
(576+2) | speed, steering appended to tensor |
(577) | LeakyRelu |
(577), (1) | LeakyRelu, Identity |
(2) | Softplus |
We succesfully achieved our goal of becoming familiar with reinforcement learning algorithms and libraries. Our novel double agent PPO was very succesful, obtaining the highest test score and 2nd lowest training cost among all encountered implementations to offically beat the environment. DQN did not offically beat the environment, however it also obtained high rewards at low computational costs.
Our best model averaged a reward of 850/900 to officially solve the environment. It was able to visit 97% of track tiles travelling at moderate speeds. We did not perform an in-depth hyperparamter search on this DQN, as it was clear that a more complex action space was required to beat the environment.
First we searched for an appropriate learning rate and epsilon decay schedule for a self-imposed training cap of 360 episodes. We tested all setups on the control hyperparameters below.
Second, we compared performance across 3 other hyperparameters (travel speed, γ reward decay factor, and time discritization length). At moderate speeds, time discritization length = 0.1 seconds, and gamma = 0.92 the car almost always completed the track (97% tiles visited), averaging a score of 850. At the same settings but at a higher speed, the car was unable to navigate sharp turns capping performance at 394. Finally, we tested at a critical speed we calculated as the bare minimum to obtain a score of 900 assuming all tiles are visited. Unfortunately the car could not stay on track at this crtical speed (hyperparameters searched shown below).
Scores | Time for gamma to decay to 20% | |||
---|---|---|---|---|
1.2 sec | 1.9 sec | 3 sec | ||
Time Discretization Length | 0.1 sec | 610 | 599 | 572 |
0.06 sec | 211(*possible mistake) | 569 | 375 |
Table 1. Hyperparameter search at the critical minimum speed required to beat environment.
2 conculusions were drawn. First, it seems impossible to solve the environment with a simple speed policy. Likely the environment creators did this on purpose. Second, at the same speed setting, the simpler hyperparameters perform better. This corresponds to higher reward decays (agent must only predict a small time horizon), and longer discritization lengths (agent must make fewer predictions per time period). Its possible that action space is so simple, our agent does not need the greater expressive power of the more complex settings. Its also possible that our condensed training sessions (45 minutes/360 episodes) were not long enough to capitalize on the greater expressive power. Below is our training curve at our best settings, compared to the top DQN implementation.
Figure ?: Our training curve (left), and the top DQN from OpenAI's leaderboard (right)[4]. We took approximately an order of magnitude less training steps (dotted line marks our total timesteps).
DDQN_video.mp4
Our final PPO models obtained an average score of 917/900 over 100 test episodes, after 925 training episodes. The only group to beat the environment at a lower cost was a multi-university student research group [6]. No group obtained a higher test score.
As discussed in the methods section, we first trained the PPO steering model with the same speed control as DDQN. Next, we fixed the steering model and trained the thrust model. One interesting parameter we tuned was the addition of a constant to the softplus output (beta distribution parameters). We found that forcing concave-like shapes (+5) initially accelarated training by avoiding extreme actions, however the agent could not handle the broad uncertainty during finetuning. Allowing uni-modal convex shapes (+1) produced better fine-tuning results. Inuitively we allowed the model to switch between hard acceleration/breaking with great certainty in the sampling outcome, verses balancing a probability mass near the middle.
A second hurdle we faced was overcoming turn failures caused by excessive speed. Our thrust agent often produced high speed high score episodes (930 reward), but every 10 episodes or so it would spin out and fail. This dropped the average to 870. We overcame this by over/under emphasizing training data. We simply skipped training on a fraction of high reward episodes (more emphasis on failures) to reduce the urge to learn dangerous speeds.
Figure ?: Our training curve (top), and the previous top PPO from OpenAI's leaderboard (bottom)[7]. We required approximately 7 times less training steps (dotted line marks our total timesteps).