Solving the OpenAI Gym CarRacing-v0 environment using Proximal Policy Optimization.
Read the full report.
See the full video demo on YouTube.
After 5000 training steps, the agent achieves a mean score of 909.48±10.30 over 100 episodes. To reproduce the results, run the following commands:
mkdir logs
python demo.py --ckpt extra/final_weights.pt --delay_ms 0
Results from episodes will be saved to logs/episode_rewards.csv
.
- A convolutional neural network to jointly approximate the value function and the policy.
- Optimization is performed using Proximal Policy Optimization.
- Policy network outputs parameters to a Beta distribution, which is better for bounded continuous action spaces.
- Advantage estimation is done through the Generalized Advantage Estimation algorithm.
- A series of 4 frames are concatenated to form the input to the network, with frame skipping optionally applied.