Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PPO-DNA algorithm for Atari #234

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
dc7d1f2
First draft of PPO-DNA
jseppanen Jul 12, 2022
a9e50b8
Fix distillation learning rate decay
jseppanen Jul 12, 2022
b0c4c45
Add argument for envpool threads
jseppanen Jul 12, 2022
76b2943
Add exponential averaging to obs normalization
jseppanen Jul 12, 2022
0c5230c
Seed envpool environment explicitly
jseppanen Jul 12, 2022
ee4d7a7
Bump default number of environments to 128
jseppanen Jul 13, 2022
b440b4f
Log gradients & upload final model to w&b
jseppanen Jul 13, 2022
22a16e2
Fix wandb logging to follow --track option
jseppanen Jul 15, 2022
b2486cc
Log environment step count
jseppanen Jul 15, 2022
eb279a9
Remove unused --capture-video option
jseppanen Jul 15, 2022
7550a5f
Fix distillation batch size argument
jseppanen Jul 19, 2022
bbbdf2e
Fix deprecation warning on np.bool_
jseppanen Jul 19, 2022
e20da82
Use correct frame skip from env
jseppanen Jul 19, 2022
f384cbb
Add docs for PPO-DNA
jseppanen Jul 19, 2022
3d1c5ba
Blacken
jseppanen Jul 19, 2022
7a66401
build docs
vwxyzjn Jul 19, 2022
89c7f9d
format table
vwxyzjn Jul 20, 2022
981201f
Add a note on environment preprocessing
vwxyzjn Jul 20, 2022
0fe7b1f
Update hyperparam defaults to match paper
jseppanen Jul 20, 2022
e50bf9e
Change order of learning stages
jseppanen Jul 20, 2022
1b91a46
Revert entropy coefficient back to 0.01
jseppanen Jul 20, 2022
cde9ada
Replace DIY obs. normalization with gym wrapper
jseppanen Jul 20, 2022
1bd2785
Disable TF32 multiplication on Ampere devices
jseppanen Jul 20, 2022
7bd5244
Remove reward clipping & add reward normalization
jseppanen Jul 20, 2022
d2d6c11
First pass to remove differences from baseline PPO
jseppanen Jul 21, 2022
87ca49d
Revert "Change order of learning stages"
jseppanen Jul 21, 2022
88464bb
Revert "Update hyperparam defaults to match paper"
jseppanen Jul 21, 2022
94fc331
Minimize differences to baseline PPO code
jseppanen Jul 21, 2022
e014b14
Re-run experiments with code from commit 94fc331
jseppanen Jul 31, 2022
75a37c9
Remove main() function
jseppanen Jul 31, 2022
b76da96
Merge branch 'master' into ppo-dna
vwxyzjn Aug 1, 2022
710f85c
minor refactor
vwxyzjn Aug 1, 2022
4fb584c
Add benchmark script for ppo_dna_atari_envpool.py
jseppanen Sep 25, 2022
4b09c88
Fix typo
jseppanen Sep 25, 2022
d259368
Run value network in eval mode in distillation
jseppanen Sep 27, 2022
c071298
Do rollouts in eval mode
jseppanen Sep 28, 2022
5d913a6
Merge branch 'master' into ppo-dna
vwxyzjn Nov 20, 2022
c052f44
Remove duplicate adv calculation (see #287)
vwxyzjn Nov 20, 2022
f4501be
remove `OMP_NUM_THREADS=1`
vwxyzjn Nov 20, 2022
684001a
Merge branch 'master' into ppo-dna
vwxyzjn Nov 20, 2022
95bca3b
push changes
vwxyzjn Nov 20, 2022
3e35b8a
Try matching the env initializion in the paper
vwxyzjn Nov 20, 2022
b156f7d
revert change
vwxyzjn Nov 20, 2022
f89e68d
revert changes
vwxyzjn Nov 20, 2022
6595b4c
bug fix
vwxyzjn Nov 22, 2022
52c9c55
update script
vwxyzjn Nov 22, 2022
caabea4
Merge branch 'master' into ppo-dna
vwxyzjn Jan 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
468 changes: 468 additions & 0 deletions cleanrl/ppo_dna_atari_envpool.py

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions docs/rl-algorithms/ppo_dna.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)

## Overview

PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.

Original paper:

* [DNA: Proximal Policy Optimization with a Dual Network Architecture](https://arxiv.org/abs/2206.10027)

## Implemented Variants


| Variants Implemented | Description |
| ----------- | ----------- |
| :material-github: [`ppo_dna_atari_envpool.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py), :material-file-document: [docs](/rl-algorithms/ppo_dna/#ppo_dna_atari_envpoolpy) | Uses the blazing fast Envpool Atari vectorized environment. |

Below are our single-file implementations of PPO-DNA:

## `ppo_dna_atari_envpool.py`

The [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) has the following features:

* Uses the blazing fast [Envpool](https://github.com/sail-sg/envpool) vectorized environment.
* For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
* Works with the Atari's pixel `Box` observation space of shape `(210, 160, 3)`
* Works with the `Discrete` action space

???+ warning

Note that `ppo_dna_atari_envpool.py` does not work in Windows :fontawesome-brands-windows: and MacOs :fontawesome-brands-apple:. See envpool's built wheels here: [https://pypi.org/project/envpool/#files](https://pypi.org/project/envpool/#files)


### Usage

```bash
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
```

### Explanation of the logged metrics

See [related docs](/rl-algorithms/ppo/#explanation-of-the-logged-metrics) for `ppo.py`.

### Implementation details

[ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) uses a customized `RecordEpisodeStatistics` to work with envpool but has the same other implementation details as `ppo_atari.py` (see [related docs](/rl-algorithms/ppo/#implementation-details_1)).

Note that the original DNA implementation uses the `StickyAction` environment pre-processing wrapper (see (Machado et al., 2018)[^1]), but we did not implement it in [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) because envpool for now does not support `StickyAction`.


### Experiment results

Below are the average episodic returns for `ppo_dna_atari_envpool.py` compared to `ppo_atari_envpool.py`.


| Environment | `ppo_dna_atari_envpool.py` | `ppo_atari_envpool.py` |
| ----------- | ----------- | ----------- |
| BattleZone-v5 (40M steps) | 94800 ± 18300 | 28800 ± 6800 |
| BeamRider-v5 (10M steps) | 5470 ± 850 | 1990 ± 560 |
| Breakout-v5 (10M steps) | 321 ± 63 | 352 ± 52 |
| DoubleDunk-v5 (40M steps) | -4.9 ± 0.3 | -2.0 ± 0.8 |
| NameThisGame-v5 (40M steps) | 8500 ± 2600 | 4400 ± 1200 |
| Phoenix-v5 (45M steps) | 184000 ± 58000 | 10200 ± 2700 |
| Pong-v5 (3M steps) | 19.5 ± 1.1 | 16.6 ± 2.3 |
| Qbert-v5 (45M steps) | 12600 ± 4600 | 10800 ± 3300 |
| Tennis-v5 (10M steps) | 13.0 ± 2.3 | -12.4 ± 2.9 |

Learning curves:

<div class="grid-container">
<img src="../ppo_dna/BattleZone-v5-steps.png">
<img src="../ppo_dna/BattleZone-v5-time.png">
<img src="../ppo_dna/BeamRider-v5-steps.png">
<img src="../ppo_dna/BeamRider-v5-time.png">
<img src="../ppo_dna/Breakout-v5-steps.png">
<img src="../ppo_dna/Breakout-v5-time.png">
<img src="../ppo_dna/DoubleDunk-v5-steps.png">
<img src="../ppo_dna/DoubleDunk-v5-time.png">
<img src="../ppo_dna/NameThisGame-v5-steps.png">
<img src="../ppo_dna/NameThisGame-v5-time.png">
<img src="../ppo_dna/Phoenix-v5-steps.png">
<img src="../ppo_dna/Phoenix-v5-time.png">
<img src="../ppo_dna/Pong-v5-steps.png">
<img src="../ppo_dna/Pong-v5-time.png">
<img src="../ppo_dna/Qbert-v5-steps.png">
<img src="../ppo_dna/Qbert-v5-time.png">
<img src="../ppo_dna/Tennis-v5-steps.png">
<img src="../ppo_dna/Tennis-v5-time.png">
</div>


Tracked experiments:

<iframe src="https://wandb.ai/jseppanen/cleanrl/reports/PPO-DNA-vs-PPO-on-Atari-Envpool--VmlldzoyMzM5Mjcw" style="width:100%; height:500px" title="PPO-DNA vs PPO on Atari Envpool"></iframe>




[^1]: Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/BeamRider-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Breakout-v5-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Breakout-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Phoenix-v5-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Phoenix-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Pong-v5-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Pong-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Qbert-v5-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Qbert-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Tennis-v5-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Tennis-v5-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ nav:
- rl-algorithms/sac.md
- rl-algorithms/td3.md
- rl-algorithms/ppg.md
- rl-algorithms/ppo_dna.md
# - Open RL Benchmark: open-rl-benchmark.md
- Community:
- contribution.md
Expand Down