-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parallel Q-Networks algorithm (PQN) #472
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Hey Roger, it's really cool to see you adding PQN to CleanRL! I've read the paper before, and I think your implementation is great. When it comes time to run benchmarks or add documentation, let's collaborate to see how we can best do it. Looking forward to seeing the completed PR! 🚀👍 |
I noticed that the epsilon greedy implementation in our current setup differs from the official one, where each environment independently performs epsilon greedy exploration, whereas in our implementation, all environments share a single random number. This might have an impact when running many environments in parallel. Of course, there could be other reasons for the performance differences too. Let's start by running some benchmark tests to see if the performance also falls short in other environments. Looking forward to working through this together! |
… some envs can explore and some exploit, like in the official implementation
Very nice catch! Let me try to set up the benchmark experiments :) |
Here are some first results! |
Been watching this from far, very cool work!! |
Nice job, your results show it takes 25 minutes for 10 million frames while the paper reports 200 million in an hour. No equivalent to |
Updated results here. I wonder how should I generate the comparison between DQN/PQN with the @pseudo-rnd-thoughts It is probably because |
Maybe try |
Hey! How do you think we should proceed? I believe that it will be hard to match the speed of the JAX-based original implementation in this torch implementation, but at least it provides a Q-learning alternative + envpool that matches CleanRL envpool PPO, which can already be very useful! :) |
I realized I was re-computing the values for each state in the rollouts when computing Also, I added Please let me know how we should continue! |
@roger-creus There is a larger issue of EnvPool with rollouts and computing the loss function, see #475 |
Hey @roger-creus thanks for the ping and the nice RP! Could you add documentation like the other algorithms? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on style. Overall we should try to reduce the lines of code differences against a reference file (e.g., pqn.py and ppo.py or ppo_atari_envpool.py and pqn_atari_envpool.py)
cleanrl/pqn_atari_envpool.py
Outdated
"""Toggle learning rate annealing for policy and value networks""" | ||
gamma: float = 0.99 | ||
"""the discount factor gamma""" | ||
num_minibatches: int = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the PQN hyperparameters from the official implementation. Overall, I have found num_envs=128
, num_steps=32
, and num_minibatches=32
to be a great combination of both speed and policy rewards (8.2k FPS in my setup compared to the current ppo_atari_envpool
for which I get 2.7k FPS).
I am running some experiments to evaluate PQN with the 2 different configurations:
num_envs=128
,num_steps=32
,num_minibatches=32
vsnum_envs=8
,num_steps=128
,num_minibatches=4
and will update you with the results. Maybe in a future PR I can do the same for PPO?
The inconsistencies with pqn_lstm will be fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for now I will just leave it to the current defaults of CleanRL: num_envs=8
, num_steps=128
, num_minibatches=4
and once this PR is merged I can make a new one to decide the best hyperparameters for both PQN and PPO (envpool versions).
cleanrl/pqn_atari_envpool.py
Outdated
nn.Conv2d(4, 32, 8, stride=4), | ||
nn.LayerNorm([32, 20, 20]), | ||
nn.ReLU(), | ||
nn.Conv2d(32, 64, 4, stride=2), | ||
nn.LayerNorm([64, 9, 9]), | ||
nn.ReLU(), | ||
nn.Conv2d(64, 64, 3, stride=1), | ||
nn.LayerNorm([64, 7, 7]), | ||
nn.ReLU(), | ||
nn.Flatten(), | ||
nn.Linear(3136, 512), | ||
nn.LayerNorm(512), | ||
nn.ReLU(), | ||
nn.Linear(512, env.single_action_space.n), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do the layer_init
to PQN as well like in PPO?
… PQN Lstm. Ran validation experiments. Reverting undesired changes in DQN. WIP documentation
Thanks for your review @vwxyzjn !
Let me know how to proceed! :) |
cleanrl/pqn.py
Outdated
@@ -0,0 +1,247 @@ | |||
# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/dqn/#dqnpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated
Hi @roger-creus sorry the slow response. Lots of going on lately. Could you clone the openrlbenchmark repo and run the following? Namely:
I can't do the plots because I don't have access to the cleanrl project. Could you make it public? Thanks. |
I tested it with a new install and openrlbenchmark seems to work fine with the demo command: https://github.com/openrlbenchmark/openrlbenchmark?tab=readme-ov-file#get-started |
Thanks for the reply @vwxyzjn ! I have updated the documentation, added the plots, etc. I have checked the changes locally and I think it looks fine! All tests pass, etc. This PR is pretty much ready in my opinion, but let me know if you would like some additional details or experiments! :) |
Description
Adding PQN from Simplifying Deep Temporal Difference Learning
I have implemented both
pqn.py
andpqn_atari_envpool.py
. The results are promising for the Cartpole version. Check them out here. I am now running some debugging experiments for the Atari version.Some details about the implementations:
pqn.py
anddqn.py
in cartpole I multiplied the rewards from the environment by 0.1 as done in the official implementation of PQN. performance increases for both algos.Overall the implementation is similar to ppo with envpool (so very fast!) but with the sample-efficiency of Q-learning! Nice algorithm! :)
Let me know how to proceed!
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video
.python -m openrlbenchmark.rlops
.python -m openrlbenchmark.rlops
utility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report
, to the documentation.