Add Parallel Q-Networks algorithm (PQN) #472

roger-creus · 2024-07-17T16:55:45Z

Description

Adding PQN from Simplifying Deep Temporal Difference Learning

I have implemented both pqn.py and pqn_atari_envpool.py. The results are promising for the Cartpole version. Check them out here. I am now running some debugging experiments for the Atari version.

Some details about the implementations:

Both use envpool
Hyperaprameters try to match the configs from the official implementations but some are changed (epsilon-decay schedule matches the DQN implementation from cleanRL. I haven't checked the importance of the hyperparameter the defaults in CleanRL made more sense to me)
For comparing pqn.py and dqn.py in cartpole I multiplied the rewards from the environment by 0.1 as done in the official implementation of PQN. performance increases for both algos.
Using LayerNorm in the networks instead of allowing the user to select between Layer or Batch norm. Layer norm should work better.
Not giving the user the option to add BatchNorm to the inputs to the network (i.e. states) as in the official implementaiton.

Overall the implementation is similar to ppo with envpool (so very fast!) but with the sample-efficiency of Q-learning! Nice algorithm! :)

Let me know how to proceed!

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2024-07-17T16:55:49Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 14, 2024 3:11pm

sdpkjc · 2024-07-17T17:14:54Z

Hey Roger, it's really cool to see you adding PQN to CleanRL! I've read the paper before, and I think your implementation is great. When it comes time to run benchmarks or add documentation, let's collaborate to see how we can best do it. Looking forward to seeing the completed PR! 🚀👍

roger-creus · 2024-07-17T18:37:26Z

I think the code might be ready to be benchmarked. These are some results in Breakout. It seems to converge to 400 score in 10M which would match DQN. The official imlpementation reports 515 score after 400M steps. Should I be added to the openrlbenchmark W&B team?

sdpkjc · 2024-07-18T02:21:26Z

I think the code might be ready to be benchmarked. These are some results in Breakout. It seems to converge to 400 score in 10M which would match DQN. The official imlpementation reports 515 score after 400M steps. Should I be added to the openrlbenchmark W&B team?

I noticed that the epsilon greedy implementation in our current setup differs from the official one, where each environment independently performs epsilon greedy exploration, whereas in our implementation, all environments share a single random number. This might have an impact when running many environments in parallel. Of course, there could be other reasons for the performance differences too. Let's start by running some benchmark tests to see if the performance also falls short in other environments. Looking forward to working through this together!

https://github.com/mttga/purejaxql/blob/9878a74439593c5d0acc8e506fefc44daa230c51/purejaxql/pqn_atari.py#L312-L325

… some envs can explore and some exploit, like in the official implementation

roger-creus · 2024-07-18T03:54:02Z

Very nice catch! Let me try to set up the benchmark experiments :)

roger-creus · 2024-07-18T13:06:51Z

Here are some first results!
I think they look pretty good but maybe in BeamRider-v5 is falling a bit short. Let me double check the implementation and run some more experiments

vwxyzjn · 2024-07-19T13:21:06Z

Been watching this from far, very cool work!!

pseudo-rnd-thoughts · 2024-07-19T13:31:09Z

Nice job, your results show it takes 25 minutes for 10 million frames while the paper reports 200 million in an hour.
Do you know why there are such significant differences in performance?

No equivalent to jax.jit or jax.lax.scan?

roger-creus · 2024-07-19T14:18:20Z

Updated results here. I wonder how should I generate the comparison between DQN/PQN with the rlops function since I am using envpool and I am not being able to compare pqn_atari_envpool.py in Breakout-v5 vs dqn_atari.py in BreakoutNoFrameskip-v4 for instance. Should I make a version of PQN that doesn't use envpool?

@pseudo-rnd-thoughts It is probably because jax.lax.scan. I am not used to coding in jax, but after searching online, it seems pytorch does not have a function like scan...

sdpkjc · 2024-07-20T07:58:43Z

Maybe try torch.compile?

roger-creus · 2024-07-25T19:14:43Z

Hey! How do you think we should proceed?

I believe that it will be hard to match the speed of the JAX-based original implementation in this torch implementation, but at least it provides a Q-learning alternative + envpool that matches CleanRL envpool PPO, which can already be very useful! :)

…150%

roger-creus · 2024-07-26T13:28:24Z

I realized I was re-computing the values for each state in the rollouts when computing Q(lambda) returns. I have now used a values buffer (as used in the PPO implementations actually) and replaced the computations in the Q(lambda) process. Performance remains the same and the code is now approx 150% faster.

Also, I added pqn_atari_lstm_envpool! First results in the atari environments show the implementation is correct. Please double check! :)

Please let me know how we should continue!

pseudo-rnd-thoughts · 2024-07-26T13:55:59Z

@roger-creus There is a larger issue of EnvPool with rollouts and computing the loss function, see #475

roger-creus · 2024-10-23T16:00:00Z

@vwxyzjn Hey! Following up :) What more do you think should be done for this PR?
See the results here

vwxyzjn · 2024-10-23T16:04:00Z

Hey @roger-creus thanks for the ping and the nice RP! Could you add documentation like the other algorithms?

vwxyzjn

Added some comments on style. Overall we should try to reduce the lines of code differences against a reference file (e.g., pqn.py and ppo.py or ppo_atari_envpool.py and pqn_atari_envpool.py)

cleanrl/dqn.py

vwxyzjn · 2024-10-23T16:35:22Z

cleanrl/pqn_atari_envpool.py

+    """Toggle learning rate annealing for policy and value networks"""
+    gamma: float = 0.99
+    """the discount factor gamma"""
+    num_minibatches: int = 32


The number of minibatches seems quite high. Where is this coming from? When comparing to other parameters I noticed quite some difference as well.

It's also inconsistent with the pqn_atari_lstm_envpool

I took the PQN hyperparameters from the official implementation. Overall, I have found num_envs=128, num_steps=32, and num_minibatches=32 to be a great combination of both speed and policy rewards (8.2k FPS in my setup compared to the current ppo_atari_envpool for which I get 2.7k FPS).

I am running some experiments to evaluate PQN with the 2 different configurations:

num_envs=128, num_steps=32, num_minibatches=32
vs

num_envs=8, num_steps=128, num_minibatches=4

and will update you with the results. Maybe in a future PR I can do the same for PPO?

The inconsistencies with pqn_lstm will be fixed!

Actually for now I will just leave it to the current defaults of CleanRL: num_envs=8, num_steps=128, num_minibatches=4 and once this PR is merged I can make a new one to decide the best hyperparameters for both PQN and PPO (envpool versions).

vwxyzjn · 2024-10-23T16:36:29Z

cleanrl/pqn_atari_envpool.py

+            nn.Conv2d(4, 32, 8, stride=4),
+            nn.LayerNorm([32, 20, 20]),
+            nn.ReLU(),
+            nn.Conv2d(32, 64, 4, stride=2),
+            nn.LayerNorm([64, 9, 9]),
+            nn.ReLU(),
+            nn.Conv2d(64, 64, 3, stride=1),
+            nn.LayerNorm([64, 7, 7]),
+            nn.ReLU(),
+            nn.Flatten(),
+            nn.Linear(3136, 512),
+            nn.LayerNorm(512),
+            nn.ReLU(),
+            nn.Linear(512, env.single_action_space.n),


Maybe do the layer_init to PQN as well like in PPO?

cleanrl/pqn_atari_lstm_envpool.py

cleanrl/pqn.py

… PQN Lstm. Ran validation experiments. Reverting undesired changes in DQN. WIP documentation

roger-creus · 2024-10-25T15:39:17Z

Thanks for your review @vwxyzjn !

pqn.py now uses gymnasium instead of envpool (I ran some experiments in CartPole and it continues to work).
pqn_atari_envpool.py and pqn_atari_envpool_lstm.py now have the same rollout hyperparameters as PPO.
Validation experiments are successful -- they match PPO. See that PQN-LSTM experiments are now running.
I added a documentation page but I am unsure how to add the plots/curves/tables for performance with RLOps. I tried but got an error, maybe because the logs are currently in my WandB account.

Let me know how to proceed! :)

pseudo-rnd-thoughts · 2024-10-30T15:58:25Z

cleanrl/pqn.py

@@ -0,0 +1,247 @@
+# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/dqn/#dqnpy


This needs to be updated

vwxyzjn · 2024-10-30T18:30:03Z

Hi @roger-creus sorry the slow response. Lots of going on lately. Could you clone the openrlbenchmark repo and run the following? Namely:

git clone https://github.com/openrlbenchmark/openrlbenchmark.git
poetry install
python -m openrlbenchmark.rlops \
    --filters '?we=rogercreus&wpn=cleanRL&ceik=env_id&cen=exp_name&metric=charts/episodic_return' \
        'pqn_atari_envpool?tag=pr-472&cl=CleanRL PQN' \
    --env-ids Breakout-v5 SpaceInvaders-v5 BeamRider-v5 Pong-v5 MsPacman-v5 \
    --no-check-empty-runs \
    --pc.ncols 3 \
    --pc.ncols-legend 3 \
    --rliable \
    --rc.score_normalization_method maxmin \
    --rc.normalized_score_threshold 1.0 \
    --rc.sample_efficiency_plots \
    --rc.sample_efficiency_and_walltime_efficiency_method Median \
    --rc.performance_profile_plots  \
    --rc.aggregate_metrics_plots  \
    --rc.sample_efficiency_num_bootstrap_reps 10 \
    --rc.performance_profile_num_bootstrap_reps 10 \
    --rc.interval_estimates_num_bootstrap_reps 10 \
    --output-filename static/0compare \
    --scan-history

I can't do the plots because I don't have access to the cleanrl project. Could you make it public? Thanks.

vwxyzjn · 2024-10-30T18:35:47Z

I tested it with a new install and openrlbenchmark seems to work fine with the demo command: https://github.com/openrlbenchmark/openrlbenchmark?tab=readme-ov-file#get-started

roger-creus · 2024-10-31T17:00:24Z

Thanks for the reply @vwxyzjn ! I have updated the documentation, added the plots, etc. I have checked the changes locally and I think it looks fine! All tests pass, etc.

This PR is pretty much ready in my opinion, but let me know if you would like some additional details or experiments! :)

roger-creus · 2024-11-14T14:09:07Z

@vwxyzjn Just following up :) Let me know if you'd like anything else in this PR

vwxyzjn · 2024-11-14T14:20:40Z

This is an amazing PR. Merging as is. Thanks for the great work @roger-creus!!! I am also sent you an invite as a collaborator for CleanRL. Feel free to merge the PR.

…d plotting the results

roger-creus added 2 commits July 17, 2024 12:32

wip

6e89a5c

wip

e9f158f

vercel bot deployed to Preview July 17, 2024 16:56 View deployment

Run pre-commit run --all-files

21226d7

vercel bot deployed to Preview July 17, 2024 18:38 View deployment

each env has its own probability of random.random() < epsilon -- i.e.…

fba8994

… some envs can explore and some exploit, like in the official implementation

vercel bot deployed to Preview July 18, 2024 03:54 View deployment

added pqn with lstm. works on breakout :)

b6a12c1

vercel bot deployed to Preview July 25, 2024 22:29 View deployment

Use buffer for values instead of re-computing in Q(lambda). Speed up …

bfe1205

…150%

vercel bot deployed to Preview July 26, 2024 13:25 View deployment

vwxyzjn reviewed Oct 23, 2024

View reviewed changes

roger-creus added 2 commits October 25, 2024 09:58

Merge remote-tracking branch 'upstream/master'

cc66288

default rollout hyperparameters in PQN are now equal to PPO. same for…

10334fa

… PQN Lstm. Ran validation experiments. Reverting undesired changes in DQN. WIP documentation

vercel bot deployed to Preview October 25, 2024 15:35 View deployment

pseudo-rnd-thoughts reviewed Oct 30, 2024

View reviewed changes

Added coumentation. Fixed links to the docs for PQN

c8acf70

vercel bot deployed to Preview October 31, 2024 16:49 View deployment

vwxyzjn approved these changes Nov 14, 2024

View reviewed changes

Finished PQN documentation by adding commands to run the benchmarks a…

d9c300b

…d plotting the results

vercel bot deployed to Preview November 14, 2024 15:11 View deployment

roger-creus merged commit e648ee2 into vwxyzjn:master Nov 14, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parallel Q-Networks algorithm (PQN) #472

Add Parallel Q-Networks algorithm (PQN) #472

roger-creus commented Jul 17, 2024 •

edited

Loading

vercel bot commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 17, 2024

roger-creus commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 18, 2024 •

edited

Loading

roger-creus commented Jul 18, 2024

roger-creus commented Jul 18, 2024

vwxyzjn commented Jul 19, 2024

pseudo-rnd-thoughts commented Jul 19, 2024 •

edited

Loading

roger-creus commented Jul 19, 2024

sdpkjc commented Jul 20, 2024

roger-creus commented Jul 25, 2024

roger-creus commented Jul 26, 2024

pseudo-rnd-thoughts commented Jul 26, 2024

roger-creus commented Oct 23, 2024 •

edited

Loading

vwxyzjn commented Oct 23, 2024

vwxyzjn left a comment

vwxyzjn Oct 23, 2024

roger-creus Oct 24, 2024

roger-creus Oct 24, 2024 •

edited

Loading

vwxyzjn Oct 23, 2024

roger-creus commented Oct 25, 2024

pseudo-rnd-thoughts Oct 30, 2024

vwxyzjn commented Oct 30, 2024

vwxyzjn commented Oct 30, 2024 •

edited

Loading

roger-creus commented Oct 31, 2024

roger-creus commented Nov 14, 2024

vwxyzjn commented Nov 14, 2024

		@@ -0,0 +1,247 @@
		# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/dqn/#dqnpy

Add Parallel Q-Networks algorithm (PQN) #472

Add Parallel Q-Networks algorithm (PQN) #472

Conversation

roger-creus commented Jul 17, 2024 • edited Loading

Description

Types of changes

Checklist:

vercel bot commented Jul 17, 2024 • edited Loading

sdpkjc commented Jul 17, 2024

roger-creus commented Jul 17, 2024 • edited Loading

sdpkjc commented Jul 18, 2024 • edited Loading

roger-creus commented Jul 18, 2024

roger-creus commented Jul 18, 2024

vwxyzjn commented Jul 19, 2024

pseudo-rnd-thoughts commented Jul 19, 2024 • edited Loading

roger-creus commented Jul 19, 2024

sdpkjc commented Jul 20, 2024

roger-creus commented Jul 25, 2024

roger-creus commented Jul 26, 2024

pseudo-rnd-thoughts commented Jul 26, 2024

roger-creus commented Oct 23, 2024 • edited Loading

vwxyzjn commented Oct 23, 2024

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn Oct 23, 2024

Choose a reason for hiding this comment

roger-creus Oct 24, 2024

Choose a reason for hiding this comment

roger-creus Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

vwxyzjn Oct 23, 2024

Choose a reason for hiding this comment

roger-creus commented Oct 25, 2024

pseudo-rnd-thoughts Oct 30, 2024

Choose a reason for hiding this comment

vwxyzjn commented Oct 30, 2024

vwxyzjn commented Oct 30, 2024 • edited Loading

roger-creus commented Oct 31, 2024

roger-creus commented Nov 14, 2024

vwxyzjn commented Nov 14, 2024

roger-creus commented Jul 17, 2024 •

edited

Loading

vercel bot commented Jul 17, 2024 •

edited

Loading

roger-creus commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 18, 2024 •

edited

Loading

pseudo-rnd-thoughts commented Jul 19, 2024 •

edited

Loading

roger-creus commented Oct 23, 2024 •

edited

Loading

roger-creus Oct 24, 2024 •

edited

Loading

vwxyzjn commented Oct 30, 2024 •

edited

Loading