-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDPG/TD3 target_actor output clip #196
Comments
Thanks for raising this issue. I think you are referring to https://github.com/sfujim/TD3/blob/385b33ac7de4767bab17eb02ade4a268d3e4e24f/TD3.py#L28? Yeah this is a bit of a problem. That said, it's fine in MuJoCo because |
Yep. I still highly suggest you adopt Fujimoto's implementations, because in some MuJoCo environments (like Humanoid-v2) and some other environments the max_action may not be 1. And I also highly suggest you add the implementation of pytorch network model saving to each of your single file, and then write a separate single file that loads the network model and tests it in the corresponding environment. Thanks for your reply and your nice cleanRL code : ) |
Thanks for the suggestion! Fixing the Model saving on the other hand is a bit more nuanced. For example, ppo_continuous_action.py technically also needs to save the running mean and standard deviations stored in the environment wrappers. Because of these complications, we have decided not doing model saving to keep the core implementation as simple as possible. Additionally, most people save the models only to load them later to see what the agent is actually doing. For this use case, we already included videos of the agents playing the game in the docs (e.g. link, further eliminating the use case of saving models. |
@dosssman This is indeed a much larger problem - I just tested it out and there is a number of environments that do not have 1 as the default
@huxiao09 would you be interested in submitting a fix for this and going through the process of re-running benchmarks? Otherwise, no worries - we will take care of this. |
Sure, it's my pleasure. But I am busy this week. I'm afraid I can only try my best to finish it in two weeks. Is that too late? |
Thanks for the heads up. Will look into this. |
Two weeks is actually a great timeline. Thank you @huxiao09 for your interest in working on this! The process would go like this
Btw would this impact cleanrl/cleanrl/ppo_continuous_action.py Line 89 in 6387191
|
SAC should be unaffected as the actions are scale to the min, max actions before being returned to the algorithm.
|
ok, get it :) |
Thank you so much. Would you mind registering a wandb account and giving me your username? Would you also mind adding me on discord ( Thanks |
Yes, I have already had an account: https://wandb.ai/huxiao . But it seems that discord can't be open in my region. I'll find how to solve this problem tomorrow. |
Preliminary fix tracked here for TD3 and here for DDPG. The corresponding PR is #211 Overall, it does not seem to make that big of a difference compared to the previous version, but this version would be theoretically correct indeed. |
I changed the run set name to as follows (customize the advanced legend to be Could you do the following
Thank you so much! |
ok. @dosssman actually has done exactly what I am going to do. But I'm sorry I may not get you to see my reply in time. |
Thank you very much for the feedback. |
Thank you Rousslan. Please remove the experiments or move them to cleanrl-cache project - the stale experiments should not remain in the openrlbenchmark/cleanrl.
…________________________________
From: Rousslan F.J. Dossa ***@***.***>
Sent: Saturday, June 25, 2022 11:00:27 AM
To: vwxyzjn/cleanrl ***@***.***>
Cc: Costa Huang ***@***.***>; Comment ***@***.***>
Subject: Re: [vwxyzjn/cleanrl] DDPG/TD3 target_actor output clip (Issue #196)
Thank you very much for the feedback.
Will fix and add re-runs just to be sure.
—
Reply to this email directly, view it on GitHub<#196 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJE47Z6HOASSAB26FDY3VQ4NIXANCNFSM5YMZVEJA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
No worries. Will do. Should get it done by the next Tuesday or Wednesday. |
Let's just keep those old experiments to compare with the last variant until then. |
re-runs are looking good so far, not much difference with the previous iterations. @vwxyzjn Do need some help to remove this runs, as I am not their author. |
Problem Description
Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?
[1] https://github.com/sfujim/TD3
The text was updated successfully, but these errors were encountered: