-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance? |
Yes. Especially for mujoco environments like |
+1 on seeing cases where it makes a big difference. CleanRL consistently performed much worse than SB3 on one of my environments, but when I added the bootstrapping from terminal observation to CleanRL it achieved the same performance as SB3. See this comment in another thread on this topic. |
+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3 |
Thank you. It's great to see that! |
Thank you for this PR! I had previously noticed the issue and made a simple modification to the CleanRL PPO by referencing the fix done in DLR-RM/stable-baselines3#658. Here is my code: https://gist.github.com/sdpkjc/71cbba3bd439d754ef9c026c31658ecd#file-ppo_continuous_action-py-L431-L437 It only requires adding a few lines to the existing code. I found that this modification offers limited performance improvements in the default Mujoco environments, but it yields benefits in scenarios that frequently involve truncation. I also noticed that Costa @vwxyzjn seems to have been interested in this issue previously. In his blog post https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/, he mentioned this problem with PPO. However, he opted not to fix it in his implementation for the sake of fidelity to the original algorithm. Whether or not to address this issue depends on whether CleanRL prioritizes higher performance or a more faithful implementation of the original algorithm. This is a topic worthy of our discussion. As far as I know, sb3 has addressed this issue. Perhaps we could consider adding an argument to allow users to choose whether to handle this correctly. |
Thanks all for the discussion. I had opted out to implement the correct behavior also because at some point I did a comprehensive eval and it did not seem to matter. There are also issues with efficient implementation. The current gymnasium's design is highly inefficient that we need to do extra forward passes whenever there is the terminal states. |
Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Therefore, my suggest would be to reopen this PR, include it in a new file, |
I have re-opened this PR and included it in a new file called |
Description
I have noticed that the issues regarding termination and truncation in the PPO (Proximal Policy Optimization) algorithm have not been addressed and resolved. Therefore, I have incorporated relevant handling methods. Specifically, for a vector environment (vec env), when termination or truncation is received at the current time step, the actual next state will be provided in the
"final_observation"
, and we need to record this information. Also, here,done = termination or truncation
.On the other hand, when calculating delta and advantage using GAE (Generalized Advantage Estimation), delta should be computed using
termination
, while advantages should be computed withdone
. For more details, please refer to the PR (Pull Request) codes and comments I have provided.Notes:
dones[t+1]
andterminations[t+1]
when computing advantages using GAE, I have make slight code adjustments to simplify the code logics.next_terminations
andnext_obs
.Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video
.python -m openrlbenchmark.rlops
.python -m openrlbenchmark.rlops
utility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report
, to the documentation.