Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Josh00-Lu
Copy link

@Josh00-Lu Josh00-Lu commented Feb 29, 2024

Description

I have noticed that the issues regarding termination and truncation in the PPO (Proximal Policy Optimization) algorithm have not been addressed and resolved. Therefore, I have incorporated relevant handling methods. Specifically, for a vector environment (vec env), when termination or truncation is received at the current time step, the actual next state will be provided in the "final_observation", and we need to record this information. Also, here, done = termination or truncation.

On the other hand, when calculating delta and advantage using GAE (Generalized Advantage Estimation), delta should be computed using termination, while advantages should be computed with done. For more details, please refer to the PR (Pull Request) codes and comments I have provided.

Notes:

  • Since we only next dones[t+1] and terminations[t+1] when computing advantages using GAE, I have make slight code adjustments to simplify the code logics.
  • Now we need to store next_terminations and next_obs.

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the tests accordingly (if applicable).
  • I have updated the documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
  • I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
  • I have performed RLops with python -m openrlbenchmark.rlops.
    • For new feature or bug fix:
      • I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
    • For new algorithm:
      • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
    • I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

Copy link

vercel bot commented Feb 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 29, 2024 3:06pm

@vwxyzjn
Copy link
Owner

vwxyzjn commented Apr 14, 2024

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

@Josh00-Lu
Copy link
Author

Josh00-Lu commented Apr 15, 2024

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Yes. Especially for mujoco environments like swimmer, where truncation are encountered more frequently than termination. When termination is predominant, this problem is not as obvious (but explicit handling does not negatively affect performance). I think this improvement is intuitive, helping to use GAE for more reasonable Temporal Credit Assignment in PPO, as it explicitly recognizes the difference between termination and truncation. Admittedly, scaled extensive experiments may be needed to verify this.

@gmarkkula
Copy link

+1 on seeing cases where it makes a big difference. CleanRL consistently performed much worse than SB3 on one of my environments, but when I added the bootstrapping from terminal observation to CleanRL it achieved the same performance as SB3. See this comment in another thread on this topic.

@varadVaidya
Copy link

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

@Josh00-Lu
Copy link
Author

Josh00-Lu commented Apr 23, 2024

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Thank you. It's great to see that!

@sdpkjc
Copy link
Collaborator

sdpkjc commented Apr 23, 2024

Thank you for this PR! I had previously noticed the issue and made a simple modification to the CleanRL PPO by referencing the fix done in DLR-RM/stable-baselines3#658. Here is my code: https://gist.github.com/sdpkjc/71cbba3bd439d754ef9c026c31658ecd#file-ppo_continuous_action-py-L431-L437

It only requires adding a few lines to the existing code.

I found that this modification offers limited performance improvements in the default Mujoco environments, but it yields benefits in scenarios that frequently involve truncation.

I also noticed that Costa @vwxyzjn seems to have been interested in this issue previously. In his blog post https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/, he mentioned this problem with PPO. However, he opted not to fix it in his implementation for the sake of fidelity to the original algorithm.

image

Whether or not to address this issue depends on whether CleanRL prioritizes higher performance or a more faithful implementation of the original algorithm. This is a topic worthy of our discussion. As far as I know, sb3 has addressed this issue.

Perhaps we could consider adding an argument to allow users to choose whether to handle this correctly.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Apr 23, 2024

Thanks all for the discussion. I had opted out to implement the correct behavior also because at some point I did a comprehensive eval and it did not seem to matter. There are also issues with efficient implementation. The current gymnasium's design is highly inefficient that we need to do extra forward passes whenever there is the terminal states.

sail-sg/envpool#194 (comment)

@Josh00-Lu Josh00-Lu closed this May 27, 2024
@pseudo-rnd-thoughts
Copy link
Collaborator

pseudo-rnd-thoughts commented May 28, 2024

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves.
Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

@Josh00-Lu
Copy link
Author

Josh00-Lu commented May 29, 2024

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

I have re-opened this PR and included it in a new file called ppo_continuous_action_truncted.py while keeping the ppo_continuous_action.py unchanged. If other files in this repository are updated to Gymnasium v1.0.0, I will make the following adjustments to this file accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants