Skip to content

Add support for DPPO [WIP]#5065

Closed
catherinelee274 wants to merge 3 commits into
huggingface:mainfrom
catherinelee274:clee_dppo
Closed

Add support for DPPO [WIP]#5065
catherinelee274 wants to merge 3 commits into
huggingface:mainfrom
catherinelee274:clee_dppo

Conversation

@catherinelee274

Copy link
Copy Markdown

What does this PR do?

Fixes #4998

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@catherinelee274 catherinelee274 changed the title Add support for DPPO Add support for DPPO [WIP] Feb 11, 2026
@@ -0,0 +1,164 @@
# DPPO Trainer

TRL supports the Decoupled Proximal Policy Optimization (DPPO) algorithm, which is a variant of PPO that decouples the optimization of the policy and value function for improved training stability. This implementation is based on the [Stable-RL](https://github.com/sail-sg/Stable-RL) paper.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "Divergence Proximal Policy Optimization".

TRL supports the Divergence Proximal Policy Optimization (DPPO) algorithm, which is a variant of PPO that substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL) for improved training efficiency and stability.

@catherinelee274 catherinelee274 Feb 27, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRL has a separate PR for this. I will be closing in favor of that.

@qgallouedec

Copy link
Copy Markdown
Member

Thanks for the contribution, @catherinelee274! This has been a WIP draft for a while without further updates and now conflicts heavily with main. Closing for housekeeping; please feel free to reopen with a fresh branch if you'd like to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DPPO - Divergence Proximal Policy Optimization

3 participants