Skip to content

Conversation

@lehaoqu
Copy link

@lehaoqu lehaoqu commented Oct 26, 2025

Add Dual-Clip PPO, which utilizes the clip_ratio_c to clip the ratio when the advantage is negative.
The Loss formular of Dual-Clip PPO is following:

$$ loss = \begin{cases} \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high})), -A\times Clip Ratio C ), & A <0\\ \max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high})), & A\geq 0 \end{cases} $$

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Qwtdgh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Dual-Clip Proximal Policy Optimization (PPO) algorithm, an advanced variant of PPO designed to enhance policy stability, particularly when dealing with negative advantages. The core change involves a new policy loss function that applies an additional clipping mechanism, controlled by clip_ratio_c, to prevent policy gradient explosion in specific scenarios. The implementation is thoroughly integrated into the policy loss function framework and validated with a new unit test.

Highlights

  • Dual-Clip PPO Implementation: Introduced the Dual-Clip PPO algorithm, which modifies the standard PPO loss by incorporating clip_ratio_c for improved handling of negative advantages.
  • New Policy Loss Function: Added DualClipPPOPolicyLossFn to implement the specific loss calculation as described in the paper, including the conditional clipping logic.
  • Module Integration: Registered the new DualClipPPOPolicyLossFn within the existing POLICY_LOSS_FN registry and made it discoverable.
  • Unit Testing: Included a dedicated unit test, test_dcppo_policy_loss, to ensure the correct behavior and numerical stability of the Dual-Clip PPO loss function.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Dual-Clip PPO policy loss function, a variant of PPO designed to improve stability when advantages are negative. The implementation correctly follows the logic described in the paper and the PR description. A corresponding unit test is also added. My review includes suggestions to refactor the __init__ method for better readability and type safety, which in turn allows for the removal of several type: ignore comments. I've also pointed out minor issues like a debug print statement in tests and stylistic improvements.

@garyzhang99
Copy link
Collaborator

Hi, thank you for your contribution!

I noticed a typo in the PR description formula; it should be $-A\times \text{Clip Ratio C}$ instead of $-A\times ratio\times \text{Clip Ratio C}$ in the first case.

Also, I think the functionality of this PR overlaps with #334. The implementations should be equivalent when setting truncate_is_range_high == clip_ratio_c in #334, assuming 1.0 + clip_range_high < truncate_is_range_high.

Specifically:

When truncate_is_range_high == clip_ratio_c and the constraint above holds, both approaches achieve the same loss values.

Should we consider consolidating these two PRs?

@lehaoqu
Copy link
Author

lehaoqu commented Oct 27, 2025

Yes @garyzhang99 . Besides, this PR decomposes the truncation of pg_loss on the advantage dimension, and compatibles #334.
I convert the #334 into clipping followed by truncating based on verl/core_algos.py, and further truncate the pg_loss based on the different signs of advantage.

The #334 does truncating and clipping in the following order:

  1. Truncate the IS ratio
  2. Clip the -A * IS ratio

However, based on the implement of ppo loss in verl/core_algos.py as follows, we can find it first clips the -A * IS ratio and then truncates the -A * IS ratio when A is negative, to avoid the -A * IS ratio is too large. The order of verl ppo loss caculation as follows:

  1. Clip the -A * IS ratio
  2. Truncate the -A * IS ratio
negative_approx_kl = log_prob - old_log_prob
# Clamp negative_approx_kl for stability
negative_approx_kl = torch.clamp(negative_approx_kl, min=-20.0, max=20.0)
ratio = torch.exp(negative_approx_kl)
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)

#######################################
# 1. Clip the -A * ratio
pg_losses1 = -advantages * ratio
if cliprange_low is None:
    cliprange_low = cliprange
if cliprange_high is None:
    cliprange_high = cliprange
pg_losses2 = -advantages * torch.clamp(
    ratio, 1 - cliprange_low, 1 + cliprange_high
)  # - clip(ratio, 1-cliprange, 1+cliprange) * A
clip_pg_losses1 = torch.maximum(
    pg_losses1, pg_losses2
)  # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)

#######################################
# 2. Truncate the -A * ratio
pg_losses3 = -advantages * clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_clipfrac_lower = verl_F.masked_mean(
    torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask
)

pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)

Since, to imporve the readability of code implement and decompose with different signs of advantage, I convert the #334 into clipping followed by truncating.

Notably, the formular of #334 is:

$$ \begin{aligned} ratio' &= clip(ratio, trunc_{low}, trunc_{high})\\ loss &= \max(-A\times ratio', -A\times clip(ratio', 1-\epsilon_{low}, 1+\epsilon_{high})) \end{aligned} $$

Convert it to clipping followed by truncating.

$$ loss = \begin{cases} \max(-A\times ratio, -A\times clip(ratio, trunc_{low}, \min(trunc_{high}, 1+\epsilon_{high}) )), & A>0 \\ \max(-A\times ratio, -A\times clip(ratio, \min(trunc_{low}, 1-\epsilon_{low}), trunc_{high} )), & A<0 \\ 0, & A=0 \end{cases} $$

It is intuitive that $trunc_{high}&gt;1+\epsilon_{high}$ and $trunc_{low}&lt;1-\epsilon_{low}$. The formula is simplified as follows:

$$ loss = \begin{cases} \max(-A\times ratio, -A\times clip(ratio, trunc_{low}, 1+\epsilon_{high} )), & A>0 \\ \max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, trunc_{high} )), & A<0 \\ 0, & A=0 \end{cases} $$

The above formula is equivalent to the following formula, which first clips -A * IS ratio based on $\epsilon$ and then truncates -A * IS ratio.

$$ loss = \begin{cases} \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high} )), -A\times trunc_{low}), & A>0 \\ \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high} )), -A\times trunc_{high}), & A<0 \\ 0, & A=0 \end{cases} $$

Since, the loss in #334 only depends on $trunc_{low}, \epsilon_{high}$ when advantage is positive, and depends on $trunc_{high}, \epsilon_{low}$ when advantage is negative. The following image shows the IS ratio curve.
IS ratio of #334

To further decompose on the advantage dimension, we need to the following two arguments to replace the truncate_large_is, which indicate whether to truncate advantages is positive and is negative.:

  • truncate_adv_pos_is (bool)
  • truncate_adv_neg_is (bool)

Since, when we set truncate_adv_pos_is=False, we do not need to care the value of truncate_is_range_low; and when we set truncate_adv_neg_is=False, we also do not need to care the value of truncate_is_range_high

For example:

truncate_adv_pos_is: false
truncate_adv_neg_is: true
truncate_is_range_high: 2.0
clip_is_range_low: 0.2
clip_is_range_high: 0.2
truncate_adv_pos_is: true
truncate_adv_neg_is: false
truncate_is_range_low: 0.0
clip_is_range_low: 0.2
clip_is_range_high: 0.2

I have rebased #344 into this PR. The main implement of pg_loss on different signs of advantage as follows:

# First clipping by clip_range, and calculate pg_clipfrac
pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * torch.clamp(
    ratio, 1.0 - self.clip_range_low, 1.0 + self.clip_range_high  # type: ignore
)
pg_losses_clip = torch.maximum(pg_losses1, pg_losses2)
pg_clipfrac = masked_mean(torch.gt(pg_losses2, pg_losses1).float(), action_mask)

# Add IS truncation for positive advantages
if self.truncate_adv_pos_is:
    pg_losses_pos_trunc = -advantages * self.truncate_is_range_low
    pg_truncfrac_pos = masked_mean(
        torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),
        action_mask,
    )
    pg_losses_pos = torch.minimum(pg_losses_trunc, pg_losses_pos_trunc)
    pg_losses_trunc = torch.where(advantages > 0, pg_losses_pos, pg_losses_trunc)

# Add IS truncation for negative advantages
if self.truncate_adv_neg_is:
    pg_losses_neg_trunc = -advantages * self.truncate_is_range_high
    pg_truncfrac_neg = masked_mean(
        torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),
        action_mask,
    )
    pg_losses_neg = torch.minimum(pg_losses_trunc, pg_losses_neg_trunc)
    pg_losses_trunc = torch.where(advantages < 0, pg_losses_neg, pg_losses_trunc)

@lehaoqu lehaoqu changed the title Add Dual-Clip PPO [Feature]Add Dual-Clip PPO Oct 27, 2025
truncate based on the sign of advantage after ratio clip
@lehaoqu lehaoqu changed the title [Feature]Add Dual-Clip PPO [Feature] Truncate based on the sign of advantage after ratio clip Oct 28, 2025
@lehaoqu lehaoqu changed the title [Feature] Truncate based on the sign of advantage after ratio clip [Feature] Truncate based on the sign of advantage after clipping Oct 28, 2025
Copy link
Collaborator

@garyzhang99 garyzhang99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. Let me actually run the algorithm for performance comparison before merging.

@lehaoqu lehaoqu requested a review from garyzhang99 October 28, 2025 03:48
@garyzhang99
Copy link
Collaborator

/unittest-module-algorithm

@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
15 15 0 0 0 0 4ms

Tests

Test Name Status Flaky Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_std_grpo 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_step_wise_grpo_advantage 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std 1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_truncate_adv_neg_is 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_truncate_adv_pos_is 1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss 1ms

Github Test Reporter by CTRF 💚

Copy link
Collaborator

@yanxi-chen yanxi-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline suggestions for improving code format and consistency. Also make sure to check code format by running pre-commit run --all-files.

assert (
self.truncate_is_range_low >= 0.0
), "truncate_is_range_low must be non-negative."
assert (self.truncate_is_range_low < 1.0-self.clip_range_low
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (self.truncate_is_range_low < 1.0-self.clip_range_low
assert (self.truncate_is_range_low < 1.0 - self.clip_range_low

self.truncate_is_range_high is not None
), "truncate_is_range_high must be specified."
assert (
self.truncate_is_range_high > 1.0+self.clip_range_high
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.truncate_is_range_high > 1.0+self.clip_range_high
self.truncate_is_range_high > 1.0 + self.clip_range_high

if self.truncate_adv_pos_is:
pg_losses_pos_trunc = -advantages * self.truncate_is_range_low
pg_truncfrac_pos = masked_mean(
torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),
torch.lt(pg_losses_pos_trunc, pg_losses_trunc).float() * (advantages > 0),

if self.truncate_adv_neg_is:
pg_losses_neg_trunc = -advantages * self.truncate_is_range_high
pg_truncfrac_neg = masked_mean(
torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),
torch.lt(pg_losses_neg_trunc, pg_losses_trunc).float() * (advantages < 0),

@yanxi-chen
Copy link
Collaborator

Thanks for contributing @lehaoqu! Just want to say that we might need to take this PR slow and make sure everything is perfect, since ppo_policy_loss plays such a central role in the whole Trinity system :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants