[Feature] Truncate based on the sign of advantage after clipping #340

lehaoqu · 2025-10-26T13:59:08Z

Add Dual-Clip PPO, which utilizes the clip_ratio_c to clip the ratio when the advantage is negative.
The Loss formular of Dual-Clip PPO is following:

$$ loss = \begin{cases} \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high})), -A\times Clip Ratio C ), & A <0\\ \max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high})), & A\geq 0 \end{cases} $$

gemini-code-assist · 2025-10-26T13:59:22Z

Summary of Changes

Hello @Qwtdgh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Dual-Clip Proximal Policy Optimization (PPO) algorithm, an advanced variant of PPO designed to enhance policy stability, particularly when dealing with negative advantages. The core change involves a new policy loss function that applies an additional clipping mechanism, controlled by clip_ratio_c, to prevent policy gradient explosion in specific scenarios. The implementation is thoroughly integrated into the policy loss function framework and validated with a new unit test.

Highlights

Dual-Clip PPO Implementation: Introduced the Dual-Clip PPO algorithm, which modifies the standard PPO loss by incorporating clip_ratio_c for improved handling of negative advantages.
New Policy Loss Function: Added DualClipPPOPolicyLossFn to implement the specific loss calculation as described in the paper, including the conditional clipping logic.
Module Integration: Registered the new DualClipPPOPolicyLossFn within the existing POLICY_LOSS_FN registry and made it discoverable.
Unit Testing: Included a dedicated unit test, test_dcppo_policy_loss, to ensure the correct behavior and numerical stability of the Dual-Clip PPO loss function.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Dual-Clip PPO policy loss function, a variant of PPO designed to improve stability when advantages are negative. The implementation correctly follows the logic described in the paper and the PR description. A corresponding unit test is also added. My review includes suggestions to refactor the __init__ method for better readability and type safety, which in turn allows for the removal of several type: ignore comments. I've also pointed out minor issues like a debug print statement in tests and stylistic improvements.

trinity/algorithm/policy_loss_fn/dcppo_policy_loss.py

tests/algorithm/policy_loss_test.py

trinity/algorithm/policy_loss_fn/dcppo_policy_loss.py

garyzhang99 · 2025-10-27T03:21:32Z

Hi, thank you for your contribution!

I noticed a typo in the PR description formula; it should be $-A\times \text{Clip Ratio C}$ instead of $-A\times ratio\times \text{Clip Ratio C}$ in the first case.

Also, I think the functionality of this PR overlaps with #334. The implementations should be equivalent when setting truncate_is_range_high == clip_ratio_c in #334, assuming 1.0 + clip_range_high < truncate_is_range_high.

Specifically:

This PR: Applies a minimum bound only for negative advantages
PR [Feature]Add tis fall back for ppo_policy_loss #334: Truncates the importance sampling ratio uniformly

When truncate_is_range_high == clip_ratio_c and the constraint above holds, both approaches achieve the same loss values.

Should we consider consolidating these two PRs?

lehaoqu · 2025-10-27T04:19:51Z

Yes @garyzhang99 . Besides, this PR decomposes the truncation of pg_loss on the advantage dimension, and compatibles #334.
I convert the #334 into clipping followed by truncating based on verl/core_algos.py, and further truncate the pg_loss based on the different signs of advantage.

The #334 does truncating and clipping in the following order：

Truncate the IS ratio
Clip the -A * IS ratio

However, based on the implement of ppo loss in verl/core_algos.py as follows, we can find it first clips the -A * IS ratio and then truncates the -A * IS ratio when A is negative, to avoid the -A * IS ratio is too large. The order of verl ppo loss caculation as follows:

Clip the -A * IS ratio
Truncate the -A * IS ratio

negative_approx_kl = log_prob - old_log_prob
# Clamp negative_approx_kl for stability
negative_approx_kl = torch.clamp(negative_approx_kl, min=-20.0, max=20.0)
ratio = torch.exp(negative_approx_kl)
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)

#######################################
# 1. Clip the -A * ratio
pg_losses1 = -advantages * ratio
if cliprange_low is None:
    cliprange_low = cliprange
if cliprange_high is None:
    cliprange_high = cliprange
pg_losses2 = -advantages * torch.clamp(
    ratio, 1 - cliprange_low, 1 + cliprange_high
)  # - clip(ratio, 1-cliprange, 1+cliprange) * A
clip_pg_losses1 = torch.maximum(
    pg_losses1, pg_losses2
)  # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)

#######################################
# 2. Truncate the -A * ratio
pg_losses3 = -advantages * clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_clipfrac_lower = verl_F.masked_mean(
    torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask
)

pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)

Since, to imporve the readability of code implement and decompose with different signs of advantage, I convert the #334 into clipping followed by truncating.

Notably, the formular of #334 is:

$$ \begin{aligned} ratio' &= clip(ratio, trunc_{low}, trunc_{high})\\ loss &= \max(-A\times ratio', -A\times clip(ratio', 1-\epsilon_{low}, 1+\epsilon_{high})) \end{aligned} $$

Convert it to clipping followed by truncating.

$$ loss = \begin{cases} \max(-A\times ratio, -A\times clip(ratio, trunc_{low}, \min(trunc_{high}, 1+\epsilon_{high}) )), & A>0 \\ \max(-A\times ratio, -A\times clip(ratio, \min(trunc_{low}, 1-\epsilon_{low}), trunc_{high} )), & A<0 \\ 0, & A=0 \end{cases} $$

It is intuitive that $trunc_{high}>1+\epsilon_{high}$ and $trunc_{low}<1-\epsilon_{low}$. The formula is simplified as follows:

$$ loss = \begin{cases} \max(-A\times ratio, -A\times clip(ratio, trunc_{low}, 1+\epsilon_{high} )), & A>0 \\ \max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, trunc_{high} )), & A<0 \\ 0, & A=0 \end{cases} $$

The above formula is equivalent to the following formula, which first clips -A * IS ratio based on $\epsilon$ and then truncates -A * IS ratio.

$$ loss = \begin{cases} \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high} )), -A\times trunc_{low}), & A>0 \\ \min(\max(-A\times ratio, -A\times clip(ratio, 1-\epsilon_{low}, 1+\epsilon_{high} )), -A\times trunc_{high}), & A<0 \\ 0, & A=0 \end{cases} $$

Since, the loss in #334 only depends on $trunc_{low}, \epsilon_{high}$ when advantage is positive, and depends on $trunc_{high}, \epsilon_{low}$ when advantage is negative. The following image shows the IS ratio curve.
IS ratio of #334

To further decompose on the advantage dimension, we need to the following two arguments to replace the truncate_large_is, which indicate whether to truncate advantages is positive and is negative.:

truncate_adv_pos_is (bool)
truncate_adv_neg_is (bool)

Since, when we set truncate_adv_pos_is=False, we do not need to care the value of truncate_is_range_low; and when we set truncate_adv_neg_is=False, we also do not need to care the value of truncate_is_range_high

For example:

truncate_adv_pos_is: false
truncate_adv_neg_is: true
truncate_is_range_high: 2.0
clip_is_range_low: 0.2
clip_is_range_high: 0.2

truncate_adv_pos_is: true
truncate_adv_neg_is: false
truncate_is_range_low: 0.0
clip_is_range_low: 0.2
clip_is_range_high: 0.2

I have rebased #344 into this PR. The main implement of pg_loss on different signs of advantage as follows:

# First clipping by clip_range, and calculate pg_clipfrac
pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * torch.clamp(
    ratio, 1.0 - self.clip_range_low, 1.0 + self.clip_range_high  # type: ignore
)
pg_losses_clip = torch.maximum(pg_losses1, pg_losses2)
pg_clipfrac = masked_mean(torch.gt(pg_losses2, pg_losses1).float(), action_mask)

# Add IS truncation for positive advantages
if self.truncate_adv_pos_is:
    pg_losses_pos_trunc = -advantages * self.truncate_is_range_low
    pg_truncfrac_pos = masked_mean(
        torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),
        action_mask,
    )
    pg_losses_pos = torch.minimum(pg_losses_trunc, pg_losses_pos_trunc)
    pg_losses_trunc = torch.where(advantages > 0, pg_losses_pos, pg_losses_trunc)

# Add IS truncation for negative advantages
if self.truncate_adv_neg_is:
    pg_losses_neg_trunc = -advantages * self.truncate_is_range_high
    pg_truncfrac_neg = masked_mean(
        torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),
        action_mask,
    )
    pg_losses_neg = torch.minimum(pg_losses_trunc, pg_losses_neg_trunc)
    pg_losses_trunc = torch.where(advantages < 0, pg_losses_neg, pg_losses_trunc)

truncate based on the sign of advantage after ratio clip

garyzhang99

This PR looks good to me. Let me actually run the algorithm for performance comparison before merging.

tests/algorithm/policy_loss_test.py

garyzhang99 · 2025-10-28T06:49:35Z

/unittest-module-algorithm

github-actions · 2025-10-28T06:51:04Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
15	15	0	0	0	0	4ms

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_std_grpo	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_step_wise_grpo_advantage	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_truncate_adv_neg_is	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_truncate_adv_pos_is	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms

Github Test Reporter by CTRF 💚

yanxi-chen

Please see the inline suggestions for improving code format and consistency. Also make sure to check code format by running pre-commit run --all-files.

yanxi-chen · 2025-10-28T09:05:08Z

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

+            assert (
+                self.truncate_is_range_low >= 0.0
+            ), "truncate_is_range_low must be non-negative."
+            assert (self.truncate_is_range_low < 1.0-self.clip_range_low


Suggested change

assert (self.truncate_is_range_low < 1.0-self.clip_range_low

assert (self.truncate_is_range_low < 1.0 - self.clip_range_low

yanxi-chen · 2025-10-28T09:05:25Z

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

+                self.truncate_is_range_high is not None
+            ), "truncate_is_range_high must be specified."
+            assert (
+                self.truncate_is_range_high > 1.0+self.clip_range_high


Suggested change

self.truncate_is_range_high > 1.0+self.clip_range_high

self.truncate_is_range_high > 1.0 + self.clip_range_high

yanxi-chen · 2025-10-28T09:31:15Z

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

+        if self.truncate_adv_pos_is:
+            pg_losses_pos_trunc = -advantages * self.truncate_is_range_low
+            pg_truncfrac_pos = masked_mean(
+                torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),


Suggested change

torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),

torch.lt(pg_losses_pos_trunc, pg_losses_trunc).float() * (advantages > 0),

yanxi-chen · 2025-10-28T09:32:28Z

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

+        if self.truncate_adv_neg_is:
+            pg_losses_neg_trunc = -advantages * self.truncate_is_range_high
+            pg_truncfrac_neg = masked_mean(
+                torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),


Suggested change

torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),

torch.lt(pg_losses_neg_trunc, pg_losses_trunc).float() * (advantages < 0),

yanxi-chen · 2025-10-28T10:08:44Z

Thanks for contributing @lehaoqu! Just want to say that we might need to take this PR slow and make sure everything is perfect, since ppo_policy_loss plays such a central role in the whole Trinity system :)

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

lehaoqu changed the title ~~Add Dual-Clip PPO~~ [Feature]Add Dual-Clip PPO Oct 27, 2025

PR#334 &

ec4d46c

truncate based on the sign of advantage after ratio clip

lehaoqu force-pushed the main branch from 22879e4 to ec4d46c Compare October 28, 2025 03:40

lehaoqu changed the title ~~[Feature]Add Dual-Clip PPO~~ [Feature] Truncate based on the sign of advantage after ratio clip Oct 28, 2025

lehaoqu changed the title ~~[Feature] Truncate based on the sign of advantage after ratio clip~~ [Feature] Truncate based on the sign of advantage after clipping Oct 28, 2025

garyzhang99 reviewed Oct 28, 2025

View reviewed changes

tests/algorithm/policy_loss_test.py Outdated Show resolved Hide resolved

lehaoqu requested a review from garyzhang99 October 28, 2025 03:48

add test policy loss for positive and negative adv

1492a46

garyzhang99 reviewed Oct 28, 2025

View reviewed changes

tests/algorithm/policy_loss_test.py Show resolved Hide resolved

add truncate_adv_both_is()

9e8b43e

lehaoqu force-pushed the main branch from 29b8f62 to 9e8b43e Compare October 28, 2025 07:05

yanxi-chen reviewed Oct 28, 2025

View reviewed changes

	assert (self.truncate_is_range_low < 1.0-self.clip_range_low
	assert (self.truncate_is_range_low < 1.0 - self.clip_range_low

	self.truncate_is_range_high > 1.0+self.clip_range_high
	self.truncate_is_range_high > 1.0 + self.clip_range_high

	torch.lt(pg_losses_pos_trunc, pg_losses_trunc) * (advantages > 0).float(),
	torch.lt(pg_losses_pos_trunc, pg_losses_trunc).float() * (advantages > 0),

	torch.lt(pg_losses_neg_trunc, pg_losses_trunc) * (advantages < 0).float(),
	torch.lt(pg_losses_neg_trunc, pg_losses_trunc).float() * (advantages < 0),

[Feature] Truncate based on the sign of advantage after clipping #340

Are you sure you want to change the base?

[Feature] Truncate based on the sign of advantage after clipping #340

Uh oh!

Conversation

lehaoqu commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Oct 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Oct 27, 2025

Uh oh!

lehaoqu commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyzhang99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Summary

Tests

Uh oh!

yanxi-chen left a comment

Choose a reason for hiding this comment

Uh oh!

yanxi-chen Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yanxi-chen Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yanxi-chen Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yanxi-chen Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yanxi-chen commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lehaoqu commented Oct 26, 2025 •

edited

Loading

lehaoqu commented Oct 27, 2025 •

edited

Loading