[algo] fix: Add seq mean mask denominator option by szrlee · Pull Request #4510 · verl-project/verl

szrlee · 2025-12-14T10:33:00Z

Summary

Refactor agg_loss function and fix entropy/KL loss scaling in distributed training.

Changes:

Refactor: Unify seq-mean-* modes with shared denominator logic using masked_sum
Behavior change: seq-mean-token-sum-norm now applies seq-mean division (denominator = global_batch_size * dp_size or local_bsz), matching the mode name
Simplification: Remove fully-masked sequence exclusion from denominator; use total batch size consistently

NOTE: Since the global loss aggregation logic is not compatible with the legacy model engine that conduct the aggregation outside agg_loss and is going to be deprecated, we keep this PR from modifying the the legacy model engine.

⚠️ Breaking: seq-mean-token-sum-norm now divides by both loss_scale_factor AND seq_denominator. Previously only divided by loss_scale_factor.

Test plan

Verify PPO training with seq-mean-token-sum mode
Verify PPO training with seq-mean-token-mean mode
Verify PPO training with seq-mean-token-sum-norm mode (note: behavior changed)
Confirm entropy/KL loss values are correctly scaled in multi-GPU training

…minator

gemini-code-assist

Code Review

This pull request introduces a useful exclude_fully_masked_seq option to control how the denominator is calculated for sequence-mean loss aggregation, and fixes a pre-existing issue where global batch information was not correctly propagated in dp_actor.py and megatron_actor.py. The changes are well-structured and clearly described.

My main feedback is a critical issue in verl/trainer/ppo/core_algos.py where the calculation of global_batch_size for non-fully-masked sequences assumes a uniform distribution across data-parallel workers. This can lead to gradient mismatches and cause distributed training to fail. I've provided suggestions to use torch.distributed.all_reduce for a robust and correct implementation.

verl/trainer/ppo/core_algos.py

tongyx361

Loss/gradient aggregation in DP training (token losses -> DP rank losses -> global loss) is tricky, which has been not thoroughly considered in the codebase for now.

For now, the dp_size is calculated as:

https://github.com/volcengine/verl/blob/7deb67ca177a38b060bca007198aecb3fa4431dd/verl/workers/engine/fsdp/transformer_impl.py#L478-L479

Distributed training frameworks sometimes apply aggregation implicitly, e.g., FSDP by default means the loss (sums the loss and divides by the reduce_scatter_world_size), which will make loss/gradient in the PR's implementation shrink by dp_size from the targeted ground truth. If not considering the numerical scale, multiplying this loss by dp_size first is an acceptable workaround. Besides, as far as I know, for FSDP2 in torch>=2.8.0, we can resolve this by calling set_gradient_divide_factor(1), but I am not sure about other setups like FSDP1 and Megatron.
The "mean" strategy assumes that the batch_sizes are even between DP ranks, but this is not always the case, e.g., 1) if the seq_mask is valid, each DP rank's sum is very likely to be uneven between DP ranks; 2) DP balance (balance_batch) might be optimized to allow dispatching with uneven batch_size for better workload balance in the future, so the implementation using all_reduce(SUM) suggested by Gemini is indeed more robust (but still problematic, see 3).
If Ulysses SP is enabled, the data will be all-gathered within each USP group of sp_size, which might cause the global batch_size multiplied by sp_size if simply summed up with all_reduce(SUM) as is suggested by Gemini (while the original implementation takes care of USP).

cc @wuxibin89 , maybe we can further improve the aggregation logic in the future.

For this PR individually, I think it can be approved because it is at least not worse than the original implementation, which adds an option where the seq_mask does not affect the aggregation, avoiding the uneven case.

…plexity

verl/workers/actor/dp_actor.py

verl/workers/actor/megatron_actor.py

ISEEKYAN · 2025-12-15T09:13:55Z

verl/trainer/ppo/core_algos.py

        loss_mask: micro batch loss mask, (bs, response_length)
        loss_agg_mode: method to aggregate the loss matrix into a scalar
-        dp_size: data parallel size
+        dp_size: data parallel size. When appling manual aggregation,


is it possible that we remove the dp_size from the algorithm loss function?

dp_size was brought into loss by @wuxibin89

This comment is kept unresolved to show why @wuxibin89 introduced the dp_size below.

wuxibin89 · 2025-12-15T10:04:22Z

dp_size, batch_num_tokens and global_batch_size were introduced to make sure each DP group's loss is averaged across global mini batch instead of averaged over local micro batch. This is mean to correct the contribution of micro-batch to gradient from different dp groups, which have different number of valid tokens and sequences. We want to make sure that each token each sequence in micro batch has equal contribution to gradient.

For example, we have dp_size=2 and num_micro_batches=2 (Gradient is accumulated across 2 micro batches in each dp group, then mean across 2 dp groups).

dp_rank=0: [micro_batch_0_0, micro_batch_0_1]
dp_rank=1: [micro_batch_1_0, micro_batch_1_1]

Then for agg_loss, we have

dp_size: 2
batch_num_tokens: sum(micro_batch_0_0, micro_batch_0_1, micro_batch_1_0, micro_batch_1_1)
global_batch_size: sum(len(micro_batch_0_0), len(micro_batch_0_1), len(micro_batch_1_0), len(micro_batch_1_1))

@vermouth1992 create a example for explanation: https://gist.github.com/vermouth1992/6c273240765c4f223478081042bfcd4a

verl/workers/actor/dp_actor.py

tongyx361 · 2025-12-15T14:44:42Z

Since the global loss aggregation logic is not compatible with the legacy model engine that conduct the aggregation outside agg_loss and is going to be deprecated, we keep this PR from modifying the the legacy model engine.

So the comments above are either resolved or avoided. @wuxibin89 @ISEEKYAN .

wuxibin89 · 2026-01-02T16:09:58Z

This PR make new model engine loss significant small than expected.

wuxibin89 · 2026-01-02T16:10:52Z

verl/trainer/ppo/core_algos.py

+    elif loss_agg_mode.startswith("seq-mean"):
+        # TODO: Correct and unify the denominator logic.
+        if global_batch_size is not None:
+            seq_denominator = global_batch_size * dp_size


The seq_denominator is not right, should be global_batch_size / dp_size

This reverts commit 6a58521.

Reverts #4510

…ct#4769) Reverts verl-project#4510

## Summary Refactor `agg_loss` function and fix entropy/KL loss scaling in distributed training. **Changes:** - **Refactor**: Unify `seq-mean-*` modes with shared denominator logic using `masked_sum` - **Behavior change**: `seq-mean-token-sum-norm` now applies seq-mean division (denominator = `global_batch_size * dp_size` or `local_bsz`), matching the mode name - **Simplification**: Remove fully-masked sequence exclusion from denominator; use total batch size consistently NOTE: Since the global loss aggregation logic is not compatible with the legacy model engine that conduct the aggregation outside `agg_loss` and is going to be deprecated, we keep this PR from modifying the the legacy model engine. ⚠️ **Breaking**: `seq-mean-token-sum-norm` now divides by both `loss_scale_factor` AND `seq_denominator`. Previously only divided by `loss_scale_factor`. ## Test plan - [ ] Verify PPO training with `seq-mean-token-sum` mode - [ ] Verify PPO training with `seq-mean-token-mean` mode - [ ] Verify PPO training with `seq-mean-token-sum-norm` mode (note: behavior changed) - [ ] Confirm entropy/KL loss values are correctly scaled in multi-GPU training --------- Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

…ct#4769) Reverts verl-project#4510

## Summary Refactor `agg_loss` function and fix entropy/KL loss scaling in distributed training. **Changes:** - **Refactor**: Unify `seq-mean-*` modes with shared denominator logic using `masked_sum` - **Behavior change**: `seq-mean-token-sum-norm` now applies seq-mean division (denominator = `global_batch_size * dp_size` or `local_bsz`), matching the mode name - **Simplification**: Remove fully-masked sequence exclusion from denominator; use total batch size consistently NOTE: Since the global loss aggregation logic is not compatible with the legacy model engine that conduct the aggregation outside `agg_loss` and is going to be deprecated, we keep this PR from modifying the the legacy model engine. ⚠️ **Breaking**: `seq-mean-token-sum-norm` now divides by both `loss_scale_factor` AND `seq_denominator`. Previously only divided by `loss_scale_factor`. ## Test plan - [ ] Verify PPO training with `seq-mean-token-sum` mode - [ ] Verify PPO training with `seq-mean-token-mean` mode - [ ] Verify PPO training with `seq-mean-token-sum-norm` mode (note: behavior changed) - [ ] Confirm entropy/KL loss values are correctly scaled in multi-GPU training --------- Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

…ct#4769) Reverts verl-project#4510

Reverts verl-project/verl#4510

szrlee added 5 commits December 14, 2025 18:16

feat(agg_loss): add exclude_fully_masked_seq option for seq-mean deno…

3f4658c

…minator

feat(ActorConfig): add exclude_fully_masked_seq field

3f2cb33

feat(ppo_loss): propagate exclude_fully_masked_seq to global_batch_info

6d9a296

feat(dp_actor): populate global_batch_info for agg_loss calls

9915db1

feat(megatron_actor): populate global_batch_info for agg_loss calls

de1fde6

szrlee requested review from ISEEKYAN, PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners December 14, 2025 10:33

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

verl/trainer/ppo/core_algos.py Outdated Show resolved Hide resolved

verl/trainer/ppo/core_algos.py Outdated Show resolved Hide resolved

tongyx361 self-assigned this Dec 14, 2025

tongyx361 requested changes Dec 14, 2025

View reviewed changes

szrlee and others added 6 commits December 14, 2025 20:57

fix(agg_loss): use local batch size to avoid distributed training com…

d3df7de

…plexity

fix(agg_loss): remove dp_size multiplier when using local count fallback

db3c2cc

revert: remove exclude_fully_masked_seq feature

8e95ebd

fix(agg_loss): use total batch size in denominator, never apply mask

2af4aff

chore: remove PR_MESSAGE.md

ab58550

feat: refactor agg_loss

d0e2a2b

tongyx361 approved these changes Dec 14, 2025

View reviewed changes

szrlee mentioned this pull request Dec 15, 2025

[FSDP] Add Masked importance sampling THUDM/slime#1063

Closed

ISEEKYAN reviewed Dec 15, 2025

View reviewed changes

wuxibin89 reviewed Dec 15, 2025

View reviewed changes

verl/workers/actor/dp_actor.py Outdated Show resolved Hide resolved

feat: remove the global aggregation from legacy model engine

1985023

ISEEKYAN approved these changes Dec 17, 2025

View reviewed changes

tongyx361 merged commit 6a58521 into verl-project:main Dec 17, 2025
75 of 78 checks passed

erictang000 mentioned this pull request Dec 19, 2025

[algorithm] Make sure loss aggregation is invariant to FSDP/Megatron parallelism NovaSky-AI/SkyRL#794

Open

wuxibin89 reviewed Jan 2, 2026

View reviewed changes

wuxibin89 added a commit that referenced this pull request Jan 2, 2026

Revert "[algo] fix: Add seq mean mask denominator option (#4510)"

8e3c7ec

This reverts commit 6a58521.

wuxibin89 mentioned this pull request Jan 2, 2026

Revert "[algo] fix: Add seq mean mask denominator option" #4769

Merged

wuxibin89 added a commit that referenced this pull request Jan 2, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (#4769)

78014a2

Reverts #4510

jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (verl-proje…

c312944

…ct#4769) Reverts verl-project#4510

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (verl-proje…

c98f132

…ct#4769) Reverts verl-project#4510

sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (verl-proje…

f5c4658

…ct#4769) Reverts verl-project#4510

y-a23 pushed a commit to y-a23/query that referenced this pull request Feb 5, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (#4769)

d95d919

Reverts verl-project/verl#4510

KimperYang pushed a commit to KimperYang/TauVerl that referenced this pull request Mar 3, 2026

Revert "[algo] fix: Add seq mean mask denominator option" (#4769)

49610ee

Reverts verl-project/verl#4510

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] fix: Add seq mean mask denominator option#4510

[algo] fix: Add seq mean mask denominator option#4510
tongyx361 merged 12 commits intoverl-project:mainfrom
szrlee:add_seq_mean_mask_denominator_option

szrlee commented Dec 14, 2025 •

edited by tongyx361

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

tongyx361 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ISEEKYAN Dec 15, 2025

Uh oh!

szrlee Dec 15, 2025

Uh oh!

tongyx361 Dec 15, 2025 •

edited

Loading

Uh oh!

wuxibin89 commented Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

tongyx361 commented Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

wuxibin89 commented Jan 2, 2026 •

edited

Loading

Uh oh!

wuxibin89 Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

szrlee commented Dec 14, 2025 • edited by tongyx361 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

tongyx361 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ISEEKYAN Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

szrlee Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

tongyx361 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tongyx361 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wuxibin89 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wuxibin89 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

szrlee commented Dec 14, 2025 •

edited by tongyx361

Loading

tongyx361 left a comment •

edited

Loading

tongyx361 Dec 15, 2025 •

edited

Loading

wuxibin89 commented Dec 15, 2025 •

edited

Loading

tongyx361 commented Dec 15, 2025 •

edited

Loading

wuxibin89 commented Jan 2, 2026 •

edited

Loading