[BREAKING][skyrl-train] Implement loss reduction via advantage normalization and fix `token_mean` reduction strategy by justinvyu · Pull Request #1296 · NovaSky-AI/SkyRL

justinvyu · 2026-03-09T18:56:09Z

This is a breaking change for the default token_mean loss behavior, as well as observed policy_loss metrics! See the "Differences in reported loss metric magnitudes" section below.

Summary

Change reduce_loss() to always returns a simple masked sum ((loss * mask).sum()). To achieve different reduction strategies, we pre-scale the advantages before they enter the loss function, which also aligns with how Tinker's API handles it.
- Scales the loss by the DP size before calling backward() to counteract the default data parallel mean gradient all-reduce across workers to do a sum instead.
Fixes the token_mean loss reduction method to take a mean across all tokens in the minibatch rather than averaging across microbatches. Allows running with the old loss reduction with the token_mean_legacy config.

Loss reduction strategies

Option 1: token_mean
- Average loss per token across the entire mini-batch.
- This is the fixed version where the denominator is the total token count across the full mini-batch, so the gradient is independent of how the minibbatch is split into micro-batches.
Option 1b: token_mean_legacy
- Compute token-mean loss within each micro-batch, then average across micro-batches.
- This reproduces the token_mean behavior before this PR.
- The problem: if micro-batches have different token counts, the effective weighting differs from a true global token mean. This is also less usable since changing micro batch size affects the loss and the training dynamics.
- Kept as a fallback in case of performance regressions — we should remove this down the line.
Option 2: sequence_mean
- Compute per-token loss within each sequence, average across sequences.
- This is unchanged and is just implemented via advantage normalization instead.
Option 3: seq_mean_token_sum_norm
- Dr. GRPO style — normalize by a fixed constant to avoid any length-dependent weighting.
- This is unchanged and is just implemented via advantage normalization instead.

Mean all-reduce -> sum all-reduce

We need the loss to be summed across microbatches and data parallel workers:

DDP/FSDP defaults to a mean all-reduce for gradients across workers. This PR counteracts this by multiplying by the DP world size in order to keep the loss sum across data parallel groups.
Megatron also does a similar mean reduction across microbatches and workers, so we counteract this by multiplying by num microbatches and DP size to achieve the sum.

Difference in reported loss metric magnitudes

You will observe that the loss metric reported has a different magnitude compared to your older runs. This is beacuse the old token_mean implementation was somewhere between a true token mean and a sequence mean due to per-micro-batch normalization (ex: micro_batch_size=1 was equivalent to sequence mean).

The new token_mean is a proper minibatch token mean, while sequence_mean weights every sequence equally regardless of length. When comparing the loss produced by different reduction methods computed on the same advantages, from a real run:

  token_mean (new):  0.322   — every token weighted equally across the mini-batch
  token_mean (old):  0.065   — mean of per-micro-batch token means, where micro_batch_size=4
  sequence_mean:     0.00098 — every sequence weighted equally

The old token_mean gave each micro-batch equal weight rather than each token, so its scale depended on how advantages were distributed across micro-batches. The new implementation is invariant to micro-batch size.

Note that token_mean_legacy reports the old metrics still, and the sequence_mean and seq_mean_token_sum_norm modes also match exactly. See this comment for more details.

Tinker compatibility

Here was the first attempt at fixing the loss reduction across microbatches: #909

This method was to track total tokens and then do one big normalization at the optim_step in order to get an average per-token loss. But, we decided to align with Tinker's way of just summing up the loss at the end, and pushing any loss normalization to the user's advantage calculation.

The benefit is that users have full control of customizing their loss reduction strategy, rather than having it happen in our opaque forward_backward, optim_step implementation which would require some configuration argument that diverges from tinker's API. For example, we would need to add a config somewhere to determine how to average/sum the loss:

client.forward_backward(...)
client.optim_step(..., loss_reduction="token_mean")  # no longer tinker compatible

The current PR aligns with Tinker semantics:

Notice that for all objectives we sum the token-level losses over the sequence length unlike some other loss implementations. If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation.

Example for loss_reduction="token_mean":

Move the 1/num_minibatch_tokens normalization into the advantage: loss = sum( -advantage_i * ratio_i for i in range(num_minibatch_tokens) ) / num_minibatch_tokens
-> sum( -(advantage_i / num_minibatch_tokens) * ratio_i for i in range(num_minibatch_tokens) )

Learning curve comparisons before/after the PR

FSDP (wandb)

Megatron (wandb)

1.7B:

30B lora:

master baseline:

token_mean_legacy + fixed token_mean:

… scale loss by dp_size for FSDP/Megatron parity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…omparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…uction # Conflicts: # skyrl/backends/skyrl_train/utils/ppo_utils.py # skyrl/train/fully_async_trainer.py # skyrl/train/trainer.py # tests/backends/skyrl_train/gpu/test_grpo_sp_sanity.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ritic, rename token_mean_baseline to token_mean_legacy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… add unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

erictang000

this looks almost good to merge, super clean thanks for adding the token_mean_legacy path

just want to check my understanding + 1 question about the metrics code that I think I probably wrote on the old PR...

… mini-batch reduction - Report unscaled loss metrics (remove * loss_scale / * dp_size) in both FSDP and Megatron workers - Rename reduce_metrics -> reduce_metrics_across_microbatches (sums _loss for gradient accumulation) - Add reduce_metrics_across_minibatches in trainer_utils (averages _loss for logging) - Use sum all-reduce for _loss keys across DP workers to reconstruct full mini-batch loss Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

justinvyu · 2026-03-25T23:01:06Z

+                "final_loss": unscaled_loss.detach().item(),
+                "policy_loss": policy_loss.detach().item(),


Metrics fix 1: remove dp_size multiplier in reported metrics, since there's no average that we need to correct for, since reduce_microbatch_metrics and all_reduce_metrics both do sums for *_loss metrics.

justinvyu · 2026-03-25T23:02:25Z

        # pop out loss_fn_outputs since it's not a scalar metric and to avoid logging it
        all_metrics.pop("loss_fn_outputs", None)
-        reduced_metrics = reduce_metrics(all_metrics)
+        reduced_metrics = reduce_metrics_across_minibatches(all_metrics)


Metrics fix 2: Take an average across minibatches instead of still summing. This is because the loss reduction normalization happens at the minibatch level. Across different minibatches we should just average, otherwise we'll increase the reported loss scale by ~num_minibatches

…e_metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uction # Conflicts: # skyrl/train/trainer.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2026-03-27T20:23:13Z

            if resolved_loss_name == "cross_entropy":
-                loss = policy_loss
+                unscaled_loss = policy_loss
+                loss = unscaled_loss * grad_sum_correction_factor


Q: should this affect the SFT case? SFT doesn't look at the normalized advantages either, similar to the critic loss case.

Before the PR, the SFT case does a sum across the negative log likelihoods within a microbatch, but still averaged over microbatches and dp workers.
Now, we are summing negative log likelihood across the entire minibatch. What's the desired behavior here?

Reverting to the old behavior for now and we can tackle it in a follow-up. SFT loss reduction is broken already due to taking a sum within the microbatch but then a mean across microbatches/workers. The current behavior does not align with this comment: https://github.com/justinvyu/SkyRL/blob/c5feb83b38f4635c7fc705c2bb192a7d6ad16947/skyrl/backends/skyrl_train/utils/ppo_utils.py#L917

justinvyu · 2026-03-27T20:55:19Z

To sanity check the difference in loss metric magnitudes, I dumped the raw advantages on the first step and manually calculated the loss with the different reduction methods on the same dumped data.

Using dumped advantage tensors from a real GRPO run to compare old vs. new:

With micro_batch_size=4 (128 micro-batches), old and new differ by ~5x — matching what's observed in the actual run:

Average: old=0.065  new=0.322  ratio=4.92

With micro_batch_size=1, the old token_mean reduces to sequence_mean (each sequence weighted equally). The old values match sequence_mean exactly:

token_mean old:  Average=-0.024
sequence_mean:   Average=-0.024  ratio=1.0000

With micro_batch_size=512 (1 micro-batch = full mini-batch), old and new converge:

Average: old=0.322  new=0.322  ratio=1.0000

The new token_mean value (0.322) is the same regardless of micro_batch_size — which is the correct behavior. The old value varied between -0.024 (at micro_batch_size=1, i.e. sequence_mean) and 0.322 (at micro_batch_size=mini_batch_size=512, i.e. fixed token_mean) depending on how micro-batches were formed.

token_mean_legacy reproduces the old behavior. Runs using token_mean won't be directly comparable to before, but the difference is analogous to comparing token_mean vs. sequence_mean — a different weighting, not a bug.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…py loss Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 14 additional findings in Devin Review.

justinvyu · 2026-03-31T16:55:49Z

+        # TODO: SFT path still averages metrics across microbatches and workers.
+        # This needs to be unified with the RL path which sums.
+        resolved_loss_name = loss_fn or self.cfg.algorithm.policy_loss_type
+        sum_loss_metrics = resolved_loss_name != "cross_entropy"


need to followup to unify the codepaths -- SFT loss is wrong right now

justinvyu · 2026-03-31T17:01:13Z

+            grad_sum_correction_factor = self.mesh_rank.dp_size
+
+            # NOTE: The KL and entropy loss terms are not pre-scaled,
+            # so we just average them across microbatches and DP workers.
+            loss = policy_loss * grad_sum_correction_factor + (kl_loss_term - entropy_loss_term) * microbatch_weight
+            unscaled_loss = loss / grad_sum_correction_factor


This part is a bit complicated to maintain kl/entropy loss parity:

Previously, the kl/entropy terms are per-token averages within the microbatch (see the masked_mean above). Then, we took the average across microbatches and DP workers (same as the old loss).

We can't just sum them because they were computed on the worker and we didn't pre-scale them in the same way we scaled the advantages.

So, to maintain the average behavior, we divide the terms by the microbatch weight (1/num_microbatches), and we don't apply the grad sum correction factor to keep the all-reduce as a mean across DP workers.

justinvyu · 2026-03-31T17:03:21Z

+            grad_sum_correction_factor = num_microbatches * dp_size
+
+            # NOTE: The KL and entropy loss terms are not pre-scaled,
+            # so we just average them across microbatches and DP workers.
+            loss = policy_loss * grad_sum_correction_factor + kl_loss_term - entropy_loss_term
+            unscaled_loss = loss / grad_sum_correction_factor


This is similar to the FSDP case, except megatron already divides by num_microbatches and dp_size internally (so no need to divide by num_microbatches here).

erictang000

looks great! thanks for all the work getting this in

…ron (#1420) Fixes `test_megatron_train[tp2_cp2_policy_seq_packing_no_entropy_loss]` failing after #1296. ### Problem The loss refactor in #1296 introduced two CP-specific bugs: 1. **Metrics double-counted across CP ranks**: `all_reduce_metrics` used `get_data_parallel_group(with_context_parallel=True)`, which includes CP ranks in the reduction group. With `sum_loss_metrics=True`, this **sums** `policy_loss` across CP ranks. But since `postprocess_packed_seqs` already gathers logprobs across CP before computing the loss, all CP ranks produce identical metrics — so summing doubles the value. This caused the ~2x discrepancy (`-28.43` FSDP vs `-57.36` Megatron). 2. **Gradient correction factor ignores CP**: `grad_sum_correction_factor` used `get_data_parallel_world_size()` (without CP), but Megatron's `finalize_model_grads` averages gradients across the full DP+CP group. The correction was therefore `1/CP_size` too small. ### Fix - Use `get_data_parallel_group(with_context_parallel=False)` for the metrics all-reduce, since metrics are already complete on each CP rank. - Use `get_data_parallel_world_size(with_context_parallel=True)` for the gradient correction factor, matching the group that `finalize_model_grads` reduces over. Both changes are no-ops when CP=1.  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1420" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

justinvyu and others added 3 commits March 9, 2026 11:51

Move loss reduction normalization to trainer-level advantage scaling,…

589c150

… scale loss by dp_size for FSDP/Megatron parity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add token_mean_baseline loss reduction for mean-of-microbatch-means c…

333f31a

…omparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix assertion

aaaba4c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Mar 9, 2026

View reviewed changes

Comment thread skyrl/train/trainer.py Outdated

Comment thread skyrl/train/trainer.py Outdated

justinvyu and others added 7 commits March 9, 2026 18:27

Update tests for sum-based reduce_loss and dp_size scaling changes

a121360

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/main' into token_mean_loss_red…

15de89a

…uction # Conflicts: # skyrl/backends/skyrl_train/utils/ppo_utils.py # skyrl/train/fully_async_trainer.py # skyrl/train/trainer.py # tests/backends/skyrl_train/gpu/test_grpo_sp_sanity.py

lint

e3842c3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tests

13bfe80

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Refactor advantage normalization: fix z-score propagation, skip for c…

e76bece

…ritic, rename token_mean_baseline to token_mean_legacy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

token_mean_baseline -> token_mean_legacy

0192e8e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Extract apply_loss_reduction_to_advantages_minibatch to ppo_utils and…

4ee0b31

… add unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

justinvyu marked this pull request as ready for review March 20, 2026 22:34

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Mar 20, 2026

View reviewed changes

justinvyu changed the title ~~[wip] loss reduction~~ [skyrl-train] Implement loss reduction via advantage normalization and fix token_mean reduction strategy Mar 20, 2026

justinvyu mentioned this pull request Mar 20, 2026

[skyrl-train] Fix loss reduction by moving normalization to the advantage computation #925

Closed

erictang000 reviewed Mar 23, 2026

View reviewed changes

Comment thread skyrl/backends/skyrl_train/utils/ppo_utils.py

Comment thread skyrl/backends/skyrl_train/workers/megatron/megatron_model_wrapper.py Outdated

justinvyu commented Mar 25, 2026

View reviewed changes

justinvyu commented Mar 26, 2026

View reviewed changes

Comment thread skyrl/backends/skyrl_train/workers/worker.py Outdated

justinvyu and others added 8 commits March 27, 2026 11:53

Fix critic metric reporting: explicit sum_loss_metrics flag for reduc…

2c13315

…e_metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove reduce_metrics_across_minibatches, reuse reduce_metrics

14ba02e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/main' into token_mean_loss_red…

0cfc95b

…uction # Conflicts: # skyrl/train/trainer.py

add some comments about sum metrics

717c3a7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add clarifying comments and rename loss_scale

661f5d8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

no_grad for safety and make private

5cc95a1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove outdated comments about loss reduction type in sapo tests

ce8f6aa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test

1a60bb5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Mar 27, 2026

View reviewed changes

fix test

c5feb83

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This comment was marked as resolved.

Sign in to view

justinvyu added 6 commits March 30, 2026 10:59

fix kl, entropy loss terms

a599a4e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert sft

971be5f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

finish reverting sft + normalize by num microbatches for the kl/entro…

308be63

…py loss Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert sft case for megatron

ad54440

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tests

3829da5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add some comments about the kl/entropy terms

a19ebca

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

devin-ai-integration Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread skyrl/backends/skyrl_train/workers/worker.py

justinvyu commented Mar 31, 2026

View reviewed changes

erictang000 approved these changes Mar 31, 2026

View reviewed changes

erictang000 changed the title ~~[skyrl-train] Implement loss reduction via advantage normalization and fix token_mean reduction strategy~~ [BREAKING][skyrl-train] Implement loss reduction via advantage normalization and fix token_mean reduction strategy Mar 31, 2026

erictang000 merged commit bf243b8 into NovaSky-AI:main Mar 31, 2026
5 of 6 checks passed

erictang000 mentioned this pull request Mar 31, 2026

[megatron] Fix loss aggregation for context parallelism (CP) in Megatron #1420

Merged

erictang000 mentioned this pull request Apr 1, 2026

sum loss reduction #1424

Closed

erictang000 mentioned this pull request Apr 13, 2026

SFT loss aggregation #1504

Open

CharlieFRuan mentioned this pull request Apr 17, 2026

[fix][train] Prompt-based mini-batching for step-wise training #1529

Merged

		"final_loss": unscaled_loss.detach().item(),
		"policy_loss": policy_loss.detach().item(),

Conversation

justinvyu commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Loss reduction strategies

Mean all-reduce -> sum all-reduce

Difference in reported loss metric magnitudes

Tinker compatibility

Learning curve comparisons before/after the PR

FSDP (wandb)

Megatron (wandb)

1.7B:

30B lora:

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinvyu commented Mar 9, 2026 •

edited

Loading

justinvyu Mar 25, 2026 •

edited

Loading

justinvyu commented Mar 27, 2026 •

edited

Loading