[megatron] Fix loss aggregation for context parallelism (CP) in Megatron by erictang000 · Pull Request #1420 · NovaSky-AI/SkyRL

erictang000 · 2026-03-31T22:32:10Z

Fixes test_megatron_train[tp2_cp2_policy_seq_packing_no_entropy_loss] failing after #1296.

Problem

The loss refactor in #1296 introduced two CP-specific bugs:

Metrics double-counted across CP ranks: all_reduce_metrics used get_data_parallel_group(with_context_parallel=True), which includes CP ranks in the reduction group. With sum_loss_metrics=True, this sums policy_loss across CP ranks. But since postprocess_packed_seqs already gathers logprobs across CP before computing the loss, all CP ranks produce identical metrics — so summing doubles the value. This caused the ~2x discrepancy (-28.43 FSDP vs -57.36 Megatron).
Gradient correction factor ignores CP: grad_sum_correction_factor used get_data_parallel_world_size() (without CP), but Megatron's finalize_model_grads averages gradients across the full DP+CP group. The correction was therefore 1/CP_size too small.

Fix

Use get_data_parallel_group(with_context_parallel=False) for the metrics all-reduce, since metrics are already complete on each CP rank.
Use get_data_parallel_world_size(with_context_parallel=True) for the gradient correction factor, matching the group that finalize_model_grads reduces over.
Both changes are no-ops when CP=1.

gemini-code-assist

Code Review

This pull request updates the Megatron worker and model wrapper to correctly handle context parallel ranks in data parallel size calculations and metric reductions. A review comment points out that increasing the global absolute tolerance in the CI tests to 0.25 may be too permissive for metrics with small magnitudes, such as the learning rate, and suggests implementing per-metric or relative tolerance checks instead.

x

f2a81bc

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py

This comment was marked as resolved.

Sign in to view

x

2dde6bf

erictang000 merged commit 8376455 into main Mar 31, 2026
5 of 6 checks passed

erictang000 deleted the megatron_cp_loss branch March 31, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] Fix loss aggregation for context parallelism (CP) in Megatron#1420

[megatron] Fix loss aggregation for context parallelism (CP) in Megatron#1420
erictang000 merged 2 commits intomainfrom
megatron_cp_loss

erictang000 commented Mar 31, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erictang000 commented Mar 31, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erictang000 commented Mar 31, 2026 •

edited by devin-ai-integration Bot

Loading