Skip to content

[megatron] Fix loss aggregation for context parallelism (CP) in Megatron#1420

Merged
erictang000 merged 2 commits intomainfrom
megatron_cp_loss
Mar 31, 2026
Merged

[megatron] Fix loss aggregation for context parallelism (CP) in Megatron#1420
erictang000 merged 2 commits intomainfrom
megatron_cp_loss

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 commented Mar 31, 2026

Fixes test_megatron_train[tp2_cp2_policy_seq_packing_no_entropy_loss] failing after #1296.

Problem

The loss refactor in #1296 introduced two CP-specific bugs:

  1. Metrics double-counted across CP ranks: all_reduce_metrics used get_data_parallel_group(with_context_parallel=True), which includes CP ranks in the reduction group. With sum_loss_metrics=True, this sums policy_loss across CP ranks. But since postprocess_packed_seqs already gathers logprobs across CP before computing the loss, all CP ranks produce identical metrics — so summing doubles the value. This caused the ~2x discrepancy (-28.43 FSDP vs -57.36 Megatron).
  2. Gradient correction factor ignores CP: grad_sum_correction_factor used get_data_parallel_world_size() (without CP), but Megatron's finalize_model_grads averages gradients across the full DP+CP group. The correction was therefore 1/CP_size too small.

Fix

  • Use get_data_parallel_group(with_context_parallel=False) for the metrics all-reduce, since metrics are already complete on each CP rank.
  • Use get_data_parallel_world_size(with_context_parallel=True) for the gradient correction factor, matching the group that finalize_model_grads reduces over.
    Both changes are no-ops when CP=1.

Open with Devin

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Megatron worker and model wrapper to correctly handle context parallel ranks in data parallel size calculations and metric reductions. A review comment points out that increasing the global absolute tolerance in the CI tests to 0.25 may be too permissive for metrics with small magnitudes, such as the learning rate, and suggests implementing per-metric or relative tolerance checks instead.

Comment thread tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py
devin-ai-integration[bot]

This comment was marked as resolved.

@erictang000 erictang000 merged commit 8376455 into main Mar 31, 2026
5 of 6 checks passed
@erictang000 erictang000 deleted the megatron_cp_loss branch March 31, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant