Conversation
❌ Submodule Fast-Forward Check FailedCheck based on commit: e5d1ae9 (PR #1990 from ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
891b3c2 to
c19ee75
Compare
❌ Submodule Fast-Forward Check FailedCheck based on commit: c19ee75 (PR #1990 from ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 891b3c2 (PR #1990 from ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 6674072 (PR #1990 from ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
6674072 to
924e32a
Compare
❌ Submodule Fast-Forward Check FailedCheck based on commit: 924e32a (PR #1990 from ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 4ba7f09 (PR #1990 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 07f73ce (PR #1990 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 8a09bf8 (PR #1990 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
terrykong
left a comment
There was a problem hiding this comment.
lgtm. @shanmugamr1992 can you take a pass over the mcore inf changes?
Signed-off-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
8a09bf8 to
aedd445
Compare
❌ Submodule Fast-Forward Check FailedCheck based on commit: aedd445 (PR #1990 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
📝 WalkthroughWalkthroughThis PR updates Megatron-LM submodule references from a fork to NVIDIA's official repository, bumps dependencies (torch, transformer-engine, mlflow, flash-linear-attention, datasets), refactors inference configuration in megatron_policy_worker.py to use new InferenceConfig and DynamicInferenceContext patterns, and reorganizes test actors by moving them to dedicated modules. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/unit/algorithms/sequence_packing_gradient_actor.py (1)
350-380: Test 3 is missing a final gradient assertion, unlike Test 2.Test 2 ends with
torch.testing.assert_close(packed_grad, baseline_grad_store, ...)(line 285), but Test 3 only prints gradient stats and differences without asserting correctness. This means Test 3 can silently pass even if gradients diverge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/algorithms/sequence_packing_gradient_actor.py` around lines 350 - 380, Test 3 currently only prints gradient stats but lacks a final correctness check; add a torch.testing.assert_close call to compare packed_grad and baseline_grad_store (same as Test 2) after the prints to fail the test on divergence — use the identical rtol/atol tolerances and message used by the existing torch.testing.assert_close in Test 2 so the comparison semantics match, and place this assertion immediately after the printed difference by token.
🧹 Nitpick comments (3)
tests/unit/algorithms/sequence_packing_gradient_actor.py (1)
197-216:logitsparameter is unused — function always readsbaseline_logitsfrom enclosing scope.
make_packed_logits(logits)never references thelogitsparameter; it capturesbaseline_logitsfrom the closure (lines 204, 209, 211). Since the only call sites passbaseline_logitsanyway, behavior is correct, but the dead parameter is misleading.Suggested fix
Either use the parameter:
- def make_packed_logits(logits): + def make_packed_logits(logits_input): packed_logits = torch.zeros( 1, packed_input_ids_cp.shape[1], vocab_size, device="cuda" ) run_seq = 0 for i, seq_len in enumerate(seq_lengths): padded_seqlen = cu_seqlens_padded[i + 1] - cu_seqlens_padded[i] - if padded_seqlen > baseline_logits.shape[1]: + if padded_seqlen > logits_input.shape[1]: tmp_logits = torch.zeros( 1, padded_seqlen, vocab_size, device="cuda" ) - tmp_logits[:, :seq_len] = baseline_logits[i : i + 1, :seq_len] + tmp_logits[:, :seq_len] = logits_input[i : i + 1, :seq_len] else: - tmp_logits = baseline_logits[i : i + 1, :padded_seqlen] + tmp_logits = logits_input[i : i + 1, :padded_seqlen]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/algorithms/sequence_packing_gradient_actor.py` around lines 197 - 216, The make_packed_logits function declares a parameter logits but never uses it, instead closing over baseline_logits; update the function to remove the unused parameter or use the passed-in logits to avoid the misleading dead parameter: modify the signature of make_packed_logits (and all its call sites) to either take no arguments and rely on baseline_logits, or replace all internal references to baseline_logits inside make_packed_logits with the parameter logits so the function uses its argument consistently (ensure calls that currently pass baseline_logits still pass the right tensor).nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
721-724: Seed computation uses magic number 1024.The formula
(node_idx * 1024) + local_ranksilently assumes ≤1024 GPUs per node. While safe for current hardware, consider extracting1024as a named constant for clarity, or at minimum adding a brief inline comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 721 - 724, The seed computation uses a hardcoded magic number 1024; update the computation in the block that sets local_rank, num_gpus_per_node, node_idx and model_config.inference_sampling_seed to replace the literal 1024 with a well-named constant (e.g., MAX_GPUS_PER_NODE or GPU_INDEX_OFFSET) declared near this module or as a module-level constant, and/or add a concise inline comment explaining why that value is chosen (to avoid collisions when shifting node index into seed space), so the intent is explicit and future reviewers know the assumption.tests/unit/models/megatron/megatron_data_actors.py (1)
443-457: Unusedvocab_sizevariable (flagged by static analysis).Line 446 assigns
vocab_size = 100but it's never referenced in_test_context_parallel. Either remove it or prefix with_.Suggested fix
- vocab_size = 100🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/models/megatron/megatron_data_actors.py` around lines 443 - 457, Remove or rename the unused local variable `vocab_size` in the `_test_context_parallel` test: either delete the assignment `vocab_size = 100` or rename it to `_vocab_size = 100` so static analysis no longer flags it as unused; update the code block around the test parameters (the lines setting `batch_size`, `seq_len`, and `vocab_size`) in tests/unit/models/megatron/megatron_data_actors.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/unit/algorithms/sequence_packing_gradient_actor.py`:
- Around line 350-380: Test 3 currently only prints gradient stats but lacks a
final correctness check; add a torch.testing.assert_close call to compare
packed_grad and baseline_grad_store (same as Test 2) after the prints to fail
the test on divergence — use the identical rtol/atol tolerances and message used
by the existing torch.testing.assert_close in Test 2 so the comparison semantics
match, and place this assertion immediately after the printed difference by
token.
---
Nitpick comments:
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 721-724: The seed computation uses a hardcoded magic number 1024;
update the computation in the block that sets local_rank, num_gpus_per_node,
node_idx and model_config.inference_sampling_seed to replace the literal 1024
with a well-named constant (e.g., MAX_GPUS_PER_NODE or GPU_INDEX_OFFSET)
declared near this module or as a module-level constant, and/or add a concise
inline comment explaining why that value is chosen (to avoid collisions when
shifting node index into seed space), so the intent is explicit and future
reviewers know the assumption.
In `@tests/unit/algorithms/sequence_packing_gradient_actor.py`:
- Around line 197-216: The make_packed_logits function declares a parameter
logits but never uses it, instead closing over baseline_logits; update the
function to remove the unused parameter or use the passed-in logits to avoid the
misleading dead parameter: modify the signature of make_packed_logits (and all
its call sites) to either take no arguments and rely on baseline_logits, or
replace all internal references to baseline_logits inside make_packed_logits
with the parameter logits so the function uses its argument consistently (ensure
calls that currently pass baseline_logits still pass the right tensor).
In `@tests/unit/models/megatron/megatron_data_actors.py`:
- Around line 443-457: Remove or rename the unused local variable `vocab_size`
in the `_test_context_parallel` test: either delete the assignment `vocab_size =
100` or rename it to `_vocab_size = 100` so static analysis no longer flags it
as unused; update the code block around the test parameters (the lines setting
`batch_size`, `seq_len`, and `vocab_size`) in
tests/unit/models/megatron/megatron_data_actors.py accordingly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
.gitmodules3rdparty/Megatron-Bridge-workspace/Megatron-Bridge3rdparty/Megatron-Bridge-workspace/setup.py3rdparty/Megatron-LM-workspace/Megatron-LM3rdparty/Megatron-LM-workspace/setup.pynemo_rl/models/policy/workers/megatron_policy_worker.pypyproject.tomltests/unit/algorithms/sequence_packing_gradient_actor.pytests/unit/algorithms/test_sequence_packing_gradients.pytests/unit/models/megatron/megatron_data_actors.pytests/unit/models/megatron/test_megatron_data.py
What does this PR do ?
The PR replaces the mcore fork created for RL with upstream version
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Dependencies
Infrastructure
Tests