Skip to content

chore: Switch to mcore upstream main#1990

Merged
terrykong merged 1 commit intomainfrom
ahmadki/mcore_main
Feb 25, 2026
Merged

chore: Switch to mcore upstream main#1990
terrykong merged 1 commit intomainfrom
ahmadki/mcore_main

Conversation

@ahmadki
Copy link
Member

@ahmadki ahmadki commented Feb 19, 2026

What does this PR do ?

The PR replaces the mcore fork created for RL with upstream version

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Dependencies

    • Updated minimum versions for torch (≥2.6.0), datasets (≥2.20.0), and adjusted transformer-engine compatibility range.
    • Added new runtime dependencies: mlflow (≥3.5.0), flask, hypercorn, and openai.
  • Infrastructure

    • Updated to use official NVIDIA Megatron-LM repository as the primary source.
    • Refined inference configuration and context management setup.
  • Tests

    • Enhanced test infrastructure organization and expanded coverage.

@ahmadki ahmadki changed the title Switch to mcore upstream main feat: Switch to mcore upstream main Feb 19, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: e5d1ae9 (PR #1990 from ahmadki/mcore_main)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Feb 19, 2026
@ahmadki ahmadki force-pushed the ahmadki/mcore_main branch 2 times, most recently from 891b3c2 to c19ee75 Compare February 20, 2026 13:29
@ahmadki ahmadki changed the title feat: Switch to mcore upstream main chore: Switch to mcore upstream main Feb 20, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: c19ee75 (PR #1990 from ahmadki/mcore_main)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 891b3c2 (PR #1990 from ahmadki/mcore_main)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 20, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 6674072 (PR #1990 from ahmadki/mcore_main)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 22, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 924e32a (PR #1990 from ahmadki/mcore_main)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 22, 2026
@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 22, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4ba7f09 (PR #1990 from ahmadki/mcore_main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki removed the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Feb 22, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 07f73ce (PR #1990 from ahmadki/mcore_main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/0d0943c6bfa9cbb30fcd62d40ce1792c4cb201e8/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 23, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8a09bf8 (PR #1990 from ahmadki/mcore_main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/ahmadki/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/ahmadki/Megatron-LM/commits/e0dd64251f2fde58606c5253280469b4dc81c75b/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @shanmugamr1992 can you take a pass over the mcore inf changes?

shanmugamr1992
shanmugamr1992 previously approved these changes Feb 23, 2026
Signed-off-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: aedd445 (PR #1990 from ahmadki/mcore_main)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/b12071b947f9ee3c6616306662069fc4ca77be4c/
CURRENT (PR #1990 from ahmadki/mcore_main): https://github.com/NVIDIA/Megatron-LM/commits/23dd639cf3de30f3b9d8d0fae71ee31180be9ddd/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@ahmadki ahmadki added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Feb 24, 2026
@ahmadki ahmadki marked this pull request as ready for review February 24, 2026 16:30
@ahmadki ahmadki requested review from a team as code owners February 24, 2026 16:30
@terrykong terrykong enabled auto-merge (squash) February 24, 2026 16:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

This PR updates Megatron-LM submodule references from a fork to NVIDIA's official repository, bumps dependencies (torch, transformer-engine, mlflow, flash-linear-attention, datasets), refactors inference configuration in megatron_policy_worker.py to use new InferenceConfig and DynamicInferenceContext patterns, and reorganizes test actors by moving them to dedicated modules.

Changes

Cohort / File(s) Summary
Submodule & Repository Updates
.gitmodules, 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge, 3rdparty/Megatron-LM-workspace/Megatron-LM
Updated Megatron-LM repository URL from fork to official NVIDIA repository and changed branch to main. Updated submodule commit pointers for Megatron-Bridge and Megatron-LM.
Dependency Version Bumps
3rdparty/Megatron-Bridge-workspace/setup.py, 3rdparty/Megatron-LM-workspace/setup.py, pyproject.toml
Updated PyPI dependencies: torch to >=2.6.0, transformer-engine to >=2.9.0a0,<2.12.0 with core_cu13 extras, flash-linear-attention to ~=0.4.0, datasets to >=2.20.0. Added mlflow>=3.5.0. Updated pyproject.toml transformer-engine constraints across automodel/mcore groups.
Inference Configuration Refactoring
nemo_rl/models/policy/workers/megatron_policy_worker.py
Replaced InferenceWrapperConfig with InferenceConfig and DynamicInferenceContext-based setup. Removed legacy seed-based randomness and CUDA graph flags. Introduced node/rank-based inference_seed computation. Refactored dynamic engine initialization to use simplified construction pattern via InferenceConfig.
Test Actor Extraction
tests/unit/algorithms/sequence_packing_gradient_actor.py, tests/unit/algorithms/test_sequence_packing_gradients.py
Extracted SequencePackingGradientTestActor from inline test to dedicated module. Updated test file to import and register external actor via ACTOR_ENVIRONMENT_REGISTRY with Ray cluster lifecycle management.
Megatron Data Test Actors
tests/unit/models/megatron/megatron_data_actors.py, tests/unit/models/megatron/test_megatron_data.py
Created new module with PackSequencesTestActor and GetPackSequenceParametersTestActor Ray actors. Moved these actors from test_megatron_data.py to dedicated megatron_data_actors.py module with comprehensive test suites for packing and parameter validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

super-v3

Suggested reviewers

  • terrykong
  • yaoyu-33
  • ashors1
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major changes including mcore fork to upstream main switch, inference system refactoring, and dependency updates, but PR description lacks test results documentation. Add comprehensive testing documentation to PR description including unit/functional test results, performance benchmarks, and convergence validation before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main objective: switching from a fork to the upstream mcore main branch.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ahmadki/mcore_main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/unit/algorithms/sequence_packing_gradient_actor.py (1)

350-380: Test 3 is missing a final gradient assertion, unlike Test 2.

Test 2 ends with torch.testing.assert_close(packed_grad, baseline_grad_store, ...) (line 285), but Test 3 only prints gradient stats and differences without asserting correctness. This means Test 3 can silently pass even if gradients diverge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/algorithms/sequence_packing_gradient_actor.py` around lines 350 -
380, Test 3 currently only prints gradient stats but lacks a final correctness
check; add a torch.testing.assert_close call to compare packed_grad and
baseline_grad_store (same as Test 2) after the prints to fail the test on
divergence — use the identical rtol/atol tolerances and message used by the
existing torch.testing.assert_close in Test 2 so the comparison semantics match,
and place this assertion immediately after the printed difference by token.
🧹 Nitpick comments (3)
tests/unit/algorithms/sequence_packing_gradient_actor.py (1)

197-216: logits parameter is unused — function always reads baseline_logits from enclosing scope.

make_packed_logits(logits) never references the logits parameter; it captures baseline_logits from the closure (lines 204, 209, 211). Since the only call sites pass baseline_logits anyway, behavior is correct, but the dead parameter is misleading.

Suggested fix

Either use the parameter:

-        def make_packed_logits(logits):
+        def make_packed_logits(logits_input):
             packed_logits = torch.zeros(
                 1, packed_input_ids_cp.shape[1], vocab_size, device="cuda"
             )
             run_seq = 0
             for i, seq_len in enumerate(seq_lengths):
                 padded_seqlen = cu_seqlens_padded[i + 1] - cu_seqlens_padded[i]
-                if padded_seqlen > baseline_logits.shape[1]:
+                if padded_seqlen > logits_input.shape[1]:
                     tmp_logits = torch.zeros(
                         1, padded_seqlen, vocab_size, device="cuda"
                     )
-                    tmp_logits[:, :seq_len] = baseline_logits[i : i + 1, :seq_len]
+                    tmp_logits[:, :seq_len] = logits_input[i : i + 1, :seq_len]
                 else:
-                    tmp_logits = baseline_logits[i : i + 1, :padded_seqlen]
+                    tmp_logits = logits_input[i : i + 1, :padded_seqlen]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/algorithms/sequence_packing_gradient_actor.py` around lines 197 -
216, The make_packed_logits function declares a parameter logits but never uses
it, instead closing over baseline_logits; update the function to remove the
unused parameter or use the passed-in logits to avoid the misleading dead
parameter: modify the signature of make_packed_logits (and all its call sites)
to either take no arguments and rely on baseline_logits, or replace all internal
references to baseline_logits inside make_packed_logits with the parameter
logits so the function uses its argument consistently (ensure calls that
currently pass baseline_logits still pass the right tensor).
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)

721-724: Seed computation uses magic number 1024.

The formula (node_idx * 1024) + local_rank silently assumes ≤1024 GPUs per node. While safe for current hardware, consider extracting 1024 as a named constant for clarity, or at minimum adding a brief inline comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 721 -
724, The seed computation uses a hardcoded magic number 1024; update the
computation in the block that sets local_rank, num_gpus_per_node, node_idx and
model_config.inference_sampling_seed to replace the literal 1024 with a
well-named constant (e.g., MAX_GPUS_PER_NODE or GPU_INDEX_OFFSET) declared near
this module or as a module-level constant, and/or add a concise inline comment
explaining why that value is chosen (to avoid collisions when shifting node
index into seed space), so the intent is explicit and future reviewers know the
assumption.
tests/unit/models/megatron/megatron_data_actors.py (1)

443-457: Unused vocab_size variable (flagged by static analysis).

Line 446 assigns vocab_size = 100 but it's never referenced in _test_context_parallel. Either remove it or prefix with _.

Suggested fix
-        vocab_size = 100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/models/megatron/megatron_data_actors.py` around lines 443 - 457,
Remove or rename the unused local variable `vocab_size` in the
`_test_context_parallel` test: either delete the assignment `vocab_size = 100`
or rename it to `_vocab_size = 100` so static analysis no longer flags it as
unused; update the code block around the test parameters (the lines setting
`batch_size`, `seq_len`, and `vocab_size`) in
tests/unit/models/megatron/megatron_data_actors.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/unit/algorithms/sequence_packing_gradient_actor.py`:
- Around line 350-380: Test 3 currently only prints gradient stats but lacks a
final correctness check; add a torch.testing.assert_close call to compare
packed_grad and baseline_grad_store (same as Test 2) after the prints to fail
the test on divergence — use the identical rtol/atol tolerances and message used
by the existing torch.testing.assert_close in Test 2 so the comparison semantics
match, and place this assertion immediately after the printed difference by
token.

---

Nitpick comments:
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 721-724: The seed computation uses a hardcoded magic number 1024;
update the computation in the block that sets local_rank, num_gpus_per_node,
node_idx and model_config.inference_sampling_seed to replace the literal 1024
with a well-named constant (e.g., MAX_GPUS_PER_NODE or GPU_INDEX_OFFSET)
declared near this module or as a module-level constant, and/or add a concise
inline comment explaining why that value is chosen (to avoid collisions when
shifting node index into seed space), so the intent is explicit and future
reviewers know the assumption.

In `@tests/unit/algorithms/sequence_packing_gradient_actor.py`:
- Around line 197-216: The make_packed_logits function declares a parameter
logits but never uses it, instead closing over baseline_logits; update the
function to remove the unused parameter or use the passed-in logits to avoid the
misleading dead parameter: modify the signature of make_packed_logits (and all
its call sites) to either take no arguments and rely on baseline_logits, or
replace all internal references to baseline_logits inside make_packed_logits
with the parameter logits so the function uses its argument consistently (ensure
calls that currently pass baseline_logits still pass the right tensor).

In `@tests/unit/models/megatron/megatron_data_actors.py`:
- Around line 443-457: Remove or rename the unused local variable `vocab_size`
in the `_test_context_parallel` test: either delete the assignment `vocab_size =
100` or rename it to `_vocab_size = 100` so static analysis no longer flags it
as unused; update the code block around the test parameters (the lines setting
`batch_size`, `seq_len`, and `vocab_size`) in
tests/unit/models/megatron/megatron_data_actors.py accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and aedd445.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • .gitmodules
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
  • 3rdparty/Megatron-Bridge-workspace/setup.py
  • 3rdparty/Megatron-LM-workspace/Megatron-LM
  • 3rdparty/Megatron-LM-workspace/setup.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • pyproject.toml
  • tests/unit/algorithms/sequence_packing_gradient_actor.py
  • tests/unit/algorithms/test_sequence_packing_gradients.py
  • tests/unit/models/megatron/megatron_data_actors.py
  • tests/unit/models/megatron/test_megatron_data.py

@terrykong terrykong merged commit 5bb8586 into main Feb 25, 2026
54 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants