[sync] Update RNG sharding to include EP rank by ananthsub · Pull Request #2092 · NVIDIA-NeMo/Megatron-Bridge

ananthsub · 2026-01-27T23:06:51Z

What does this PR do ?

Sync with changes from NVIDIA/Megatron-LM#2658 and NVIDIA/Megatron-LM#2641

Changelog

Update RNG sharding to include EP rank, and fix CUDA RNG tracker

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Improvements
- Enhanced checkpoint handling for Expert Parallelism configurations in distributed training.
- Implemented graph-safe CUDA RNG state management during checkpoint loading.
Tests
- Added comprehensive test coverage for RNG state collection with varying parallelism configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-27T23:06:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2026-01-27T23:07:01Z

/ok to test efa8548

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub · 2026-01-29T10:27:07Z

/ok to test b3e472d

coderabbitai · 2026-01-29T10:46:09Z

📝 Walkthrough

Walkthrough

Enhanced RNG state handling in checkpointing to support Expert Parallelism (EP) by accepting a ProcessGroupCollection parameter, sharding RNG states across PP, TP, and DP when EP > 1, and introducing graph-safe CUDA RNG tracker state loading through conversion before application.

Changes

Cohort / File(s)	Summary
RNG State Checkpointing `src/megatron/bridge/training/checkpointing.py`	Imports `get_pg_size` from megatron.core.utils. Updates `get_rng_state` signature to accept `pg_collection` parameter. Implements EP-aware RNG sharding logic: when EP > 1, shards across PP, TP, DP; otherwise maintains prior PP, TP sharding with DP as replica_id. Adds graph-safe RNG state loading by acquiring CUDA RNG tracker, determining graph-safety, and converting rng_tracker_states via `tensor_parallel.convert_cuda_rng_state()` before application.
RNG State Checkpointing Tests `tests/unit_tests/training/test_checkpointing.py`	Adds comprehensive unit tests for RNG state collection covering EP scenarios: EP > 1 (sharded by PP, TP, DP) and EP = 1 (sharded by PP, TP). Includes tests for EP group being None and validates ShardedObject metadata (global_shape, global_offset, replica_id) and correct RNG state gathering. Tests verify `get_pg_size` invocation with appropriate EP objects.

Sequence Diagram

sequenceDiagram
    participant CL as Checkpoint Loader
    participant PGC as ProcessGroupCollection
    participant GSDD as Graph Safety Detector
    participant TP as tensor_parallel
    participant CRT as CUDA RNG Tracker
    
    CL->>PGC: Query EP size via get_pg_size
    PGC-->>CL: Return EP size
    
    alt EP > 1
        CL->>CL: Shard RNG states by PP, TP, DP
    else EP ≤ 1
        CL->>CL: Shard RNG states by PP, TP (DP as replica_id)
    end
    
    CL->>CRT: Acquire CUDA RNG tracker
    CRT-->>CL: Return tracker instance
    
    CL->>GSDD: Determine graph_safety status
    GSDD-->>CL: Return is_graph_safe flag
    
    CL->>TP: convert_cuda_rng_state(rng_tracker_states)
    TP-->>CL: Return converted states
    
    CL->>CRT: Set converted RNG states
    CRT-->>CL: States applied successfully

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

M4: add pg_collection into setup and wire into train.py #1062: Introduces ProcessGroupCollection wiring into global state/setup, which provides the pg_collection parameter that this PR's checkpointing updates now accept and utilize for EP-aware RNG sharding.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR description lacks test results or testing information despite major breaking changes to get_rng_state() function signature and significant RNG state handling refactoring for Expert Parallelism support.	Update PR description to document test execution results, convergence validation, and performance impact assessment to verify major changes are properly validated.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: updating RNG sharding to include EP (expert parallel) rank, which is the primary focus of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub · 2026-01-29T11:06:26Z

/ok to test 3bbb050

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: conver334 <conver334@gmail.com>

ananthsub requested review from dimapihtar and paul-gibbons January 27, 2026 23:06

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 23:07 Inactive

copy-pr-bot bot temporarily deployed to test January 27, 2026 23:07 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 23:33 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 27, 2026 23:40 Failure

ananthsub added 2 commits January 29, 2026 02:17

[sync] Update RNG sharding to include EP rank

25af11e

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

add unit test

a862439

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub requested a review from yaoyu-33 January 29, 2026 10:19

add unit tests

b3e472d

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub force-pushed the sync-2658 branch from efa8548 to b3e472d Compare January 29, 2026 10:26

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 10:27 Inactive

copy-pr-bot bot had a problem deploying to test January 29, 2026 10:27 Error

ananthsub marked this pull request as ready for review January 29, 2026 10:46

sync 2641

3bbb050

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 11:06 Inactive

copy-pr-bot bot temporarily deployed to test January 29, 2026 11:07 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:28 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 12:38 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:38 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 18:19 Inactive

ananthsub enabled auto-merge (squash) January 29, 2026 18:50

yaoyu-33 approved these changes Jan 29, 2026

View reviewed changes

ananthsub merged commit a44f04c into NVIDIA-NeMo:main Jan 29, 2026
83 of 85 checks passed

conver334 pushed a commit to conver334/Megatron-Bridge that referenced this pull request Jan 30, 2026

[sync] Update RNG sharding to include EP rank (NVIDIA-NeMo#2092)

a6ae37c

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: conver334 <conver334@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] Update RNG sharding to include EP rank#2092

[sync] Update RNG sharding to include EP rank#2092
ananthsub merged 4 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2658

ananthsub commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ananthsub commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ananthsub commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading