feat: Migration from NeMo Tron to Megatron Bridge by yaoyu-33 · Pull Request #905 · NVIDIA-NeMo/RL

yaoyu-33 · 2025-08-12T21:02:11Z

Summary

This PR migrates the codebase from using nemo.tron imports to the new megatron.bridge interface, with significant improvements to model weight refitting workflows. The migration streamlines integration with Megatron-LM while introducing a more robust and efficient model weight conversion system.

Key Changes

🔄 Import Migration

Replaced nemo.tron.* imports with megatron.bridge.* equivalents:
- nemo.tron.state → megatron.bridge.training.state
- nemo.tron.checkpointing → megatron.bridge.training.checkpointing
- nemo.tron.config → megatron.bridge.training.config
- nemo.tron.init → megatron.bridge.training.initialize
- nemo.tron.optim → megatron.bridge.training.optim
- nemo.tron.tokenizers → megatron.bridge.training.tokenizers
- nemo.tron.utils → megatron.bridge.training.utils / megatron.bridge.utils

🔧 Enhanced Model Refitting System

New AutoBridge Integration:

Replaced manual converter classes with unified AutoBridge.from_hf_pretrained() interface
Added self.megatron_bridge = AutoBridge.from_hf_pretrained(hf_model_name) in policy worker initialization
Streamlined model conversion in import_model_from_hf_name() and export_model_from_megatron()

Advanced Refit Parameter Management:

New prepare_refit_info(): Calculates parameter conversion tasks and memory requirements
Enhanced _calculate_refit_param_info():
- Generates conversion tasks using self.megatron_bridge.get_conversion_tasks([self.model])
- Calculates precise memory requirements per parameter with dtype scaling
- Returns parameter metadata for efficient memory allocation

Optimized Weight Transfer System:

Improved get_weights_ipc_handles():
- Uses self.megatron_bridge.export_hf_weights() with conversion tasks
- Implements intelligent tensor packing based on NEMO_RL_MEGATRON_IPC_TENSOR_PACKING_THRESHOLD
- Consolidates tensors by dtype to reduce IPC overhead
- Maintains tensor references to prevent garbage collection during transfer

Memory-Efficient Broadcasting:

Updated broadcast_weights_for_collective():
- Leverages bridge's export_hf_weights() generator for memory efficiency
- Streams weights without loading entire model into memory simultaneously

🗂️ Code Cleanup and Simplification

Removed deprecated converter modules:
- nemo_rl/models/megatron/converters/ entire directory
- nemo_rl/models/megatron/refit_utils.py (replaced by bridge functionality)
- Removed imports: gather_params, get_local_key_to_global_keys, get_param_info

Simplified Model Import:

# Before: Complex model-specific converters
if hf_config.model_type == "llama":
    importer = HFLlamaImporter(hf_model_name, output_path=output_path)
elif hf_config.model_type == "qwen2":
    importer = HFQwen2Importer(hf_model_name, output_path=output_path)
# ... many more model types

# After: Unified interface
bridge = AutoBridge.from_hf_pretrained(hf_model_name)
megatron_model = bridge.to_megatron_model(wrap_with_ddp=False)
bridge.save_megatron_model(megatron_model, output_path)

🔧 Configuration Updates

Set cfg.dist.external_gpu_device_mapping = True for proper GPU device handling
Updated fault tolerance configuration pattern (cfg.ft_config → cfg.ft)
Enhanced checkpointing imports: init_checkpointing_context, maybe_finalize_async_save

🆕 New Tooling

Added tools/refit_verifier.py: Comprehensive validation tool for comparing logprobs between Megatron and vLLM policies after model weight refitting, ensuring inference consistency across backends

🎯 Refit Performance Improvements

Reduced Memory Footprint: Bridge-based conversion eliminates intermediate tensor copies
Faster Weight Transfer: Dtype-based tensor packing reduces IPC calls by up to 90%
Better Memory Management: Configurable buffer ratios via NRL_REFIT_BUFFER_MEMORY_RATIO
Streaming Conversion: Generator-based weight export prevents OOM on large models

🧪 Testing & Validation

All existing refit workflows maintained through bridge interface compatibility
New refit verifier enables precise logprob comparison validation
Enhanced error handling and memory monitoring during weight transfers

📋 Files Modified

nemo_rl/models/megatron/common.py - Updated imports
nemo_rl/models/megatron/community_import.py - Simplified with AutoBridge
nemo_rl/models/policy/megatron_policy_worker.py - Major refit system overhaul
tests/unit/distributed/test_virtual_cluster.py - Updated imports

📋 Files Added

tools/refit_verifier.py - Model consistency validation tool

📋 Files Removed

All files in nemo_rl/models/megatron/converters/ directory
nemo_rl/models/megatron/refit_utils.py

Known issues

qwen3-30b-a3b is slower during refit: #1004

Results

dpo-llama3.1-8b-instruct-4n8g-megatron.v2

grpo-llama3.2-1b-instruct-1n8g-megatron

grpo-qwen2.5-7b-instruct-4n8g-megatron

grpo-qwen3-30ba3b-8n8g-megatron

sft-llama3.1-70b-8n8g-tp4pp2-long-megatron

sft-llama3.1-8b-1n8g-megatron

sft-llama3.1-8b-1n8g-megatron-seqpack

github-actions · 2025-08-12T21:03:34Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 07a505a (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/terrykong/Megatron-LM/commits/fdd88b20910752ada68f21f31caff7bddd372bc1/

NeMo: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/aaefedd1d13f4ccd5cd06a19e06f1df33589a235/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA/NeMo/commits/024a7e65db629d0cfa6f1086f9298bdfee8cdaec/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-08-14T22:41:26Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8ad8f88 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-08-14T22:42:51Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0bed0c0 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-08-14T22:53:03Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: d811f17 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

guyueh1 · 2025-08-19T15:24:05Z

LGTM now, I will run some tests

terrykong · 2025-08-19T16:17:02Z

So I tested this with some of the recipes added in #713 and not all models seem to be converging:

tron | bridge
[ Y  | Y  ] dpo-llama3.1-8b-instruct-4n8g-megatron.v2
[ Y  |lp bad] grpo-llama3.2-1b-instruct-1n8g-megatron
[ Y  |lp med] grpo-qwen2.5-7b-instruct-4n8g-megatron
[ Y  |lp med] grpo-qwen3-30ba3b-8n8g-megatron
[ Y  |OOM ] sft-llama3.1-70b-8n8g-tp4pp2-long-megatron
[ Y  | Y  ] sft-llama3.1-8b-1n8g-megatron
[ Y  | Y  ] sft-llama3.1-8b-1n8g-megatron-seqpack

nemo_rl/models/policy/megatron_policy_worker.py

yfw · 2025-08-21T18:14:22Z

One more data point is I tested with DeepSeek-V2-Lite (using Megatron-Bridge branch yifu/nemo-rl-ds) and also saw high logprob error (> 1.3). Not sure if this is an issue in the converter though, still debugging.

yaoyu-33 · 2025-08-27T03:13:11Z

One more data point is I tested with DeepSeek-V2-Lite (using Megatron-Bridge branch yifu/nemo-rl-ds) and also saw high logprob error (> 1.3). Not sure if this is an issue in the converter though, still debugging.

this is fixed

terrykong · 2025-08-27T06:02:47Z

[note to others] i'm currently testing a rebased version of the latest commit on this branch. so any commits after this comment i'd have to pull in in #996

github-actions · 2025-08-27T23:39:28Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 1c9e78a (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2025-08-28T00:00:36Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 9088b0a (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

.gitmodules

github-actions · 2025-08-28T00:19:05Z

✅ Submodule Fast-Forward Check Results

Check based on commit: cc19d01 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2025-08-28T03:55:49Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 39eac45 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Terry Kong <terryk@nvidia.com>

github-actions · 2025-08-29T21:05:23Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 8244c0f (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Terry Kong <terryk@nvidia.com>

github-actions · 2025-08-30T16:57:47Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

This reverts commit c4fd5d3.

github-actions · 2025-09-08T18:18:02Z

ℹ️ File Consistency Check

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-08T18:18:10Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Automodel/commits/256f74c8d3bc12cc789488e72cf3b1d05601a955/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-09-23T23:00:22Z

ℹ️ File Consistency Check

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-23T23:00:40Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/7b55cabc0a3b1d8b03b6c1f680c030ea2c8eaa77/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Automodel/commits/256f74c8d3bc12cc789488e72cf3b1d05601a955/

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/abd52c89fe969869b8969acc181630c273cca4fd/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a1bbfc2429a23786a0a288ac55437fc931c567bd/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/af73aa2cebf94a0bee5ea6dda2614ad989faffae/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/terrykong/Megatron-LM/commits/e2d5bcd605108e2cf64fdb91fdfc669f10a57f56/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

yaoyu-33 changed the title ~~Adapt Megatron Bridge for Conversions and Refit~~ Migration from NeMo Tron to Megatron Bridge Aug 12, 2025

yaoyu-33 changed the title ~~Migration from NeMo Tron to Megatron Bridge~~ feat: Migration from NeMo Tron to Megatron Bridge Aug 12, 2025

terrykong mentioned this pull request Aug 14, 2025

feat: adding the megatron-bridge submodules and fix up dependencies #920

Merged

guyueh1 requested review from guyueh1, terrykong and yuki-97 August 18, 2025 22:00

guyueh1 mentioned this pull request Aug 19, 2025

OOM during refit in mcore path when using larger bucket sizes #630

Open

ananthsub mentioned this pull request Aug 20, 2025

fix: checkpoint saving with distributed optimizer + overlap param gather #949

Merged

4 tasks

yuki-97 reviewed Aug 20, 2025

View reviewed changes

terrykong mentioned this pull request Aug 20, 2025

B200/GB200 on NGC pytorch #952

Closed

yfw reviewed Aug 20, 2025

View reviewed changes

nemo_rl/models/policy/megatron_policy_worker.py Show resolved Hide resolved

ashors1 mentioned this pull request Aug 21, 2025

gpt-oss Mcore support #869

Closed

guyueh1 linked an issue Aug 26, 2025 that may be closed by this pull request

B200/GB200 on NGC pytorch #952

Closed

terrykong mentioned this pull request Aug 27, 2025

[dont merge] visualizing diff during rebase of mbridge #996

Closed

4 tasks

terrykong force-pushed the yuya/adapt_megatron_bridge branch from 1af5241 to 1c9e78a Compare August 27, 2025 23:38

terrykong force-pushed the yuya/adapt_megatron_bridge branch from 1c9e78a to 9088b0a Compare August 28, 2025 00:00

yfw reviewed Aug 28, 2025

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

guyueh1 mentioned this pull request Aug 29, 2025

feat: FP8 Training in Megatron Path #971

Merged

4 tasks

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 29, 2025

remove converts tests since that's all moved to mbridge

8244c0f

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed their stale review via 8244c0f August 29, 2025 21:04

terrykong enabled auto-merge August 29, 2025 21:05

terrykong previously approved these changes Aug 29, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 29, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 30, 2025

get rid of sqwen converter too

7f0f5d1

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed their stale review via 7f0f5d1 August 30, 2025 16:57

terrykong approved these changes Aug 30, 2025

View reviewed changes

terrykong enabled auto-merge August 30, 2025 16:57

terrykong added this pull request to the merge queue Aug 30, 2025

Merged via the queue into main with commit c4fd5d3 Aug 30, 2025
21 checks passed

terrykong deleted the yuya/adapt_megatron_bridge branch August 30, 2025 19:55

terrykong mentioned this pull request Sep 2, 2025

bug: megatron dpo nemo error #1031

Closed

gshennvm pushed a commit that referenced this pull request Sep 3, 2025

Revert "feat: Migration from NeMo Tron to Megatron Bridge (#905)"

970c905

This reverts commit c4fd5d3.

terrykong assigned yaoyu-33 and phtran8 Sep 8, 2025

terrykong added the QA:In Progress label Sep 8, 2025

phtran8 added QA:Verified and removed QA:In Progress labels Sep 23, 2025

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

feat: Migration from NeMo Tron to Megatron Bridge (NVIDIA-NeMo#905)

153c50e

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

Conversation

yaoyu-33 commented Aug 12, 2025 • edited by terrykong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

🔄 Import Migration

🔧 Enhanced Model Refitting System

🗂️ Code Cleanup and Simplification

🔧 Configuration Updates

🆕 New Tooling

🎯 Refit Performance Improvements

🧪 Testing & Validation

📋 Files Modified

📋 Files Added

📋 Files Removed

Known issues

Results

dpo-llama3.1-8b-instruct-4n8g-megatron.v2

grpo-llama3.2-1b-instruct-1n8g-megatron

grpo-qwen2.5-7b-instruct-4n8g-megatron

grpo-qwen3-30ba3b-8n8g-megatron

sft-llama3.1-70b-8n8g-tp4pp2-long-megatron

sft-llama3.1-8b-1n8g-megatron

sft-llama3.1-8b-1n8g-megatron-seqpack

Uh oh!

github-actions bot commented Aug 12, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Aug 14, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Aug 14, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Aug 14, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

Uh oh!

guyueh1 commented Aug 19, 2025

Uh oh!

terrykong commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yfw commented Aug 21, 2025

Uh oh!

yaoyu-33 commented Aug 27, 2025

Uh oh!

terrykong commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Aug 28, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Aug 28, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Uh oh!

yaoyu-33 commented Aug 12, 2025 •

edited by terrykong

Loading

terrykong commented Aug 27, 2025 •

edited

Loading