Skip to content

feat: Migration from NeMo Tron to Megatron Bridge #905

Merged
terrykong merged 11 commits intomainfrom
yuya/adapt_megatron_bridge
Aug 30, 2025
Merged

feat: Migration from NeMo Tron to Megatron Bridge #905
terrykong merged 11 commits intomainfrom
yuya/adapt_megatron_bridge

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Aug 12, 2025

Summary

This PR migrates the codebase from using nemo.tron imports to the new megatron.bridge interface, with significant improvements to model weight refitting workflows. The migration streamlines integration with Megatron-LM while introducing a more robust and efficient model weight conversion system.

Key Changes

🔄 Import Migration

  • Replaced nemo.tron.* imports with megatron.bridge.* equivalents:
    • nemo.tron.statemegatron.bridge.training.state
    • nemo.tron.checkpointingmegatron.bridge.training.checkpointing
    • nemo.tron.configmegatron.bridge.training.config
    • nemo.tron.initmegatron.bridge.training.initialize
    • nemo.tron.optimmegatron.bridge.training.optim
    • nemo.tron.tokenizersmegatron.bridge.training.tokenizers
    • nemo.tron.utilsmegatron.bridge.training.utils / megatron.bridge.utils

🔧 Enhanced Model Refitting System

New AutoBridge Integration:

  • Replaced manual converter classes with unified AutoBridge.from_hf_pretrained() interface
  • Added self.megatron_bridge = AutoBridge.from_hf_pretrained(hf_model_name) in policy worker initialization
  • Streamlined model conversion in import_model_from_hf_name() and export_model_from_megatron()

Advanced Refit Parameter Management:

  • New prepare_refit_info(): Calculates parameter conversion tasks and memory requirements
  • Enhanced _calculate_refit_param_info():
    • Generates conversion tasks using self.megatron_bridge.get_conversion_tasks([self.model])
    • Calculates precise memory requirements per parameter with dtype scaling
    • Returns parameter metadata for efficient memory allocation

Optimized Weight Transfer System:

  • Improved get_weights_ipc_handles():
    • Uses self.megatron_bridge.export_hf_weights() with conversion tasks
    • Implements intelligent tensor packing based on NEMO_RL_MEGATRON_IPC_TENSOR_PACKING_THRESHOLD
    • Consolidates tensors by dtype to reduce IPC overhead
    • Maintains tensor references to prevent garbage collection during transfer

Memory-Efficient Broadcasting:

  • Updated broadcast_weights_for_collective():
    • Leverages bridge's export_hf_weights() generator for memory efficiency
    • Streams weights without loading entire model into memory simultaneously

🗂️ Code Cleanup and Simplification

  • Removed deprecated converter modules:
    • nemo_rl/models/megatron/converters/ entire directory
    • nemo_rl/models/megatron/refit_utils.py (replaced by bridge functionality)
    • Removed imports: gather_params, get_local_key_to_global_keys, get_param_info

Simplified Model Import:

# Before: Complex model-specific converters
if hf_config.model_type == "llama":
    importer = HFLlamaImporter(hf_model_name, output_path=output_path)
elif hf_config.model_type == "qwen2":
    importer = HFQwen2Importer(hf_model_name, output_path=output_path)
# ... many more model types

# After: Unified interface
bridge = AutoBridge.from_hf_pretrained(hf_model_name)
megatron_model = bridge.to_megatron_model(wrap_with_ddp=False)
bridge.save_megatron_model(megatron_model, output_path)

🔧 Configuration Updates

  • Set cfg.dist.external_gpu_device_mapping = True for proper GPU device handling
  • Updated fault tolerance configuration pattern (cfg.ft_configcfg.ft)
  • Enhanced checkpointing imports: init_checkpointing_context, maybe_finalize_async_save

🆕 New Tooling

  • Added tools/refit_verifier.py: Comprehensive validation tool for comparing logprobs between Megatron and vLLM policies after model weight refitting, ensuring inference consistency across backends

🎯 Refit Performance Improvements

  1. Reduced Memory Footprint: Bridge-based conversion eliminates intermediate tensor copies
  2. Faster Weight Transfer: Dtype-based tensor packing reduces IPC calls by up to 90%
  3. Better Memory Management: Configurable buffer ratios via NRL_REFIT_BUFFER_MEMORY_RATIO
  4. Streaming Conversion: Generator-based weight export prevents OOM on large models

🧪 Testing & Validation

  • All existing refit workflows maintained through bridge interface compatibility
  • New refit verifier enables precise logprob comparison validation
  • Enhanced error handling and memory monitoring during weight transfers

📋 Files Modified

  • nemo_rl/models/megatron/common.py - Updated imports
  • nemo_rl/models/megatron/community_import.py - Simplified with AutoBridge
  • nemo_rl/models/policy/megatron_policy_worker.py - Major refit system overhaul
  • tests/unit/distributed/test_virtual_cluster.py - Updated imports

📋 Files Added

  • tools/refit_verifier.py - Model consistency validation tool

📋 Files Removed

  • All files in nemo_rl/models/megatron/converters/ directory
  • nemo_rl/models/megatron/refit_utils.py

Known issues

qwen3-30b-a3b is slower during refit: #1004


Results

dpo-llama3.1-8b-instruct-4n8g-megatron.v2

image

grpo-llama3.2-1b-instruct-1n8g-megatron

image

grpo-qwen2.5-7b-instruct-4n8g-megatron

image

grpo-qwen3-30ba3b-8n8g-megatron

image

sft-llama3.1-70b-8n8g-tp4pp2-long-megatron

image

sft-llama3.1-8b-1n8g-megatron

image

sft-llama3.1-8b-1n8g-megatron-seqpack

image

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 07a505a (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/terrykong/Megatron-LM/commits/fdd88b20910752ada68f21f31caff7bddd372bc1/

NeMo: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/aaefedd1d13f4ccd5cd06a19e06f1df33589a235/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA/NeMo/commits/024a7e65db629d0cfa6f1086f9298bdfee8cdaec/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yaoyu-33 yaoyu-33 changed the title Adapt Megatron Bridge for Conversions and Refit Migration from NeMo Tron to Megatron Bridge Aug 12, 2025
@yaoyu-33 yaoyu-33 changed the title Migration from NeMo Tron to Megatron Bridge feat: Migration from NeMo Tron to Megatron Bridge Aug 12, 2025
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8ad8f88 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0bed0c0 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: d811f17 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@guyueh1
Copy link
Contributor

guyueh1 commented Aug 19, 2025

LGTM now, I will run some tests

@terrykong
Copy link
Collaborator

So I tested this with some of the recipes added in #713 and not all models seem to be converging:

tron | bridge
[ Y  | Y  ] dpo-llama3.1-8b-instruct-4n8g-megatron.v2
[ Y  |lp bad] grpo-llama3.2-1b-instruct-1n8g-megatron
[ Y  |lp med] grpo-qwen2.5-7b-instruct-4n8g-megatron
[ Y  |lp med] grpo-qwen3-30ba3b-8n8g-megatron
[ Y  |OOM ] sft-llama3.1-70b-8n8g-tp4pp2-long-megatron
[ Y  | Y  ] sft-llama3.1-8b-1n8g-megatron
[ Y  | Y  ] sft-llama3.1-8b-1n8g-megatron-seqpack

@ashors1 ashors1 mentioned this pull request Aug 21, 2025
@yfw
Copy link
Contributor

yfw commented Aug 21, 2025

One more data point is I tested with DeepSeek-V2-Lite (using Megatron-Bridge branch yifu/nemo-rl-ds) and also saw high logprob error (> 1.3). Not sure if this is an issue in the converter though, still debugging.

@guyueh1 guyueh1 linked an issue Aug 26, 2025 that may be closed by this pull request
@yaoyu-33
Copy link
Contributor Author

One more data point is I tested with DeepSeek-V2-Lite (using Megatron-Bridge branch yifu/nemo-rl-ds) and also saw high logprob error (> 1.3). Not sure if this is an issue in the converter though, still debugging.

this is fixed

@terrykong
Copy link
Collaborator

terrykong commented Aug 27, 2025

[note to others] i'm currently testing a rebased version of the latest commit on this branch. so any commits after this comment i'd have to pull in in #996

@terrykong terrykong force-pushed the yuya/adapt_megatron_bridge branch from 1af5241 to 1c9e78a Compare August 27, 2025 23:38
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 1c9e78a (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong force-pushed the yuya/adapt_megatron_bridge branch from 1c9e78a to 9088b0a Compare August 28, 2025 00:00
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 9088b0a (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: cc19d01 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 39eac45 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@guyueh1 guyueh1 mentioned this pull request Aug 29, 2025
4 tasks
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 29, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 8244c0f (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong enabled auto-merge August 29, 2025 21:05
terrykong
terrykong previously approved these changes Aug 29, 2025
@terrykong terrykong added this pull request to the merge queue Aug 29, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 30, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong enabled auto-merge August 30, 2025 16:57
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ New submodule being added
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong added this pull request to the merge queue Aug 30, 2025
Merged via the queue into main with commit c4fd5d3 Aug 30, 2025
21 checks passed
@terrykong terrykong deleted the yuya/adapt_megatron_bridge branch August 30, 2025 19:55
gshennvm pushed a commit that referenced this pull request Sep 3, 2025
@github-actions
Copy link

github-actions bot commented Sep 8, 2025

ℹ️ File Consistency Check

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Automodel/commits/256f74c8d3bc12cc789488e72cf3b1d05601a955/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7f0f5d1 (PR #905 from yuya/adapt_megatron_bridge)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/7b55cabc0a3b1d8b03b6c1f680c030ea2c8eaa77/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Automodel/commits/256f74c8d3bc12cc789488e72cf3b1d05601a955/

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/abd52c89fe969869b8969acc181630c273cca4fd/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a1bbfc2429a23786a0a288ac55437fc931c567bd/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/af73aa2cebf94a0bee5ea6dda2614ad989faffae/
CURRENT (PR #905 from yuya/adapt_megatron_bridge): https://github.com/terrykong/Megatron-LM/commits/e2d5bcd605108e2cf64fdb91fdfc669f10a57f56/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

B200/GB200 on NGC pytorch

8 participants