feat: Migration from NeMo Tron to Megatron Bridge #905
Conversation
❌ Submodule Fast-Forward Check FailedCheck based on commit: 07a505a (PR #905 from ❌ Submodules that need attention:Megatron-LM: ❌ PR branch is BEHIND main branch NeMo: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 8ad8f88 (PR #905 from ✅ Submodules that are properly updated:Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward) Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 0bed0c0 (PR #905 from ✅ Submodules that are properly updated:Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward) Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: d811f17 (PR #905 from ✅ Submodules that are properly updated:Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward) Please ensure all submodule commits are fast-forwards of the main branch before merging. |
|
LGTM now, I will run some tests |
|
So I tested this with some of the recipes added in #713 and not all models seem to be converging: |
|
One more data point is I tested with DeepSeek-V2-Lite (using Megatron-Bridge branch |
this is fixed |
|
[note to others] i'm currently testing a rebased version of the latest commit on this branch. so any commits after this comment i'd have to pull in in #996 |
1af5241 to
1c9e78a
Compare
1c9e78a to
9088b0a
Compare
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: 7f0f5d1 (PR #905 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 7f0f5d1 (PR #905 from ❌ Submodules that need attention:Automodel: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
ℹ️ File Consistency CheckCheck based on commit: 7f0f5d1 (PR #905 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
Summary
This PR migrates the codebase from using
nemo.tronimports to the newmegatron.bridgeinterface, with significant improvements to model weight refitting workflows. The migration streamlines integration with Megatron-LM while introducing a more robust and efficient model weight conversion system.Key Changes
🔄 Import Migration
nemo.tron.*imports withmegatron.bridge.*equivalents:nemo.tron.state→megatron.bridge.training.statenemo.tron.checkpointing→megatron.bridge.training.checkpointingnemo.tron.config→megatron.bridge.training.confignemo.tron.init→megatron.bridge.training.initializenemo.tron.optim→megatron.bridge.training.optimnemo.tron.tokenizers→megatron.bridge.training.tokenizersnemo.tron.utils→megatron.bridge.training.utils/megatron.bridge.utils🔧 Enhanced Model Refitting System
New AutoBridge Integration:
AutoBridge.from_hf_pretrained()interfaceself.megatron_bridge = AutoBridge.from_hf_pretrained(hf_model_name)in policy worker initializationimport_model_from_hf_name()andexport_model_from_megatron()Advanced Refit Parameter Management:
prepare_refit_info(): Calculates parameter conversion tasks and memory requirements_calculate_refit_param_info():self.megatron_bridge.get_conversion_tasks([self.model])Optimized Weight Transfer System:
get_weights_ipc_handles():self.megatron_bridge.export_hf_weights()with conversion tasksNEMO_RL_MEGATRON_IPC_TENSOR_PACKING_THRESHOLDMemory-Efficient Broadcasting:
broadcast_weights_for_collective():export_hf_weights()generator for memory efficiency🗂️ Code Cleanup and Simplification
nemo_rl/models/megatron/converters/entire directorynemo_rl/models/megatron/refit_utils.py(replaced by bridge functionality)gather_params,get_local_key_to_global_keys,get_param_infoSimplified Model Import:
🔧 Configuration Updates
cfg.dist.external_gpu_device_mapping = Truefor proper GPU device handlingcfg.ft_config→cfg.ft)init_checkpointing_context,maybe_finalize_async_save🆕 New Tooling
tools/refit_verifier.py: Comprehensive validation tool for comparing logprobs between Megatron and vLLM policies after model weight refitting, ensuring inference consistency across backends🎯 Refit Performance Improvements
NRL_REFIT_BUFFER_MEMORY_RATIO🧪 Testing & Validation
📋 Files Modified
nemo_rl/models/megatron/common.py- Updated importsnemo_rl/models/megatron/community_import.py- Simplified with AutoBridgenemo_rl/models/policy/megatron_policy_worker.py- Major refit system overhaultests/unit/distributed/test_virtual_cluster.py- Updated imports📋 Files Added
tools/refit_verifier.py- Model consistency validation tool📋 Files Removed
nemo_rl/models/megatron/converters/directorynemo_rl/models/megatron/refit_utils.pyKnown issues
qwen3-30b-a3b is slower during refit: #1004
Results
dpo-llama3.1-8b-instruct-4n8g-megatron.v2
grpo-llama3.2-1b-instruct-1n8g-megatron
grpo-qwen2.5-7b-instruct-4n8g-megatron
grpo-qwen3-30ba3b-8n8g-megatron
sft-llama3.1-70b-8n8g-tp4pp2-long-megatron
sft-llama3.1-8b-1n8g-megatron
sft-llama3.1-8b-1n8g-megatron-seqpack