chore: Update RL to use megatron-bridge tot by yaoyu-33 · Pull Request #1358 · NVIDIA-NeMo/RL

yaoyu-33 · 2025-10-14T22:55:59Z

What does this PR do ?

As title

Tests

METRIC PASS    [Step 10/10]         : code_snapshots_mcore_tot/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack/6814205-logs/ray-driver.log
METRIC PASS    [Step 20/20]         : code_snapshots_mcore_tot/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick/6814215-logs/ray-driver.log
METRIC FAIL    [Step 150/150]       : code_snapshots_mcore_tot/dpo-llama3.1-8b-instruct-4n8g-megatron.v2/6826018-logs/ray-driver.log
METRIC PASS    [Step 500/500]       : code_snapshots_mcore_tot/grpo-llama3.2-1b-instruct-1n8g-megatron/6814226-logs/ray-driver.log
METRIC PASS    [Step 3/3]           : code_snapshots_mcore_tot/grpo-math-qwen3-30ba3b-megatron-tp4-32k/6826014-logs/ray-driver.log
METRIC PASS    [Step 30/30]         : code_snapshots_mcore_tot/grpo-moonlight-16ba3b-4n8g-megatron/6814290-logs/ray-driver.log
METRIC PASS    [Step 30/30]         : code_snapshots_mcore_tot/grpo-qwen2.5-7b-instruct-4n8g-megatron/6814301-logs/ray-driver.log
METRIC PASS    [Step 30/30]         : code_snapshots_mcore_tot/grpo-qwen3-30ba3b-8n8g-megatron/6833541-logs/ray-driver.log
METRIC FAIL    [Step 300/300]       : code_snapshots_mcore_tot/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron/6814309-logs/ray-driver.log
METRIC PASS    [Step 250/250]       : code_snapshots_mcore_tot/sft-llama3.1-8b-1n8g-megatron/6823954-logs/ray-driver.log
METRIC PASS    [Step 250/250]       : code_snapshots_mcore_tot/sft-llama3.1-8b-1n8g-megatron-seqpack/6814313-logs/ray-driver.log
METRIC PASS    [Step 200/200]       : code_snapshots_mcore_tot/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-megatrontp2.v1/6850360-logs/ray-driver.log

The 70b failure is a slight memory bump that exists in main.
The dpo failure dpo-llama3.1-8b-instruct-4n8g-megatron.v2 is due the num_workers change and caused the shuffling order to change since the default is 1 instead of 0.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Enhanced mixed-precision support with optional float32 expert biases for MoE routers.
- Automatic padded vocab size calculation during setup.
- Explicit TP/PP/EP configuration for vLLM.
- Default generation temperature set to 1.0 across Megatron and vLLM.
Bug Fixes
- More robust checkpoint handling when FSDP is unavailable.
- Correct boolean handling for vLLM enforce_eager.
- Aligned generation defaults for consistent behavior.
Refactor
- Updated model initialization flow for improved stability.
Chores
- Added config defaults (train_iters, bias_activation_fusion) and clearer validation for required vocab size.

coderabbitai · 2025-10-14T22:56:09Z

📝 Walkthrough

Walkthrough

Adds a pre-finalize step to Megatron community import. Introduces a CustomFloat16Module and mixed_precision_wrapper logic in Megatron policy worker, including FSDP detection, vocab padding, and stricter config assertions. Updates refit_verifier to explicitly pass TP/PP/EP to vLLM, set temperatures, add train_iters, and fix boolean types.

Changes

Cohort / File(s)	Summary
Megatron community import init flow `nemo_rl/models/megatron/community_import.py`	Inserted model_provider.finalize() before initialize_model_parallel(...) in import_model_from_hf_name initialization sequence.
Policy worker mixed-precision, MoE bias, vocab, and FSDP handling `nemo_rl/models/policy/megatron_policy_worker.py`	Added CustomFloat16Module to re-enable float32 expert biases in MoE routers; introduced mixed_precision_wrapper selection across model/ref model init; guarded FSDP import and HAVE_FSDP2 flag; added padded vocab size computation and enforced vocab_size assertion; updated construction paths to use new wrapper and config types.
Refit verifier config alignment for Megatron and vLLM `tools/refit_verifier.py`	Added temperature: 1.0 (Megatron and vLLM); set train_iters: 1 and bias_activation_fusion: False; passed TP/PP/EP explicitly to vLLM (removed product computation); enforce_eager converted to boolean False; updated comments/structure accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant CommunityImport as Community Import
  participant Provider as Megatron Provider

  Caller->>CommunityImport: import_model_from_hf_name(...)
  CommunityImport->>Provider: finalize()
  Note over Provider: Prepares global state before parallel init
  CommunityImport->>Provider: initialize_model_parallel(...)
  Provider-->>Caller: Initialized model

sequenceDiagram
  autonumber
  participant Worker as PolicyWorker
  participant Tokenizer
  participant Vocab as VocabUtil
  participant Wrapper as MixedPrecisionWrapper
  participant Model as MegatronModel
  participant Router as MoE Routers
  participant FSDP as torch_FSDP (optional)

  Worker->>Tokenizer: load tokenizer
  Worker->>Vocab: calculate_padded_vocab_size(vocab_size)
  Vocab-->>Worker: final_padded_vocab_size
  Worker->>Wrapper: select wrapper (Float16 / CustomFloat16 / None)
  alt FSDP available
    Worker->>FSDP: set HAVE_FSDP2 flag
  end
  Worker->>Model: get_model(..., mixed_precision_wrapper=Wrapper, vocab_size=asserted)
  opt Using CustomFloat16
    Worker->>Wrapper: re_enable_float32_expert_bias()
    Wrapper->>Router: _maintain_float32_expert_bias()
  end
  Note over Worker,Model: Same flow applied to reference model (pre/post load)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	The pull request introduces substantial new functionality and modifications to core Megatron integration, including a custom float-16 wrapper, revised initialization flow, and API surface changes, which qualify as major changes under our criteria; however, the PR description contains only a placeholder checklist and no actual test results, numeric or convergence validation, or performance benchmarks.	Please augment the PR description with concrete testing information, such as unit or functional test outcomes demonstrating no regressions, convergence validation for numeric changes, and before-and-after performance numbers with configuration details.
Title check	❓ Inconclusive	The title is vague and incomplete—'megatron-bridge tot' is unclear, possibly truncated or containing an acronym without context, making it difficult to understand the main change.	Clarify the title to explicitly state the purpose of the update. Consider: 'Update RL models to use latest megatron-bridge version' or similar descriptive phrasing.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuya/update-to-use-mbridge-tot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

nemo_rl/models/policy/megatron_policy_worker.py (1)
2026-2060: CustomFloat16Module correctly maintains float32 MoE router bias.

The new CustomFloat16Module class properly extends Float16Module to ensure MoE router expert bias stays in float32 for numerical stability. The re_enable_float32_expert_bias() method correctly:

Handles VLM models by unwrapping language_model

Walks decoder layers to find routers

Invokes _maintain_float32_expert_bias() when available

Consider adding defensive checks for robustness:
 def re_enable_float32_expert_bias(self) -> None:
     """Ensure MoE router expert bias stays in float32 for numerical stability.
 
     Walks the wrapped module to find MoE routers and invokes the
     `_maintain_float32_expert_bias()` helper which recreates or casts the
     expert bias tensors to float32 as required by Megatron-LM.
     """
     module = self.module
     # Handle VLM models where language model is nested
     if hasattr(module, "language_model"):
         module = module.language_model
-    if hasattr(module, "decoder") and hasattr(module.decoder, "layers"):
+    # Only process if the model has the expected decoder structure
+    if not (hasattr(module, "decoder") and hasattr(module.decoder, "layers")):
+        return
+    for layer in module.decoder.layers:
-        for layer in module.decoder.layers:
-            mlp = getattr(layer, "mlp", None)
-            router = getattr(mlp, "router", None) if mlp is not None else None
-            if router is not None and hasattr(
-                router, "_maintain_float32_expert_bias"
-            ):
-                router._maintain_float32_expert_bias()
+        mlp = getattr(layer, "mlp", None)
+        router = getattr(mlp, "router", None) if mlp is not None else None
+        if router is not None and hasattr(router, "_maintain_float32_expert_bias"):
+            router._maintain_float32_expert_bias()
nemo_rl/models/megatron/community_import.py (1)

72-72: Update docstring to document finalize() call
Add that model_provider.finalize() runs deferred post-init logic, validates the provider/config, and must be called after config modifications and before initialize_model_parallel().

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355aa98 and 40979a3.

📒 Files selected for processing (3)

nemo_rl/models/megatron/community_import.py (1 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (9 hunks)
tools/refit_verifier.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/models/megatron/community_import.py
tools/refit_verifier.py
nemo_rl/models/policy/megatron_policy_worker.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/megatron/community_import.py
nemo_rl/models/policy/megatron_policy_worker.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (9)

tools/refit_verifier.py (3)

161-161: LGTM! Consistent temperature configuration.

Adding temperature: 1.0 to the Megatron generation config ensures consistency with the vLLM configuration (line 271), which is important for accurate logprob comparison in this verification script.

216-217: LGTM! Required Megatron config additions.

Adding train_iters: 1 and bias_activation_fusion: False aligns with new Megatron configuration requirements. The assertion at line 651 in megatron_policy_worker.py confirms that train_iters is now mandatory.

277-279: LGTM! Explicit parallelism configuration.

Explicitly passing tensor_parallel_size, pipeline_parallel_size, and expert_parallel_size to vLLM config (instead of computing them) improves clarity and aligns with the updated VllmGeneration expectations noted in the comment at line 262.
nemo_rl/models/policy/megatron_policy_worker.py (6)
134-139: LGTM! Proper FSDP2 feature detection.

The try/except block correctly detects FSDP2 availability without requiring a hard dependency. The ImportError-specific exception handling follows best practices and is used appropriately at lines 316 and 771 to conditionally adjust checkpoint loading behavior.

249-249: Stricter vocab_size validation.

The assertion now requires vocab_size to be explicitly specified in the model config, replacing any previous fallback behavior. This is a breaking change that ensures explicit configuration.

Ensure all configuration files and documentation specify vocab_size explicitly.

814-818: LGTM! Explicit padded vocab size calculation.

Calculating final_padded_vocab_size using the imported calculate_padded_vocab_size utility ensures correct vocab padding for tensor parallelism. The calculated value is used at line 1466 for inference configuration.

254-276: Clarify precedence when both freeze_moe_router and defer_fp32_logits are enabled.

The logic sets mixed_precision_wrapper = CustomFloat16Module when freeze_moe_router is true (lines 271-272), then overrides it to None when defer_fp32_logits is enabled (lines 275-276). This means defer_fp32_logits takes precedence.

Verify that the precedence is intentional. If both options can be enabled simultaneously, consider adding a comment or assertion to clarify the expected behavior:
# If deferring fp32 logits, disable mixed-precision wrapper entirely
# This takes precedence over freeze_moe_router which also sets the wrapper
if policy_cfg["megatron_cfg"].get("defer_fp32_logits", None):
    mixed_precision_wrapper = None
745-758: LGTM! Consistent wrapper configuration for reference model.

The reference model uses the same mixed precision wrapper selection logic as the main model (lines 254-276), ensuring consistency. The ref_mixed_precision_wrapper is correctly passed to get_model at line 758.

93-93: LGTM! Required import for CustomFloat16Module.

The TransformerConfig import is necessary for the new CustomFloat16Module class definition at line 2038.

tools/refit_verifier.py

yfw · 2025-10-15T00:02:25Z

@ZhiyuLi-Nvidia can you double check the fp32 expert bias change in this PR?

ZhiyuLi-Nvidia · 2025-10-15T00:10:02Z

@ZhiyuLi-Nvidia can you double check the fp32 expert bias change in this PR?

@yaoyu-33 could you help me run a test on moonshotai/Moonlight-16B-A3B-Instruct for verification.

I think we are good to go if the experiment is successful.

export HF_HOME=<xxx>

export NCCL_NVLS_ENABLE=0
export NRL_FORCE_REBUILD_VENVS=true 
PYTHONPATH=$HF_HOME/modules:$PYTHONPATH uv run python examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml \
    cluster.num_nodes=2 \
    grpo.val_batch_size=2 \
    policy.model_name=moonshotai/Moonlight-16B-A3B-Instruct \
    policy.generation.vllm_cfg.tensor_parallel_size=16 \
    cluster.gpus_per_node=8 \
    policy.megatron_cfg.pipeline_model_parallel_size=8 \
    policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 \
    policy.megatron_cfg.num_layers_in_last_pipeline_stage=4 \
    policy.max_total_sequence_length=1024 \
    checkpointing.enabled=False \
    checkpointing.save_period=10 \
    grpo.val_period=10 \
    grpo.max_val_samples=16 \
    grpo.val_batch_size=4 \
    checkpointing.keep_top_k=100 \
    checkpointing.checkpoint_dir=results/grpo_moonlight \
    checkpointing.enabled=False \
    grpo.val_period=-1 \
    policy.megatron_cfg.expert_model_parallel_size=2 \
    policy.generation.vllm_cfg.async_engine=True \
    grpo.max_num_steps=10 \
    grpo.num_prompts_per_step=8 \
    grpo.num_generations_per_prompt=8 \
    policy.train_global_batch_size=64 \
    policy.train_micro_batch_size=1 \
    logger.wandb_enabled=False \
    logger.wandb.project='grpo-dev-zhiyul' \
    logger.wandb.name='moonlight-16b' \
    policy.megatron_cfg.apply_rope_fusion=False

github-actions · 2025-10-21T22:00:42Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0f93ad0 (PR #1358 from yuya/update-to-use-mbridge-tot)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/063498590a2923a63c04df1fd3e7a5467a519880/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-21T22:34:56Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 371e458 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/063498590a2923a63c04df1fd3e7a5467a519880/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-24T20:45:24Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: fc33e3b (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/063498590a2923a63c04df1fd3e7a5467a519880/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-24T22:27:11Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: e9a3e46 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/063498590a2923a63c04df1fd3e7a5467a519880/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-27T22:15:57Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 752a7cf (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-27T23:51:48Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8675993 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-27T23:56:22Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: bb2809a (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-28T00:02:16Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: e7851c7 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

ZhiyuLi-Nvidia

Thank you @yaoyu-33. LGTM!

Sync up offline.
Correct convergence/logprob error in moonlight model should verify the effectiveness of re_enable_float32_expert_bias.

github-actions · 2025-10-28T03:38:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8c7d7f1 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-28T06:56:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: ba4f889 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-28T19:43:47Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: f9eb786 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-28T19:45:00Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 87451fe (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-28T19:46:40Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: d7a3e40 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-29T04:54:02Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 3674c3f (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> fix more stuff Signed-off-by: Terry Kong <terryk@nvidia.com> fix Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

github-actions · 2025-11-04T06:54:40Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 8797154 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-11-04T06:55:32Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 535f66c (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong · 2025-11-04T07:02:42Z

@yfw i updated the branches: 4ee29a2

github-actions · 2025-11-04T07:03:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4ee29a2 (PR #1358 from yuya/update-to-use-mbridge-tot)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-Bridge: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/62f4704b8d665ac4a8c318a809a070217caa8901/
CURRENT (PR #1358 from yuya/update-to-use-mbridge-tot): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/f003cd8ca3e4876853b6097e816f0a94ea8fefc1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

yaoyu-33 requested review from a team as code owners October 14, 2025 22:56

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

tools/refit_verifier.py Outdated Show resolved Hide resolved

yfw previously approved these changes Oct 15, 2025

View reviewed changes

yaoyu-33 dismissed yfw’s stale review via 55923a3 October 16, 2025 19:51

euronymous-aithal added t-mcore r0.4.0 labels Oct 18, 2025

ZhiyuLi-Nvidia previously approved these changes Oct 28, 2025

View reviewed changes

yaoyu-33 dismissed ZhiyuLi-Nvidia’s stale review via 8c7d7f1 October 28, 2025 03:37

terrykong requested a review from a team as a code owner October 28, 2025 06:55

terrykong requested a review from a team as a code owner October 28, 2025 19:43

terrykong force-pushed the yuya/update-to-use-mbridge-tot branch from 87451fe to d7a3e40 Compare October 28, 2025 19:45

terrykong removed the r0.4.0 label Oct 28, 2025

yaoyu-33 and others added 9 commits November 3, 2025 22:54

update setup.py

bf80b04

Signed-off-by: Terry Kong <terryk@nvidia.com>

update mbridge submodule

d912448

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

update setup.py

47e9760

Signed-off-by: Terry Kong <terryk@nvidia.com>

Update Megatron-LM submodule to f60074b211d7b2c97208dba3c92e57a2be949188

294734d

Signed-off-by: Terry Kong <terryk@nvidia.com>

revert setup.py change for megatron-bridge

e3fb454

Signed-off-by: Terry Kong <terryk@nvidia.com>

update run_mcore_engine import path

bbd9a83

Signed-off-by: Terry Kong <terryk@nvidia.com>

3rdparty: Update Megatron-LM to 76065f17e1e1e2850d1e9009ae5f601007aeeeb3

0fd717e

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix dependency issues

702adaa

Signed-off-by: Terry Kong <terryk@nvidia.com> fix more stuff Signed-off-by: Terry Kong <terryk@nvidia.com> fix Signed-off-by: Terry Kong <terryk@nvidia.com>

code_snapshot fix for recursive submodules

535f66c

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong force-pushed the yuya/update-to-use-mbridge-tot branch from 8797154 to 535f66c Compare November 4, 2025 06:54

fix gitmodule branches

4ee29a2

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 4, 2025

terrykong temporarily deployed to nemo-ci November 4, 2025 07:02 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) November 4, 2025 07:03

terrykong temporarily deployed to nemo-ci November 4, 2025 07:21 — with GitHub Actions Inactive

yfw approved these changes Nov 4, 2025

View reviewed changes

terrykong temporarily deployed to nemo-ci November 4, 2025 09:50 — with GitHub Actions Inactive

terrykong merged commit 2e2c2b3 into main Nov 4, 2025
39 of 41 checks passed

terrykong deleted the yuya/update-to-use-mbridge-tot branch November 4, 2025 11:39

This was referenced Nov 10, 2025

refactor: refactor config matching from megatron-bridge #1501

Closed

feat: Support for nano-v2 #1514

Merged

coderabbitai bot mentioned this pull request Jan 13, 2026

docs: V0.5 perf results #1771

Closed

4 tasks

coderabbitai bot mentioned this pull request Feb 4, 2026

chore: Update Megatron submodule pins #1787

Closed

coderabbitai bot mentioned this pull request Feb 24, 2026

fix: Re-enable tests/functional/test_converters.sh functional test #2005

Merged

4 tasks

Conversation

yaoyu-33 commented Oct 14, 2025 • edited by terrykong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Tests

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yfw commented Oct 15, 2025

Uh oh!

ZhiyuLi-Nvidia commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 21, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 21, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 24, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 24, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 27, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 27, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 27, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 28, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 28, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 28, 2025

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

yaoyu-33 commented Oct 14, 2025 •

edited by terrykong

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

ZhiyuLi-Nvidia commented Oct 15, 2025 •

edited

Loading