Fix performance config scripts for parameterless recipe API by yaoyu-33 · Pull Request #2201 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-02-03T20:13:10Z

Summary

This PR fixes the performance config scripts to work with the new parameterless recipe API introduced in PR #2067.

Problem

After reverting PR #2067, the nemo-ci library and performance testing was broken because:

scripts.performance.configs.*.pretrain.py files import and instantiate recipes
Recipes no longer accept kwargs (like mock, precision_config, dir, name)
The performance scripts were still passing these arguments

Changes

`scripts/performance/utils/utils.py`

Updated get_library_recipe() to call recipes without arguments
Set output paths after instantiation:
- cfg.checkpoint.save and cfg.checkpoint.load
- cfg.logger.tensorboard_dir
- cfg.logger.wandb_exp_name and cfg.logger.wandb_save_dir

Performance Config Files

Updated all pretrain config functions to:

Call base recipe functions without mock and precision_config arguments
Directly set cfg.mixed_precision after instantiation

Files updated:

scripts/performance/configs/llama/llama3_llm_pretrain.py
scripts/performance/configs/llama/llama31_llm_pretrain.py
scripts/performance/configs/deepseek/deepseek_llm_pretrain.py
scripts/performance/configs/qwen/qwen3_llm_pretrain.py
scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py
scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py

DeepSeek V3 Layout Fix

Fixed DeepSeek V3 to recompute pipeline layout after updating PP/VP sizes
Import and call set_deepseek_v3_pipeline_model_parallel_layout() after changing parallelism settings

New Functional Test

Added tests/functional_tests/recipes/test_perf_config_integration.py
Verifies that performance configs can be instantiated correctly

Testing

All pre-commit hooks pass
Functional test added for integration verification

Summary by CodeRabbit

Release Notes

New Features
- Introduced simplified, parameter-free configuration functions for model pretraining across all supported architectures.
- Added centralized pretraining configuration baseline for consistent setup patterns.
Improvements
- Standardized API design for model recipe configurations, reducing parameter complexity.
- Enhanced checkpoint and logging path handling in configuration utilities.
- Improved test coverage for configuration integration and model variants.

…LMs (#2067)"" This reverts commit 0909f9f.

This commit fixes the performance config scripts to work with the new parameterless recipe API introduced in PR #2067. Changes: - Update get_library_recipe() to call recipes without arguments and set output paths (checkpoint, tensorboard, wandb) after instantiation - Update all pretrain config functions to: - Call base recipe functions without mock/precision_config arguments - Directly set cfg.mixed_precision after instantiation - Fix DeepSeek V3 to recompute pipeline layout after updating PP/VP sizes by calling set_deepseek_v3_pipeline_model_parallel_layout() - Add functional test for performance config integration Files modified: - scripts/performance/utils/utils.py - scripts/performance/configs/deepseek/deepseek_llm_pretrain.py - scripts/performance/configs/llama/llama3_llm_pretrain.py - scripts/performance/configs/llama/llama31_llm_pretrain.py - scripts/performance/configs/qwen/qwen3_llm_pretrain.py - scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py - scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py - tests/functional_tests/recipes/test_perf_config_integration.py

copy-pr-bot · 2026-02-03T20:13:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-03T20:26:58Z

📝 Walkthrough

Walkthrough

This PR introduces a centralized _pretrain_common() helper function and refactors ~40 recipe configuration functions across multiple model families (Llama, DeepSeek, Qwen, Gemma, GPT, Nemotron, etc.) from parameterized APIs accepting user kwargs to parameterless factory functions returning pre-configured ConfigContainer objects. The refactoring standardizes pretraining setup through explicit, model-specific configuration assignments rather than parameter passing.

Changes

Cohort / File(s)	Summary
Core pretraining common helper `src/megatron/bridge/recipes/common.py`	Introduces new `_pretrain_common()` private helper function providing a standardized baseline pretraining ConfigContainer with default optimizer, scheduler, dataset, logger, and training settings.
Llama recipes `src/megatron/bridge/recipes/llama/llama2.py`, `llama3.py`, `llama3_8b_16k_...`	Converts parameterized pretrain/finetune config functions to parameterless factories using `_pretrain_common()` baseline. Expands Llama3 coverage with variants for multiple sequence lengths (16K, 64K, 128K) and precision modes.
DeepSeek recipes `src/megatron/bridge/recipes/deepseek/deepseek_v2.py`, `deepseek_v3.py`	Replaces kwargs-based config builders with streamlined `_pretrain_common()` implementations. Introduces `AutoBridge.from_hf_pretrained()` for model loading and centralizes MoE/pipeline layout configuration.
Gemma recipes `src/megatron/bridge/recipes/gemma/gemma2.py`, `gemma3.py`	Migrates from `_gemma_common`-based helpers to `_pretrain_common()` pathway. Updates model instantiation via AutoBridge and removes GPTDatasetConfig imports in favor of centralized dataset handling.
Qwen recipes `src/megatron/bridge/recipes/qwen/qwen2.py`, `qwen3.py`, `qwen3_moe.py`, `qwen3_next.py`	Converts multiple Qwen2/Qwen3 variants and MoE models from parameterized to parameterless APIs. Adds new Qwen3 235B-A22B pretrain/finetune configs and finetuning helpers.
GPT/GPT-OSS recipes `src/megatron/bridge/recipes/gpt/gpt3_175b.py`, `gpt_oss/gpt_oss.py`	Replaces large multi-argument builders with concise `_pretrain_common()` implementations. Removes separate `model_config()` functions and consolidates configuration into single pretrain entry points.
Other model families `src/megatron/bridge/recipes/glm/glm45.py`, `kimi/kimi_k2.py`, `moonlight/moonlight_16b.py`, `nemotronh/*`, `olmoe/olmoe_7b.py`	Applies consistent refactoring pattern: removes per-model helper functions, introduces `_pretrain_common()` baseline, adds pipeline layout helpers where needed, updates tokenizer/dataset/MoE configuration paths.
Performance config scripts `scripts/performance/configs/deepseek/...`, `gpt_oss/...`, `llama/...`, `nemotronh/...`, `qwen/...`	Updates all performance config functions to call parameterless recipe functions and assign precision/layout via post-instantiation mutations instead of constructor arguments.
Example and utility scripts `examples/quantization/pretrain_quantized_llama3_8b.py`, `scripts/performance/utils/utils.py`, `tests/functional_tests/recipes/utils.py`	Migrates to parameterless recipe API and performs inline model/config overrides. Updates `get_library_recipe()` to construct paths post-instantiation rather than via parameters.
Unit and functional tests `tests/unit_tests/recipes/`, `tests/functional_tests/recipes/`	Updates test fixtures to call parameterless config functions. Introduces new integration test module (`test_perf_config_integration.py`). Adjusts override logic to distinguish pretrain (empty overrides) from finetune (full overrides). Relaxes tokenizer assertions to accept either NullTokenizer or HuggingFaceTokenizer for pretrain paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #2067: Implements the same large-scale refactor introducing _pretrain_common() and converting recipe functions to parameterless ConfigContainer factories with AutoBridge HF handling.
PR #1914: Modifies Nemotron-3-Nano recipe surface, adding/altering nemotron_3_nano pretrain/finetune config functions and provider classes.

Suggested reviewers

erhoo82
malay-nagda
cuichenx

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: updating performance config scripts to be compatible with a parameterless recipe API introduced in another PR.
Docstring Coverage	✅ Passed	Docstring coverage is 96.67% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR contains compatibility fixes for parameterless recipe API with comprehensive functional integration test covering 8+ model configs, precision variations, and path resolution.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch replay/0909f9fd

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

scripts/performance/configs/llama/llama3_llm_pretrain.py (2)
341-348: ⚠️ Potential issue | 🟡 Minor

Missing nvfp4 tp_comm_overlap check compared to other 8B configs.

Other 8B config functions (gb300, gb200, b300, b200) include a check to disable tp_comm_overlap when precision == "nvfp4", but the H100 variant does not. If nvfp4 is not supported on H100, consider adding a comment to clarify this; otherwise, this may be an oversight.
🔧 Suggested fix if nvfp4 check is needed
     cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=bool(cfg.model.tensor_model_parallel_size > 1))
+    cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap

     if cfg.ddp.use_megatron_fsdp:
50-52: ⚠️ Potential issue | 🟡 Minor

Remove unused mock parameter from all config functions.

The mock parameter is declared in all 10 config function signatures (lines 50, 85, 120, 155, 190, 222, 247, 272, 297, 322) but never referenced in any function body. Tests explicitly pass mock=True when calling these functions, indicating this was likely a legacy parameter from an earlier API. Since it has no effect, either remove it entirely or document why it is retained for backward compatibility.
tests/unit_tests/recipes/gemma/test_gemma2_recipes.py (1)
117-154: ⚠️ Potential issue | 🟡 Minor

Add module-level pytest unit test categorization.

This test module lacks the required pytestmark declaration for test categorization. Add the following after the imports:
import pytest


+pytestmark = pytest.mark.unit
All tests in this module should be categorized as unit tests per the coding guidelines.
tests/unit_tests/recipes/test_gemma3_recipes.py (1)
128-179: ⚠️ Potential issue | 🟡 Minor

Add module-level pytest.mark.unit to categorize all tests in this file.

Since all test functions here are unit tests, use a module-level mark to avoid repeating the decorator on each function.
Proposed change
 import pytest
 
+pytestmark = pytest.mark.unit
As per coding guidelines: Use 'pytest.mark' to categorize tests (unit, integration, system).

🤖 Fix all issues with AI agents

In `@scripts/performance/configs/deepseek/deepseek_llm_pretrain.py`:
- Around line 142-151: The b300 pretrain config currently ignores
base_cfg.pp_layout by always calling
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model); change it to follow
the gb300/gb200 pattern: if base_cfg.pp_layout is set, assign
cfg.model.pp_layout = base_cfg.pp_layout (or equivalent field) and do not
recompute, otherwise call
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model) to compute the layout;
update the block that sets cfg.model.pipeline_model_parallel_size /
virtual_pipeline_model_parallel_size / moe_flex_dispatcher_backend to
conditionally respect base_cfg.pp_layout before recomputing the layout.

In `@src/megatron/bridge/recipes/gemma/gemma3.py`:
- Around line 128-132: Update the docstring for gemma3_1b_pretrain_config to use
Google-style formatting by adding a "Returns" section that documents the
returned type and meaning (e.g., "Returns: ConfigContainer: A pre-training
configuration for Gemma3 1B with default parallelism TP=1, PP=1 and
seq_length=32K"). Keep the existing short description and default-parallelism
note, and ensure the "Returns" section is placed after the function description.

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`:
- Around line 251-356: The gpt_oss_120b_pretrain_config function leaves
cfg.model.pipeline_dtype as None even though
cfg.model.pipeline_model_parallel_size is 4; update gpt_oss_120b_pretrain_config
to set cfg.model.pipeline_dtype = torch.bfloat16 whenever
cfg.model.pipeline_model_parallel_size > 1 (mirror the pattern used in other
recipes), and make the same change in the corresponding 20B pretrain config
function so pipeline parallelism uses torch.bfloat16 for safe pipeline
communication and gradient computation.
- Around line 132-248: In gpt_oss_20b_pretrain_config, explicitly set
cfg.model.pipeline_dtype to torch.bfloat16 when using
pipeline_model_parallel_size=4 (replace the current None); update the assignment
near the "Parallelism settings" block so cfg.model.pipeline_model_parallel_size
remains 4 and cfg.model.pipeline_dtype = torch.bfloat16 to follow the codebase
convention and avoid relying on finalization auto-population.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 91-97: Update the docstrings for the Nano v2 pretrain config
functions (e.g., nemotron_nano_9b_v2_pretrain_config and the other pretrain
config around lines 192-199) to use Google-style docstrings by adding a
"Returns" section that documents the return type and purpose (e.g., "Returns:
ConfigContainer: pre-training configuration for Nemotron Nano 9B v2" or
similar), ensuring the section is Sphinx/napoleon-parsable and consistent with
the existing top-level description.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Around line 102-107: Add a Google-style "Returns" section to the docstring of
nemotronh_4b_pretrain_config and the other pretrain config functions in this
file (the ones noted in the review). Specifically, update each function
docstring to include a "Returns:" block that states the return type and brief
description (e.g., "Returns: ConfigContainer: A pre-training configuration for
NemotronH 4B."), following Google docstring formatting so Sphinx can parse it.

In `@tests/unit_tests/recipes/gemma/test_gemma2_recipes.py`:
- Around line 41-65: Update the _safe_overrides_for function docstring to
Google-style with an "Args" section describing the name: str parameter and a
"Returns" section describing the returned dict of overrides; keep the existing
short description about pretrain vs finetune behavior and briefly note that
finetune configs accept parameters while pretrain return an empty dict. Ensure
the docstring is triple-quoted and Sphinx-friendly for parsing.

In `@tests/unit_tests/recipes/test_gemma3_recipes.py`:
- Around line 37-62: Update the _safe_overrides_for function docstring to Google
style: add an "Args" section describing the name: str parameter and a "Returns"
section describing the returned dict of overrides; keep the existing brief
description about pretrain vs finetune behavior and ensure the docstring remains
Sphinx/Google-parseable (triple-quoted) for the function _safe_overrides_for.

🧹 Nitpick comments (12)

scripts/performance/configs/llama/llama3_llm_pretrain.py (1)
15-32: Import order does not follow coding guidelines.

Per coding guidelines, imports should be organized as: future imports, standard library, third-party, first-party (megatron.bridge.*), then local folder imports. Currently, local utils.* imports precede megatron.bridge.* imports.
♻️ Suggested import reordering
 import logging

+from megatron.bridge.recipes.llama import llama3_8b_pretrain_config, llama3_70b_pretrain_config
+from megatron.bridge.training.comm_overlap import (
+    CommOverlapConfig,
+    userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192,
+    userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192,
+    userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192,
+    userbuffers_fp8_h100_h8192_tp4_mbs1_seqlen8192,
+)
+from megatron.bridge.training.config import ConfigContainer
+
 from utils.overrides import set_workload_base_configs
 from utils.precision import get_precision_config
 from utils.utils import get_workload_base_config

-from megatron.bridge.recipes.llama import llama3_8b_pretrain_config, llama3_70b_pretrain_config
-from megatron.bridge.training.comm_overlap import (
-    CommOverlapConfig,
-    userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192,
-    userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192,
-    userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192,
-    userbuffers_fp8_h100_h8192_tp4_mbs1_seqlen8192,
-)
-from megatron.bridge.training.config import ConfigContainer
src/megatron/bridge/recipes/olmoe/olmoe_7b.py (1)
143-145: Simplify redundant list comprehension.

The outer list() call is unnecessary since the list comprehension already produces a list.
♻️ Suggested simplification
     if layout is not None:
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]
     return layout
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)
42-44: Simplify redundant list comprehension.

Same issue as in olmoe_7b.py - the outer list() is redundant.
♻️ Suggested simplification
     if layout is not None:
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]
     return layout
tests/unit_tests/recipes/kimi/test_kimi_k2.py (1)
24-25: Consider adding @pytest.mark.unit decorator.

The test classes are missing the @pytest.mark.unit marker. Per coding guidelines, tests should use pytest.mark to categorize tests.
♻️ Add pytest markers
+@pytest.mark.unit
 class TestKimiK2PipelineLayout:
     """Test cases for _get_kimi_k2_pipeline_layout function."""
+@pytest.mark.unit
 class TestKimiK2PretrainConfig:
     """Test cases for kimi_k2_pretrain_config function."""
As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".

Also applies to: 50-51
src/megatron/bridge/recipes/deepseek/deepseek_v3.py (3)
87-93: Redundant assignment of rotary_base.

rotary_base is assigned on line 89, then immediately converted to float on line 91. The second assignment is redundant since line 89 already assigns a float literal.
♻️ Proposed fix
     # Model-specific settings
     cfg.model.init_method_std = 0.006
-    cfg.model.rotary_base = 10000.0
-    cfg.model.rotary_scaling_factor = 40
-    cfg.model.rotary_base = float(cfg.model.rotary_base)  # Ensure rotary_base is float
-    cfg.model.rotary_scaling_factor = int(cfg.model.rotary_scaling_factor)
+    cfg.model.rotary_base = 10000.0  # float literal
+    cfg.model.rotary_scaling_factor = 40  # int literal
198-334: Consider extracting common configuration into a helper function.

deepseek_v3_pretrain_config_32nodes() shares ~90% of its code with deepseek_v3_pretrain_config(). The differences are primarily in parallelism settings (PP=8 vs PP=16, EP=32 vs EP=64) and recompute configuration (full vs selective).

A shared helper would reduce maintenance burden and risk of divergence.
♻️ Suggested approach
def _deepseek_v3_common() -> ConfigContainer:
    """Common configuration for DeepSeek-V3 variants."""
    cfg = _pretrain_common()
    cfg.model = AutoBridge.from_hf_pretrained("deepseek-ai/DeepSeek-V3").to_megatron_provider(load_weights=False)
    # ... shared config settings ...
    return cfg

def deepseek_v3_pretrain_config() -> ConfigContainer:
    cfg = _deepseek_v3_common()
    cfg.model.pipeline_model_parallel_size = 16
    cfg.model.expert_model_parallel_size = 64
    cfg.model.recompute_granularity = "selective"
    # ... variant-specific settings ...
    return cfg

def deepseek_v3_pretrain_config_32nodes() -> ConfigContainer:
    cfg = _deepseek_v3_common()
    cfg.model.pipeline_model_parallel_size = 8
    cfg.model.expert_model_parallel_size = 32
    cfg.model.recompute_granularity = "full"
    cfg.model.recompute_method = "uniform"
    cfg.model.recompute_num_layers = 1
    # ... variant-specific settings ...
    return cfg
234-239: Same redundant pattern as the main config.

The same redundant assignment pattern for rotary_base and rotary_scaling_factor appears here.
♻️ Proposed fix
     # Model-specific settings
     cfg.model.init_method_std = 0.006
-    cfg.model.rotary_base = 10000.0
-    cfg.model.rotary_scaling_factor = 40
-    cfg.model.rotary_base = float(cfg.model.rotary_base)
-    cfg.model.rotary_scaling_factor = int(cfg.model.rotary_scaling_factor)
+    cfg.model.rotary_base = 10000.0
+    cfg.model.rotary_scaling_factor = 40
scripts/performance/configs/deepseek/deepseek_llm_pretrain.py (1)

48-49: Unused mock parameter across all config functions.

The mock parameter is declared but never used in any of the config functions (gb300, gb200, b300, b200, h100). This appears to be a leftover from the previous API.

Consider removing it if no longer needed, or add a comment explaining it's retained for backward compatibility.

Also applies to: 88-89, 128-129, 160-161, 192-193
tests/unit_tests/recipes/test_deepseek_recipes.py (1)
80-102: Consider adding @pytest.mark.unit decorator.

The test function lacks the @pytest.mark.unit marker. Per coding guidelines, tests should use pytest.mark to categorize tests (unit, integration, system).
♻️ Proposed fix
+@pytest.mark.unit
 `@pytest.mark.parametrize`("recipe_func", _DEEPSEEK_RECIPE_FUNCS)
 def test_each_deepseek_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch):
As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".
tests/functional_tests/recipes/test_perf_config_integration.py (2)
24-30: sys.path manipulation may be fragile for test discovery.

Using sys.path.insert at module level affects the entire test session. Consider using pytest's pythonpath configuration in pyproject.toml or a conftest.py fixture for cleaner path management.

33-35: Add pytest markers for test categorization.

Per coding guidelines, tests should use pytest.mark to categorize tests. Since these are functional/integration tests, they should be marked appropriately.
♻️ Proposed fix
+import pytest
+
+
+@pytest.mark.functional
 class TestPerfConfigIntegration:
     """Test class for performance config integration with library recipes."""
As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".
src/megatron/bridge/recipes/moonlight/moonlight_16b.py (1)
130-148: Add an explicit return type to _get_moonlight_pipeline_layout.
This keeps the API typed and aligned with the codebase’s typing conventions.
♻️ Suggested change
-def _get_moonlight_pipeline_layout(pp_size: int, vp_size: int):
+def _get_moonlight_pipeline_layout(pp_size: int, vp_size: int) -> list[list[str]] | None:
As per coding guidelines: Use type hints for function arguments and return types; Use built-in generics (list, dict, tuple) instead of typing equivalents; Use 'T | None' for nullable types instead of 'Optional[T]'.

coderabbitai · 2026-02-03T20:27:01Z

scripts/performance/configs/deepseek/deepseek_llm_pretrain.py

+    cfg = pretrain_config()
+    cfg.mixed_precision = precision_config
+
+    # Apply model-specific settings that were previously passed as constructor args
+    cfg.model.pipeline_model_parallel_size = base_cfg.pipeline_model_parallel_size
+    cfg.model.virtual_pipeline_model_parallel_size = base_cfg.virtual_pipeline_model_parallel_size
+    cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend
+    # Recompute layout based on updated PP/VP sizes
+    set_deepseek_v3_pipeline_model_parallel_layout(cfg.model)
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, run the provided script to check pp_layout usage rg -n "pp_layout" scripts/performance/ --glob "*.yaml" --glob "*.json" -A2 -B2 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Also check Python files for pp_layout references rg -n "pp_layout" scripts/performance/ --glob "*.py" -A2 -B2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2391

🏁 Script executed:

# Find gb300 and gb200 config files to compare with b300 fd -e py "gb300|gb200|b300" scripts/performance/configs/deepseek/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Look at the actual deepseek_llm_pretrain.py file around lines 142-151 cat -n scripts/performance/configs/deepseek/deepseek_llm_pretrain.py | sed -n '130,160p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1373

🏁 Script executed:

# Look at the broader context to identify which functions have the conditional logic cat -n scripts/performance/configs/deepseek/deepseek_llm_pretrain.py | sed -n '50,120p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3317

🏁 Script executed:

# Also check the b300 base config to verify pp_layout is set there cat -n scripts/performance/configs/deepseek/deepseek_workload_base_configs.py | sed -n '40,55p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 698

B300 config should respect base_cfg.pp_layout like GB300/GB200 variants.

The b300 function unconditionally calls set_deepseek_v3_pipeline_model_parallel_layout() without checking base_cfg.pp_layout first, unlike the gb300 and gb200 variants. Since the B300 base config explicitly sets pp_layout="Et*4|(t*4|)*14tmL", it is being silently ignored. Update b300 to use the same conditional pattern as gb300/gb200: check if base_cfg.pp_layout exists, use it if present, otherwise compute the layout.

🤖 Prompt for AI Agents

In `@scripts/performance/configs/deepseek/deepseek_llm_pretrain.py` around lines 142 - 151, The b300 pretrain config currently ignores base_cfg.pp_layout by always calling set_deepseek_v3_pipeline_model_parallel_layout(cfg.model); change it to follow the gb300/gb200 pattern: if base_cfg.pp_layout is set, assign cfg.model.pp_layout = base_cfg.pp_layout (or equivalent field) and do not recompute, otherwise call set_deepseek_v3_pipeline_model_parallel_layout(cfg.model) to compute the layout; update the block that sets cfg.model.pipeline_model_parallel_size / virtual_pipeline_model_parallel_size / moe_flex_dispatcher_backend to conditionally respect base_cfg.pp_layout before recomputing the layout.

coderabbitai · 2026-02-03T20:27:01Z

src/megatron/bridge/recipes/gemma/gemma3.py

+def gemma3_1b_pretrain_config() -> ConfigContainer:
    """Return a pre-training config for Gemma3 1B.

-    See `_gemma3_common` for the full list of parameters.
+    Default parallelism: TP=1, PP=1, seq_length=32K
    """


⚠️ Potential issue | 🟡 Minor

Add a Google-style Returns section to the pretrain config docstring.

This keeps the recipe docs consistent with the rest of the module’s Google-style docstrings.

Proposed docstring update

def gemma3_1b_pretrain_config() -> ConfigContainer: """Return a pre-training config for Gemma3 1B. Default parallelism: TP=1, PP=1, seq_length=32K + + Returns: + ConfigContainer: Pre-training configuration for Gemma3 1B. """

As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def gemma3_1b_pretrain_config() -> ConfigContainer:

"""Return a pre-training config for Gemma3 1B.

See `_gemma3_common` for the full list of parameters.

Default parallelism: TP=1, PP=1, seq_length=32K

"""

def gemma3_1b_pretrain_config() -> ConfigContainer:

"""Return a pre-training config for Gemma3 1B.

Default parallelism: TP=1, PP=1, seq_length=32K

Returns:

ConfigContainer: Pre-training configuration for Gemma3 1B.

"""

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 128 - 132, Update the docstring for gemma3_1b_pretrain_config to use Google-style formatting by adding a "Returns" section that documents the returned type and meaning (e.g., "Returns: ConfigContainer: A pre-training configuration for Gemma3 1B with default parallelism TP=1, PP=1 and seq_length=32K"). Keep the existing short description and default-parallelism note, and ensure the "Returns" section is placed after the function description.

coderabbitai · 2026-02-03T20:27:01Z

src/megatron/bridge/recipes/gpt_oss/gpt_oss.py

+def gpt_oss_20b_pretrain_config() -> ConfigContainer:
+    """Return a pre-training config for GPT-OSS 20B variant.

-def _gpt_oss_common(
-    hf_path: str,
-    dir: Optional[str] = None,
-    name: str = "default",
-    # Dataset configuration
-    data_paths: Optional[List[str]] = None,
-    data_args_path: Optional[str] = None,
-    train_data_path: Optional[List[str]] = None,
-    valid_data_path: Optional[List[str]] = None,
-    test_data_path: Optional[List[str]] = None,
-    per_split_data_args_path: Optional[str] = None,
-    mock: bool = False,
-    # Dataset override option
-    dataset: Optional[Union[GPTDatasetConfig, FinetuningDatasetConfig, DatasetProvider]] = None,
-    # Model configuration
-    num_layers: int = None,  # for ci testing
-    tensor_model_parallel_size: int = 1,
-    pipeline_model_parallel_size: int = 1,
-    pipeline_dtype: Optional[torch.dtype] = None,
-    virtual_pipeline_model_parallel_size: Optional[int] = None,
-    context_parallel_size: int = 1,
-    expert_model_parallel_size: int = 1,
-    sequence_parallel: bool = False,
-    use_megatron_fsdp: bool = False,
-    account_for_embedding_in_pipeline_split: bool = False,
-    account_for_loss_in_pipeline_split: bool = False,
-    cp_comm_type: Optional[str] = None,
-    # Training hyperparameters
-    train_iters: int = 1000000,
-    global_batch_size: int = 512,
-    micro_batch_size: int = 1,
-    seq_length: int = 4096,
-    lr: float = 3e-4,
-    min_lr: float = 3e-5,
-    lr_warmup_iters: int = 2000,
-    lr_decay_iters: Optional[int] = None,
-    eval_interval: int = 2000,
-    save_interval: int = 500,
-    use_null_tokenizer: bool = True,
-    # Precision recipe
-    precision_config: Optional[Union[MixedPrecisionConfig, str]] = "bf16_mixed",
-    comm_overlap_config: Optional[CommOverlapConfig] = None,
-    # Checkpointing
-    pretrained_checkpoint: Optional[str] = None,
-) -> ConfigContainer:
-    """
-    Create a pre-training configuration for GPT-OSS family models using a given HuggingFace path.
-    Mirrors the structure used in llama recipes for consistency.
+    Recommended parallelism: TP=2, PP=4, EP=4
    """
+    cfg = _pretrain_common()
+
+    # Model config
+    cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-20b").to_megatron_provider(load_weights=False)
+
+    # Tokenizer - uses NullTokenizer by default
+    cfg.tokenizer.tokenizer_type = "NullTokenizer"
+    cfg.tokenizer.tokenizer_model = None
+    cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE
+
+    # Dataset config - mock data by default
+    cfg.dataset.blend = None  # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
+    cfg.dataset.seq_length = 4096
+    cfg.dataset.num_workers = 8
+
+    # Parallelism settings
+    cfg.model.tensor_model_parallel_size = 2
+    cfg.model.pipeline_model_parallel_size = 4
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.pipeline_dtype = None
+    cfg.model.virtual_pipeline_model_parallel_size = None
+    cfg.model.context_parallel_size = 1
+    cfg.model.expert_model_parallel_size = 4
+    cfg.model.expert_tensor_parallel_size = 1
+    cfg.model.sequence_parallel = True
+    cfg.model.seq_length = 4096
+
+    # Pipeline split settings
+    cfg.model.account_for_embedding_in_pipeline_split = False
+    cfg.model.account_for_loss_in_pipeline_split = False
+
+    if cfg.model.context_parallel_size > 1:
+        cfg.model.calculate_per_token_loss = True
+        cfg.model.cp_comm_type = "a2a"  # only a2a cp is supported for sink attention.
+
+    # MoE Token Dispatcher settings
+    cfg.model.moe_token_dispatcher_type = "alltoall"  # Default
+    cfg.model.moe_flex_dispatcher_backend = "deepep"  # Options: None, deepep, hybridep
+    cfg.model.moe_hybridep_num_sms = 16  # Number of SMs for hybridep backend
+
+    # Training config
+    cfg.train.train_iters = 1000000
+    cfg.train.global_batch_size = 512
+    cfg.train.micro_batch_size = 1
+    cfg.train.eval_interval = 2000
+    cfg.train.manual_gc = True
+    cfg.train.manual_gc_interval = 100
+
+    # Scheduler config
+    cfg.scheduler.lr_warmup_iters = 2000
+
+    # TE (Transformer Engine)
+    cfg.model.transformer_impl = "transformer_engine"
+
+    # CUDA Graph
+    cfg.model.cuda_graph_impl = "none"
+    cfg.model.cuda_graph_scope = "full"
+    cfg.model.cuda_graph_warmup_steps = 3
+
+    # Kernel selections
+    cfg.model.attention_backend = None
+    cfg.model.moe_router_fusion = False
+    cfg.model.moe_permute_fusion = True
+    cfg.model.moe_grouped_gemm = True
+    cfg.model.cross_entropy_loss_fusion = True
+    cfg.model.cross_entropy_fusion_impl = "native"
+
+    # Memory saving (recompute & offloading)
+    cfg.model.recompute_granularity = None
+    cfg.model.recompute_modules = None
+    cfg.model.fine_grained_activation_offloading = False
+    cfg.model.offload_modules = None
+
+    # Mixed precision - uses "bf16_mixed" from _pretrain_common
+    # FP8 settings (commented - enable if using FP8)
+    # cfg.mixed_precision.fp8_recipe = "tensorwise"
+    # cfg.mixed_precision.fp8 = None
+    # cfg.mixed_precision.fp8_param_gather = False
+    # cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False
+    cfg.model.moe_router_padding_for_fp8 = False  # Pad router for FP8 alignment
+
+    # Optimizer precision settings
+    cfg.optimizer.use_precision_aware_optimizer = False
+    cfg.optimizer.main_grads_dtype = torch.float32
+    cfg.optimizer.main_params_dtype = torch.float32
+    cfg.optimizer.exp_avg_dtype = torch.float32
+    cfg.optimizer.exp_avg_sq_dtype = torch.float32
+
+    # Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
+    # cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False)  # Uncomment to enable
+    # cfg.comm_overlap.delay_wgrad_compute = False
+    # cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
+    cfg.model.moe_shared_expert_overlap = False  # GPT-OSS default
+
+    # Checkpoint config
+    # cfg.checkpoint.save = "path/to/save"
+    # cfg.checkpoint.load = "path/to/load"
+
+    # DDP config (matches _pretrain_common)
+    cfg.ddp.overlap_grad_reduce = True
+    cfg.ddp.overlap_param_gather = True
+    cfg.ddp.check_for_nan_in_grad = True
+    cfg.ddp.use_distributed_optimizer = True
+    cfg.ddp.use_megatron_fsdp = False
+    cfg.ddp.grad_reduce_in_fp32 = True
+    cfg.ddp.average_in_collective = cfg.model.context_parallel_size == 1
+    cfg.ddp.data_parallel_sharding_strategy = "no_shard"
+
+    # MoE Force Load Balancing
+    cfg.model.moe_router_force_load_balancing = False

-    base_output_dir = dir if dir is not None else os.path.join(os.getcwd(), "nemo_experiments")
-    run_output_dir = os.path.join(base_output_dir, name)
-    checkpoint_dir = os.path.join(run_output_dir, "checkpoints")
-    tensorboard_dir = os.path.join(run_output_dir, "tb_logs")
-
-    bridge = AutoBridge.from_hf_pretrained(hf_path)
-    model_cfg = bridge.to_megatron_provider(load_weights=False)
-    if num_layers is not None:
-        model_cfg.num_layers = num_layers
-    model_cfg.tensor_model_parallel_size = tensor_model_parallel_size
-    model_cfg.pipeline_model_parallel_size = pipeline_model_parallel_size
-    model_cfg.pipeline_dtype = pipeline_dtype
-    model_cfg.virtual_pipeline_model_parallel_size = virtual_pipeline_model_parallel_size
-    model_cfg.context_parallel_size = context_parallel_size
-    model_cfg.expert_model_parallel_size = expert_model_parallel_size
-    model_cfg.expert_tensor_parallel_size = 1
-    model_cfg.sequence_parallel = sequence_parallel
-    model_cfg.seq_length = seq_length
+    return cfg


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for GPT-OSS provider and pipeline_dtype handling rg -n "pipeline_dtype" src/megatron/bridge --type py -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383

🏁 Script executed:

# Look for GPTOSS/gpt_oss provider implementation find src -type f -name "*.py" | xargs grep -l "gpt_oss\|GPTOSS" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 399

🏁 Script executed:

# Check AutoBridge.from_hf_pretrained and how it sets pipeline_dtype rg -n "from_hf_pretrained\|to_megatron_provider" src/megatron/bridge --type py -A 5 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Search for pipeline_dtype defaults in config or model definitions rg -n "pipeline_dtype\s*=" src/megatron/bridge --type py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2898

🏁 Script executed:

# Check GPT-OSS provider for pipeline_dtype handling cat src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3687

🏁 Script executed:

# Check if AutoBridge.from_hf_pretrained sets pipeline_dtype rg -n "from_hf_pretrained\|to_megatron_provider" src/megatron/bridge/models --type py -A 10 | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Find and check GPTModelProvider for pipeline_dtype field rg -n "class GPTModelProvider" src/megatron/bridge --type py -A 30 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2876

🏁 Script executed:

# Also search for any pipeline_dtype default in model provider base classes rg -n "pipeline_dtype" src/megatron/bridge/models --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 714

🏁 Script executed:

# Get more context around the transformer_config.py auto-setting logic sed -n '80,170p' src/megatron/bridge/models/transformer_config.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3753

🏁 Script executed:

# Check when finalize() is called in the recipe flow rg -n "finalize" src/megatron/bridge/recipes/gpt_oss --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check if to_megatron_provider calls finalize rg -n "to_megatron_provider\|finalize" src/megatron/bridge/models --type py -B 3 -A 5 | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check the AutoBridge class to see how it handles finalize rg -n "class AutoBridge\|def to_megatron_provider" src/megatron/bridge --type py -A 10 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Search for AutoBridge definition fd -n "auto_bridge" src/megatron/bridge --type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 301

🏁 Script executed:

# Search more broadly for to_megatron_provider implementation rg -n "def to_megatron_provider" src/megatron/bridge --type py -A 15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1778

🏁 Script executed:

# Look at the GPT-OSS recipe to see how config flows after to_megatron_provider sed -n '130,170p' src/megatron/bridge/recipes/gpt_oss/gpt_oss.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1681

🏁 Script executed:

# Check when finalize() is called on the provider rg -n "finalize" src/megatron/bridge --type py -B 5 -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383

🏁 Script executed:

# Look at to_megatron_provider implementation more carefully sed -n '829,900p' src/megatron/bridge/models/conversion/auto_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3717

Set explicit pipeline_dtype for PP=4 to match codebase convention.

With pipeline_model_parallel_size=4, set pipeline_dtype=torch.bfloat16 explicitly instead of None. While the framework will auto-populate this during finalization (when PP>1), all other recipes in the codebase explicitly set pipeline_dtype when using pipeline parallelism (e.g., qwen3_moe, olmoe_7b, kimi_k2). This explicit assignment aligns with the "Required for PP > 1" pattern documented elsewhere and avoids relying on implicit finalization behavior.

Recommended fix

- cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16

🧰 Tools

🪛 Ruff (0.14.14)

[error] 173-173: Possible hardcoded password assigned to: "moe_token_dispatcher_type"

(S105)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 132 - 248, In gpt_oss_20b_pretrain_config, explicitly set cfg.model.pipeline_dtype to torch.bfloat16 when using pipeline_model_parallel_size=4 (replace the current None); update the assignment near the "Parallelism settings" block so cfg.model.pipeline_model_parallel_size remains 4 and cfg.model.pipeline_dtype = torch.bfloat16 to follow the codebase convention and avoid relying on finalization auto-population.

coderabbitai · 2026-02-03T20:27:01Z

src/megatron/bridge/recipes/gpt_oss/gpt_oss.py

+def gpt_oss_120b_pretrain_config() -> ConfigContainer:
+    """Return a pre-training config for GPT-OSS 120B variant.

-    # Build dataset config if not supplied directly
-    if dataset is None:
-        blend, blend_per_split, split = get_blend_fields_from_data_paths(
-            data_paths,
-            data_args_path,
-            train_data_path,
-            valid_data_path,
-            test_data_path,
-            per_split_data_args_path,
-            mock,
-        )
-        dataset_cfg = GPTDatasetConfig(
-            random_seed=1234,
-            reset_attention_mask=False,
-            reset_position_ids=False,
-            eod_mask_loss=False,
-            seq_length=seq_length,
-            num_dataset_builder_threads=1,
-            blend=blend,
-            blend_per_split=blend_per_split,
-            split=split,
-            data_sharding=True,
-            dataloader_type="single",
-            skip_getting_attention_mask_from_dataset=True,
-        )
-    else:
-        dataset_cfg = dataset
-
-    cfg = ConfigContainer(
-        model=model_cfg,
-        train=TrainingConfig(
-            train_iters=train_iters,
-            eval_interval=eval_interval,
-            eval_iters=32,
-            global_batch_size=global_batch_size,
-            micro_batch_size=micro_batch_size,
-            manual_gc=True,
-            manual_gc_interval=100,
-            manual_gc_eval=100,
-        ),
-        optimizer=opt_config,
-        scheduler=scheduler,
-        ddp=DistributedDataParallelConfig(
-            check_for_nan_in_grad=True,
-            grad_reduce_in_fp32=True,
-            overlap_grad_reduce=True,
-            overlap_param_gather=True,
-            average_in_collective=context_parallel_size == 1,
-            use_distributed_optimizer=True,
-            use_megatron_fsdp=use_megatron_fsdp,
-        ),
-        dataset=dataset_cfg,
-        logger=LoggerConfig(
-            log_interval=10,
-            tensorboard_dir=tensorboard_dir,
-            log_timers_to_tensorboard=True,
-        ),
-        tokenizer=TokenizerConfig(
-            tokenizer_type="NullTokenizer" if use_null_tokenizer else "HuggingFaceTokenizer",
-            tokenizer_model=hf_path if not use_null_tokenizer else None,
-            vocab_size=DEFAULT_NULL_TOKENIZER_VOCAB_SIZE if use_null_tokenizer else None,
-        ),
-        checkpoint=CheckpointConfig(
-            save_interval=save_interval,
-            save=checkpoint_dir,
-            load=checkpoint_dir,
-            pretrained_checkpoint=pretrained_checkpoint,
-            ckpt_format="torch_dist",
-            fully_parallel_save=True,
-        ),
-        rng=RNGConfig(seed=1234),
-        comm_overlap=comm_overlap_config,
-        mixed_precision=precision_config,
-    )
+    Recommended parallelism: TP=2, PP=4, EP=16
+    """
+    cfg = _pretrain_common()
+
+    # Model config
+    cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)
+
+    # Tokenizer - uses NullTokenizer by default
+    cfg.tokenizer.tokenizer_type = "NullTokenizer"
+    cfg.tokenizer.tokenizer_model = None
+    cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE
+
+    # Dataset config - mock data by default
+    cfg.dataset.blend = None  # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
+    cfg.dataset.seq_length = 4096
+    cfg.dataset.num_workers = 8
+
+    # Parallelism settings (MoE-specific)
+    cfg.model.tensor_model_parallel_size = 2
+    cfg.model.pipeline_model_parallel_size = 4
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.pipeline_dtype = None
+    cfg.model.virtual_pipeline_model_parallel_size = None
+    cfg.model.context_parallel_size = 1
+    cfg.model.expert_model_parallel_size = 16  # Larger EP for 120B
+    cfg.model.expert_tensor_parallel_size = 1
+    cfg.model.sequence_parallel = True
+    cfg.model.seq_length = 4096
+
+    # Pipeline split settings
+    cfg.model.account_for_embedding_in_pipeline_split = False
+    cfg.model.account_for_loss_in_pipeline_split = False
+    if cfg.model.context_parallel_size > 1:
+        cfg.model.calculate_per_token_loss = True
+        cfg.model.cp_comm_type = "a2a"  # only a2a cp is supported for sink attention.
+
+    # MoE Token Dispatcher settings
+    cfg.model.moe_token_dispatcher_type = "alltoall"
+    cfg.model.moe_flex_dispatcher_backend = "deepep"
+    cfg.model.moe_hybridep_num_sms = 16
+
+    # Training config (DIFFERENT from _pretrain_common)
+    cfg.train.train_iters = 1000000
+    cfg.train.global_batch_size = 512
+    cfg.train.micro_batch_size = 1
+    cfg.train.eval_interval = 2000
+    cfg.train.manual_gc = True
+    cfg.train.manual_gc_interval = 100
+
+    # Scheduler config
+    cfg.scheduler.lr_warmup_iters = 2000
+
+    # TE (Transformer Engine)
+    cfg.model.transformer_impl = "transformer_engine"
+
+    # CUDA Graph
+    cfg.model.cuda_graph_impl = "none"
+    cfg.model.cuda_graph_scope = "full"
+    cfg.model.cuda_graph_warmup_steps = 3
+
+    # Kernel selections
+    cfg.model.attention_backend = None
+    cfg.model.moe_router_fusion = False
+    cfg.model.moe_permute_fusion = True
+    cfg.model.moe_grouped_gemm = True
+    cfg.model.cross_entropy_loss_fusion = True
+    cfg.model.cross_entropy_fusion_impl = "native"  # GPT-OSS uses native
+
+    # Memory saving
+    cfg.model.recompute_granularity = None
+    cfg.model.recompute_modules = None
+    cfg.model.fine_grained_activation_offloading = False
+    cfg.model.offload_modules = None
+
+    # Mixed precision
+    cfg.model.moe_router_padding_for_fp8 = False
+
+    # Optimizer precision settings
+    cfg.optimizer.use_precision_aware_optimizer = False
+    cfg.optimizer.main_grads_dtype = torch.float32
+    cfg.optimizer.main_params_dtype = torch.float32
+    cfg.optimizer.exp_avg_dtype = torch.float32
+    cfg.optimizer.exp_avg_sq_dtype = torch.float32
+
+    # Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
+    # cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False)  # Uncomment to enable
+    # cfg.comm_overlap.delay_wgrad_compute = False
+    # cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
+    cfg.model.moe_shared_expert_overlap = False
+
+    # DDP config
+    cfg.ddp.overlap_grad_reduce = True
+    cfg.ddp.overlap_param_gather = True
+    cfg.ddp.check_for_nan_in_grad = True
+    cfg.ddp.use_distributed_optimizer = True
+    cfg.ddp.use_megatron_fsdp = False
+    cfg.ddp.grad_reduce_in_fp32 = True
+    cfg.ddp.average_in_collective = True
+    cfg.ddp.data_parallel_sharding_strategy = "no_shard"
+
+    # MoE Force Load Balancing
+    cfg.model.moe_router_force_load_balancing = False



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for GPT-OSS provider/config defaults for pipeline_dtype rg -n "gpt_oss|gpt-oss|GPTOSS" src/megatron/bridge -i -A 5 -B 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 34016

🏁 Script executed:

#!/bin/bash # Search for pipeline_dtype references and defaults rg -n "pipeline_dtype" src/megatron/bridge -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383

🏁 Script executed:

#!/bin/bash # Look for the AutoBridge.from_hf_pretrained implementation rg -n "from_hf_pretrained" src/megatron/bridge -A 10 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50384

🏁 Script executed:

#!/bin/bash # Search for openai/gpt-oss references rg -n "openai/gpt-oss" . -A 10 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 14084

Set pipeline_dtype to torch.bfloat16 when pipeline_model_parallel_size > 1.

For PP=4, pipeline_dtype should not be None. Across the codebase (Qwen, Llama, OlMoE, Moonlight recipes), when pipeline parallelism is enabled, pipeline_dtype is explicitly set to torch.bfloat16. This prevents undefined behavior during pipeline communication and gradient computation. The same issue applies to the 20B pretrain config.

💡 Proposed fix

- cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def gpt_oss_120b_pretrain_config() -> ConfigContainer:

"""Return a pre-training config for GPT-OSS 120B variant.

# Build dataset config if not supplied directly

if dataset is None:

blend, blend_per_split, split = get_blend_fields_from_data_paths(

data_paths,

data_args_path,

train_data_path,

valid_data_path,

test_data_path,

per_split_data_args_path,

mock,

)

dataset_cfg = GPTDatasetConfig(

random_seed=1234,

reset_attention_mask=False,

reset_position_ids=False,

eod_mask_loss=False,

seq_length=seq_length,

num_dataset_builder_threads=1,

blend=blend,

blend_per_split=blend_per_split,

split=split,

data_sharding=True,

dataloader_type="single",

skip_getting_attention_mask_from_dataset=True,

)

else:

dataset_cfg = dataset

cfg = ConfigContainer(

model=model_cfg,

train=TrainingConfig(

train_iters=train_iters,

eval_interval=eval_interval,

eval_iters=32,

global_batch_size=global_batch_size,

micro_batch_size=micro_batch_size,

manual_gc=True,

manual_gc_interval=100,

manual_gc_eval=100,

),

optimizer=opt_config,

scheduler=scheduler,

ddp=DistributedDataParallelConfig(

check_for_nan_in_grad=True,

grad_reduce_in_fp32=True,

overlap_grad_reduce=True,

overlap_param_gather=True,

average_in_collective=context_parallel_size == 1,

use_distributed_optimizer=True,

use_megatron_fsdp=use_megatron_fsdp,

),

dataset=dataset_cfg,

logger=LoggerConfig(

log_interval=10,

tensorboard_dir=tensorboard_dir,

log_timers_to_tensorboard=True,

),

tokenizer=TokenizerConfig(

tokenizer_type="NullTokenizer" if use_null_tokenizer else "HuggingFaceTokenizer",

tokenizer_model=hf_path if not use_null_tokenizer else None,

vocab_size=DEFAULT_NULL_TOKENIZER_VOCAB_SIZE if use_null_tokenizer else None,

),

checkpoint=CheckpointConfig(

save_interval=save_interval,

save=checkpoint_dir,

load=checkpoint_dir,

pretrained_checkpoint=pretrained_checkpoint,

ckpt_format="torch_dist",

fully_parallel_save=True,

),

rng=RNGConfig(seed=1234),

comm_overlap=comm_overlap_config,

mixed_precision=precision_config,

)

Recommended parallelism: TP=2, PP=4, EP=16

"""

cfg = _pretrain_common()

# Model config

cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)

# Tokenizer - uses NullTokenizer by default

cfg.tokenizer.tokenizer_type = "NullTokenizer"

cfg.tokenizer.tokenizer_model = None

cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE

# Dataset config - mock data by default

cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]

cfg.dataset.seq_length = 4096

cfg.dataset.num_workers = 8

# Parallelism settings (MoE-specific)

cfg.model.tensor_model_parallel_size = 2

cfg.model.pipeline_model_parallel_size = 4

cfg.model.pipeline_model_parallel_layout = None

cfg.model.pipeline_dtype = None

cfg.model.virtual_pipeline_model_parallel_size = None

cfg.model.context_parallel_size = 1

cfg.model.expert_model_parallel_size = 16 # Larger EP for 120B

cfg.model.expert_tensor_parallel_size = 1

cfg.model.sequence_parallel = True

cfg.model.seq_length = 4096

# Pipeline split settings

cfg.model.account_for_embedding_in_pipeline_split = False

cfg.model.account_for_loss_in_pipeline_split = False

if cfg.model.context_parallel_size > 1:

cfg.model.calculate_per_token_loss = True

cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.

# MoE Token Dispatcher settings

cfg.model.moe_token_dispatcher_type = "alltoall"

cfg.model.moe_flex_dispatcher_backend = "deepep"

cfg.model.moe_hybridep_num_sms = 16

# Training config (DIFFERENT from _pretrain_common)

cfg.train.train_iters = 1000000

cfg.train.global_batch_size = 512

cfg.train.micro_batch_size = 1

cfg.train.eval_interval = 2000

cfg.train.manual_gc = True

cfg.train.manual_gc_interval = 100

# Scheduler config

cfg.scheduler.lr_warmup_iters = 2000

# TE (Transformer Engine)

cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph

cfg.model.cuda_graph_impl = "none"

cfg.model.cuda_graph_scope = "full"

cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections

cfg.model.attention_backend = None

cfg.model.moe_router_fusion = False

cfg.model.moe_permute_fusion = True

cfg.model.moe_grouped_gemm = True

cfg.model.cross_entropy_loss_fusion = True

cfg.model.cross_entropy_fusion_impl = "native" # GPT-OSS uses native

# Memory saving

cfg.model.recompute_granularity = None

cfg.model.recompute_modules = None

cfg.model.fine_grained_activation_offloading = False

cfg.model.offload_modules = None

# Mixed precision

cfg.model.moe_router_padding_for_fp8 = False

# Optimizer precision settings

cfg.optimizer.use_precision_aware_optimizer = False

cfg.optimizer.main_grads_dtype = torch.float32

cfg.optimizer.main_params_dtype = torch.float32

cfg.optimizer.exp_avg_dtype = torch.float32

cfg.optimizer.exp_avg_sq_dtype = torch.float32

# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)

# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable

# cfg.comm_overlap.delay_wgrad_compute = False

# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False

cfg.model.moe_shared_expert_overlap = False

# DDP config

cfg.ddp.overlap_grad_reduce = True

cfg.ddp.overlap_param_gather = True

cfg.ddp.check_for_nan_in_grad = True

cfg.ddp.use_distributed_optimizer = True

cfg.ddp.use_megatron_fsdp = False

cfg.ddp.grad_reduce_in_fp32 = True

cfg.ddp.average_in_collective = True

cfg.ddp.data_parallel_sharding_strategy = "no_shard"

# MoE Force Load Balancing

cfg.model.moe_router_force_load_balancing = False

def gpt_oss_120b_pretrain_config() -> ConfigContainer:

"""Return a pre-training config for GPT-OSS 120B variant.

Recommended parallelism: TP=2, PP=4, EP=16

"""

cfg = _pretrain_common()

# Model config

cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)

# Tokenizer - uses NullTokenizer by default

cfg.tokenizer.tokenizer_type = "NullTokenizer"

cfg.tokenizer.tokenizer_model = None

cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE

# Dataset config - mock data by default

cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]

cfg.dataset.seq_length = 4096

cfg.dataset.num_workers = 8

# Parallelism settings (MoE-specific)

cfg.model.tensor_model_parallel_size = 2

cfg.model.pipeline_model_parallel_size = 4

cfg.model.pipeline_model_parallel_layout = None

cfg.model.pipeline_dtype = torch.bfloat16

cfg.model.virtual_pipeline_model_parallel_size = None

cfg.model.context_parallel_size = 1

cfg.model.expert_model_parallel_size = 16 # Larger EP for 120B

cfg.model.expert_tensor_parallel_size = 1

cfg.model.sequence_parallel = True

cfg.model.seq_length = 4096

# Pipeline split settings

cfg.model.account_for_embedding_in_pipeline_split = False

cfg.model.account_for_loss_in_pipeline_split = False

if cfg.model.context_parallel_size > 1:

cfg.model.calculate_per_token_loss = True

cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.

# MoE Token Dispatcher settings

cfg.model.moe_token_dispatcher_type = "alltoall"

cfg.model.moe_flex_dispatcher_backend = "deepep"

cfg.model.moe_hybridep_num_sms = 16

# Training config (DIFFERENT from _pretrain_common)

cfg.train.train_iters = 1000000

cfg.train.global_batch_size = 512

cfg.train.micro_batch_size = 1

cfg.train.eval_interval = 2000

cfg.train.manual_gc = True

cfg.train.manual_gc_interval = 100

# Scheduler config

cfg.scheduler.lr_warmup_iters = 2000

# TE (Transformer Engine)

cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph

cfg.model.cuda_graph_impl = "none"

cfg.model.cuda_graph_scope = "full"

cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections

cfg.model.attention_backend = None

cfg.model.moe_router_fusion = False

cfg.model.moe_permute_fusion = True

cfg.model.moe_grouped_gemm = True

cfg.model.cross_entropy_loss_fusion = True

cfg.model.cross_entropy_fusion_impl = "native" # GPT-OSS uses native

# Memory saving

cfg.model.recompute_granularity = None

cfg.model.recompute_modules = None

cfg.model.fine_grained_activation_offloading = False

cfg.model.offload_modules = None

# Mixed precision

cfg.model.moe_router_padding_for_fp8 = False

# Optimizer precision settings

cfg.optimizer.use_precision_aware_optimizer = False

cfg.optimizer.main_grads_dtype = torch.float32

cfg.optimizer.main_params_dtype = torch.float32

cfg.optimizer.exp_avg_dtype = torch.float32

cfg.optimizer.exp_avg_sq_dtype = torch.float32

# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)

# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable

# cfg.comm_overlap.delay_wgrad_compute = False

# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False

cfg.model.moe_shared_expert_overlap = False

# DDP config

cfg.ddp.overlap_grad_reduce = True

cfg.ddp.overlap_param_gather = True

cfg.ddp.check_for_nan_in_grad = True

cfg.ddp.use_distributed_optimizer = True

cfg.ddp.use_megatron_fsdp = False

cfg.ddp.grad_reduce_in_fp32 = True

cfg.ddp.average_in_collective = True

cfg.ddp.data_parallel_sharding_strategy = "no_shard"

# MoE Force Load Balancing

cfg.model.moe_router_force_load_balancing = False

🧰 Tools

🪛 Ruff (0.14.14)

[error] 291-291: Possible hardcoded password assigned to: "moe_token_dispatcher_type"

(S105)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 251 - 356, The gpt_oss_120b_pretrain_config function leaves cfg.model.pipeline_dtype as None even though cfg.model.pipeline_model_parallel_size is 4; update gpt_oss_120b_pretrain_config to set cfg.model.pipeline_dtype = torch.bfloat16 whenever cfg.model.pipeline_model_parallel_size > 1 (mirror the pattern used in other recipes), and make the same change in the corresponding 20B pretrain config function so pipeline parallelism uses torch.bfloat16 for safe pipeline communication and gradient computation.

coderabbitai · 2026-02-03T20:27:01Z

src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py

+def nemotron_nano_9b_v2_pretrain_config() -> ConfigContainer:
    """Return a pre-training config for Nemotron Nano 9B v2.

    This recipe is designed for single-node training (1 node).
    Default parallelism: TP=2, PP=1, SP=True.
-
-    See `_nemotronh_common` for the full list of parameters.
    """
-    from megatron.bridge.recipes.nemotronh.nemotronh import _nemotronh_common
-
-    recommended_kwargs: NemotronNanoV2CommonKwargs = {
-        "model_provider": NemotronNanoModelProvider9Bv2,
-        "tensor_model_parallel_size": 2,
-        "pipeline_model_parallel_size": 1,
-        "sequence_parallel": True,
-        "precision_config": "bf16_mixed",
-        "enable_default_comm_overlap": True,
-    }
-    combined_kwargs: NemotronNanoV2CommonKwargs = {**recommended_kwargs, **user_kwargs}
-    return _nemotronh_common(tokenizer_model="nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base", **combined_kwargs)
-
-
-def nemotron_nano_12b_v2_pretrain_config(**user_kwargs: Unpack[NemotronNanoV2CommonKwargs]) -> ConfigContainer:
+    cfg = _pretrain_common()


⚠️ Potential issue | 🟡 Minor

Add Google-style Returns sections to the Nano v2 pretrain config docstrings.

This keeps the public recipe docs consistent and Sphinx-parseable.

Example update (apply to both pretrain configs)

def nemotron_nano_9b_v2_pretrain_config() -> ConfigContainer: """Return a pre-training config for Nemotron Nano 9B v2. This recipe is designed for single-node training (1 node). Default parallelism: TP=2, PP=1, SP=True. + + Returns: + ConfigContainer: Pre-training configuration for Nemotron Nano 9B v2. """

As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.
Also applies to: 192-199

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 91 - 97, Update the docstrings for the Nano v2 pretrain config functions (e.g., nemotron_nano_9b_v2_pretrain_config and the other pretrain config around lines 192-199) to use Google-style docstrings by adding a "Returns" section that documents the return type and purpose (e.g., "Returns: ConfigContainer: pre-training configuration for Nemotron Nano 9B v2" or similar), ensuring the section is Sphinx/napoleon-parsable and consistent with the existing top-level description.

coderabbitai · 2026-02-03T20:27:02Z

src/megatron/bridge/recipes/nemotronh/nemotronh.py

+def nemotronh_4b_pretrain_config() -> ConfigContainer:
    """Return a pre-training config for NemotronH 4B.

    This recipe is designed for single-node training (1 node).
    Default parallelism: TP=1, PP=1, SP=False.
-
-    See `_nemotronh_common` for the full list of parameters.
    """


⚠️ Potential issue | 🟡 Minor

Add Google-style Returns sections to NemotronH pretrain config docstrings.

Each pretrain config docstring omits the Returns: block. Please add it consistently across these functions.

Example update (apply to all pretrain configs)

def nemotronh_4b_pretrain_config() -> ConfigContainer: """Return a pre-training config for NemotronH 4B. This recipe is designed for single-node training (1 node). Default parallelism: TP=1, PP=1, SP=False. + + Returns: + ConfigContainer: Pre-training configuration for NemotronH 4B. """

As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.
Also applies to: 203-208, 304-311, 407-414

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` around lines 102 - 107, Add a Google-style "Returns" section to the docstring of nemotronh_4b_pretrain_config and the other pretrain config functions in this file (the ones noted in the review). Specifically, update each function docstring to include a "Returns:" block that states the return type and brief description (e.g., "Returns: ConfigContainer: A pre-training configuration for NemotronH 4B."), following Google docstring formatting so Sphinx can parse it.

coderabbitai · 2026-02-03T20:27:02Z

tests/unit_tests/recipes/gemma/test_gemma2_recipes.py

 def _safe_overrides_for(name: str) -> dict:
-    # Detect if this is a finetune recipe
-    is_finetune = "finetune" in name.lower()
+    """Return overrides for recipe functions.

-    overrides = {
-        "name": f"unit_{name}",
-        "dir": ".",
-        "train_iters": 10,
-        "micro_batch_size": 1,
-        "seq_length": 64,
-        "min_lr": 1e-5,
-        "lr_warmup_iters": 2,
-        "global_batch_size": 2,
-    }
+    Pretrain configs use the new parameterless API (return empty dict).
+    Finetune configs still accept parameters.
+    """
+    is_finetune = "finetune" in name.lower()

    if is_finetune:
-        # Finetuning-specific overrides
-        overrides.update(
-            {
-                "finetune_lr": 1e-4,
-            }
-        )
+        # Finetuning-specific overrides - finetune configs still accept parameters
+        overrides = {
+            "name": f"unit_{name}",
+            "dir": ".",
+            "train_iters": 10,
+            "micro_batch_size": 1,
+            "seq_length": 64,
+            "min_lr": 1e-5,
+            "lr_warmup_iters": 2,
+            "global_batch_size": 2,
+            "finetune_lr": 1e-4,
+        }
    else:
-        # Pretrain-specific overrides
-        overrides.update(
-            {
-                "mock": True,
-                "lr": 1e-4,
-                "use_null_tokenizer": True,
-                "tensor_model_parallel_size": 1,
-                "pipeline_model_parallel_size": 1,
-                "context_parallel_size": 1,
-            }
-        )
+        # Pretrain configs use the new parameterless API
+        overrides = {}



⚠️ Potential issue | 🟡 Minor

Make _safe_overrides_for docstring Google-style (Args/Returns).

This keeps helper docs consistent and Sphinx-parseable.

Proposed docstring update

def _safe_overrides_for(name: str) -> dict: """Return overrides for recipe functions. Pretrain configs use the new parameterless API (return empty dict). Finetune configs still accept parameters. + + Args: + name: Recipe function name. + + Returns: + dict: Overrides to pass into the recipe function. """

As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.

🤖 Prompt for AI Agents

In `@tests/unit_tests/recipes/gemma/test_gemma2_recipes.py` around lines 41 - 65, Update the _safe_overrides_for function docstring to Google-style with an "Args" section describing the name: str parameter and a "Returns" section describing the returned dict of overrides; keep the existing short description about pretrain vs finetune behavior and briefly note that finetune configs accept parameters while pretrain return an empty dict. Ensure the docstring is triple-quoted and Sphinx-friendly for parsing.

coderabbitai · 2026-02-03T20:27:02Z

tests/unit_tests/recipes/test_gemma3_recipes.py

 def _safe_overrides_for(name: str) -> dict:
-    # Detect if this is a finetune recipe
-    is_finetune = "finetune" in name.lower()
+    """Return overrides for recipe functions.

-    overrides = {
-        "name": f"unit_{name}",
-        "dir": ".",
-        "train_iters": 10,
-        "global_batch_size": 2,
-        "micro_batch_size": 1,
-        "seq_length": 64,
-        "min_lr": 1e-5,
-        "lr_warmup_iters": 2,
-    }
+    Pretrain configs use the new parameterless API (return empty dict).
+    Finetune configs still accept parameters.
+    """
+    is_finetune = "finetune" in name.lower()

    if is_finetune:
-        # Finetuning-specific overrides
-        overrides.update(
-            {
-                "finetune_lr": 1e-4,
-                "pretrained_checkpoint": "/fake/checkpoint/path",
-            }
-        )
-        # Note: Finetuning recipes set parallelism internally based on PEFT vs full SFT
-        # Note: Finetuning always uses HF tokenizer, never null tokenizer
+        # Finetuning-specific overrides - finetune configs still accept parameters
+        overrides = {
+            "name": f"unit_{name}",
+            "dir": ".",
+            "train_iters": 10,
+            "global_batch_size": 2,
+            "micro_batch_size": 1,
+            "seq_length": 64,
+            "min_lr": 1e-5,
+            "lr_warmup_iters": 2,
+            "finetune_lr": 1e-4,
+            "pretrained_checkpoint": "/fake/checkpoint/path",
+        }
    else:
-        # Pretrain-specific overrides
-        overrides.update(
-            {
-                "mock": True,
-                "lr": 1e-4,
-                "tensor_model_parallel_size": 1,
-                "pipeline_model_parallel_size": 1,
-                "context_parallel_size": 1,
-                "use_null_tokenizer": True,
-            }
-        )
-
-        # Large models/variants may set additional flags in recipes; keep harmless defaults
-        lname = name.lower()
-        if "12b" in lname or "27b" in lname:
-            overrides.update(
-                {
-                    "virtual_pipeline_model_parallel_size": None,
-                    "sequence_parallel": True,
-                }
-            )
+        # Pretrain configs use the new parameterless API
+        overrides = {}



⚠️ Potential issue | 🟡 Minor

Make _safe_overrides_for docstring Google-style (Args/Returns).

This keeps helper docs consistent and Sphinx-parseable.

Proposed docstring update

def _safe_overrides_for(name: str) -> dict: """Return overrides for recipe functions. Pretrain configs use the new parameterless API (return empty dict). Finetune configs still accept parameters. + + Args: + name: Recipe function name. + + Returns: + dict: Overrides to pass into the recipe function. """

As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.

🤖 Prompt for AI Agents

In `@tests/unit_tests/recipes/test_gemma3_recipes.py` around lines 37 - 62, Update the _safe_overrides_for function docstring to Google style: add an "Args" section describing the name: str parameter and a "Returns" section describing the returned dict of overrides; keep the existing brief description about pretrain vs finetune behavior and ensure the docstring remains Sphinx/Google-parseable (triple-quoted) for the function _safe_overrides_for.

# Conflicts: # src/megatron/bridge/recipes/deepseek/deepseek_v2.py # src/megatron/bridge/recipes/deepseek/deepseek_v3.py # src/megatron/bridge/recipes/llama/llama2.py # src/megatron/bridge/recipes/moonlight/moonlight_16b.py # src/megatron/bridge/recipes/nemotronh/nemotronh.py

yaoyu-33 · 2026-02-05T00:39:07Z

/ok to test 718dee4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

…eMo#2201) Signed-off-by: sowmen <sowmendipta@gmail.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

yaoyu-33 added 2 commits February 3, 2026 11:32

Revert "Revert "Add refactored recipe files for pretrain configs of L…

f83928a

…LMs (#2067)"" This reverts commit 0909f9f.

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 00:39 Inactive

copy-pr-bot bot temporarily deployed to test February 5, 2026 00:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:10 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:17 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:26 Inactive

yaoyu-33 enabled auto-merge (squash) February 5, 2026 19:39

ko3n1g approved these changes Feb 5, 2026

View reviewed changes

yaoyu-33 merged commit b19588a into main Feb 5, 2026
50 checks passed

yaoyu-33 deleted the replay/0909f9fd branch February 5, 2026 20:53

This was referenced Feb 6, 2026

Add refactored recipes for finetuning #2268

Merged

kimi k2 recipe intro #2097

Merged

rhmukundan pushed a commit that referenced this pull request Feb 9, 2026

Fix performance config scripts for parameterless recipe API (#2201)

fe601f0

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan pushed a commit that referenced this pull request Feb 9, 2026

Fix performance config scripts for parameterless recipe API (#2201)

eaf4aa2

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026

Fix performance config scripts for parameterless recipe API (NVIDIA-N…

861a66b

…eMo#2201) Signed-off-by: sowmen <sowmendipta@gmail.com>

ko3n1g pushed a commit that referenced this pull request Feb 24, 2026

Fix performance config scripts for parameterless recipe API (#2201)

70436be

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g mentioned this pull request Feb 24, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 2, 2026

Add refactored SFT, PEFT recipes for VLMs #2614

Merged

5 tasks

Conversation

yaoyu-33 commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

scripts/performance/utils/utils.py

Performance Config Files

DeepSeek V3 Layout Fix

New Functional Test

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

`scripts/performance/utils/utils.py`