Skip to content

Fix performance config scripts for parameterless recipe API#2201

Merged
yaoyu-33 merged 3 commits intomainfrom
replay/0909f9fd
Feb 5, 2026
Merged

Fix performance config scripts for parameterless recipe API#2201
yaoyu-33 merged 3 commits intomainfrom
replay/0909f9fd

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Feb 3, 2026

Summary

This PR fixes the performance config scripts to work with the new parameterless recipe API introduced in PR #2067.

Problem

After reverting PR #2067, the nemo-ci library and performance testing was broken because:

  1. scripts.performance.configs.*.pretrain.py files import and instantiate recipes
  2. Recipes no longer accept kwargs (like mock, precision_config, dir, name)
  3. The performance scripts were still passing these arguments

Changes

scripts/performance/utils/utils.py

  • Updated get_library_recipe() to call recipes without arguments
  • Set output paths after instantiation:
    • cfg.checkpoint.save and cfg.checkpoint.load
    • cfg.logger.tensorboard_dir
    • cfg.logger.wandb_exp_name and cfg.logger.wandb_save_dir

Performance Config Files

Updated all pretrain config functions to:

  • Call base recipe functions without mock and precision_config arguments
  • Directly set cfg.mixed_precision after instantiation

Files updated:

  • scripts/performance/configs/llama/llama3_llm_pretrain.py
  • scripts/performance/configs/llama/llama31_llm_pretrain.py
  • scripts/performance/configs/deepseek/deepseek_llm_pretrain.py
  • scripts/performance/configs/qwen/qwen3_llm_pretrain.py
  • scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py
  • scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py

DeepSeek V3 Layout Fix

  • Fixed DeepSeek V3 to recompute pipeline layout after updating PP/VP sizes
  • Import and call set_deepseek_v3_pipeline_model_parallel_layout() after changing parallelism settings

New Functional Test

  • Added tests/functional_tests/recipes/test_perf_config_integration.py
  • Verifies that performance configs can be instantiated correctly

Testing

  • All pre-commit hooks pass
  • Functional test added for integration verification

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced simplified, parameter-free configuration functions for model pretraining across all supported architectures.
    • Added centralized pretraining configuration baseline for consistent setup patterns.
  • Improvements

    • Standardized API design for model recipe configurations, reducing parameter complexity.
    • Enhanced checkpoint and logging path handling in configuration utilities.
    • Improved test coverage for configuration integration and model variants.

This commit fixes the performance config scripts to work with the new
parameterless recipe API introduced in PR #2067.

Changes:
- Update get_library_recipe() to call recipes without arguments and set
  output paths (checkpoint, tensorboard, wandb) after instantiation
- Update all pretrain config functions to:
  - Call base recipe functions without mock/precision_config arguments
  - Directly set cfg.mixed_precision after instantiation
- Fix DeepSeek V3 to recompute pipeline layout after updating PP/VP sizes
  by calling set_deepseek_v3_pipeline_model_parallel_layout()
- Add functional test for performance config integration

Files modified:
- scripts/performance/utils/utils.py
- scripts/performance/configs/deepseek/deepseek_llm_pretrain.py
- scripts/performance/configs/llama/llama3_llm_pretrain.py
- scripts/performance/configs/llama/llama31_llm_pretrain.py
- scripts/performance/configs/qwen/qwen3_llm_pretrain.py
- scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py
- scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py
- tests/functional_tests/recipes/test_perf_config_integration.py
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This PR introduces a centralized _pretrain_common() helper function and refactors ~40 recipe configuration functions across multiple model families (Llama, DeepSeek, Qwen, Gemma, GPT, Nemotron, etc.) from parameterized APIs accepting user kwargs to parameterless factory functions returning pre-configured ConfigContainer objects. The refactoring standardizes pretraining setup through explicit, model-specific configuration assignments rather than parameter passing.

Changes

Cohort / File(s) Summary
Core pretraining common helper
src/megatron/bridge/recipes/common.py
Introduces new _pretrain_common() private helper function providing a standardized baseline pretraining ConfigContainer with default optimizer, scheduler, dataset, logger, and training settings.
Llama recipes
src/megatron/bridge/recipes/llama/llama2.py, llama3.py, llama3_8b_16k_...
Converts parameterized pretrain/finetune config functions to parameterless factories using _pretrain_common() baseline. Expands Llama3 coverage with variants for multiple sequence lengths (16K, 64K, 128K) and precision modes.
DeepSeek recipes
src/megatron/bridge/recipes/deepseek/deepseek_v2.py, deepseek_v3.py
Replaces kwargs-based config builders with streamlined _pretrain_common() implementations. Introduces AutoBridge.from_hf_pretrained() for model loading and centralizes MoE/pipeline layout configuration.
Gemma recipes
src/megatron/bridge/recipes/gemma/gemma2.py, gemma3.py
Migrates from _gemma_common-based helpers to _pretrain_common() pathway. Updates model instantiation via AutoBridge and removes GPTDatasetConfig imports in favor of centralized dataset handling.
Qwen recipes
src/megatron/bridge/recipes/qwen/qwen2.py, qwen3.py, qwen3_moe.py, qwen3_next.py
Converts multiple Qwen2/Qwen3 variants and MoE models from parameterized to parameterless APIs. Adds new Qwen3 235B-A22B pretrain/finetune configs and finetuning helpers.
GPT/GPT-OSS recipes
src/megatron/bridge/recipes/gpt/gpt3_175b.py, gpt_oss/gpt_oss.py
Replaces large multi-argument builders with concise _pretrain_common() implementations. Removes separate model_config() functions and consolidates configuration into single pretrain entry points.
Other model families
src/megatron/bridge/recipes/glm/glm45.py, kimi/kimi_k2.py, moonlight/moonlight_16b.py, nemotronh/*, olmoe/olmoe_7b.py
Applies consistent refactoring pattern: removes per-model helper functions, introduces _pretrain_common() baseline, adds pipeline layout helpers where needed, updates tokenizer/dataset/MoE configuration paths.
Performance config scripts
scripts/performance/configs/deepseek/..., gpt_oss/..., llama/..., nemotronh/..., qwen/...
Updates all performance config functions to call parameterless recipe functions and assign precision/layout via post-instantiation mutations instead of constructor arguments.
Example and utility scripts
examples/quantization/pretrain_quantized_llama3_8b.py, scripts/performance/utils/utils.py, tests/functional_tests/recipes/utils.py
Migrates to parameterless recipe API and performs inline model/config overrides. Updates get_library_recipe() to construct paths post-instantiation rather than via parameters.
Unit and functional tests
tests/unit_tests/recipes/*, tests/functional_tests/recipes/*
Updates test fixtures to call parameterless config functions. Introduces new integration test module (test_perf_config_integration.py). Adjusts override logic to distinguish pretrain (empty overrides) from finetune (full overrides). Relaxes tokenizer assertions to accept either NullTokenizer or HuggingFaceTokenizer for pretrain paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #2067: Implements the same large-scale refactor introducing _pretrain_common() and converting recipe functions to parameterless ConfigContainer factories with AutoBridge HF handling.
  • PR #1914: Modifies Nemotron-3-Nano recipe surface, adding/altering nemotron_3_nano pretrain/finetune config functions and provider classes.

Suggested reviewers

  • erhoo82
  • malay-nagda
  • cuichenx
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: updating performance config scripts to be compatible with a parameterless recipe API introduced in another PR.
Docstring Coverage ✅ Passed Docstring coverage is 96.67% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains compatibility fixes for parameterless recipe API with comprehensive functional integration test covering 8+ model configs, precision variations, and path resolution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch replay/0909f9fd

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
scripts/performance/configs/llama/llama3_llm_pretrain.py (2)

341-348: ⚠️ Potential issue | 🟡 Minor

Missing nvfp4 tp_comm_overlap check compared to other 8B configs.

Other 8B config functions (gb300, gb200, b300, b200) include a check to disable tp_comm_overlap when precision == "nvfp4", but the H100 variant does not. If nvfp4 is not supported on H100, consider adding a comment to clarify this; otherwise, this may be an oversight.

🔧 Suggested fix if nvfp4 check is needed
     cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=bool(cfg.model.tensor_model_parallel_size > 1))
+    cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap

     if cfg.ddp.use_megatron_fsdp:

50-52: ⚠️ Potential issue | 🟡 Minor

Remove unused mock parameter from all config functions.

The mock parameter is declared in all 10 config function signatures (lines 50, 85, 120, 155, 190, 222, 247, 272, 297, 322) but never referenced in any function body. Tests explicitly pass mock=True when calling these functions, indicating this was likely a legacy parameter from an earlier API. Since it has no effect, either remove it entirely or document why it is retained for backward compatibility.

tests/unit_tests/recipes/gemma/test_gemma2_recipes.py (1)

117-154: ⚠️ Potential issue | 🟡 Minor

Add module-level pytest unit test categorization.

This test module lacks the required pytestmark declaration for test categorization. Add the following after the imports:

import pytest


+pytestmark = pytest.mark.unit

All tests in this module should be categorized as unit tests per the coding guidelines.

tests/unit_tests/recipes/test_gemma3_recipes.py (1)

128-179: ⚠️ Potential issue | 🟡 Minor

Add module-level pytest.mark.unit to categorize all tests in this file.

Since all test functions here are unit tests, use a module-level mark to avoid repeating the decorator on each function.

Proposed change
 import pytest
 
+pytestmark = pytest.mark.unit

As per coding guidelines: Use 'pytest.mark' to categorize tests (unit, integration, system).

🤖 Fix all issues with AI agents
In `@scripts/performance/configs/deepseek/deepseek_llm_pretrain.py`:
- Around line 142-151: The b300 pretrain config currently ignores
base_cfg.pp_layout by always calling
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model); change it to follow
the gb300/gb200 pattern: if base_cfg.pp_layout is set, assign
cfg.model.pp_layout = base_cfg.pp_layout (or equivalent field) and do not
recompute, otherwise call
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model) to compute the layout;
update the block that sets cfg.model.pipeline_model_parallel_size /
virtual_pipeline_model_parallel_size / moe_flex_dispatcher_backend to
conditionally respect base_cfg.pp_layout before recomputing the layout.

In `@src/megatron/bridge/recipes/gemma/gemma3.py`:
- Around line 128-132: Update the docstring for gemma3_1b_pretrain_config to use
Google-style formatting by adding a "Returns" section that documents the
returned type and meaning (e.g., "Returns: ConfigContainer: A pre-training
configuration for Gemma3 1B with default parallelism TP=1, PP=1 and
seq_length=32K"). Keep the existing short description and default-parallelism
note, and ensure the "Returns" section is placed after the function description.

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`:
- Around line 251-356: The gpt_oss_120b_pretrain_config function leaves
cfg.model.pipeline_dtype as None even though
cfg.model.pipeline_model_parallel_size is 4; update gpt_oss_120b_pretrain_config
to set cfg.model.pipeline_dtype = torch.bfloat16 whenever
cfg.model.pipeline_model_parallel_size > 1 (mirror the pattern used in other
recipes), and make the same change in the corresponding 20B pretrain config
function so pipeline parallelism uses torch.bfloat16 for safe pipeline
communication and gradient computation.
- Around line 132-248: In gpt_oss_20b_pretrain_config, explicitly set
cfg.model.pipeline_dtype to torch.bfloat16 when using
pipeline_model_parallel_size=4 (replace the current None); update the assignment
near the "Parallelism settings" block so cfg.model.pipeline_model_parallel_size
remains 4 and cfg.model.pipeline_dtype = torch.bfloat16 to follow the codebase
convention and avoid relying on finalization auto-population.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 91-97: Update the docstrings for the Nano v2 pretrain config
functions (e.g., nemotron_nano_9b_v2_pretrain_config and the other pretrain
config around lines 192-199) to use Google-style docstrings by adding a
"Returns" section that documents the return type and purpose (e.g., "Returns:
ConfigContainer: pre-training configuration for Nemotron Nano 9B v2" or
similar), ensuring the section is Sphinx/napoleon-parsable and consistent with
the existing top-level description.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Around line 102-107: Add a Google-style "Returns" section to the docstring of
nemotronh_4b_pretrain_config and the other pretrain config functions in this
file (the ones noted in the review). Specifically, update each function
docstring to include a "Returns:" block that states the return type and brief
description (e.g., "Returns: ConfigContainer: A pre-training configuration for
NemotronH 4B."), following Google docstring formatting so Sphinx can parse it.

In `@tests/unit_tests/recipes/gemma/test_gemma2_recipes.py`:
- Around line 41-65: Update the _safe_overrides_for function docstring to
Google-style with an "Args" section describing the name: str parameter and a
"Returns" section describing the returned dict of overrides; keep the existing
short description about pretrain vs finetune behavior and briefly note that
finetune configs accept parameters while pretrain return an empty dict. Ensure
the docstring is triple-quoted and Sphinx-friendly for parsing.

In `@tests/unit_tests/recipes/test_gemma3_recipes.py`:
- Around line 37-62: Update the _safe_overrides_for function docstring to Google
style: add an "Args" section describing the name: str parameter and a "Returns"
section describing the returned dict of overrides; keep the existing brief
description about pretrain vs finetune behavior and ensure the docstring remains
Sphinx/Google-parseable (triple-quoted) for the function _safe_overrides_for.
🧹 Nitpick comments (12)
scripts/performance/configs/llama/llama3_llm_pretrain.py (1)

15-32: Import order does not follow coding guidelines.

Per coding guidelines, imports should be organized as: future imports, standard library, third-party, first-party (megatron.bridge.*), then local folder imports. Currently, local utils.* imports precede megatron.bridge.* imports.

♻️ Suggested import reordering
 import logging

+from megatron.bridge.recipes.llama import llama3_8b_pretrain_config, llama3_70b_pretrain_config
+from megatron.bridge.training.comm_overlap import (
+    CommOverlapConfig,
+    userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192,
+    userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192,
+    userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192,
+    userbuffers_fp8_h100_h8192_tp4_mbs1_seqlen8192,
+)
+from megatron.bridge.training.config import ConfigContainer
+
 from utils.overrides import set_workload_base_configs
 from utils.precision import get_precision_config
 from utils.utils import get_workload_base_config

-from megatron.bridge.recipes.llama import llama3_8b_pretrain_config, llama3_70b_pretrain_config
-from megatron.bridge.training.comm_overlap import (
-    CommOverlapConfig,
-    userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192,
-    userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192,
-    userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192,
-    userbuffers_fp8_h100_h8192_tp4_mbs1_seqlen8192,
-)
-from megatron.bridge.training.config import ConfigContainer
src/megatron/bridge/recipes/olmoe/olmoe_7b.py (1)

143-145: Simplify redundant list comprehension.

The outer list() call is unnecessary since the list comprehension already produces a list.

♻️ Suggested simplification
     if layout is not None:
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]
     return layout
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)

42-44: Simplify redundant list comprehension.

Same issue as in olmoe_7b.py - the outer list() is redundant.

♻️ Suggested simplification
     if layout is not None:
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]
     return layout
tests/unit_tests/recipes/kimi/test_kimi_k2.py (1)

24-25: Consider adding @pytest.mark.unit decorator.

The test classes are missing the @pytest.mark.unit marker. Per coding guidelines, tests should use pytest.mark to categorize tests.

♻️ Add pytest markers
+@pytest.mark.unit
 class TestKimiK2PipelineLayout:
     """Test cases for _get_kimi_k2_pipeline_layout function."""
+@pytest.mark.unit
 class TestKimiK2PretrainConfig:
     """Test cases for kimi_k2_pretrain_config function."""

As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".

Also applies to: 50-51

src/megatron/bridge/recipes/deepseek/deepseek_v3.py (3)

87-93: Redundant assignment of rotary_base.

rotary_base is assigned on line 89, then immediately converted to float on line 91. The second assignment is redundant since line 89 already assigns a float literal.

♻️ Proposed fix
     # Model-specific settings
     cfg.model.init_method_std = 0.006
-    cfg.model.rotary_base = 10000.0
-    cfg.model.rotary_scaling_factor = 40
-    cfg.model.rotary_base = float(cfg.model.rotary_base)  # Ensure rotary_base is float
-    cfg.model.rotary_scaling_factor = int(cfg.model.rotary_scaling_factor)
+    cfg.model.rotary_base = 10000.0  # float literal
+    cfg.model.rotary_scaling_factor = 40  # int literal

198-334: Consider extracting common configuration into a helper function.

deepseek_v3_pretrain_config_32nodes() shares ~90% of its code with deepseek_v3_pretrain_config(). The differences are primarily in parallelism settings (PP=8 vs PP=16, EP=32 vs EP=64) and recompute configuration (full vs selective).

A shared helper would reduce maintenance burden and risk of divergence.

♻️ Suggested approach
def _deepseek_v3_common() -> ConfigContainer:
    """Common configuration for DeepSeek-V3 variants."""
    cfg = _pretrain_common()
    cfg.model = AutoBridge.from_hf_pretrained("deepseek-ai/DeepSeek-V3").to_megatron_provider(load_weights=False)
    # ... shared config settings ...
    return cfg

def deepseek_v3_pretrain_config() -> ConfigContainer:
    cfg = _deepseek_v3_common()
    cfg.model.pipeline_model_parallel_size = 16
    cfg.model.expert_model_parallel_size = 64
    cfg.model.recompute_granularity = "selective"
    # ... variant-specific settings ...
    return cfg

def deepseek_v3_pretrain_config_32nodes() -> ConfigContainer:
    cfg = _deepseek_v3_common()
    cfg.model.pipeline_model_parallel_size = 8
    cfg.model.expert_model_parallel_size = 32
    cfg.model.recompute_granularity = "full"
    cfg.model.recompute_method = "uniform"
    cfg.model.recompute_num_layers = 1
    # ... variant-specific settings ...
    return cfg

234-239: Same redundant pattern as the main config.

The same redundant assignment pattern for rotary_base and rotary_scaling_factor appears here.

♻️ Proposed fix
     # Model-specific settings
     cfg.model.init_method_std = 0.006
-    cfg.model.rotary_base = 10000.0
-    cfg.model.rotary_scaling_factor = 40
-    cfg.model.rotary_base = float(cfg.model.rotary_base)
-    cfg.model.rotary_scaling_factor = int(cfg.model.rotary_scaling_factor)
+    cfg.model.rotary_base = 10000.0
+    cfg.model.rotary_scaling_factor = 40
scripts/performance/configs/deepseek/deepseek_llm_pretrain.py (1)

48-49: Unused mock parameter across all config functions.

The mock parameter is declared but never used in any of the config functions (gb300, gb200, b300, b200, h100). This appears to be a leftover from the previous API.

Consider removing it if no longer needed, or add a comment explaining it's retained for backward compatibility.

Also applies to: 88-89, 128-129, 160-161, 192-193

tests/unit_tests/recipes/test_deepseek_recipes.py (1)

80-102: Consider adding @pytest.mark.unit decorator.

The test function lacks the @pytest.mark.unit marker. Per coding guidelines, tests should use pytest.mark to categorize tests (unit, integration, system).

♻️ Proposed fix
+@pytest.mark.unit
 `@pytest.mark.parametrize`("recipe_func", _DEEPSEEK_RECIPE_FUNCS)
 def test_each_deepseek_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch):

As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".

tests/functional_tests/recipes/test_perf_config_integration.py (2)

24-30: sys.path manipulation may be fragile for test discovery.

Using sys.path.insert at module level affects the entire test session. Consider using pytest's pythonpath configuration in pyproject.toml or a conftest.py fixture for cleaner path management.


33-35: Add pytest markers for test categorization.

Per coding guidelines, tests should use pytest.mark to categorize tests. Since these are functional/integration tests, they should be marked appropriately.

♻️ Proposed fix
+import pytest
+
+
+@pytest.mark.functional
 class TestPerfConfigIntegration:
     """Test class for performance config integration with library recipes."""

As per coding guidelines: "Use 'pytest.mark' to categorize tests (unit, integration, system)".

src/megatron/bridge/recipes/moonlight/moonlight_16b.py (1)

130-148: Add an explicit return type to _get_moonlight_pipeline_layout.
This keeps the API typed and aligned with the codebase’s typing conventions.

♻️ Suggested change
-def _get_moonlight_pipeline_layout(pp_size: int, vp_size: int):
+def _get_moonlight_pipeline_layout(pp_size: int, vp_size: int) -> list[list[str]] | None:

As per coding guidelines: Use type hints for function arguments and return types; Use built-in generics (list, dict, tuple) instead of typing equivalents; Use 'T | None' for nullable types instead of 'Optional[T]'.

Comment on lines +142 to +151
cfg = pretrain_config()
cfg.mixed_precision = precision_config

# Apply model-specific settings that were previously passed as constructor args
cfg.model.pipeline_model_parallel_size = base_cfg.pipeline_model_parallel_size
cfg.model.virtual_pipeline_model_parallel_size = base_cfg.virtual_pipeline_model_parallel_size
cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend
# Recompute layout based on updated PP/VP sizes
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, run the provided script to check pp_layout usage
rg -n "pp_layout" scripts/performance/ --glob "*.yaml" --glob "*.json" -A2 -B2 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Also check Python files for pp_layout references
rg -n "pp_layout" scripts/performance/ --glob "*.py" -A2 -B2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2391


🏁 Script executed:

# Find gb300 and gb200 config files to compare with b300
fd -e py "gb300|gb200|b300" scripts/performance/configs/deepseek/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Look at the actual deepseek_llm_pretrain.py file around lines 142-151
cat -n scripts/performance/configs/deepseek/deepseek_llm_pretrain.py | sed -n '130,160p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1373


🏁 Script executed:

# Look at the broader context to identify which functions have the conditional logic
cat -n scripts/performance/configs/deepseek/deepseek_llm_pretrain.py | sed -n '50,120p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3317


🏁 Script executed:

# Also check the b300 base config to verify pp_layout is set there
cat -n scripts/performance/configs/deepseek/deepseek_workload_base_configs.py | sed -n '40,55p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 698


B300 config should respect base_cfg.pp_layout like GB300/GB200 variants.

The b300 function unconditionally calls set_deepseek_v3_pipeline_model_parallel_layout() without checking base_cfg.pp_layout first, unlike the gb300 and gb200 variants. Since the B300 base config explicitly sets pp_layout="Et*4|(t*4|)*14tmL", it is being silently ignored. Update b300 to use the same conditional pattern as gb300/gb200: check if base_cfg.pp_layout exists, use it if present, otherwise compute the layout.

🤖 Prompt for AI Agents
In `@scripts/performance/configs/deepseek/deepseek_llm_pretrain.py` around lines
142 - 151, The b300 pretrain config currently ignores base_cfg.pp_layout by
always calling set_deepseek_v3_pipeline_model_parallel_layout(cfg.model); change
it to follow the gb300/gb200 pattern: if base_cfg.pp_layout is set, assign
cfg.model.pp_layout = base_cfg.pp_layout (or equivalent field) and do not
recompute, otherwise call
set_deepseek_v3_pipeline_model_parallel_layout(cfg.model) to compute the layout;
update the block that sets cfg.model.pipeline_model_parallel_size /
virtual_pipeline_model_parallel_size / moe_flex_dispatcher_backend to
conditionally respect base_cfg.pp_layout before recomputing the layout.

Comment on lines +128 to 132
def gemma3_1b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for Gemma3 1B.

See `_gemma3_common` for the full list of parameters.
Default parallelism: TP=1, PP=1, seq_length=32K
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a Google-style Returns section to the pretrain config docstring.

This keeps the recipe docs consistent with the rest of the module’s Google-style docstrings.

Proposed docstring update
 def gemma3_1b_pretrain_config() -> ConfigContainer:
     """Return a pre-training config for Gemma3 1B.
 
     Default parallelism: TP=1, PP=1, seq_length=32K
+
+    Returns:
+        ConfigContainer: Pre-training configuration for Gemma3 1B.
     """
As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def gemma3_1b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for Gemma3 1B.
See `_gemma3_common` for the full list of parameters.
Default parallelism: TP=1, PP=1, seq_length=32K
"""
def gemma3_1b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for Gemma3 1B.
Default parallelism: TP=1, PP=1, seq_length=32K
Returns:
ConfigContainer: Pre-training configuration for Gemma3 1B.
"""
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 128 - 132, Update
the docstring for gemma3_1b_pretrain_config to use Google-style formatting by
adding a "Returns" section that documents the returned type and meaning (e.g.,
"Returns: ConfigContainer: A pre-training configuration for Gemma3 1B with
default parallelism TP=1, PP=1 and seq_length=32K"). Keep the existing short
description and default-parallelism note, and ensure the "Returns" section is
placed after the function description.

Comment on lines +132 to +248
def gpt_oss_20b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for GPT-OSS 20B variant.

def _gpt_oss_common(
hf_path: str,
dir: Optional[str] = None,
name: str = "default",
# Dataset configuration
data_paths: Optional[List[str]] = None,
data_args_path: Optional[str] = None,
train_data_path: Optional[List[str]] = None,
valid_data_path: Optional[List[str]] = None,
test_data_path: Optional[List[str]] = None,
per_split_data_args_path: Optional[str] = None,
mock: bool = False,
# Dataset override option
dataset: Optional[Union[GPTDatasetConfig, FinetuningDatasetConfig, DatasetProvider]] = None,
# Model configuration
num_layers: int = None, # for ci testing
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
pipeline_dtype: Optional[torch.dtype] = None,
virtual_pipeline_model_parallel_size: Optional[int] = None,
context_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
sequence_parallel: bool = False,
use_megatron_fsdp: bool = False,
account_for_embedding_in_pipeline_split: bool = False,
account_for_loss_in_pipeline_split: bool = False,
cp_comm_type: Optional[str] = None,
# Training hyperparameters
train_iters: int = 1000000,
global_batch_size: int = 512,
micro_batch_size: int = 1,
seq_length: int = 4096,
lr: float = 3e-4,
min_lr: float = 3e-5,
lr_warmup_iters: int = 2000,
lr_decay_iters: Optional[int] = None,
eval_interval: int = 2000,
save_interval: int = 500,
use_null_tokenizer: bool = True,
# Precision recipe
precision_config: Optional[Union[MixedPrecisionConfig, str]] = "bf16_mixed",
comm_overlap_config: Optional[CommOverlapConfig] = None,
# Checkpointing
pretrained_checkpoint: Optional[str] = None,
) -> ConfigContainer:
"""
Create a pre-training configuration for GPT-OSS family models using a given HuggingFace path.
Mirrors the structure used in llama recipes for consistency.
Recommended parallelism: TP=2, PP=4, EP=4
"""
cfg = _pretrain_common()

# Model config
cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-20b").to_megatron_provider(load_weights=False)

# Tokenizer - uses NullTokenizer by default
cfg.tokenizer.tokenizer_type = "NullTokenizer"
cfg.tokenizer.tokenizer_model = None
cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE

# Dataset config - mock data by default
cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
cfg.dataset.seq_length = 4096
cfg.dataset.num_workers = 8

# Parallelism settings
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.context_parallel_size = 1
cfg.model.expert_model_parallel_size = 4
cfg.model.expert_tensor_parallel_size = 1
cfg.model.sequence_parallel = True
cfg.model.seq_length = 4096

# Pipeline split settings
cfg.model.account_for_embedding_in_pipeline_split = False
cfg.model.account_for_loss_in_pipeline_split = False

if cfg.model.context_parallel_size > 1:
cfg.model.calculate_per_token_loss = True
cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.

# MoE Token Dispatcher settings
cfg.model.moe_token_dispatcher_type = "alltoall" # Default
cfg.model.moe_flex_dispatcher_backend = "deepep" # Options: None, deepep, hybridep
cfg.model.moe_hybridep_num_sms = 16 # Number of SMs for hybridep backend

# Training config
cfg.train.train_iters = 1000000
cfg.train.global_batch_size = 512
cfg.train.micro_batch_size = 1
cfg.train.eval_interval = 2000
cfg.train.manual_gc = True
cfg.train.manual_gc_interval = 100

# Scheduler config
cfg.scheduler.lr_warmup_iters = 2000

# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections
cfg.model.attention_backend = None
cfg.model.moe_router_fusion = False
cfg.model.moe_permute_fusion = True
cfg.model.moe_grouped_gemm = True
cfg.model.cross_entropy_loss_fusion = True
cfg.model.cross_entropy_fusion_impl = "native"

# Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None

# Mixed precision - uses "bf16_mixed" from _pretrain_common
# FP8 settings (commented - enable if using FP8)
# cfg.mixed_precision.fp8_recipe = "tensorwise"
# cfg.mixed_precision.fp8 = None
# cfg.mixed_precision.fp8_param_gather = False
# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False
cfg.model.moe_router_padding_for_fp8 = False # Pad router for FP8 alignment

# Optimizer precision settings
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32

# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable
# cfg.comm_overlap.delay_wgrad_compute = False
# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
cfg.model.moe_shared_expert_overlap = False # GPT-OSS default

# Checkpoint config
# cfg.checkpoint.save = "path/to/save"
# cfg.checkpoint.load = "path/to/load"

# DDP config (matches _pretrain_common)
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.use_megatron_fsdp = False
cfg.ddp.grad_reduce_in_fp32 = True
cfg.ddp.average_in_collective = cfg.model.context_parallel_size == 1
cfg.ddp.data_parallel_sharding_strategy = "no_shard"

# MoE Force Load Balancing
cfg.model.moe_router_force_load_balancing = False

base_output_dir = dir if dir is not None else os.path.join(os.getcwd(), "nemo_experiments")
run_output_dir = os.path.join(base_output_dir, name)
checkpoint_dir = os.path.join(run_output_dir, "checkpoints")
tensorboard_dir = os.path.join(run_output_dir, "tb_logs")

bridge = AutoBridge.from_hf_pretrained(hf_path)
model_cfg = bridge.to_megatron_provider(load_weights=False)
if num_layers is not None:
model_cfg.num_layers = num_layers
model_cfg.tensor_model_parallel_size = tensor_model_parallel_size
model_cfg.pipeline_model_parallel_size = pipeline_model_parallel_size
model_cfg.pipeline_dtype = pipeline_dtype
model_cfg.virtual_pipeline_model_parallel_size = virtual_pipeline_model_parallel_size
model_cfg.context_parallel_size = context_parallel_size
model_cfg.expert_model_parallel_size = expert_model_parallel_size
model_cfg.expert_tensor_parallel_size = 1
model_cfg.sequence_parallel = sequence_parallel
model_cfg.seq_length = seq_length
return cfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for GPT-OSS provider and pipeline_dtype handling
rg -n "pipeline_dtype" src/megatron/bridge --type py -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


🏁 Script executed:

# Look for GPTOSS/gpt_oss provider implementation
find src -type f -name "*.py" | xargs grep -l "gpt_oss\|GPTOSS" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 399


🏁 Script executed:

# Check AutoBridge.from_hf_pretrained and how it sets pipeline_dtype
rg -n "from_hf_pretrained\|to_megatron_provider" src/megatron/bridge --type py -A 5 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Search for pipeline_dtype defaults in config or model definitions
rg -n "pipeline_dtype\s*=" src/megatron/bridge --type py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2898


🏁 Script executed:

# Check GPT-OSS provider for pipeline_dtype handling
cat src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3687


🏁 Script executed:

# Check if AutoBridge.from_hf_pretrained sets pipeline_dtype
rg -n "from_hf_pretrained\|to_megatron_provider" src/megatron/bridge/models --type py -A 10 | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Find and check GPTModelProvider for pipeline_dtype field
rg -n "class GPTModelProvider" src/megatron/bridge --type py -A 30 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2876


🏁 Script executed:

# Also search for any pipeline_dtype default in model provider base classes
rg -n "pipeline_dtype" src/megatron/bridge/models --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 714


🏁 Script executed:

# Get more context around the transformer_config.py auto-setting logic
sed -n '80,170p' src/megatron/bridge/models/transformer_config.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3753


🏁 Script executed:

# Check when finalize() is called in the recipe flow
rg -n "finalize" src/megatron/bridge/recipes/gpt_oss --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if to_megatron_provider calls finalize
rg -n "to_megatron_provider\|finalize" src/megatron/bridge/models --type py -B 3 -A 5 | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check the AutoBridge class to see how it handles finalize
rg -n "class AutoBridge\|def to_megatron_provider" src/megatron/bridge --type py -A 10 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Search for AutoBridge definition
fd -n "auto_bridge" src/megatron/bridge --type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 301


🏁 Script executed:

# Search more broadly for to_megatron_provider implementation
rg -n "def to_megatron_provider" src/megatron/bridge --type py -A 15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1778


🏁 Script executed:

# Look at the GPT-OSS recipe to see how config flows after to_megatron_provider
sed -n '130,170p' src/megatron/bridge/recipes/gpt_oss/gpt_oss.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1681


🏁 Script executed:

# Check when finalize() is called on the provider
rg -n "finalize" src/megatron/bridge --type py -B 5 -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


🏁 Script executed:

# Look at to_megatron_provider implementation more carefully
sed -n '829,900p' src/megatron/bridge/models/conversion/auto_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3717


Set explicit pipeline_dtype for PP=4 to match codebase convention.

With pipeline_model_parallel_size=4, set pipeline_dtype=torch.bfloat16 explicitly instead of None. While the framework will auto-populate this during finalization (when PP>1), all other recipes in the codebase explicitly set pipeline_dtype when using pipeline parallelism (e.g., qwen3_moe, olmoe_7b, kimi_k2). This explicit assignment aligns with the "Required for PP > 1" pattern documented elsewhere and avoids relying on implicit finalization behavior.

Recommended fix
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16
🧰 Tools
🪛 Ruff (0.14.14)

[error] 173-173: Possible hardcoded password assigned to: "moe_token_dispatcher_type"

(S105)

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 132 - 248, In
gpt_oss_20b_pretrain_config, explicitly set cfg.model.pipeline_dtype to
torch.bfloat16 when using pipeline_model_parallel_size=4 (replace the current
None); update the assignment near the "Parallelism settings" block so
cfg.model.pipeline_model_parallel_size remains 4 and cfg.model.pipeline_dtype =
torch.bfloat16 to follow the codebase convention and avoid relying on
finalization auto-population.

Comment on lines +251 to 356
def gpt_oss_120b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for GPT-OSS 120B variant.

# Build dataset config if not supplied directly
if dataset is None:
blend, blend_per_split, split = get_blend_fields_from_data_paths(
data_paths,
data_args_path,
train_data_path,
valid_data_path,
test_data_path,
per_split_data_args_path,
mock,
)
dataset_cfg = GPTDatasetConfig(
random_seed=1234,
reset_attention_mask=False,
reset_position_ids=False,
eod_mask_loss=False,
seq_length=seq_length,
num_dataset_builder_threads=1,
blend=blend,
blend_per_split=blend_per_split,
split=split,
data_sharding=True,
dataloader_type="single",
skip_getting_attention_mask_from_dataset=True,
)
else:
dataset_cfg = dataset

cfg = ConfigContainer(
model=model_cfg,
train=TrainingConfig(
train_iters=train_iters,
eval_interval=eval_interval,
eval_iters=32,
global_batch_size=global_batch_size,
micro_batch_size=micro_batch_size,
manual_gc=True,
manual_gc_interval=100,
manual_gc_eval=100,
),
optimizer=opt_config,
scheduler=scheduler,
ddp=DistributedDataParallelConfig(
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
average_in_collective=context_parallel_size == 1,
use_distributed_optimizer=True,
use_megatron_fsdp=use_megatron_fsdp,
),
dataset=dataset_cfg,
logger=LoggerConfig(
log_interval=10,
tensorboard_dir=tensorboard_dir,
log_timers_to_tensorboard=True,
),
tokenizer=TokenizerConfig(
tokenizer_type="NullTokenizer" if use_null_tokenizer else "HuggingFaceTokenizer",
tokenizer_model=hf_path if not use_null_tokenizer else None,
vocab_size=DEFAULT_NULL_TOKENIZER_VOCAB_SIZE if use_null_tokenizer else None,
),
checkpoint=CheckpointConfig(
save_interval=save_interval,
save=checkpoint_dir,
load=checkpoint_dir,
pretrained_checkpoint=pretrained_checkpoint,
ckpt_format="torch_dist",
fully_parallel_save=True,
),
rng=RNGConfig(seed=1234),
comm_overlap=comm_overlap_config,
mixed_precision=precision_config,
)
Recommended parallelism: TP=2, PP=4, EP=16
"""
cfg = _pretrain_common()

# Model config
cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)

# Tokenizer - uses NullTokenizer by default
cfg.tokenizer.tokenizer_type = "NullTokenizer"
cfg.tokenizer.tokenizer_model = None
cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE

# Dataset config - mock data by default
cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
cfg.dataset.seq_length = 4096
cfg.dataset.num_workers = 8

# Parallelism settings (MoE-specific)
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.context_parallel_size = 1
cfg.model.expert_model_parallel_size = 16 # Larger EP for 120B
cfg.model.expert_tensor_parallel_size = 1
cfg.model.sequence_parallel = True
cfg.model.seq_length = 4096

# Pipeline split settings
cfg.model.account_for_embedding_in_pipeline_split = False
cfg.model.account_for_loss_in_pipeline_split = False
if cfg.model.context_parallel_size > 1:
cfg.model.calculate_per_token_loss = True
cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.

# MoE Token Dispatcher settings
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_flex_dispatcher_backend = "deepep"
cfg.model.moe_hybridep_num_sms = 16

# Training config (DIFFERENT from _pretrain_common)
cfg.train.train_iters = 1000000
cfg.train.global_batch_size = 512
cfg.train.micro_batch_size = 1
cfg.train.eval_interval = 2000
cfg.train.manual_gc = True
cfg.train.manual_gc_interval = 100

# Scheduler config
cfg.scheduler.lr_warmup_iters = 2000

# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections
cfg.model.attention_backend = None
cfg.model.moe_router_fusion = False
cfg.model.moe_permute_fusion = True
cfg.model.moe_grouped_gemm = True
cfg.model.cross_entropy_loss_fusion = True
cfg.model.cross_entropy_fusion_impl = "native" # GPT-OSS uses native

# Memory saving
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None

# Mixed precision
cfg.model.moe_router_padding_for_fp8 = False

# Optimizer precision settings
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32

# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable
# cfg.comm_overlap.delay_wgrad_compute = False
# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
cfg.model.moe_shared_expert_overlap = False

# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.use_megatron_fsdp = False
cfg.ddp.grad_reduce_in_fp32 = True
cfg.ddp.average_in_collective = True
cfg.ddp.data_parallel_sharding_strategy = "no_shard"

# MoE Force Load Balancing
cfg.model.moe_router_force_load_balancing = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for GPT-OSS provider/config defaults for pipeline_dtype
rg -n "gpt_oss|gpt-oss|GPTOSS" src/megatron/bridge -i -A 5 -B 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 34016


🏁 Script executed:

#!/bin/bash
# Search for pipeline_dtype references and defaults
rg -n "pipeline_dtype" src/megatron/bridge -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


🏁 Script executed:

#!/bin/bash
# Look for the AutoBridge.from_hf_pretrained implementation
rg -n "from_hf_pretrained" src/megatron/bridge -A 10 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50384


🏁 Script executed:

#!/bin/bash
# Search for openai/gpt-oss references
rg -n "openai/gpt-oss" . -A 10 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 14084


Set pipeline_dtype to torch.bfloat16 when pipeline_model_parallel_size > 1.

For PP=4, pipeline_dtype should not be None. Across the codebase (Qwen, Llama, OlMoE, Moonlight recipes), when pipeline parallelism is enabled, pipeline_dtype is explicitly set to torch.bfloat16. This prevents undefined behavior during pipeline communication and gradient computation. The same issue applies to the 20B pretrain config.

💡 Proposed fix
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def gpt_oss_120b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for GPT-OSS 120B variant.
# Build dataset config if not supplied directly
if dataset is None:
blend, blend_per_split, split = get_blend_fields_from_data_paths(
data_paths,
data_args_path,
train_data_path,
valid_data_path,
test_data_path,
per_split_data_args_path,
mock,
)
dataset_cfg = GPTDatasetConfig(
random_seed=1234,
reset_attention_mask=False,
reset_position_ids=False,
eod_mask_loss=False,
seq_length=seq_length,
num_dataset_builder_threads=1,
blend=blend,
blend_per_split=blend_per_split,
split=split,
data_sharding=True,
dataloader_type="single",
skip_getting_attention_mask_from_dataset=True,
)
else:
dataset_cfg = dataset
cfg = ConfigContainer(
model=model_cfg,
train=TrainingConfig(
train_iters=train_iters,
eval_interval=eval_interval,
eval_iters=32,
global_batch_size=global_batch_size,
micro_batch_size=micro_batch_size,
manual_gc=True,
manual_gc_interval=100,
manual_gc_eval=100,
),
optimizer=opt_config,
scheduler=scheduler,
ddp=DistributedDataParallelConfig(
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
average_in_collective=context_parallel_size == 1,
use_distributed_optimizer=True,
use_megatron_fsdp=use_megatron_fsdp,
),
dataset=dataset_cfg,
logger=LoggerConfig(
log_interval=10,
tensorboard_dir=tensorboard_dir,
log_timers_to_tensorboard=True,
),
tokenizer=TokenizerConfig(
tokenizer_type="NullTokenizer" if use_null_tokenizer else "HuggingFaceTokenizer",
tokenizer_model=hf_path if not use_null_tokenizer else None,
vocab_size=DEFAULT_NULL_TOKENIZER_VOCAB_SIZE if use_null_tokenizer else None,
),
checkpoint=CheckpointConfig(
save_interval=save_interval,
save=checkpoint_dir,
load=checkpoint_dir,
pretrained_checkpoint=pretrained_checkpoint,
ckpt_format="torch_dist",
fully_parallel_save=True,
),
rng=RNGConfig(seed=1234),
comm_overlap=comm_overlap_config,
mixed_precision=precision_config,
)
Recommended parallelism: TP=2, PP=4, EP=16
"""
cfg = _pretrain_common()
# Model config
cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)
# Tokenizer - uses NullTokenizer by default
cfg.tokenizer.tokenizer_type = "NullTokenizer"
cfg.tokenizer.tokenizer_model = None
cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE
# Dataset config - mock data by default
cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
cfg.dataset.seq_length = 4096
cfg.dataset.num_workers = 8
# Parallelism settings (MoE-specific)
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.context_parallel_size = 1
cfg.model.expert_model_parallel_size = 16 # Larger EP for 120B
cfg.model.expert_tensor_parallel_size = 1
cfg.model.sequence_parallel = True
cfg.model.seq_length = 4096
# Pipeline split settings
cfg.model.account_for_embedding_in_pipeline_split = False
cfg.model.account_for_loss_in_pipeline_split = False
if cfg.model.context_parallel_size > 1:
cfg.model.calculate_per_token_loss = True
cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.
# MoE Token Dispatcher settings
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_flex_dispatcher_backend = "deepep"
cfg.model.moe_hybridep_num_sms = 16
# Training config (DIFFERENT from _pretrain_common)
cfg.train.train_iters = 1000000
cfg.train.global_batch_size = 512
cfg.train.micro_batch_size = 1
cfg.train.eval_interval = 2000
cfg.train.manual_gc = True
cfg.train.manual_gc_interval = 100
# Scheduler config
cfg.scheduler.lr_warmup_iters = 2000
# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"
# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3
# Kernel selections
cfg.model.attention_backend = None
cfg.model.moe_router_fusion = False
cfg.model.moe_permute_fusion = True
cfg.model.moe_grouped_gemm = True
cfg.model.cross_entropy_loss_fusion = True
cfg.model.cross_entropy_fusion_impl = "native" # GPT-OSS uses native
# Memory saving
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None
# Mixed precision
cfg.model.moe_router_padding_for_fp8 = False
# Optimizer precision settings
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32
# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable
# cfg.comm_overlap.delay_wgrad_compute = False
# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
cfg.model.moe_shared_expert_overlap = False
# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.use_megatron_fsdp = False
cfg.ddp.grad_reduce_in_fp32 = True
cfg.ddp.average_in_collective = True
cfg.ddp.data_parallel_sharding_strategy = "no_shard"
# MoE Force Load Balancing
cfg.model.moe_router_force_load_balancing = False
def gpt_oss_120b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for GPT-OSS 120B variant.
Recommended parallelism: TP=2, PP=4, EP=16
"""
cfg = _pretrain_common()
# Model config
cfg.model = AutoBridge.from_hf_pretrained("openai/gpt-oss-120b").to_megatron_provider(load_weights=False)
# Tokenizer - uses NullTokenizer by default
cfg.tokenizer.tokenizer_type = "NullTokenizer"
cfg.tokenizer.tokenizer_model = None
cfg.tokenizer.vocab_size = DEFAULT_NULL_TOKENIZER_VOCAB_SIZE
# Dataset config - mock data by default
cfg.dataset.blend = None # Pass the path to the dataset here if not using mock data, along with weight. Ex: (["path/to/data1"], 0.2), [("path/to/data2", 0.8)]
cfg.dataset.seq_length = 4096
cfg.dataset.num_workers = 8
# Parallelism settings (MoE-specific)
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = torch.bfloat16
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.context_parallel_size = 1
cfg.model.expert_model_parallel_size = 16 # Larger EP for 120B
cfg.model.expert_tensor_parallel_size = 1
cfg.model.sequence_parallel = True
cfg.model.seq_length = 4096
# Pipeline split settings
cfg.model.account_for_embedding_in_pipeline_split = False
cfg.model.account_for_loss_in_pipeline_split = False
if cfg.model.context_parallel_size > 1:
cfg.model.calculate_per_token_loss = True
cfg.model.cp_comm_type = "a2a" # only a2a cp is supported for sink attention.
# MoE Token Dispatcher settings
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_flex_dispatcher_backend = "deepep"
cfg.model.moe_hybridep_num_sms = 16
# Training config (DIFFERENT from _pretrain_common)
cfg.train.train_iters = 1000000
cfg.train.global_batch_size = 512
cfg.train.micro_batch_size = 1
cfg.train.eval_interval = 2000
cfg.train.manual_gc = True
cfg.train.manual_gc_interval = 100
# Scheduler config
cfg.scheduler.lr_warmup_iters = 2000
# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"
# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3
# Kernel selections
cfg.model.attention_backend = None
cfg.model.moe_router_fusion = False
cfg.model.moe_permute_fusion = True
cfg.model.moe_grouped_gemm = True
cfg.model.cross_entropy_loss_fusion = True
cfg.model.cross_entropy_fusion_impl = "native" # GPT-OSS uses native
# Memory saving
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None
# Mixed precision
cfg.model.moe_router_padding_for_fp8 = False
# Optimizer precision settings
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32
# Communication overlap (default None, can pass CommOverlapConfig for advanced overlap)
# cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=False) # Uncomment to enable
# cfg.comm_overlap.delay_wgrad_compute = False
# cfg.comm_overlap.overlap_moe_expert_parallel_comm = False
cfg.model.moe_shared_expert_overlap = False
# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.use_megatron_fsdp = False
cfg.ddp.grad_reduce_in_fp32 = True
cfg.ddp.average_in_collective = True
cfg.ddp.data_parallel_sharding_strategy = "no_shard"
# MoE Force Load Balancing
cfg.model.moe_router_force_load_balancing = False
🧰 Tools
🪛 Ruff (0.14.14)

[error] 291-291: Possible hardcoded password assigned to: "moe_token_dispatcher_type"

(S105)

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 251 - 356, The
gpt_oss_120b_pretrain_config function leaves cfg.model.pipeline_dtype as None
even though cfg.model.pipeline_model_parallel_size is 4; update
gpt_oss_120b_pretrain_config to set cfg.model.pipeline_dtype = torch.bfloat16
whenever cfg.model.pipeline_model_parallel_size > 1 (mirror the pattern used in
other recipes), and make the same change in the corresponding 20B pretrain
config function so pipeline parallelism uses torch.bfloat16 for safe pipeline
communication and gradient computation.

Comment on lines +91 to +97
def nemotron_nano_9b_v2_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for Nemotron Nano 9B v2.

This recipe is designed for single-node training (1 node).
Default parallelism: TP=2, PP=1, SP=True.

See `_nemotronh_common` for the full list of parameters.
"""
from megatron.bridge.recipes.nemotronh.nemotronh import _nemotronh_common

recommended_kwargs: NemotronNanoV2CommonKwargs = {
"model_provider": NemotronNanoModelProvider9Bv2,
"tensor_model_parallel_size": 2,
"pipeline_model_parallel_size": 1,
"sequence_parallel": True,
"precision_config": "bf16_mixed",
"enable_default_comm_overlap": True,
}
combined_kwargs: NemotronNanoV2CommonKwargs = {**recommended_kwargs, **user_kwargs}
return _nemotronh_common(tokenizer_model="nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base", **combined_kwargs)


def nemotron_nano_12b_v2_pretrain_config(**user_kwargs: Unpack[NemotronNanoV2CommonKwargs]) -> ConfigContainer:
cfg = _pretrain_common()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add Google-style Returns sections to the Nano v2 pretrain config docstrings.

This keeps the public recipe docs consistent and Sphinx-parseable.

Example update (apply to both pretrain configs)
 def nemotron_nano_9b_v2_pretrain_config() -> ConfigContainer:
     """Return a pre-training config for Nemotron Nano 9B v2.
 
     This recipe is designed for single-node training (1 node).
     Default parallelism: TP=2, PP=1, SP=True.
+
+    Returns:
+        ConfigContainer: Pre-training configuration for Nemotron Nano 9B v2.
     """
As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.

Also applies to: 192-199

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 91 -
97, Update the docstrings for the Nano v2 pretrain config functions (e.g.,
nemotron_nano_9b_v2_pretrain_config and the other pretrain config around lines
192-199) to use Google-style docstrings by adding a "Returns" section that
documents the return type and purpose (e.g., "Returns: ConfigContainer:
pre-training configuration for Nemotron Nano 9B v2" or similar), ensuring the
section is Sphinx/napoleon-parsable and consistent with the existing top-level
description.

Comment on lines +102 to 107
def nemotronh_4b_pretrain_config() -> ConfigContainer:
"""Return a pre-training config for NemotronH 4B.

This recipe is designed for single-node training (1 node).
Default parallelism: TP=1, PP=1, SP=False.

See `_nemotronh_common` for the full list of parameters.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add Google-style Returns sections to NemotronH pretrain config docstrings.

Each pretrain config docstring omits the Returns: block. Please add it consistently across these functions.

Example update (apply to all pretrain configs)
 def nemotronh_4b_pretrain_config() -> ConfigContainer:
     """Return a pre-training config for NemotronH 4B.
 
     This recipe is designed for single-node training (1 node).
     Default parallelism: TP=1, PP=1, SP=False.
+
+    Returns:
+        ConfigContainer: Pre-training configuration for NemotronH 4B.
     """
As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.

Also applies to: 203-208, 304-311, 407-414

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` around lines 102 - 107,
Add a Google-style "Returns" section to the docstring of
nemotronh_4b_pretrain_config and the other pretrain config functions in this
file (the ones noted in the review). Specifically, update each function
docstring to include a "Returns:" block that states the return type and brief
description (e.g., "Returns: ConfigContainer: A pre-training configuration for
NemotronH 4B."), following Google docstring formatting so Sphinx can parse it.

Comment on lines 41 to 65
def _safe_overrides_for(name: str) -> dict:
# Detect if this is a finetune recipe
is_finetune = "finetune" in name.lower()
"""Return overrides for recipe functions.

overrides = {
"name": f"unit_{name}",
"dir": ".",
"train_iters": 10,
"micro_batch_size": 1,
"seq_length": 64,
"min_lr": 1e-5,
"lr_warmup_iters": 2,
"global_batch_size": 2,
}
Pretrain configs use the new parameterless API (return empty dict).
Finetune configs still accept parameters.
"""
is_finetune = "finetune" in name.lower()

if is_finetune:
# Finetuning-specific overrides
overrides.update(
{
"finetune_lr": 1e-4,
}
)
# Finetuning-specific overrides - finetune configs still accept parameters
overrides = {
"name": f"unit_{name}",
"dir": ".",
"train_iters": 10,
"micro_batch_size": 1,
"seq_length": 64,
"min_lr": 1e-5,
"lr_warmup_iters": 2,
"global_batch_size": 2,
"finetune_lr": 1e-4,
}
else:
# Pretrain-specific overrides
overrides.update(
{
"mock": True,
"lr": 1e-4,
"use_null_tokenizer": True,
"tensor_model_parallel_size": 1,
"pipeline_model_parallel_size": 1,
"context_parallel_size": 1,
}
)
# Pretrain configs use the new parameterless API
overrides = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make _safe_overrides_for docstring Google-style (Args/Returns).

This keeps helper docs consistent and Sphinx-parseable.

Proposed docstring update
 def _safe_overrides_for(name: str) -> dict:
     """Return overrides for recipe functions.
 
     Pretrain configs use the new parameterless API (return empty dict).
     Finetune configs still accept parameters.
+
+    Args:
+        name: Recipe function name.
+
+    Returns:
+        dict: Overrides to pass into the recipe function.
     """
As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.
🤖 Prompt for AI Agents
In `@tests/unit_tests/recipes/gemma/test_gemma2_recipes.py` around lines 41 - 65,
Update the _safe_overrides_for function docstring to Google-style with an "Args"
section describing the name: str parameter and a "Returns" section describing
the returned dict of overrides; keep the existing short description about
pretrain vs finetune behavior and briefly note that finetune configs accept
parameters while pretrain return an empty dict. Ensure the docstring is
triple-quoted and Sphinx-friendly for parsing.

Comment on lines 37 to 62
def _safe_overrides_for(name: str) -> dict:
# Detect if this is a finetune recipe
is_finetune = "finetune" in name.lower()
"""Return overrides for recipe functions.

overrides = {
"name": f"unit_{name}",
"dir": ".",
"train_iters": 10,
"global_batch_size": 2,
"micro_batch_size": 1,
"seq_length": 64,
"min_lr": 1e-5,
"lr_warmup_iters": 2,
}
Pretrain configs use the new parameterless API (return empty dict).
Finetune configs still accept parameters.
"""
is_finetune = "finetune" in name.lower()

if is_finetune:
# Finetuning-specific overrides
overrides.update(
{
"finetune_lr": 1e-4,
"pretrained_checkpoint": "/fake/checkpoint/path",
}
)
# Note: Finetuning recipes set parallelism internally based on PEFT vs full SFT
# Note: Finetuning always uses HF tokenizer, never null tokenizer
# Finetuning-specific overrides - finetune configs still accept parameters
overrides = {
"name": f"unit_{name}",
"dir": ".",
"train_iters": 10,
"global_batch_size": 2,
"micro_batch_size": 1,
"seq_length": 64,
"min_lr": 1e-5,
"lr_warmup_iters": 2,
"finetune_lr": 1e-4,
"pretrained_checkpoint": "/fake/checkpoint/path",
}
else:
# Pretrain-specific overrides
overrides.update(
{
"mock": True,
"lr": 1e-4,
"tensor_model_parallel_size": 1,
"pipeline_model_parallel_size": 1,
"context_parallel_size": 1,
"use_null_tokenizer": True,
}
)

# Large models/variants may set additional flags in recipes; keep harmless defaults
lname = name.lower()
if "12b" in lname or "27b" in lname:
overrides.update(
{
"virtual_pipeline_model_parallel_size": None,
"sequence_parallel": True,
}
)
# Pretrain configs use the new parameterless API
overrides = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make _safe_overrides_for docstring Google-style (Args/Returns).

This keeps helper docs consistent and Sphinx-parseable.

Proposed docstring update
 def _safe_overrides_for(name: str) -> dict:
     """Return overrides for recipe functions.
 
     Pretrain configs use the new parameterless API (return empty dict).
     Finetune configs still accept parameters.
+
+    Args:
+        name: Recipe function name.
+
+    Returns:
+        dict: Overrides to pass into the recipe function.
     """
As per coding guidelines: Use Google style docstrings (parseable by Sphinx) for classes and functions.
🤖 Prompt for AI Agents
In `@tests/unit_tests/recipes/test_gemma3_recipes.py` around lines 37 - 62, Update
the _safe_overrides_for function docstring to Google style: add an "Args"
section describing the name: str parameter and a "Returns" section describing
the returned dict of overrides; keep the existing brief description about
pretrain vs finetune behavior and ensure the docstring remains
Sphinx/Google-parseable (triple-quoted) for the function _safe_overrides_for.

# Conflicts:
#	src/megatron/bridge/recipes/deepseek/deepseek_v2.py
#	src/megatron/bridge/recipes/deepseek/deepseek_v3.py
#	src/megatron/bridge/recipes/llama/llama2.py
#	src/megatron/bridge/recipes/moonlight/moonlight_16b.py
#	src/megatron/bridge/recipes/nemotronh/nemotronh.py
@yaoyu-33
Copy link
Contributor Author

yaoyu-33 commented Feb 5, 2026

/ok to test 718dee4

@yaoyu-33 yaoyu-33 merged commit b19588a into main Feb 5, 2026
50 checks passed
@yaoyu-33 yaoyu-33 deleted the replay/0909f9fd branch February 5, 2026 20:53
rhmukundan pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
rhmukundan pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026
ko3n1g pushed a commit that referenced this pull request Feb 24, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g mentioned this pull request Feb 24, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants