Skip to content

260201: Cherrypick various changes#2509

Merged
ko3n1g merged 31 commits intor0.3.0from
ko3n1g/cp/260201
Mar 3, 2026
Merged

260201: Cherrypick various changes#2509
ko3n1g merged 31 commits intor0.3.0from
ko3n1g/cp/260201

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 24, 2026

What does this PR do ?

Includes:

Does not include:

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added comprehensive HuggingFace ↔ Megatron configuration translation with activation function mapping support.
    • Introduced Kimi K2 and Nemotron 3 Nano model support with dedicated pretraining configurations.
    • Enhanced model bridge registration with explicit provider and model type metadata.
  • Improvements

    • Simplified recipe APIs to parameterless functions for cleaner configuration.
    • Added rope scaling support for improved model compatibility.
    • Improved optimizer defaults (Adam epsilon adjusted for better numerical stability).
    • Improved performance for SFT
    • Performance improvement for Qwen 30B_a3B on B200
  • Bug Fixes

    • Removed incompatible generation_config propagation from providers.
    • Fixed configuration overrides for DeepSeek and Qwen models.

… in base class (#2052)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…config logging (#2184)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
… in base class for Nemotron models (#2225)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…retrain recipes (#2288)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

This PR refactors Megatron Bridge recipes and model providers to use parameterless APIs with post-instantiation configuration mutation. Major changes include centralizing HF↔Megatron config translation on MegatronModelBridge, updating all model bridges to use super().provider_bridge() with in-place provider configuration, removing generation_config propagation, adding rope_scaling support, introducing new Kimi K2 and Nemotron 3 Nano configurations, and updating optimizer epsilon defaults from 1e-5 to 1e-8.

Changes

Cohort / File(s) Summary
Model Bridge Core Architecture
src/megatron/bridge/models/conversion/model_bridge.py
Introduces comprehensive HF↔Megatron config translation infrastructure: adds CONFIG_MAPPING, YARN_ROPE_SCALING_MAPPING, MLA_ROPE_SCALING_MAPPING, ACTIVATION_MAPPING; implements hf_config_to_provider_kwargs and megatron_to_hf_config methods; adds PROVIDER_CLASS, SOURCE_NAME, MODEL_TYPE attributes; extends register_bridge and create_bridge_decorator to accept provider and model_type parameters.
Individual Model Bridges
src/megatron/bridge/models/.../...bridge.py
Updates ~15 bridge files (Gemma, Llama, DeepSeek, Qwen, GLM, Nemotron, NemotronH, OlMoE, MistralMLA, GPT-OSS) to: register with provider and model_type; replace explicit provider construction with super().provider_bridge() + in-place attribute assignment; change return types to GPTModelProvider or MLAModelProvider where applicable.
Model Providers
src/megatron/bridge/models/...provider.py
Updates provider classes (GPTModelProvider adds rope_scaling/rope_scaling_factor/rotary_scaling_factor; DeepSeekProvider shifts to MLAModelProvider; Llama providers remove manual RoPE scaling; removes generation_config from multiple providers); introduces new MLAModelProvider.
Recipe Common Base
src/megatron/bridge/recipes/common.py
Adds new _pretrain_common() helper that returns a ConfigContainer with comprehensive default training configurations (optimizer, scheduler, DDP, logger, etc.), serving as a shared base for all pretrain recipes.
Recipe Implementations
src/megatron/bridge/recipes/.../...py (~15 files)
Replaces parameterized pretrain config functions with parameterless factories that use _pretrain_common() and explicitly assign model, tokenizer, dataset, parallelism, MoE, training, and precision fields post-instantiation; removes TypedDict-based kwargs and internal helper functions.
Performance Configuration Scripts
scripts/performance/configs/.../...llm_pretrain.py
Updates config builders (Llama, DeepSeek, GPT-OSS, Qwen, Nemotron, Kimi) to call pretrain_config() without arguments and then assign cfg.mixed_precision, cfg.model fields via post-creation mutation instead of constructor parameters.
Kimi K2 New Configurations
scripts/performance/configs/kimi/*
Adds new Kimi K2 performance preset module with init.py (conditional exports), kimi_llm_pretrain.py (5 GPU-variant pretrain functions), and kimi_workload_base_configs.py (base config and 13 preset variants); includes _get_kimi_k2_pipeline_layout helper.
Nemotron 3 Nano New Configurations
scripts/performance/configs/nemotronh/nemotron_3_nano*
Adds Nemotron 3 Nano performance module with init.py (conditional exports), nemotron_3_nano_llm_pretrain.py (5 GPU-variant functions), and nemotron_3_nano_workload_base_configs.py (11 preset variants); supports BF16 and FP8 precision variants.
Optimizer & Config Utilities
src/megatron/bridge/recipes/utils/optimizer_utils.py, src/megatron/bridge/training/config.py, src/megatron/bridge/training/setup.py
Updates Adam epsilon default from 1e-5 to 1e-8; adds log_non_default_values() method to TrainingConfig for logging config deviations.
Overrides & Path Handling
scripts/performance/utils/overrides.py, scripts/performance/utils/utils.py
Adds Kimi K2 pipeline layout inference in user overrides; updates get_library_recipe to use parameterless recipe functions and configure output directories post-instantiation.
Example & Test Updates
examples/quantization/pretrain_quantized_llama3_8b.py, tests/functional_tests/...*, tests/unit_tests/...*.py
Updates example to use parameterless pretrain_config() with post-creation HF path injection; removes provider-mapping validation tests; adds perf config integration tests; updates numerous test assertions to reflect new provider types (GPTModelProvider vs model-specific), parameterless APIs, and removal of generation_config expectations.
Removed Test Modules
tests/functional_tests/models/.../test_*_provider.py (7 files)
Removes provider equivalence tests that validated bridge-generated configs against predefined provider classes for DeepSeek, Gemma, GLM, Llama Nemotron, NemotronH, and Qwen variants.

Sequence Diagrams

sequenceDiagram
    participant Client as Config Builder
    participant Recipe as pretrain_config()
    participant Common as _pretrain_common()
    participant HFBridge as AutoBridge
    participant Provider as GPTModelProvider
    participant Config as ConfigContainer

    Client->>Recipe: Call pretrain_config() [no args]
    Recipe->>Common: cfg = _pretrain_common()
    Common->>Config: Create ConfigContainer w/ defaults
    Common-->>Recipe: Return cfg
    Recipe->>HFBridge: Load HF model (if hf_path provided)
    HFBridge->>Provider: Convert to Megatron provider
    Provider-->>Recipe: Return provider
    Recipe->>Config: cfg.model = provider
    Recipe->>Config: cfg.mixed_precision = precision_config
    Recipe->>Config: cfg.dataset = {...}
    Recipe->>Config: cfg.train = {...}
    Recipe-->>Client: Return fully configured cfg
Loading
sequenceDiagram
    participant HFModel as HF Pretrained
    participant Bridge as Model Bridge
    participant SuperClass as MegatronModelBridge
    participant Provider as GPTModelProvider
    participant Translator as Config Translator

    HFModel->>Bridge: Load model & config
    Bridge->>SuperClass: Call super().provider_bridge()
    SuperClass->>Translator: Apply hf_config_to_provider_kwargs()
    Translator->>Translator: Map CONFIG_MAPPING
    Translator->>Translator: Apply ACTIVATION_MAPPING
    Translator->>Translator: Handle rope_scaling variants
    Translator-->>SuperClass: Return provider_kwargs dict
    SuperClass->>Provider: Instantiate with base defaults
    Provider-->>Bridge: Return provider
    Bridge->>Provider: Set rope_scaling = ...
    Bridge->>Provider: Set activation_func = ...
    Bridge->>Provider: Set model-specific fields
    Bridge-->>HFModel: Return configured provider
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #2052: Implements identical provider_bridge refactor with CONFIG_MAPPING/PROVIDER_CLASS/activation mappings and shift to GPTModelProvider/MLAModelProvider with in-place configuration across all bridges.
  • PR #2067: Introduces _pretrain_common helper and converts multiple recipe files to parameterless ConfigContainer-returning factories using the same pattern.
  • PR #2097: Adds Kimi K2 performance config modules (scripts/performance/configs/kimi/*) with identical structure and _get_kimi_k2_pipeline_layout integration.

Suggested labels

r0.3.0, Run CICD, cherry-pick

Suggested reviewers

  • malay-nagda
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR with major breaking API changes (recipe signature refactoring, provider implementations, optimizer epsilon modification) lacks documented test results or validation evidence. Add test results, CI validation, convergence analysis for adam_eps change (1e-5→1e-8), and regression testing documentation to PR description.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '260201: Cherrypick various changes' is descriptive and specific about the nature of the work (cherry-picking multiple upstream changes), though it does not enumerate all affected areas.
Docstring Coverage ✅ Passed Docstring coverage is 87.68% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ko3n1g/cp/260201

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (14)
tests/unit_tests/recipes/test_deepseek_recipes.py (1)

87-98: ⚠️ Potential issue | 🟡 Minor

Add a unit category marker for this test.

The updated test should be tagged to satisfy test categorization requirements.

✅ Suggested change
 `@pytest.mark.parametrize`("recipe_func", _DEEPSEEK_RECIPE_FUNCS)
+@pytest.mark.unit
 def test_each_deepseek_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch):

As per coding guidelines, "Use pytest.mark to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_deepseek_recipes.py` around lines 87 - 98, This
test needs a pytest unit marker: add `@pytest.mark.unit` above the test function
that contains the code using cfg and _assert_basic_config (the test in
tests/unit_tests/recipes/test_deepseek_recipes.py that asserts tokenizer
behavior), and ensure pytest is imported (import pytest) at the top of the file
if not already present; keep the existing assertions and logic unchanged.
tests/functional_tests/models/deepseek/test_deepseek_conversion.py (5)

96-97: ⚠️ Potential issue | 🟡 Minor

Remove the duplicate save_pretrained call.

model.save_pretrained(model_dir, safe_serialization=True) is called identically on two consecutive lines — the second write is redundant and re-serialises all weights for no benefit.

🐛 Proposed fix
-        model.save_pretrained(model_dir, safe_serialization=True)
         model.save_pretrained(model_dir, safe_serialization=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around
lines 96 - 97, The test contains a duplicated call to
model.save_pretrained(model_dir, safe_serialization=True): remove the redundant
second invocation so the model is only serialized once; keep the remaining call
to model.save_pretrained with safe_serialization=True to preserve the intended
behavior.

96-97: ⚠️ Potential issue | 🟡 Minor

Remove the duplicate save_pretrained call.

model.save_pretrained(model_dir, safe_serialization=True) is called identically on two consecutive lines. The second call writes the same files again, wasting I/O and potentially overwriting any intermediate state.

🐛 Proposed fix
-        model.save_pretrained(model_dir, safe_serialization=True)
         model.save_pretrained(model_dir, safe_serialization=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around
lines 96 - 97, The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True); remove the redundant
second invocation so the model is only saved once (keep the first call that uses
model.save_pretrained with model_dir and safe_serialization=True) to avoid
unnecessary I/O and potential overwrites.

92-93: ⚠️ Potential issue | 🟡 Minor

Narrow the except clause — swallowing all exceptions hides real failures.

Per coding guidelines, the except clause should be limited to the smallest set of expected errors. A bare except Exception: pass will silently absorb OSError, ConnectionError, permission errors, etc., making CI failures hard to diagnose.

♻️ Suggested fix
-        except Exception:
-            pass
+        except OSError:
+            # Non-fatal: fallback tokenizer is optional for conversion tests
+            pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around
lines 92 - 93, The bare "except Exception: pass" in the test should be narrowed:
replace it with explicit expected exceptions (for example FileNotFoundError,
OSError, ConnectionError or the specific exception your try block can raise) and
handle them (or log them) instead of silently swallowing; for anything truly
unexpected either re-raise or log and fail the test. Locate the try/... except
block that currently uses "except Exception: pass" and change it to "except
<ExpectedError1>, <ExpectedError2> as e: ..." (or catch Exception as e and
re-raise unless isinstance(e, (ExpectedError1, ...))) so only anticipated errors
are suppressed and unexpected failures surface.

92-93: ⚠️ Potential issue | 🟡 Minor

Narrow the except clause — bare except Exception: pass silently swallows all failures.

This violates the project guideline: "limit the except clause to the smallest set of errors possible." A broad silent swallow hides connectivity issues, permission errors, etc. At minimum, catch only the expected failure mode and log it.

♻️ Suggested fix
-        except Exception:
-            pass
+        except OSError:
+            # Network or file-system error; tokenizer is optional for conversion tests
+            pass

69-80: ⚠️ Potential issue | 🟡 Minor

Replace deprecated use_auth_token with token parameter.

The use_auth_token parameter is deprecated in transformers (removed in v5.0.0). Replace with token on line 76:

token=None,

This parameter appears in multiple test files and should be updated consistently across the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around
lines 69 - 80, Update the deprecated parameter in the
get_class_from_dynamic_module call: replace the removed use_auth_token argument
with the new token argument (i.e., change use_auth_token=None to token=None) in
the invocation of get_class_from_dynamic_module so the call signature matches
transformers v5+; search other test files for additional uses of use_auth_token
and update them similarly to token to keep the codebase consistent.
tests/unit_tests/models/qwen/test_qwen3_moe_bridge.py (1)

118-126: ⚠️ Potential issue | 🟡 Minor

Add pytest.mark category to these unit tests.

Please mark these tests as unit tests to align with repo standards.

✅ Suggested change
+@pytest.mark.unit
 class TestQwen3MoEBridge:

As per coding guidelines: Use 'pytest.mark' to categorize tests (unit, integration, system).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/models/qwen/test_qwen3_moe_bridge.py` around lines 118 -
126, Mark the test function test_provider_bridge_basic with the unit category
using pytest.mark: add `@pytest.mark.unit` above the test definition, ensure
pytest is imported in the test module, and apply the same decoration to other
tests in this file that are unit tests (e.g., any tests exercising
Qwen3MoEBridge.provider_bridge) so they comply with the repo's pytest.mark
categorization.
tests/unit_tests/recipes/utils/test_optimizer_utils.py (1)

15-15: ⚠️ Potential issue | 🟡 Minor

Typo in module docstring: "optinizer_utils" → "optimizer_utils".

✏️ Fix typo
-"""Tests for optinizer_utils module."""
+"""Tests for optimizer_utils module."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/utils/test_optimizer_utils.py` at line 15, Fix the
typo in the module docstring of the test file test_optimizer_utils.py: change
the word "optinizer_utils" to "optimizer_utils" in the top-level docstring so
the module name is spelled correctly.
src/megatron/bridge/recipes/qwen/qwen2.py (1)

16-16: 🛠️ Refactor suggestion | 🟠 Major

Use modern union syntax per coding guidelines.

This file uses Optional[T], Union[X, Y], and List[T] from typing throughout (e.g., Qwen2CommonKwargs, Qwen2FinetuneKwargs, _qwen2_finetune_common signature), while llama3.py in the same PR correctly uses T | None, X | Y, and list[T]. The coding guidelines require the modern syntax.

♻️ Example fix for the import and TypedDict
-from typing import List, Optional, Union
+# No typing import needed for these; use built-in generics and | syntax
 class Qwen2CommonKwargs(TypedDict, total=False):
-    dir: Optional[str]
+    dir: str | None
-    data_paths: Optional[List[str]]
+    data_paths: list[str] | None
-    pipeline_dtype: Optional[torch.dtype]
+    pipeline_dtype: torch.dtype | None
-    precision_config: Optional[Union[MixedPrecisionConfig, str]]
+    precision_config: MixedPrecisionConfig | str | None
     # ... apply same pattern throughout

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'", "Use 'X | Y' for union types instead of 'Union[X, Y]'", and "Use built-in generics (list, dict, tuple) instead of typing equivalents".

Also applies to: 40-78

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen2.py` at line 16, Replace usages of
typing generics with modern built-in and PEP 604 union syntax across this
module: remove List, Optional, Union imports and use list[T], T | None, and X |
Y respectively. Update the TypedDicts Qwen2CommonKwargs and Qwen2FinetuneKwargs
and any function signatures such as _qwen2_finetune_common to use list[...]
instead of List[...], T | None instead of Optional[T], and A | B instead of
Union[A, B]; ensure any default None annotations remain compatible. Keep only
necessary imports from typing (e.g., TypedDict) and run a quick typecheck to
confirm no remaining typing.List/Optional/Union usages.
tests/unit_tests/models/llama/test_llama_bridge.py (1)

37-122: ⚠️ Potential issue | 🟡 Minor

Add type hints to fixtures and test methods

The new fixtures/tests omit argument and return type hints. Please annotate fixtures with concrete return types and tests with -> None (and type fixture parameters) to satisfy typing rules; apply this consistently across the module.

💡 Example (apply broadly)
-    def llama_3_2_1b_config_dict(self):
+    def llama_3_2_1b_config_dict(self) -> dict[str, object]:
         """Create a sample Llama 3.2 1B configuration with RoPE scaling."""
         return {
             "architectures": ["LlamaForCausalLM"],
             ...
         }

-    def test_provider_bridge_returns_gpt_provider(self, mock_pretrained_llama_2):
+    def test_provider_bridge_returns_gpt_provider(
+        self, mock_pretrained_llama_2: PreTrainedCausalLM
+    ) -> None:
         """Test that provider_bridge returns a GPTModelProvider for Llama 2."""
         bridge = LlamaBridge()

As per coding guidelines, Use type hints for function arguments and return types.

Also applies to: 123-341, 343-458, 461-515, 535-598, 601-638, 706-855

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/models/llama/test_llama_bridge.py` around lines 37 - 122,
Several fixtures and test functions in this module (e.g.,
llama_3_2_1b_config_dict, llama_2_7b_config_dict, llama_config, llama_2_config,
mock_pretrained_llama, mock_pretrained_llama_2 and the test functions referenced
later) are missing type hints; update each fixture signature to include
parameter types for injected fixtures and an explicit return type (e.g., ->
Dict[str, Any] or -> LlamaConfig or -> Mock[PreTrainedCausalLM] as appropriate),
and annotate all test functions with -> None; ensure imports for typing (Dict,
Any, Optional, Iterator, etc.) are present if needed and apply the same pattern
consistently across the module.
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)

25-46: ⚠️ Potential issue | 🟡 Minor

Fix _get_kimi_k2_pipeline_layout typing and docstring

vp_size is treated as nullable, but the signature doesn’t reflect that, and the return type is missing. Also, the docstring should follow Google style.

💡 Suggested change
-def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int):
-    """Get pipeline layout for Kimi-K2 based on PP and VP size."""
+def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int | None) -> list[list[str]] | None:
+    """Get pipeline layout for Kimi-K2 based on PP and VP size.
+
+    Args:
+        pp_size: Pipeline parallel size.
+        vp_size: Virtual pipeline size (defaults to 1 when None).
+
+    Returns:
+        Pipeline layout per stage, or None when no layout is required.
+    """

As per coding guidelines, Use type hints for function arguments and return types; Use 'T | None' for nullable types instead of 'Optional[T]'; Use Google style docstrings (parseable by Sphinx) for classes and functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/kimi/kimi_k2.py` around lines 25 - 46, Update the
_get_kimi_k2_pipeline_layout signature to use explicit nullable typing for
vp_size (use int | None) and add a return type of list[list[str]] | None; update
the docstring to Google style with a brief description, Args (pp_size, vp_size),
and Returns (pipeline layout or None). Keep the existing logic that normalizes
vp_size to 1 when None (vp_size = 1 if vp_size is None else vp_size) and ensure
the map and returned layout types match the annotated return type. Use the 'T |
None' form (not Optional[T]) and ensure the function name
_get_kimi_k2_pipeline_layout is the one you modify.
src/megatron/bridge/models/llama/llama_bridge.py (1)

15-31: ⚠️ Potential issue | 🟡 Minor

Rename module logger to comply with global naming rule

Module-level globals should be upper snake case with a G_ prefix.

Suggested fix
-logger = logging.getLogger(__name__)
+G_LOGGER = logging.getLogger(__name__)

As per coding guidelines: Use upper snake_case and prefix 'G' for global variables (e.g., G_MY_GLOBAL).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/llama/llama_bridge.py` around lines 15 - 31, The
module-level logger variable `logger` violates the global naming rule; rename it
to an upper snake-case global prefixed with G (e.g., `G_LOGGER`) and update all
references in this module (e.g., any use of `logger` in this file) to the new
name; ensure you only change the variable name and not the logging import or
behavior so existing logging calls (info/debug/error) continue to work with
`G_LOGGER`.
src/megatron/bridge/models/deepseek/deepseek_provider.py (1)

16-109: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 unions and built‑in generics in type hints

Replace Optional, Union, and List with T | None, X | Y, and list[T] respectively, and remove unused typing imports.

Suggested fix
-from typing import Callable, List, Optional, Union
+from typing import Callable
@@
-    q_lora_rank: Optional[int] = 1536
+    q_lora_rank: int | None = 1536
@@
-    moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] + [1] * 59)  # first layer is dense
+    moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] + [1] * 59)  # first layer is dense
@@
-    q_lora_rank: Optional[int] = None
+    q_lora_rank: int | None = None
@@
-    moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] + [1] * 26)  # first layer is dense
+    moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] + [1] * 26)  # first layer is dense
@@
-    moe_layer_freq: Union[int, List[int]] = field(
+    moe_layer_freq: int | list[int] = field(
         default_factory=lambda: [0] * 3 + [1] * 58
     )  # first three layers are dense
@@
-    moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] * 1 + [1] * 26)  # first layer is dense
+    moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] * 1 + [1] * 26)  # first layer is dense
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_provider.py` around lines 16 -
109, Update type hints to modern PEP 604 and builtin generics and remove unused
typing imports: replace Optional[int] with int | None for fields like
DeepSeekModelProvider.q_lora_rank and DeepSeekV2LiteModelProvider.q_lora_rank,
replace Union[int, List[int]] with int | list[int] for
DeepSeekV2ModelProvider.moe_layer_freq, and replace List[int] with list[int]
wherever used (e.g., the lambda default factories). Remove unused imports
Optional, Union, List from the typing import list and tidy up the imports
accordingly.
src/megatron/bridge/models/qwen/qwen2_bridge.py (1)

28-57: ⚠️ Potential issue | 🟡 Minor

Add type hints and Google-style docstring to provider_bridge

The method currently lacks type annotations and a complete docstring. Per coding guidelines, all public methods must include type hints for arguments and return types, plus Google-style docstrings with Args and Returns sections.

Suggested fix
+from megatron.bridge.models.gpt_provider import GPTModelProvider
+from megatron.bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM
 from megatron.bridge.models.conversion.mapping_registry import MegatronMappingRegistry
 from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge
 from megatron.bridge.models.conversion.param_mapping import (
     AutoMapping,
     GatedMLPMapping,
     QKVMapping,
 )
@@
-    def provider_bridge(self, hf_pretrained):
-        """Convert HuggingFace Qwen2 config to GPTModelProvider."""
+    def provider_bridge(self, hf_pretrained: PreTrainedCausalLM) -> GPTModelProvider:
+        """Convert HuggingFace Qwen2 config to GPTModelProvider.
+
+        Args:
+            hf_pretrained: HuggingFace PreTrainedCausalLM instance.
+
+        Returns:
+            GPTModelProvider configured for Qwen2.
+        """
         provider = super().provider_bridge(hf_pretrained)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen/qwen2_bridge.py` around lines 28 - 57, The
provider_bridge method in Qwen2Bridge lacks type annotations and a Google-style
docstring; update the signature to include type hints (e.g., def
provider_bridge(self, hf_pretrained: Union[str, "PreTrainedModel"]) ->
"GPTModelProvider") and add a Google-style docstring with Args and Returns
sections describing hf_pretrained and the returned GPTModelProvider, and
import/forward-reference typing symbols (Union, PreTrainedModel,
GPTModelProvider) as needed to satisfy annotations.
♻️ Duplicate comments (7)
scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py (1)

34-35: Same unused mock parameter as in the llama31 configs.

Same issue as flagged in llama31_llm_pretrain.py — the mock: bool = True parameter is declared but unused in all five function signatures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py` around lines
34 - 35, The parameter mock: bool = True is declared but never used; remove the
unused mock parameter from the function signature(s) (e.g.,
nemotronh_56b_pretrain_config_gb300) and any other pretrain config functions in
this file so signatures no longer include mock, or alternatively wire the mock
flag into the function logic if a mocked behavior is intended—update all
referenced function definitions and any callers to match the new signature
(adjust imports/usage accordingly).
src/megatron/bridge/recipes/qwen/qwen3.py (5)

257-335: Same long-line issue repeats in this recipe block.

As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 257 - 335, The
qwen3_4b_pretrain_config function contains lines longer than the 119-character
limit (e.g., the AutoBridge.from_hf_pretrained assignment, long
comment/documentation strings, and some cfg assignments like
cfg.tokenizer.tokenizer_model and dataset.blend); shorten them by breaking long
string literals and expressions across multiple lines (use implicit continuation
with parentheses or assign parts to intermediate variables), wrap long comments
to multiple shorter lines, and split chained attribute assignments into multiple
statements where needed so every line ≤119 chars while keeping the same
identifiers (qwen3_4b_pretrain_config, AutoBridge.from_hf_pretrained,
cfg.tokenizer.tokenizer_model, cfg.dataset.blend, etc.).

500-570: Same long-line issue repeats in this recipe block.

As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 500 - 570, The
qwen3_32b_pretrain_config function contains several lines exceeding the
119-character limit (e.g., the long assignment/comment lines such as cfg.model =
AutoBridge.from_hf_pretrained("Qwen/Qwen3-32B").to_megatron_provider(load_weights=False),
the cfg.dataset.blend explanatory comment, and the checkpoint override comment
block); wrap these long expressions and comments across multiple shorter
strings/lines and/or split chained calls and long comments into concatenated
pieces so each source line stays <=119 characters while preserving the same
values and wording (edit the qwen3_32b_pretrain_config function, focusing on the
long lines noted above and similar long comments/assignments).

419-497: Same long-line issue repeats in this recipe block.

As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 419 - 497, The
function qwen3_14b_pretrain_config contains multiple lines exceeding the
119-char limit (e.g., the AutoBridge.from_hf_pretrained assignment, long comment
lines, and long cfg.* assignment/comments). Fix by wrapping long string literals
and long chained assignments across lines (use implicit continuation with
parentheses for the AutoBridge.from_hf_pretrained call and split long comments
into multiple shorter comment lines), and break any long cfg.* comments or
example lines into shorter lines; target symbols to edit include
qwen3_14b_pretrain_config, cfg.model =
AutoBridge.from_hf_pretrained("Qwen/Qwen3-14B"), cfg.tokenizer.tokenizer_model,
and long comment lines shown in the block so every line <= 119 chars.

338-416: Same long-line issue repeats in this recipe block.

As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 338 - 416, The
qwen3_8b_pretrain_config function contains lines exceeding the 119-char limit
(repeat violation). Fix by breaking long comment strings and long assignments
inside qwen3_8b_pretrain_config: wrap/line-break long comments into multiple
shorter comment lines, split long expressions (e.g., the cfg.dataset.blend
example tuple, long inline comments for FP8/mixed_precision) into multiple
concatenated strings or separate local variables, and prefer short helper
variables for long values (e.g., model_id = "Qwen/Qwen3-8B") used in cfg.model
and cfg.tokenizer assignments so no single source line exceeds 119 chars; update
any other long cfg.* assignment lines similarly. Ensure each modified line is
<=119 chars and preserves the same values and behavior.

176-255: Same long-line issue repeats in this recipe block.

As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 176 - 255, Several
lines in qwen3_1p7b_pretrain_config exceed the 119-character limit; locate and
break up long statements such as the chained call cfg.model =
AutoBridge.from_hf_pretrained("Qwen/Qwen3-1.7B").to_megatron_provider(load_weights=False),
the long comment/assignment for cfg.dataset.blend, and the long checkpoint
example comment lines so each physical line is <=119 chars. Fix by splitting
long strings/comments across multiple concatenated literals or by introducing a
short local variable (e.g., pretrained_id = "Qwen/Qwen3-1.7B") and using that in
calls, and wrap long comments into multiple lines; ensure no functional change
to qwen3_1p7b_pretrain_config, cfg.dataset.blend, or the checkpoint comment
examples.
src/megatron/bridge/models/deepseek/deepseek_v3_bridge.py (1)

112-113: Same docstring‑style guidance as above applies here.

As per coding guidelines, "Use Google style docstrings (parseable by Sphinx) for classes and functions".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_v3_bridge.py` around lines 112 -
113, The function-level docstring above the line "global_name =
task.global_param_name" is not Google-style; update the docstring to a parseable
Google style (short description, blank line, Args: with parameter descriptions,
Returns: with return type/meaning, and Raises: if applicable) for the function
in deepseek_v3_bridge.py that handles adding the rotary embedding inverse
frequency parameter (the routine that references task.global_param_name / local
variable global_name). Ensure the docstring explains the purpose, documents
parameters (e.g., task, any flags), describes the return value (if any), and
notes exceptions so Sphinx can parse it.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c106d06 and cb434ff.

📒 Files selected for processing (129)
  • examples/quantization/pretrain_quantized_llama3_8b.py
  • scripts/performance/configs/deepseek/deepseek_llm_pretrain.py
  • scripts/performance/configs/deepseek/deepseek_workload_base_configs.py
  • scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py
  • scripts/performance/configs/kimi/__init__.py
  • scripts/performance/configs/kimi/kimi_llm_pretrain.py
  • scripts/performance/configs/kimi/kimi_workload_base_configs.py
  • scripts/performance/configs/llama/llama31_llm_pretrain.py
  • scripts/performance/configs/llama/llama3_llm_pretrain.py
  • scripts/performance/configs/nemotronh/__init__.py
  • scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
  • scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
  • scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py
  • scripts/performance/configs/qwen/qwen3_llm_pretrain.py
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py
  • scripts/performance/utils/overrides.py
  • scripts/performance/utils/utils.py
  • src/megatron/bridge/models/conversion/model_bridge.py
  • src/megatron/bridge/models/deepseek/common.py
  • src/megatron/bridge/models/deepseek/deepseek_provider.py
  • src/megatron/bridge/models/deepseek/deepseek_v2_bridge.py
  • src/megatron/bridge/models/deepseek/deepseek_v3_bridge.py
  • src/megatron/bridge/models/gemma/gemma2_bridge.py
  • src/megatron/bridge/models/gemma/gemma3_bridge.py
  • src/megatron/bridge/models/gemma/gemma_bridge.py
  • src/megatron/bridge/models/glm/glm45_bridge.py
  • src/megatron/bridge/models/gpt_oss/gpt_oss_bridge.py
  • src/megatron/bridge/models/gpt_provider.py
  • src/megatron/bridge/models/llama/llama_bridge.py
  • src/megatron/bridge/models/llama/llama_provider.py
  • src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py
  • src/megatron/bridge/models/mistral/mistral_bridge.py
  • src/megatron/bridge/models/mla_provider.py
  • src/megatron/bridge/models/nemotron/nemotron_bridge.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
  • src/megatron/bridge/models/nemotronh/__init__.py
  • src/megatron/bridge/models/nemotronh/nemotron_h_bridge.py
  • src/megatron/bridge/models/olmoe/olmoe_bridge.py
  • src/megatron/bridge/models/qwen/qwen2_bridge.py
  • src/megatron/bridge/models/qwen/qwen3_bridge.py
  • src/megatron/bridge/models/qwen/qwen3_moe_bridge.py
  • src/megatron/bridge/models/qwen/qwen3_next_bridge.py
  • src/megatron/bridge/models/qwen_vl/qwen25_vl_bridge.py
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
  • src/megatron/bridge/recipes/common.py
  • src/megatron/bridge/recipes/deepseek/deepseek_v2.py
  • src/megatron/bridge/recipes/deepseek/deepseek_v3.py
  • src/megatron/bridge/recipes/gemma/gemma2.py
  • src/megatron/bridge/recipes/gemma/gemma3.py
  • src/megatron/bridge/recipes/glm/glm45.py
  • src/megatron/bridge/recipes/gpt/gpt3_175b.py
  • src/megatron/bridge/recipes/gpt_oss/gpt_oss.py
  • src/megatron/bridge/recipes/kimi/kimi_k2.py
  • src/megatron/bridge/recipes/llama/llama2.py
  • src/megatron/bridge/recipes/llama/llama3.py
  • src/megatron/bridge/recipes/moonlight/moonlight_16b.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py
  • src/megatron/bridge/recipes/nemotronh/nemotronh.py
  • src/megatron/bridge/recipes/olmoe/olmoe_7b.py
  • src/megatron/bridge/recipes/qwen/qwen2.py
  • src/megatron/bridge/recipes/qwen/qwen3.py
  • src/megatron/bridge/recipes/qwen/qwen3_moe.py
  • src/megatron/bridge/recipes/qwen/qwen3_next.py
  • src/megatron/bridge/recipes/utils/optimizer_utils.py
  • src/megatron/bridge/training/config.py
  • src/megatron/bridge/training/setup.py
  • tests/functional_tests/models/deepseek/test_deepseek_conversion.py
  • tests/functional_tests/models/deepseek/test_deepseek_provider_mapping.py
  • tests/functional_tests/models/gemma/test_gemma2_provider.py
  • tests/functional_tests/models/gemma/test_gemma3_provider.py
  • tests/functional_tests/models/gemma/test_gemma_provider.py
  • tests/functional_tests/models/glm/test_glm45_provider.py
  • tests/functional_tests/models/llama_nemotron/test_llama_nemotron_provider.py
  • tests/functional_tests/models/nemotronh/test_nemotron_h_provider.py
  • tests/functional_tests/models/qwen/test_qwen2_provider.py
  • tests/functional_tests/models/qwen/test_qwen3_moe_provider.py
  • tests/functional_tests/models/qwen/test_qwen3_provider.py
  • tests/functional_tests/quantization/test_qat_workflow.py
  • tests/functional_tests/recipes/test_llama_recipes_distill_3b-1b.py
  • tests/functional_tests/recipes/test_perf_config_integration.py
  • tests/functional_tests/recipes/utils.py
  • tests/functional_tests/training/test_callbacks.py
  • tests/functional_tests/training/test_decentralized_pg.py
  • tests/functional_tests/training/test_finetune_dora.py
  • tests/functional_tests/training/test_finetune_lora.py
  • tests/functional_tests/training/test_inprocess_restart.py
  • tests/functional_tests/training/test_nvrx_straggler.py
  • tests/functional_tests/training/test_pretrain.py
  • tests/functional_tests/training/test_pretrain_resume.py
  • tests/functional_tests/training/test_sft.py
  • tests/functional_tests/training/test_tensor_inspect.py
  • tests/unit_tests/models/deepseek/test_deepseek_bridges.py
  • tests/unit_tests/models/deepseek/test_deepseek_provider.py
  • tests/unit_tests/models/gemma/test_gemma2_bridge.py
  • tests/unit_tests/models/gemma/test_gemma3_bridge.py
  • tests/unit_tests/models/gemma/test_gemma_bridge.py
  • tests/unit_tests/models/glm/test_glm45_bridge.py
  • tests/unit_tests/models/gpt_oss/test_gpt_oss_bridges.py
  • tests/unit_tests/models/llama/test_llama_bridge.py
  • tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
  • tests/unit_tests/models/mistral/test_mistral_model_bridge.py
  • tests/unit_tests/models/nemotron/test_nemotron_bridge.py
  • tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py
  • tests/unit_tests/models/olmoe/test_olmoe_bridge.py
  • tests/unit_tests/models/qwen/test_qwen3_bridge.py
  • tests/unit_tests/models/qwen/test_qwen3_moe_bridge.py
  • tests/unit_tests/models/qwen/test_qwen3_next_bridge.py
  • tests/unit_tests/models/qwen_vl/test_qwen25_vl_bridge.py
  • tests/unit_tests/models/test_gpt_provider.py
  • tests/unit_tests/recipes/gemma/test_gemma2_recipes.py
  • tests/unit_tests/recipes/gpt/test_gpt3_175b.py
  • tests/unit_tests/recipes/kimi/test_kimi_k2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotronh.py
  • tests/unit_tests/recipes/qwen/test_qwen2_recipes.py
  • tests/unit_tests/recipes/test_deepseek_recipes.py
  • tests/unit_tests/recipes/test_gemma3_recipes.py
  • tests/unit_tests/recipes/test_glm45_recipes.py
  • tests/unit_tests/recipes/test_llama_recipes.py
  • tests/unit_tests/recipes/test_moonlight_recipes.py
  • tests/unit_tests/recipes/test_nemotronh_recipes.py
  • tests/unit_tests/recipes/test_olmoe_recipes.py
  • tests/unit_tests/recipes/test_qwen_recipes.py
  • tests/unit_tests/recipes/test_run_plugins.py
  • tests/unit_tests/recipes/utils/test_optimizer_utils.py
  • tests/unit_tests/training/test_log_non_default_values.py
  • tests/unit_tests/training/test_setup.py
💤 Files with no reviewable changes (23)
  • src/megatron/bridge/models/qwen/qwen3_next_bridge.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
  • tests/unit_tests/models/gemma/test_gemma2_bridge.py
  • tests/functional_tests/models/glm/test_glm45_provider.py
  • src/megatron/bridge/models/mistral/mistral_bridge.py
  • tests/functional_tests/models/qwen/test_qwen3_moe_provider.py
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
  • tests/functional_tests/models/gemma/test_gemma3_provider.py
  • tests/unit_tests/models/gemma/test_gemma3_bridge.py
  • tests/functional_tests/models/qwen/test_qwen2_provider.py
  • tests/functional_tests/models/llama_nemotron/test_llama_nemotron_provider.py
  • tests/functional_tests/models/gemma/test_gemma_provider.py
  • src/megatron/bridge/models/qwen_vl/qwen25_vl_bridge.py
  • tests/unit_tests/models/test_gpt_provider.py
  • tests/functional_tests/models/deepseek/test_deepseek_provider_mapping.py
  • src/megatron/bridge/models/deepseek/common.py
  • tests/functional_tests/models/nemotronh/test_nemotron_h_provider.py
  • tests/unit_tests/models/mistral/test_mistral_model_bridge.py
  • tests/functional_tests/models/gemma/test_gemma2_provider.py
  • tests/functional_tests/models/qwen/test_qwen3_provider.py
  • tests/unit_tests/models/gemma/test_gemma_bridge.py
  • tests/unit_tests/models/qwen_vl/test_qwen25_vl_bridge.py
  • tests/unit_tests/models/qwen/test_qwen3_next_bridge.py

Comment on lines 160 to +165
if args.hf_path:
logger.info(f"Using custom HuggingFace path: {args.hf_path}")
recipe_kwargs["hf_path"] = args.hf_path
# Import AutoBridge to create a new model provider with the custom HF path
from megatron.bridge.models import AutoBridge

cfg: ConfigContainer = pretrain_config(**recipe_kwargs)
logger.info("Loaded base configuration")
cfg.model = AutoBridge.from_hf_pretrained(args.hf_path).to_megatron_provider(load_weights=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for AutoBridge class and its methods
find . -type f -name "*.py" -path "*/megatron/bridge/models/*" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1120


🏁 Script executed:

#!/bin/bash
# Look for AutoBridge definition and from_hf_config method
rg "class AutoBridge|def from_hf_config|def from_hf_pretrained" --type=py -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1977


🏁 Script executed:

#!/bin/bash
# Check the specific file for import placement
cat -n examples/quantization/pretrain_quantized_llama3_8b.py | sed -n '150,170p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1095


🏁 Script executed:

#!/bin/bash
# Examine the from_hf_pretrained and from_hf_config implementations
sed -n '148,230p' src/megatron/bridge/models/conversion/auto_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3593


🏁 Script executed:

#!/bin/bash
# Also check the to_megatron_provider implementation
rg "def to_megatron_provider" -A 20 --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 32005


🏁 Script executed:

#!/bin/bash
# Check what PreTrainedCausalLM.from_pretrained does and if it loads weights
rg "class AutoBridge" -A 50 src/megatron/bridge/models/conversion/auto_bridge.py | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2443


🏁 Script executed:

#!/bin/bash
# Get the full implementation of from_hf_pretrained
sed -n '194,250p' src/megatron/bridge/models/conversion/auto_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2566


🏁 Script executed:

#!/bin/bash
# Check if there's any rank guard usage in the AutoBridge or related code
rg "get_rank|rank_zero|dist.get_rank" src/megatron/bridge/models/conversion/ -A 2 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1787


🏁 Script executed:

#!/bin/bash
# Get the full to_megatron_provider implementation
sed -n '238,320p' src/megatron/bridge/models/conversion/auto_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3635


🏁 Script executed:

#!/bin/bash
# Check if there are any attribute accesses on hf_pretrained that would fail if it's just a config
rg "self.hf_pretrained\." src/megatron/bridge/models/conversion/auto_bridge.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 528


🏁 Script executed:

#!/bin/bash
# Search for to_megatron_provider in AutoBridge specifically
rg "def to_megatron_provider" src/megatron/bridge/models/conversion/auto_bridge.py -A 80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4006


🏁 Script executed:

#!/bin/bash
# Check the config property and how hf_pretrained.config is accessed
rg "\.config" src/megatron/bridge/models/conversion/auto_bridge.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 506


🏁 Script executed:

#!/bin/bash
# Check the beginning of the file to see how imports are organized
head -30 examples/quantization/pretrain_quantized_llama3_8b.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1327


🏁 Script executed:

#!/bin/bash
# Verify if AutoBridge is already imported at the top
rg "^import|^from" examples/quantization/pretrain_quantized_llama3_8b.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 684


All distributed ranks load the full HF model unnecessarily — OOM risk.

AutoBridge.from_hf_pretrained(args.hf_path) loads the entire model with all weights via PreTrainedCausalLM.from_pretrained(), and because there is no rank guard, every torchrun worker independently loads the full model in parallel. For LLaMA 3-8B that is ~16 GB per rank; with 8 ranks this consumes ~128 GB total VRAM and will cause OOM before pretraining starts.

Use AutoBridge.from_hf_config() instead to load only the configuration, then call to_megatron_provider(load_weights=False) to defer weight loading until after distributed initialization:

🛡️ Proposed fix
-        # Import AutoBridge to create a new model provider with the custom HF path
-        from megatron.bridge.models import AutoBridge
-
-        cfg.model = AutoBridge.from_hf_pretrained(args.hf_path).to_megatron_provider(load_weights=False)
+        # Import AutoBridge to create a new model provider with the custom HF path
+        from megatron.bridge.models import AutoBridge
+        from transformers import AutoConfig
+
+        hf_cfg = AutoConfig.from_pretrained(args.hf_path)
+        cfg.model = AutoBridge.from_hf_config(hf_cfg).to_megatron_provider(load_weights=False)

Also move the AutoBridge import to the top of the file with other megatron.bridge.* imports to follow the import organization guideline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/quantization/pretrain_quantized_llama3_8b.py` around lines 160 -
165, Replace the blocking call to AutoBridge.from_hf_pretrained(args.hf_path)
(which loads full weights on every rank) with
AutoBridge.from_hf_config(args.hf_path) so only the model config is read and
then call .to_megatron_provider(load_weights=False) to defer weight loading
until after distributed init; update the code that currently does
AutoBridge.from_hf_pretrained(...).to_megatron_provider(load_weights=False) to
use AutoBridge.from_hf_config(...) instead, and move the `from
megatron.bridge.models import AutoBridge` import to the top with other
megatron.bridge imports to follow import organization.

Comment on lines +45 to +83
def kimi_k2_pretrain_config_gb300(
precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
) -> ConfigContainer:
"""GB300, baseline config."""
base_cfg = get_workload_base_config(
model_family_name="kimi",
model_recipe_name="kimi_k2",
gpu="gb300",
compute_dtype=precision.upper(),
task="pretrain",
config_variant=config_variant,
)

cfg = pretrain_config()
precision_config = get_precision_config(precision)
cfg.mixed_precision = precision_config

if base_cfg.moe_flex_dispatcher_backend is not None:
cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend

if base_cfg.pp_layout:
cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout
else:
# Recompute layout based on updated PP/VP sizes
pp_size = base_cfg.pipeline_model_parallel_size
vp_size = base_cfg.virtual_pipeline_model_parallel_size
layout = _get_kimi_k2_pipeline_layout(pp_size, vp_size)
cfg.model.pipeline_model_parallel_layout = layout

set_kimi_k2_common_configs(cfg)
set_workload_base_configs(cfg, base_cfg)

cfg.comm_overlap.overlap_grad_reduce = True

# Setting num_workers and pin_memory to 0 and False respectively gives better performance.
# we are debugging this and might change this in the future.
cfg.dataset.num_workers = 0
cfg.dataset.pin_memory = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the file and search for pretrain_config() definition and comm_overlap usage
find . -name "kimi_llm_pretrain.py" -type f | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 126


🏁 Script executed:

#!/bin/bash
# Search for pretrain_config function definition
rg "def pretrain_config" -A 10 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Check for comm_overlap initialization in the file
rg "comm_overlap" -n scripts/performance/configs/kimi/kimi_llm_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 278


🏁 Script executed:

#!/bin/bash
# Check all 4 function signatures to see if mock is used
rg "def kimi_k2_pretrain_config" -A 40 scripts/performance/configs/kimi/kimi_llm_pretrain.py | head -150

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5316


🏁 Script executed:

#!/bin/bash
# Search where pretrain_config is defined - likely in another module
rg "def pretrain_config\(" --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Get the top of the file to see imports
head -50 scripts/performance/configs/kimi/kimi_llm_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1890


🏁 Script executed:

#!/bin/bash
# Check if mock is used in any of the functions
rg "mock" scripts/performance/configs/kimi/kimi_llm_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 370


🏁 Script executed:

#!/bin/bash
# Find pretrain_config definition in the entire codebase
rg "def pretrain_config\(" -B 2 -A 5 --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Check what module pretrain_config likely comes from
rg "pretrain_config" -B 3 scripts/performance/configs/kimi/kimi_llm_pretrain.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 494


🏁 Script executed:

#!/bin/bash
# Find the kimi_k2.py file
find . -name "kimi_k2.py" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 116


🏁 Script executed:

#!/bin/bash
# Search for kimi_k2_pretrain_config definition
rg "def kimi_k2_pretrain_config\(" -A 20 --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1734


🏁 Script executed:

#!/bin/bash
# Check ConfigContainer for comm_overlap field
rg "comm_overlap" -B 2 -A 2 --type py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4103


🏁 Script executed:

#!/bin/bash
# Find _pretrain_common definition
rg "def _pretrain_common\(" -A 30 --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2472


🏁 Script executed:

#!/bin/bash
# Get full content of kimi_k2.py to understand the config flow
wc -l src/megatron/bridge/recipes/kimi/kimi_k2.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 118


🏁 Script executed:

#!/bin/bash
# Check what _pretrain_common returns and if it initializes comm_overlap
rg "_pretrain_common" -B 2 -A 2 --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


🏁 Script executed:

#!/bin/bash
# Search for comm_overlap initialization in common.py
rg "comm_overlap" src/megatron/bridge/recipes/common.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 251


🏁 Script executed:

#!/bin/bash
# Get the full ConfigContainer initialization in _pretrain_common
sed -n '/def _pretrain_common/,/return cfg/p' src/megatron/bridge/recipes/common.py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3456


🏁 Script executed:

#!/bin/bash
# Check the entire file structure of kimi_llm_pretrain.py to see all 4 functions
cat scripts/performance/configs/kimi/kimi_llm_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 7142


Guard comm_overlap access and remove unused mock parameter

cfg.comm_overlap.overlap_grad_reduce = ... will raise AttributeError since _pretrain_common() initializes comm_overlap=None. Guard with a null check before assignment. Also, mock is unused across all four config builder functions—remove the parameter to avoid a misleading API signature.

Suggested fix (apply to each config builder)
-    cfg.comm_overlap.overlap_grad_reduce = True
+    if cfg.comm_overlap is not None:
+        cfg.comm_overlap.overlap_grad_reduce = True

Also applies to lines 87-125 (gb200), 129-163 (b200), 166-200 (h100). Remove mock: bool = True parameter from all four functions.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 46-46: Unused function argument: mock

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py` around lines 45 - 83,
The cfg.comm_overlap assignment can raise AttributeError because
_pretrain_common() may set comm_overlap=None; update the config builders (e.g.,
kimi_k2_pretrain_config_gb300) to check if cfg.comm_overlap is not None before
setting cfg.comm_overlap.overlap_grad_reduce, and remove the unused mock: bool =
True parameter from this function and the other three config builders (the
GB200, B200, H100 variants) so the signatures no longer include mock and callers
reflect the updated signature.

moe_flex_dispatcher_backend="hybridep",
moe_a2a_overlap=False,
cuda_graph_impl="transformer_engine",
cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n --type=py "cuda_graph_scope" -C 3 -g '!*test*'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


🏁 Script executed:

# Check the WorkloadBaseConfig definition
rg -A 20 "class WorkloadBaseConfig" scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1046


🏁 Script executed:

# Get more lines from WorkloadBaseConfig to see cuda_graph_scope definition
sed -n '40,70p' scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1065


🏁 Script executed:

# Check the exact class definition and decorator
sed -n '30,60p' scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1386


🏁 Script executed:

# Check lines before the class definition to see if it's a dataclass
sed -n '25,35p' scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 622


Fix type annotation for cuda_graph_scope in WorkloadBaseConfig.

The cuda_graph_scope field is declared as Optional[str] = None in scripts/performance/utils/utils.py line 55, but lines 37 and 57 of this file assign list values like ["attn", "moe_router", "moe_preprocess"]. The downstream override handler in scripts/performance/utils/overrides.py accepts Optional[str | List[str]] and converts strings to lists, indicating the field should support both types. Update the type annotation in WorkloadBaseConfig to Optional[str | List[str]] = None to reflect actual usage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/kimi/kimi_workload_base_configs.py` at line 37,
The WorkloadBaseConfig field cuda_graph_scope is declared as Optional[str] but
is assigned lists (e.g. ["attn", "moe_router", "moe_preprocess"]) and must
accept both strings and lists; update the type annotation on the
cuda_graph_scope attribute in the WorkloadBaseConfig class to Optional[str |
List[str]] = None so it matches usage and the downstream override handler that
accepts Optional[str | List[str]].

Comment on lines +371 to +396
is_mla_provider = self.PROVIDER_CLASS is not None and issubclass(self.PROVIDER_CLASS, MLAModelProvider)
rope_scaling = getattr(hf_config, "rope_scaling", None)

if rope_scaling is not None and isinstance(rope_scaling, dict):
rope_type = rope_scaling.get("type") or rope_scaling.get("rope_type")
if rope_type == "yarn":
# Check if this is an MLA provider (uses direct field names)
# or a GPT provider (uses yarn_ prefixed field names)
if is_mla_provider:
# MLA models: use direct field names (mscale, rotary_scaling_factor, etc.)
mla_params = {}
for hf_key, megatron_key in self.MLA_ROPE_SCALING_MAPPING:
value = rope_scaling.get(hf_key)
if value is not None:
mla_params[megatron_key] = value
if mla_params:
provider_kwargs["_mla_rope_params"] = mla_params
else:
# GPT models: use yarn_ prefixed field names
yarn_params = {"position_embedding_type": "yarn"}
for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING:
yarn_params[megatron_key] = rope_scaling.get(hf_key)
if "truncate" in rope_scaling:
yarn_params["yarn_correction_range_round_to_int"] = rope_scaling["truncate"]
if yarn_params:
provider_kwargs["_yarn_params"] = yarn_params
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skip None values when building YARN params.

Right now missing YARN keys get set to None, which can overwrite provider defaults and break config validation downstream.

🐛 Proposed fix
-                    for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING:
-                        yarn_params[megatron_key] = rope_scaling.get(hf_key)
+                    for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING:
+                        value = rope_scaling.get(hf_key)
+                        if value is not None:
+                            yarn_params[megatron_key] = value
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 371 -
396, The YARN param builder currently assigns None for missing keys (in the loop
over YARN_ROPE_SCALING_MAPPING), which can overwrite provider defaults; change
the loop in model_bridge.py that iterates self.YARN_ROPE_SCALING_MAPPING to only
set yarn_params[megatron_key] = value when value is not None (mirroring the MLA
branch), and only add provider_kwargs["_yarn_params"] when yarn_params contains
actual scaling entries (i.e., beyond the initial "position_embedding_type") or
when a non-None truncate is present, so missing HF keys do not produce
None-valued overrides.

Comment on lines +504 to +519
# Handle YARN rope scaling: check if provider has yarn_* params and build rope_scaling dict
yarn_rotary_scaling_factor = getattr(provider, "yarn_rotary_scaling_factor", None)
if yarn_rotary_scaling_factor is not None:
if "rope_scaling" not in hf_config:
hf_config["rope_scaling"] = {}
hf_config["rope_scaling"]["rope_type"] = "yarn"

for hf_key, megatron_key in cls.YARN_ROPE_SCALING_MAPPING:
value = getattr(provider, megatron_key, None)
if value is not None:
hf_config["rope_scaling"][hf_key] = value

yarn_correction_range_round_to_int = getattr(provider, "yarn_correction_range_round_to_int", None)
if yarn_correction_range_round_to_int is not None:
hf_config["rope_scaling"]["truncate"] = yarn_correction_range_round_to_int

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Export MLA rope scaling to HF configs.

The new MLA rope-scaling fields are handled when importing HF configs, but they are not emitted on export. That loses critical settings for MLA models.

🐛 Proposed fix
+        from megatron.bridge.models.mla_provider import MLAModelProvider
+
+        if isinstance(provider, MLAModelProvider):
+            mla_rope_scaling: dict[str, float] = {}
+            for hf_key, megatron_key in cls.MLA_ROPE_SCALING_MAPPING:
+                value = getattr(provider, megatron_key, None)
+                if value is not None:
+                    mla_rope_scaling[hf_key] = value
+            if mla_rope_scaling:
+                hf_config["rope_scaling"] = {"rope_type": "yarn", **mla_rope_scaling}
+        else:
             yarn_rotary_scaling_factor = getattr(provider, "yarn_rotary_scaling_factor", None)
             if yarn_rotary_scaling_factor is not None:
                 if "rope_scaling" not in hf_config:
                     hf_config["rope_scaling"] = {}
                 hf_config["rope_scaling"]["rope_type"] = "yarn"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 504 -
519, Export currently doesn't emit MLA rope-scaling fields so rope_scaling
settings are lost; in the export path that builds hf_config (the same code area
using cls.YARN_ROPE_SCALING_MAPPING and provider attributes), detect
MLA/yarn-related provider attributes (yarn_rotary_scaling_factor,
yarn_correction_range_round_to_int and the keys listed in
cls.YARN_ROPE_SCALING_MAPPING) and populate hf_config["rope_scaling"] with
rope_type="yarn" and the mapped hf keys and truncate flag when present; update
the export logic that reads from provider into hf_config so it mirrors the
import handling shown here (use the provider attribute names from the mapping
and yarn_* attributes to set hf_config["rope_scaling"]).

Comment on lines +272 to +276
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the target file
find . -type f -name "gpt_oss.py" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 119


🏁 Script executed:

# Read the file to see lines 272-276 in context
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '260,290p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1642


🏁 Script executed:

# Search for 20B config and other pipeline_dtype references in the file
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | grep -E "(20B|pipeline_dtype)"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 709


🏁 Script executed:

# Check the 20B pre-training config settings
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '133,170p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1891


🏁 Script executed:

# Check the 120B config context and finetuning configs
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '250,290p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2045


🏁 Script executed:

# Check finetuning configs
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '361,420p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2819


🏁 Script executed:

# Continue reading the _gpt_oss_finetune_common function
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '404,460p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2860


🏁 Script executed:

# Search for other recipe files to see how they handle pipeline_dtype
find ./src/megatron/bridge/recipes -type f -name "*.py" | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 556


🏁 Script executed:

# Search for pipeline_dtype in other recipe files
rg "pipeline_dtype" ./src/megatron/bridge/recipes --type py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50383


Set pipeline_dtype to torch.bfloat16 for PP=4 configs.

Both the 20B and 120B pre-training configurations use pipeline parallelism (PP=4) but have pipeline_dtype = None, which violates the pattern established across other recipe files. When pipeline model parallelism > 1, pipeline_dtype must be explicitly set to torch.bfloat16.

🔧 Fix for both configs
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16

Apply this fix to both:

  • Line 156 (20B pre-training config)
  • Line 275 (120B pre-training config)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = torch.bfloat16
cfg.model.virtual_pipeline_model_parallel_size = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 272 - 276, The
pipeline_dtype is left as None for configs where
cfg.model.pipeline_model_parallel_size == 4; update those occurrences in
src/megatron/bridge/recipes/gpt_oss/gpt_oss.py by setting
cfg.model.pipeline_dtype = torch.bfloat16 (ensure torch is imported/available in
the module) for both the 20B and 120B pre-training config blocks where
cfg.model.pipeline_model_parallel_size is 4 so pipeline_dtype is explicitly set
whenever pipeline parallelism > 1.

# Optimizer hyperparameters
cfg.optimizer.lr = 1.6e-3
cfg.optimizer.weight_decay = 0.1
cfg.scheduler.min_lr = 1.6e-5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether min_lr lives on optimizer or scheduler
rg -n "\.(optimizer|scheduler)\.min_lr" --type=py -g 'src/megatron/bridge/recipes/**'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 717


🏁 Script executed:

#!/bin/bash
# Look at the context around line 131 in nemotron_3_nano.py
sed -n '120,140p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 795


🏁 Script executed:

#!/bin/bash
# Search for scheduler and optimizer config definitions and their min_lr usage
rg -n "min_lr" --type=py -B2 -A2 src/megatron/bridge/recipes/gpt/gpt3_175b.py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 221


Change cfg.scheduler.min_lr to cfg.optimizer.min_lr at line 131.

All other recipes (GPT 175B, LLaMA3, GLM45, Nemotron Nano V2 VL) set cfg.optimizer.min_lr. This file is the only outlier using cfg.scheduler.min_lr, which is inconsistent with the established pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` at line 131,
Replace the inconsistent assignment of cfg.scheduler.min_lr with
cfg.optimizer.min_lr in the Nemotron 3 Nano recipe: locate the line setting
cfg.scheduler.min_lr (currently 1.6e-5) and change it to assign that value to
cfg.optimizer.min_lr instead, and scan the file for any other uses of
cfg.scheduler.min_lr to align them with the optimizer config pattern used in
other recipes.

cfg.optimizer.lr = 1.6e-3
cfg.optimizer.weight_decay = 0.1
cfg.scheduler.min_lr = 1.6e-5
cfg.scheduler.warmup_iters = 333
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find SchedulerConfig class definition
find . -name "*.py" -type f | xargs grep -l "class SchedulerConfig" | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 111


🏁 Script executed:

# Check warmup_iters vs lr_warmup_iters usage in recipes
rg -n "scheduler\.(lr_)?warmup_iters" --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5255


🏁 Script executed:

# Look at the specific file to confirm line 132
sed -n '125,140p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 600


🏁 Script executed:

# Search for _pretrain_common to understand what it sets
find . -name "*.py" -type f | xargs grep -l "_pretrain_common" | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 310


🏁 Script executed:

# View SchedulerConfig class and its attributes
sed -n '1600,1650p' src/megatron/bridge/training/config.py | grep -A 50 "class SchedulerConfig"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Look for attribute definitions in SchedulerConfig
rg -A 30 "class SchedulerConfig" src/megatron/bridge/training/config.py | head -60

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1370


Fix incorrect scheduler attribute name on line 132.

Change cfg.scheduler.warmup_iters = 333 to cfg.scheduler.lr_warmup_iters = 333. The SchedulerConfig class defines lr_warmup_iters as the valid attribute (default: 0), not warmup_iters. Using the wrong name silently creates a new attribute without actually setting the scheduler's warmup iterations, leaving it at the default value of 0 instead of the intended 333. All other recipes correctly use cfg.scheduler.lr_warmup_iters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` at line 132, The
line sets a non-existent scheduler attribute (cfg.scheduler.warmup_iters) so the
scheduler's warmup stays at default; change that assignment to use the
SchedulerConfig's real field name by replacing cfg.scheduler.warmup_iters = 333
with cfg.scheduler.lr_warmup_iters = 333 so the scheduler uses 333 warmup
iterations (refer to cfg.scheduler and SchedulerConfig's lr_warmup_iters).

…rate Moonlight recipe to AutoBridge (#2428)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…2571)

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Mar 3, 2026

/ok to test e39b6da

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Mar 3, 2026

/ok to test 8fe60aa

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Mar 3, 2026

/ok to test 44468af

Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants