Conversation
… in base class (#2052) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…config logging (#2184) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
… in base class for Nemotron models (#2225) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…retrain recipes (#2288) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: malay-nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
📝 WalkthroughWalkthroughThis PR refactors Megatron Bridge recipes and model providers to use parameterless APIs with post-instantiation configuration mutation. Major changes include centralizing HF↔Megatron config translation on MegatronModelBridge, updating all model bridges to use super().provider_bridge() with in-place provider configuration, removing generation_config propagation, adding rope_scaling support, introducing new Kimi K2 and Nemotron 3 Nano configurations, and updating optimizer epsilon defaults from 1e-5 to 1e-8. Changes
Sequence DiagramssequenceDiagram
participant Client as Config Builder
participant Recipe as pretrain_config()
participant Common as _pretrain_common()
participant HFBridge as AutoBridge
participant Provider as GPTModelProvider
participant Config as ConfigContainer
Client->>Recipe: Call pretrain_config() [no args]
Recipe->>Common: cfg = _pretrain_common()
Common->>Config: Create ConfigContainer w/ defaults
Common-->>Recipe: Return cfg
Recipe->>HFBridge: Load HF model (if hf_path provided)
HFBridge->>Provider: Convert to Megatron provider
Provider-->>Recipe: Return provider
Recipe->>Config: cfg.model = provider
Recipe->>Config: cfg.mixed_precision = precision_config
Recipe->>Config: cfg.dataset = {...}
Recipe->>Config: cfg.train = {...}
Recipe-->>Client: Return fully configured cfg
sequenceDiagram
participant HFModel as HF Pretrained
participant Bridge as Model Bridge
participant SuperClass as MegatronModelBridge
participant Provider as GPTModelProvider
participant Translator as Config Translator
HFModel->>Bridge: Load model & config
Bridge->>SuperClass: Call super().provider_bridge()
SuperClass->>Translator: Apply hf_config_to_provider_kwargs()
Translator->>Translator: Map CONFIG_MAPPING
Translator->>Translator: Apply ACTIVATION_MAPPING
Translator->>Translator: Handle rope_scaling variants
Translator-->>SuperClass: Return provider_kwargs dict
SuperClass->>Provider: Instantiate with base defaults
Provider-->>Bridge: Return provider
Bridge->>Provider: Set rope_scaling = ...
Bridge->>Provider: Set activation_func = ...
Bridge->>Provider: Set model-specific fields
Bridge-->>HFModel: Return configured provider
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (14)
tests/unit_tests/recipes/test_deepseek_recipes.py (1)
87-98:⚠️ Potential issue | 🟡 MinorAdd a unit category marker for this test.
The updated test should be tagged to satisfy test categorization requirements.
✅ Suggested change
`@pytest.mark.parametrize`("recipe_func", _DEEPSEEK_RECIPE_FUNCS) +@pytest.mark.unit def test_each_deepseek_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch):As per coding guidelines, "Use pytest.mark to categorize tests (unit, integration, system)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/test_deepseek_recipes.py` around lines 87 - 98, This test needs a pytest unit marker: add `@pytest.mark.unit` above the test function that contains the code using cfg and _assert_basic_config (the test in tests/unit_tests/recipes/test_deepseek_recipes.py that asserts tokenizer behavior), and ensure pytest is imported (import pytest) at the top of the file if not already present; keep the existing assertions and logic unchanged.tests/functional_tests/models/deepseek/test_deepseek_conversion.py (5)
96-97:⚠️ Potential issue | 🟡 MinorRemove the duplicate
save_pretrainedcall.
model.save_pretrained(model_dir, safe_serialization=True)is called identically on two consecutive lines — the second write is redundant and re-serialises all weights for no benefit.🐛 Proposed fix
- model.save_pretrained(model_dir, safe_serialization=True) model.save_pretrained(model_dir, safe_serialization=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around lines 96 - 97, The test contains a duplicated call to model.save_pretrained(model_dir, safe_serialization=True): remove the redundant second invocation so the model is only serialized once; keep the remaining call to model.save_pretrained with safe_serialization=True to preserve the intended behavior.
96-97:⚠️ Potential issue | 🟡 MinorRemove the duplicate
save_pretrainedcall.
model.save_pretrained(model_dir, safe_serialization=True)is called identically on two consecutive lines. The second call writes the same files again, wasting I/O and potentially overwriting any intermediate state.🐛 Proposed fix
- model.save_pretrained(model_dir, safe_serialization=True) model.save_pretrained(model_dir, safe_serialization=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around lines 96 - 97, The test contains a duplicate call to model.save_pretrained(model_dir, safe_serialization=True); remove the redundant second invocation so the model is only saved once (keep the first call that uses model.save_pretrained with model_dir and safe_serialization=True) to avoid unnecessary I/O and potential overwrites.
92-93:⚠️ Potential issue | 🟡 MinorNarrow the
exceptclause — swallowing all exceptions hides real failures.Per coding guidelines, the
exceptclause should be limited to the smallest set of expected errors. A bareexcept Exception: passwill silently absorbOSError,ConnectionError, permission errors, etc., making CI failures hard to diagnose.♻️ Suggested fix
- except Exception: - pass + except OSError: + # Non-fatal: fallback tokenizer is optional for conversion tests + pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around lines 92 - 93, The bare "except Exception: pass" in the test should be narrowed: replace it with explicit expected exceptions (for example FileNotFoundError, OSError, ConnectionError or the specific exception your try block can raise) and handle them (or log them) instead of silently swallowing; for anything truly unexpected either re-raise or log and fail the test. Locate the try/... except block that currently uses "except Exception: pass" and change it to "except <ExpectedError1>, <ExpectedError2> as e: ..." (or catch Exception as e and re-raise unless isinstance(e, (ExpectedError1, ...))) so only anticipated errors are suppressed and unexpected failures surface.
92-93:⚠️ Potential issue | 🟡 MinorNarrow the
exceptclause — bareexcept Exception: passsilently swallows all failures.This violates the project guideline: "limit the except clause to the smallest set of errors possible." A broad silent swallow hides connectivity issues, permission errors, etc. At minimum, catch only the expected failure mode and log it.
♻️ Suggested fix
- except Exception: - pass + except OSError: + # Network or file-system error; tokenizer is optional for conversion tests + pass
69-80:⚠️ Potential issue | 🟡 MinorReplace deprecated
use_auth_tokenwithtokenparameter.The
use_auth_tokenparameter is deprecated in transformers (removed in v5.0.0). Replace withtokenon line 76:token=None,This parameter appears in multiple test files and should be updated consistently across the codebase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py` around lines 69 - 80, Update the deprecated parameter in the get_class_from_dynamic_module call: replace the removed use_auth_token argument with the new token argument (i.e., change use_auth_token=None to token=None) in the invocation of get_class_from_dynamic_module so the call signature matches transformers v5+; search other test files for additional uses of use_auth_token and update them similarly to token to keep the codebase consistent.tests/unit_tests/models/qwen/test_qwen3_moe_bridge.py (1)
118-126:⚠️ Potential issue | 🟡 MinorAdd pytest.mark category to these unit tests.
Please mark these tests as unit tests to align with repo standards.
✅ Suggested change
+@pytest.mark.unit class TestQwen3MoEBridge:As per coding guidelines: Use 'pytest.mark' to categorize tests (unit, integration, system).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/models/qwen/test_qwen3_moe_bridge.py` around lines 118 - 126, Mark the test function test_provider_bridge_basic with the unit category using pytest.mark: add `@pytest.mark.unit` above the test definition, ensure pytest is imported in the test module, and apply the same decoration to other tests in this file that are unit tests (e.g., any tests exercising Qwen3MoEBridge.provider_bridge) so they comply with the repo's pytest.mark categorization.tests/unit_tests/recipes/utils/test_optimizer_utils.py (1)
15-15:⚠️ Potential issue | 🟡 MinorTypo in module docstring: "optinizer_utils" → "optimizer_utils".
✏️ Fix typo
-"""Tests for optinizer_utils module.""" +"""Tests for optimizer_utils module."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/utils/test_optimizer_utils.py` at line 15, Fix the typo in the module docstring of the test file test_optimizer_utils.py: change the word "optinizer_utils" to "optimizer_utils" in the top-level docstring so the module name is spelled correctly.src/megatron/bridge/recipes/qwen/qwen2.py (1)
16-16: 🛠️ Refactor suggestion | 🟠 MajorUse modern union syntax per coding guidelines.
This file uses
Optional[T],Union[X, Y], andList[T]fromtypingthroughout (e.g.,Qwen2CommonKwargs,Qwen2FinetuneKwargs,_qwen2_finetune_commonsignature), whilellama3.pyin the same PR correctly usesT | None,X | Y, andlist[T]. The coding guidelines require the modern syntax.♻️ Example fix for the import and TypedDict
-from typing import List, Optional, Union +# No typing import needed for these; use built-in generics and | syntaxclass Qwen2CommonKwargs(TypedDict, total=False): - dir: Optional[str] + dir: str | None - data_paths: Optional[List[str]] + data_paths: list[str] | None - pipeline_dtype: Optional[torch.dtype] + pipeline_dtype: torch.dtype | None - precision_config: Optional[Union[MixedPrecisionConfig, str]] + precision_config: MixedPrecisionConfig | str | None # ... apply same pattern throughoutAs per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'", "Use 'X | Y' for union types instead of 'Union[X, Y]'", and "Use built-in generics (list, dict, tuple) instead of typing equivalents".
Also applies to: 40-78
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen2.py` at line 16, Replace usages of typing generics with modern built-in and PEP 604 union syntax across this module: remove List, Optional, Union imports and use list[T], T | None, and X | Y respectively. Update the TypedDicts Qwen2CommonKwargs and Qwen2FinetuneKwargs and any function signatures such as _qwen2_finetune_common to use list[...] instead of List[...], T | None instead of Optional[T], and A | B instead of Union[A, B]; ensure any default None annotations remain compatible. Keep only necessary imports from typing (e.g., TypedDict) and run a quick typecheck to confirm no remaining typing.List/Optional/Union usages.tests/unit_tests/models/llama/test_llama_bridge.py (1)
37-122:⚠️ Potential issue | 🟡 MinorAdd type hints to fixtures and test methods
The new fixtures/tests omit argument and return type hints. Please annotate fixtures with concrete return types and tests with
-> None(and type fixture parameters) to satisfy typing rules; apply this consistently across the module.💡 Example (apply broadly)
- def llama_3_2_1b_config_dict(self): + def llama_3_2_1b_config_dict(self) -> dict[str, object]: """Create a sample Llama 3.2 1B configuration with RoPE scaling.""" return { "architectures": ["LlamaForCausalLM"], ... } - def test_provider_bridge_returns_gpt_provider(self, mock_pretrained_llama_2): + def test_provider_bridge_returns_gpt_provider( + self, mock_pretrained_llama_2: PreTrainedCausalLM + ) -> None: """Test that provider_bridge returns a GPTModelProvider for Llama 2.""" bridge = LlamaBridge()As per coding guidelines, Use type hints for function arguments and return types.
Also applies to: 123-341, 343-458, 461-515, 535-598, 601-638, 706-855
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/models/llama/test_llama_bridge.py` around lines 37 - 122, Several fixtures and test functions in this module (e.g., llama_3_2_1b_config_dict, llama_2_7b_config_dict, llama_config, llama_2_config, mock_pretrained_llama, mock_pretrained_llama_2 and the test functions referenced later) are missing type hints; update each fixture signature to include parameter types for injected fixtures and an explicit return type (e.g., -> Dict[str, Any] or -> LlamaConfig or -> Mock[PreTrainedCausalLM] as appropriate), and annotate all test functions with -> None; ensure imports for typing (Dict, Any, Optional, Iterator, etc.) are present if needed and apply the same pattern consistently across the module.src/megatron/bridge/recipes/kimi/kimi_k2.py (1)
25-46:⚠️ Potential issue | 🟡 MinorFix
_get_kimi_k2_pipeline_layouttyping and docstring
vp_sizeis treated as nullable, but the signature doesn’t reflect that, and the return type is missing. Also, the docstring should follow Google style.💡 Suggested change
-def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int): - """Get pipeline layout for Kimi-K2 based on PP and VP size.""" +def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int | None) -> list[list[str]] | None: + """Get pipeline layout for Kimi-K2 based on PP and VP size. + + Args: + pp_size: Pipeline parallel size. + vp_size: Virtual pipeline size (defaults to 1 when None). + + Returns: + Pipeline layout per stage, or None when no layout is required. + """As per coding guidelines, Use type hints for function arguments and return types; Use 'T | None' for nullable types instead of 'Optional[T]'; Use Google style docstrings (parseable by Sphinx) for classes and functions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/kimi/kimi_k2.py` around lines 25 - 46, Update the _get_kimi_k2_pipeline_layout signature to use explicit nullable typing for vp_size (use int | None) and add a return type of list[list[str]] | None; update the docstring to Google style with a brief description, Args (pp_size, vp_size), and Returns (pipeline layout or None). Keep the existing logic that normalizes vp_size to 1 when None (vp_size = 1 if vp_size is None else vp_size) and ensure the map and returned layout types match the annotated return type. Use the 'T | None' form (not Optional[T]) and ensure the function name _get_kimi_k2_pipeline_layout is the one you modify.src/megatron/bridge/models/llama/llama_bridge.py (1)
15-31:⚠️ Potential issue | 🟡 MinorRename module logger to comply with global naming rule
Module-level globals should be upper snake case with a
G_prefix.Suggested fix
-logger = logging.getLogger(__name__) +G_LOGGER = logging.getLogger(__name__)As per coding guidelines: Use upper snake_case and prefix 'G' for global variables (e.g., G_MY_GLOBAL).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/llama/llama_bridge.py` around lines 15 - 31, The module-level logger variable `logger` violates the global naming rule; rename it to an upper snake-case global prefixed with G (e.g., `G_LOGGER`) and update all references in this module (e.g., any use of `logger` in this file) to the new name; ensure you only change the variable name and not the logging import or behavior so existing logging calls (info/debug/error) continue to work with `G_LOGGER`.src/megatron/bridge/models/deepseek/deepseek_provider.py (1)
16-109:⚠️ Potential issue | 🟡 MinorUse PEP 604 unions and built‑in generics in type hints
Replace
Optional,Union, andListwithT | None,X | Y, andlist[T]respectively, and remove unused typing imports.Suggested fix
-from typing import Callable, List, Optional, Union +from typing import Callable @@ - q_lora_rank: Optional[int] = 1536 + q_lora_rank: int | None = 1536 @@ - moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] + [1] * 59) # first layer is dense + moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] + [1] * 59) # first layer is dense @@ - q_lora_rank: Optional[int] = None + q_lora_rank: int | None = None @@ - moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] + [1] * 26) # first layer is dense + moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] + [1] * 26) # first layer is dense @@ - moe_layer_freq: Union[int, List[int]] = field( + moe_layer_freq: int | list[int] = field( default_factory=lambda: [0] * 3 + [1] * 58 ) # first three layers are dense @@ - moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] * 1 + [1] * 26) # first layer is dense + moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] * 1 + [1] * 26) # first layer is dense🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/deepseek_provider.py` around lines 16 - 109, Update type hints to modern PEP 604 and builtin generics and remove unused typing imports: replace Optional[int] with int | None for fields like DeepSeekModelProvider.q_lora_rank and DeepSeekV2LiteModelProvider.q_lora_rank, replace Union[int, List[int]] with int | list[int] for DeepSeekV2ModelProvider.moe_layer_freq, and replace List[int] with list[int] wherever used (e.g., the lambda default factories). Remove unused imports Optional, Union, List from the typing import list and tidy up the imports accordingly.src/megatron/bridge/models/qwen/qwen2_bridge.py (1)
28-57:⚠️ Potential issue | 🟡 MinorAdd type hints and Google-style docstring to
provider_bridgeThe method currently lacks type annotations and a complete docstring. Per coding guidelines, all public methods must include type hints for arguments and return types, plus Google-style docstrings with Args and Returns sections.
Suggested fix
+from megatron.bridge.models.gpt_provider import GPTModelProvider +from megatron.bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM from megatron.bridge.models.conversion.mapping_registry import MegatronMappingRegistry from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge from megatron.bridge.models.conversion.param_mapping import ( AutoMapping, GatedMLPMapping, QKVMapping, ) @@ - def provider_bridge(self, hf_pretrained): - """Convert HuggingFace Qwen2 config to GPTModelProvider.""" + def provider_bridge(self, hf_pretrained: PreTrainedCausalLM) -> GPTModelProvider: + """Convert HuggingFace Qwen2 config to GPTModelProvider. + + Args: + hf_pretrained: HuggingFace PreTrainedCausalLM instance. + + Returns: + GPTModelProvider configured for Qwen2. + """ provider = super().provider_bridge(hf_pretrained)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen/qwen2_bridge.py` around lines 28 - 57, The provider_bridge method in Qwen2Bridge lacks type annotations and a Google-style docstring; update the signature to include type hints (e.g., def provider_bridge(self, hf_pretrained: Union[str, "PreTrainedModel"]) -> "GPTModelProvider") and add a Google-style docstring with Args and Returns sections describing hf_pretrained and the returned GPTModelProvider, and import/forward-reference typing symbols (Union, PreTrainedModel, GPTModelProvider) as needed to satisfy annotations.
♻️ Duplicate comments (7)
scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py (1)
34-35: Same unusedmockparameter as in the llama31 configs.Same issue as flagged in
llama31_llm_pretrain.py— themock: bool = Trueparameter is declared but unused in all five function signatures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/nemotronh/nemotronh_llm_pretrain.py` around lines 34 - 35, The parameter mock: bool = True is declared but never used; remove the unused mock parameter from the function signature(s) (e.g., nemotronh_56b_pretrain_config_gb300) and any other pretrain config functions in this file so signatures no longer include mock, or alternatively wire the mock flag into the function logic if a mocked behavior is intended—update all referenced function definitions and any callers to match the new signature (adjust imports/usage accordingly).src/megatron/bridge/recipes/qwen/qwen3.py (5)
257-335: Same long-line issue repeats in this recipe block.As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 257 - 335, The qwen3_4b_pretrain_config function contains lines longer than the 119-character limit (e.g., the AutoBridge.from_hf_pretrained assignment, long comment/documentation strings, and some cfg assignments like cfg.tokenizer.tokenizer_model and dataset.blend); shorten them by breaking long string literals and expressions across multiple lines (use implicit continuation with parentheses or assign parts to intermediate variables), wrap long comments to multiple shorter lines, and split chained attribute assignments into multiple statements where needed so every line ≤119 chars while keeping the same identifiers (qwen3_4b_pretrain_config, AutoBridge.from_hf_pretrained, cfg.tokenizer.tokenizer_model, cfg.dataset.blend, etc.).
500-570: Same long-line issue repeats in this recipe block.As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 500 - 570, The qwen3_32b_pretrain_config function contains several lines exceeding the 119-character limit (e.g., the long assignment/comment lines such as cfg.model = AutoBridge.from_hf_pretrained("Qwen/Qwen3-32B").to_megatron_provider(load_weights=False), the cfg.dataset.blend explanatory comment, and the checkpoint override comment block); wrap these long expressions and comments across multiple shorter strings/lines and/or split chained calls and long comments into concatenated pieces so each source line stays <=119 characters while preserving the same values and wording (edit the qwen3_32b_pretrain_config function, focusing on the long lines noted above and similar long comments/assignments).
419-497: Same long-line issue repeats in this recipe block.As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 419 - 497, The function qwen3_14b_pretrain_config contains multiple lines exceeding the 119-char limit (e.g., the AutoBridge.from_hf_pretrained assignment, long comment lines, and long cfg.* assignment/comments). Fix by wrapping long string literals and long chained assignments across lines (use implicit continuation with parentheses for the AutoBridge.from_hf_pretrained call and split long comments into multiple shorter comment lines), and break any long cfg.* comments or example lines into shorter lines; target symbols to edit include qwen3_14b_pretrain_config, cfg.model = AutoBridge.from_hf_pretrained("Qwen/Qwen3-14B"), cfg.tokenizer.tokenizer_model, and long comment lines shown in the block so every line <= 119 chars.
338-416: Same long-line issue repeats in this recipe block.As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 338 - 416, The qwen3_8b_pretrain_config function contains lines exceeding the 119-char limit (repeat violation). Fix by breaking long comment strings and long assignments inside qwen3_8b_pretrain_config: wrap/line-break long comments into multiple shorter comment lines, split long expressions (e.g., the cfg.dataset.blend example tuple, long inline comments for FP8/mixed_precision) into multiple concatenated strings or separate local variables, and prefer short helper variables for long values (e.g., model_id = "Qwen/Qwen3-8B") used in cfg.model and cfg.tokenizer assignments so no single source line exceeds 119 chars; update any other long cfg.* assignment lines similarly. Ensure each modified line is <=119 chars and preserves the same values and behavior.
176-255: Same long-line issue repeats in this recipe block.As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 176 - 255, Several lines in qwen3_1p7b_pretrain_config exceed the 119-character limit; locate and break up long statements such as the chained call cfg.model = AutoBridge.from_hf_pretrained("Qwen/Qwen3-1.7B").to_megatron_provider(load_weights=False), the long comment/assignment for cfg.dataset.blend, and the long checkpoint example comment lines so each physical line is <=119 chars. Fix by splitting long strings/comments across multiple concatenated literals or by introducing a short local variable (e.g., pretrained_id = "Qwen/Qwen3-1.7B") and using that in calls, and wrap long comments into multiple lines; ensure no functional change to qwen3_1p7b_pretrain_config, cfg.dataset.blend, or the checkpoint comment examples.src/megatron/bridge/models/deepseek/deepseek_v3_bridge.py (1)
112-113: Same docstring‑style guidance as above applies here.As per coding guidelines, "Use Google style docstrings (parseable by Sphinx) for classes and functions".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/deepseek_v3_bridge.py` around lines 112 - 113, The function-level docstring above the line "global_name = task.global_param_name" is not Google-style; update the docstring to a parseable Google style (short description, blank line, Args: with parameter descriptions, Returns: with return type/meaning, and Raises: if applicable) for the function in deepseek_v3_bridge.py that handles adding the rotary embedding inverse frequency parameter (the routine that references task.global_param_name / local variable global_name). Ensure the docstring explains the purpose, documents parameters (e.g., task, any flags), describes the return value (if any), and notes exceptions so Sphinx can parse it.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (129)
examples/quantization/pretrain_quantized_llama3_8b.pyscripts/performance/configs/deepseek/deepseek_llm_pretrain.pyscripts/performance/configs/deepseek/deepseek_workload_base_configs.pyscripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.pyscripts/performance/configs/kimi/__init__.pyscripts/performance/configs/kimi/kimi_llm_pretrain.pyscripts/performance/configs/kimi/kimi_workload_base_configs.pyscripts/performance/configs/llama/llama31_llm_pretrain.pyscripts/performance/configs/llama/llama3_llm_pretrain.pyscripts/performance/configs/nemotronh/__init__.pyscripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.pyscripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.pyscripts/performance/configs/nemotronh/nemotronh_llm_pretrain.pyscripts/performance/configs/qwen/qwen3_llm_pretrain.pyscripts/performance/configs/qwen/qwen3_workload_base_configs.pyscripts/performance/utils/overrides.pyscripts/performance/utils/utils.pysrc/megatron/bridge/models/conversion/model_bridge.pysrc/megatron/bridge/models/deepseek/common.pysrc/megatron/bridge/models/deepseek/deepseek_provider.pysrc/megatron/bridge/models/deepseek/deepseek_v2_bridge.pysrc/megatron/bridge/models/deepseek/deepseek_v3_bridge.pysrc/megatron/bridge/models/gemma/gemma2_bridge.pysrc/megatron/bridge/models/gemma/gemma3_bridge.pysrc/megatron/bridge/models/gemma/gemma_bridge.pysrc/megatron/bridge/models/glm/glm45_bridge.pysrc/megatron/bridge/models/gpt_oss/gpt_oss_bridge.pysrc/megatron/bridge/models/gpt_provider.pysrc/megatron/bridge/models/llama/llama_bridge.pysrc/megatron/bridge/models/llama/llama_provider.pysrc/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.pysrc/megatron/bridge/models/mistral/mistral_bridge.pysrc/megatron/bridge/models/mla_provider.pysrc/megatron/bridge/models/nemotron/nemotron_bridge.pysrc/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.pysrc/megatron/bridge/models/nemotronh/__init__.pysrc/megatron/bridge/models/nemotronh/nemotron_h_bridge.pysrc/megatron/bridge/models/olmoe/olmoe_bridge.pysrc/megatron/bridge/models/qwen/qwen2_bridge.pysrc/megatron/bridge/models/qwen/qwen3_bridge.pysrc/megatron/bridge/models/qwen/qwen3_moe_bridge.pysrc/megatron/bridge/models/qwen/qwen3_next_bridge.pysrc/megatron/bridge/models/qwen_vl/qwen25_vl_bridge.pysrc/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.pysrc/megatron/bridge/recipes/common.pysrc/megatron/bridge/recipes/deepseek/deepseek_v2.pysrc/megatron/bridge/recipes/deepseek/deepseek_v3.pysrc/megatron/bridge/recipes/gemma/gemma2.pysrc/megatron/bridge/recipes/gemma/gemma3.pysrc/megatron/bridge/recipes/glm/glm45.pysrc/megatron/bridge/recipes/gpt/gpt3_175b.pysrc/megatron/bridge/recipes/gpt_oss/gpt_oss.pysrc/megatron/bridge/recipes/kimi/kimi_k2.pysrc/megatron/bridge/recipes/llama/llama2.pysrc/megatron/bridge/recipes/llama/llama3.pysrc/megatron/bridge/recipes/moonlight/moonlight_16b.pysrc/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pysrc/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.pysrc/megatron/bridge/recipes/nemotronh/nemotronh.pysrc/megatron/bridge/recipes/olmoe/olmoe_7b.pysrc/megatron/bridge/recipes/qwen/qwen2.pysrc/megatron/bridge/recipes/qwen/qwen3.pysrc/megatron/bridge/recipes/qwen/qwen3_moe.pysrc/megatron/bridge/recipes/qwen/qwen3_next.pysrc/megatron/bridge/recipes/utils/optimizer_utils.pysrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/setup.pytests/functional_tests/models/deepseek/test_deepseek_conversion.pytests/functional_tests/models/deepseek/test_deepseek_provider_mapping.pytests/functional_tests/models/gemma/test_gemma2_provider.pytests/functional_tests/models/gemma/test_gemma3_provider.pytests/functional_tests/models/gemma/test_gemma_provider.pytests/functional_tests/models/glm/test_glm45_provider.pytests/functional_tests/models/llama_nemotron/test_llama_nemotron_provider.pytests/functional_tests/models/nemotronh/test_nemotron_h_provider.pytests/functional_tests/models/qwen/test_qwen2_provider.pytests/functional_tests/models/qwen/test_qwen3_moe_provider.pytests/functional_tests/models/qwen/test_qwen3_provider.pytests/functional_tests/quantization/test_qat_workflow.pytests/functional_tests/recipes/test_llama_recipes_distill_3b-1b.pytests/functional_tests/recipes/test_perf_config_integration.pytests/functional_tests/recipes/utils.pytests/functional_tests/training/test_callbacks.pytests/functional_tests/training/test_decentralized_pg.pytests/functional_tests/training/test_finetune_dora.pytests/functional_tests/training/test_finetune_lora.pytests/functional_tests/training/test_inprocess_restart.pytests/functional_tests/training/test_nvrx_straggler.pytests/functional_tests/training/test_pretrain.pytests/functional_tests/training/test_pretrain_resume.pytests/functional_tests/training/test_sft.pytests/functional_tests/training/test_tensor_inspect.pytests/unit_tests/models/deepseek/test_deepseek_bridges.pytests/unit_tests/models/deepseek/test_deepseek_provider.pytests/unit_tests/models/gemma/test_gemma2_bridge.pytests/unit_tests/models/gemma/test_gemma3_bridge.pytests/unit_tests/models/gemma/test_gemma_bridge.pytests/unit_tests/models/glm/test_glm45_bridge.pytests/unit_tests/models/gpt_oss/test_gpt_oss_bridges.pytests/unit_tests/models/llama/test_llama_bridge.pytests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.pytests/unit_tests/models/mistral/test_mistral_model_bridge.pytests/unit_tests/models/nemotron/test_nemotron_bridge.pytests/unit_tests/models/nemotronh/test_nemotron_h_bridge.pytests/unit_tests/models/olmoe/test_olmoe_bridge.pytests/unit_tests/models/qwen/test_qwen3_bridge.pytests/unit_tests/models/qwen/test_qwen3_moe_bridge.pytests/unit_tests/models/qwen/test_qwen3_next_bridge.pytests/unit_tests/models/qwen_vl/test_qwen25_vl_bridge.pytests/unit_tests/models/test_gpt_provider.pytests/unit_tests/recipes/gemma/test_gemma2_recipes.pytests/unit_tests/recipes/gpt/test_gpt3_175b.pytests/unit_tests/recipes/kimi/test_kimi_k2.pytests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.pytests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.pytests/unit_tests/recipes/nemotronh/test_nemotronh.pytests/unit_tests/recipes/qwen/test_qwen2_recipes.pytests/unit_tests/recipes/test_deepseek_recipes.pytests/unit_tests/recipes/test_gemma3_recipes.pytests/unit_tests/recipes/test_glm45_recipes.pytests/unit_tests/recipes/test_llama_recipes.pytests/unit_tests/recipes/test_moonlight_recipes.pytests/unit_tests/recipes/test_nemotronh_recipes.pytests/unit_tests/recipes/test_olmoe_recipes.pytests/unit_tests/recipes/test_qwen_recipes.pytests/unit_tests/recipes/test_run_plugins.pytests/unit_tests/recipes/utils/test_optimizer_utils.pytests/unit_tests/training/test_log_non_default_values.pytests/unit_tests/training/test_setup.py
💤 Files with no reviewable changes (23)
- src/megatron/bridge/models/qwen/qwen3_next_bridge.py
- src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
- tests/unit_tests/models/gemma/test_gemma2_bridge.py
- tests/functional_tests/models/glm/test_glm45_provider.py
- src/megatron/bridge/models/mistral/mistral_bridge.py
- tests/functional_tests/models/qwen/test_qwen3_moe_provider.py
- src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
- tests/functional_tests/models/gemma/test_gemma3_provider.py
- tests/unit_tests/models/gemma/test_gemma3_bridge.py
- tests/functional_tests/models/qwen/test_qwen2_provider.py
- tests/functional_tests/models/llama_nemotron/test_llama_nemotron_provider.py
- tests/functional_tests/models/gemma/test_gemma_provider.py
- src/megatron/bridge/models/qwen_vl/qwen25_vl_bridge.py
- tests/unit_tests/models/test_gpt_provider.py
- tests/functional_tests/models/deepseek/test_deepseek_provider_mapping.py
- src/megatron/bridge/models/deepseek/common.py
- tests/functional_tests/models/nemotronh/test_nemotron_h_provider.py
- tests/unit_tests/models/mistral/test_mistral_model_bridge.py
- tests/functional_tests/models/gemma/test_gemma2_provider.py
- tests/functional_tests/models/qwen/test_qwen3_provider.py
- tests/unit_tests/models/gemma/test_gemma_bridge.py
- tests/unit_tests/models/qwen_vl/test_qwen25_vl_bridge.py
- tests/unit_tests/models/qwen/test_qwen3_next_bridge.py
| if args.hf_path: | ||
| logger.info(f"Using custom HuggingFace path: {args.hf_path}") | ||
| recipe_kwargs["hf_path"] = args.hf_path | ||
| # Import AutoBridge to create a new model provider with the custom HF path | ||
| from megatron.bridge.models import AutoBridge | ||
|
|
||
| cfg: ConfigContainer = pretrain_config(**recipe_kwargs) | ||
| logger.info("Loaded base configuration") | ||
| cfg.model = AutoBridge.from_hf_pretrained(args.hf_path).to_megatron_provider(load_weights=False) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for AutoBridge class and its methods
find . -type f -name "*.py" -path "*/megatron/bridge/models/*" | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1120
🏁 Script executed:
#!/bin/bash
# Look for AutoBridge definition and from_hf_config method
rg "class AutoBridge|def from_hf_config|def from_hf_pretrained" --type=py -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1977
🏁 Script executed:
#!/bin/bash
# Check the specific file for import placement
cat -n examples/quantization/pretrain_quantized_llama3_8b.py | sed -n '150,170p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1095
🏁 Script executed:
#!/bin/bash
# Examine the from_hf_pretrained and from_hf_config implementations
sed -n '148,230p' src/megatron/bridge/models/conversion/auto_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3593
🏁 Script executed:
#!/bin/bash
# Also check the to_megatron_provider implementation
rg "def to_megatron_provider" -A 20 --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 32005
🏁 Script executed:
#!/bin/bash
# Check what PreTrainedCausalLM.from_pretrained does and if it loads weights
rg "class AutoBridge" -A 50 src/megatron/bridge/models/conversion/auto_bridge.py | head -80Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2443
🏁 Script executed:
#!/bin/bash
# Get the full implementation of from_hf_pretrained
sed -n '194,250p' src/megatron/bridge/models/conversion/auto_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2566
🏁 Script executed:
#!/bin/bash
# Check if there's any rank guard usage in the AutoBridge or related code
rg "get_rank|rank_zero|dist.get_rank" src/megatron/bridge/models/conversion/ -A 2 -B 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1787
🏁 Script executed:
#!/bin/bash
# Get the full to_megatron_provider implementation
sed -n '238,320p' src/megatron/bridge/models/conversion/auto_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3635
🏁 Script executed:
#!/bin/bash
# Check if there are any attribute accesses on hf_pretrained that would fail if it's just a config
rg "self.hf_pretrained\." src/megatron/bridge/models/conversion/auto_bridge.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 528
🏁 Script executed:
#!/bin/bash
# Search for to_megatron_provider in AutoBridge specifically
rg "def to_megatron_provider" src/megatron/bridge/models/conversion/auto_bridge.py -A 80Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4006
🏁 Script executed:
#!/bin/bash
# Check the config property and how hf_pretrained.config is accessed
rg "\.config" src/megatron/bridge/models/conversion/auto_bridge.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 506
🏁 Script executed:
#!/bin/bash
# Check the beginning of the file to see how imports are organized
head -30 examples/quantization/pretrain_quantized_llama3_8b.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1327
🏁 Script executed:
#!/bin/bash
# Verify if AutoBridge is already imported at the top
rg "^import|^from" examples/quantization/pretrain_quantized_llama3_8b.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 684
All distributed ranks load the full HF model unnecessarily — OOM risk.
AutoBridge.from_hf_pretrained(args.hf_path) loads the entire model with all weights via PreTrainedCausalLM.from_pretrained(), and because there is no rank guard, every torchrun worker independently loads the full model in parallel. For LLaMA 3-8B that is ~16 GB per rank; with 8 ranks this consumes ~128 GB total VRAM and will cause OOM before pretraining starts.
Use AutoBridge.from_hf_config() instead to load only the configuration, then call to_megatron_provider(load_weights=False) to defer weight loading until after distributed initialization:
🛡️ Proposed fix
- # Import AutoBridge to create a new model provider with the custom HF path
- from megatron.bridge.models import AutoBridge
-
- cfg.model = AutoBridge.from_hf_pretrained(args.hf_path).to_megatron_provider(load_weights=False)
+ # Import AutoBridge to create a new model provider with the custom HF path
+ from megatron.bridge.models import AutoBridge
+ from transformers import AutoConfig
+
+ hf_cfg = AutoConfig.from_pretrained(args.hf_path)
+ cfg.model = AutoBridge.from_hf_config(hf_cfg).to_megatron_provider(load_weights=False)Also move the AutoBridge import to the top of the file with other megatron.bridge.* imports to follow the import organization guideline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/quantization/pretrain_quantized_llama3_8b.py` around lines 160 -
165, Replace the blocking call to AutoBridge.from_hf_pretrained(args.hf_path)
(which loads full weights on every rank) with
AutoBridge.from_hf_config(args.hf_path) so only the model config is read and
then call .to_megatron_provider(load_weights=False) to defer weight loading
until after distributed init; update the code that currently does
AutoBridge.from_hf_pretrained(...).to_megatron_provider(load_weights=False) to
use AutoBridge.from_hf_config(...) instead, and move the `from
megatron.bridge.models import AutoBridge` import to the top with other
megatron.bridge imports to follow import organization.
| def kimi_k2_pretrain_config_gb300( | ||
| precision: str = "bf16", mock: bool = True, config_variant: str = "v1" | ||
| ) -> ConfigContainer: | ||
| """GB300, baseline config.""" | ||
| base_cfg = get_workload_base_config( | ||
| model_family_name="kimi", | ||
| model_recipe_name="kimi_k2", | ||
| gpu="gb300", | ||
| compute_dtype=precision.upper(), | ||
| task="pretrain", | ||
| config_variant=config_variant, | ||
| ) | ||
|
|
||
| cfg = pretrain_config() | ||
| precision_config = get_precision_config(precision) | ||
| cfg.mixed_precision = precision_config | ||
|
|
||
| if base_cfg.moe_flex_dispatcher_backend is not None: | ||
| cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend | ||
|
|
||
| if base_cfg.pp_layout: | ||
| cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout | ||
| else: | ||
| # Recompute layout based on updated PP/VP sizes | ||
| pp_size = base_cfg.pipeline_model_parallel_size | ||
| vp_size = base_cfg.virtual_pipeline_model_parallel_size | ||
| layout = _get_kimi_k2_pipeline_layout(pp_size, vp_size) | ||
| cfg.model.pipeline_model_parallel_layout = layout | ||
|
|
||
| set_kimi_k2_common_configs(cfg) | ||
| set_workload_base_configs(cfg, base_cfg) | ||
|
|
||
| cfg.comm_overlap.overlap_grad_reduce = True | ||
|
|
||
| # Setting num_workers and pin_memory to 0 and False respectively gives better performance. | ||
| # we are debugging this and might change this in the future. | ||
| cfg.dataset.num_workers = 0 | ||
| cfg.dataset.pin_memory = False | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the file and search for pretrain_config() definition and comm_overlap usage
find . -name "kimi_llm_pretrain.py" -type f | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 126
🏁 Script executed:
#!/bin/bash
# Search for pretrain_config function definition
rg "def pretrain_config" -A 10 -B 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Check for comm_overlap initialization in the file
rg "comm_overlap" -n scripts/performance/configs/kimi/kimi_llm_pretrain.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 278
🏁 Script executed:
#!/bin/bash
# Check all 4 function signatures to see if mock is used
rg "def kimi_k2_pretrain_config" -A 40 scripts/performance/configs/kimi/kimi_llm_pretrain.py | head -150Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5316
🏁 Script executed:
#!/bin/bash
# Search where pretrain_config is defined - likely in another module
rg "def pretrain_config\(" --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Get the top of the file to see imports
head -50 scripts/performance/configs/kimi/kimi_llm_pretrain.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1890
🏁 Script executed:
#!/bin/bash
# Check if mock is used in any of the functions
rg "mock" scripts/performance/configs/kimi/kimi_llm_pretrain.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 370
🏁 Script executed:
#!/bin/bash
# Find pretrain_config definition in the entire codebase
rg "def pretrain_config\(" -B 2 -A 5 --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Check what module pretrain_config likely comes from
rg "pretrain_config" -B 3 scripts/performance/configs/kimi/kimi_llm_pretrain.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 494
🏁 Script executed:
#!/bin/bash
# Find the kimi_k2.py file
find . -name "kimi_k2.py" -type fRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 116
🏁 Script executed:
#!/bin/bash
# Search for kimi_k2_pretrain_config definition
rg "def kimi_k2_pretrain_config\(" -A 20 --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1734
🏁 Script executed:
#!/bin/bash
# Check ConfigContainer for comm_overlap field
rg "comm_overlap" -B 2 -A 2 --type py | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4103
🏁 Script executed:
#!/bin/bash
# Find _pretrain_common definition
rg "def _pretrain_common\(" -A 30 --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2472
🏁 Script executed:
#!/bin/bash
# Get full content of kimi_k2.py to understand the config flow
wc -l src/megatron/bridge/recipes/kimi/kimi_k2.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 118
🏁 Script executed:
#!/bin/bash
# Check what _pretrain_common returns and if it initializes comm_overlap
rg "_pretrain_common" -B 2 -A 2 --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 50383
🏁 Script executed:
#!/bin/bash
# Search for comm_overlap initialization in common.py
rg "comm_overlap" src/megatron/bridge/recipes/common.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 251
🏁 Script executed:
#!/bin/bash
# Get the full ConfigContainer initialization in _pretrain_common
sed -n '/def _pretrain_common/,/return cfg/p' src/megatron/bridge/recipes/common.py | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3456
🏁 Script executed:
#!/bin/bash
# Check the entire file structure of kimi_llm_pretrain.py to see all 4 functions
cat scripts/performance/configs/kimi/kimi_llm_pretrain.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 7142
Guard comm_overlap access and remove unused mock parameter
cfg.comm_overlap.overlap_grad_reduce = ... will raise AttributeError since _pretrain_common() initializes comm_overlap=None. Guard with a null check before assignment. Also, mock is unused across all four config builder functions—remove the parameter to avoid a misleading API signature.
Suggested fix (apply to each config builder)
- cfg.comm_overlap.overlap_grad_reduce = True
+ if cfg.comm_overlap is not None:
+ cfg.comm_overlap.overlap_grad_reduce = TrueAlso applies to lines 87-125 (gb200), 129-163 (b200), 166-200 (h100). Remove mock: bool = True parameter from all four functions.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 46-46: Unused function argument: mock
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py` around lines 45 - 83,
The cfg.comm_overlap assignment can raise AttributeError because
_pretrain_common() may set comm_overlap=None; update the config builders (e.g.,
kimi_k2_pretrain_config_gb300) to check if cfg.comm_overlap is not None before
setting cfg.comm_overlap.overlap_grad_reduce, and remove the unused mock: bool =
True parameter from this function and the other three config builders (the
GB200, B200, H100 variants) so the signatures no longer include mock and callers
reflect the updated signature.
| moe_flex_dispatcher_backend="hybridep", | ||
| moe_a2a_overlap=False, | ||
| cuda_graph_impl="transformer_engine", | ||
| cuda_graph_scope=["attn", "moe_router", "moe_preprocess"], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n --type=py "cuda_graph_scope" -C 3 -g '!*test*'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 50383
🏁 Script executed:
# Check the WorkloadBaseConfig definition
rg -A 20 "class WorkloadBaseConfig" scripts/performance/utils/utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1046
🏁 Script executed:
# Get more lines from WorkloadBaseConfig to see cuda_graph_scope definition
sed -n '40,70p' scripts/performance/utils/utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1065
🏁 Script executed:
# Check the exact class definition and decorator
sed -n '30,60p' scripts/performance/utils/utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1386
🏁 Script executed:
# Check lines before the class definition to see if it's a dataclass
sed -n '25,35p' scripts/performance/utils/utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 622
Fix type annotation for cuda_graph_scope in WorkloadBaseConfig.
The cuda_graph_scope field is declared as Optional[str] = None in scripts/performance/utils/utils.py line 55, but lines 37 and 57 of this file assign list values like ["attn", "moe_router", "moe_preprocess"]. The downstream override handler in scripts/performance/utils/overrides.py accepts Optional[str | List[str]] and converts strings to lists, indicating the field should support both types. Update the type annotation in WorkloadBaseConfig to Optional[str | List[str]] = None to reflect actual usage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/kimi/kimi_workload_base_configs.py` at line 37,
The WorkloadBaseConfig field cuda_graph_scope is declared as Optional[str] but
is assigned lists (e.g. ["attn", "moe_router", "moe_preprocess"]) and must
accept both strings and lists; update the type annotation on the
cuda_graph_scope attribute in the WorkloadBaseConfig class to Optional[str |
List[str]] = None so it matches usage and the downstream override handler that
accepts Optional[str | List[str]].
| is_mla_provider = self.PROVIDER_CLASS is not None and issubclass(self.PROVIDER_CLASS, MLAModelProvider) | ||
| rope_scaling = getattr(hf_config, "rope_scaling", None) | ||
|
|
||
| if rope_scaling is not None and isinstance(rope_scaling, dict): | ||
| rope_type = rope_scaling.get("type") or rope_scaling.get("rope_type") | ||
| if rope_type == "yarn": | ||
| # Check if this is an MLA provider (uses direct field names) | ||
| # or a GPT provider (uses yarn_ prefixed field names) | ||
| if is_mla_provider: | ||
| # MLA models: use direct field names (mscale, rotary_scaling_factor, etc.) | ||
| mla_params = {} | ||
| for hf_key, megatron_key in self.MLA_ROPE_SCALING_MAPPING: | ||
| value = rope_scaling.get(hf_key) | ||
| if value is not None: | ||
| mla_params[megatron_key] = value | ||
| if mla_params: | ||
| provider_kwargs["_mla_rope_params"] = mla_params | ||
| else: | ||
| # GPT models: use yarn_ prefixed field names | ||
| yarn_params = {"position_embedding_type": "yarn"} | ||
| for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING: | ||
| yarn_params[megatron_key] = rope_scaling.get(hf_key) | ||
| if "truncate" in rope_scaling: | ||
| yarn_params["yarn_correction_range_round_to_int"] = rope_scaling["truncate"] | ||
| if yarn_params: | ||
| provider_kwargs["_yarn_params"] = yarn_params |
There was a problem hiding this comment.
Skip None values when building YARN params.
Right now missing YARN keys get set to None, which can overwrite provider defaults and break config validation downstream.
🐛 Proposed fix
- for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING:
- yarn_params[megatron_key] = rope_scaling.get(hf_key)
+ for hf_key, megatron_key in self.YARN_ROPE_SCALING_MAPPING:
+ value = rope_scaling.get(hf_key)
+ if value is not None:
+ yarn_params[megatron_key] = value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 371 -
396, The YARN param builder currently assigns None for missing keys (in the loop
over YARN_ROPE_SCALING_MAPPING), which can overwrite provider defaults; change
the loop in model_bridge.py that iterates self.YARN_ROPE_SCALING_MAPPING to only
set yarn_params[megatron_key] = value when value is not None (mirroring the MLA
branch), and only add provider_kwargs["_yarn_params"] when yarn_params contains
actual scaling entries (i.e., beyond the initial "position_embedding_type") or
when a non-None truncate is present, so missing HF keys do not produce
None-valued overrides.
| # Handle YARN rope scaling: check if provider has yarn_* params and build rope_scaling dict | ||
| yarn_rotary_scaling_factor = getattr(provider, "yarn_rotary_scaling_factor", None) | ||
| if yarn_rotary_scaling_factor is not None: | ||
| if "rope_scaling" not in hf_config: | ||
| hf_config["rope_scaling"] = {} | ||
| hf_config["rope_scaling"]["rope_type"] = "yarn" | ||
|
|
||
| for hf_key, megatron_key in cls.YARN_ROPE_SCALING_MAPPING: | ||
| value = getattr(provider, megatron_key, None) | ||
| if value is not None: | ||
| hf_config["rope_scaling"][hf_key] = value | ||
|
|
||
| yarn_correction_range_round_to_int = getattr(provider, "yarn_correction_range_round_to_int", None) | ||
| if yarn_correction_range_round_to_int is not None: | ||
| hf_config["rope_scaling"]["truncate"] = yarn_correction_range_round_to_int | ||
|
|
There was a problem hiding this comment.
Export MLA rope scaling to HF configs.
The new MLA rope-scaling fields are handled when importing HF configs, but they are not emitted on export. That loses critical settings for MLA models.
🐛 Proposed fix
+ from megatron.bridge.models.mla_provider import MLAModelProvider
+
+ if isinstance(provider, MLAModelProvider):
+ mla_rope_scaling: dict[str, float] = {}
+ for hf_key, megatron_key in cls.MLA_ROPE_SCALING_MAPPING:
+ value = getattr(provider, megatron_key, None)
+ if value is not None:
+ mla_rope_scaling[hf_key] = value
+ if mla_rope_scaling:
+ hf_config["rope_scaling"] = {"rope_type": "yarn", **mla_rope_scaling}
+ else:
yarn_rotary_scaling_factor = getattr(provider, "yarn_rotary_scaling_factor", None)
if yarn_rotary_scaling_factor is not None:
if "rope_scaling" not in hf_config:
hf_config["rope_scaling"] = {}
hf_config["rope_scaling"]["rope_type"] = "yarn"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 504 -
519, Export currently doesn't emit MLA rope-scaling fields so rope_scaling
settings are lost; in the export path that builds hf_config (the same code area
using cls.YARN_ROPE_SCALING_MAPPING and provider attributes), detect
MLA/yarn-related provider attributes (yarn_rotary_scaling_factor,
yarn_correction_range_round_to_int and the keys listed in
cls.YARN_ROPE_SCALING_MAPPING) and populate hf_config["rope_scaling"] with
rope_type="yarn" and the mapped hf keys and truncate flag when present; update
the export logic that reads from provider into hf_config so it mirrors the
import handling shown here (use the provider attribute names from the mapping
and yarn_* attributes to set hf_config["rope_scaling"]).
| cfg.model.tensor_model_parallel_size = 2 | ||
| cfg.model.pipeline_model_parallel_size = 4 | ||
| cfg.model.pipeline_model_parallel_layout = None | ||
| cfg.model.pipeline_dtype = None | ||
| cfg.model.virtual_pipeline_model_parallel_size = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the target file
find . -type f -name "gpt_oss.py" | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 119
🏁 Script executed:
# Read the file to see lines 272-276 in context
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '260,290p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1642
🏁 Script executed:
# Search for 20B config and other pipeline_dtype references in the file
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | grep -E "(20B|pipeline_dtype)"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 709
🏁 Script executed:
# Check the 20B pre-training config settings
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '133,170p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1891
🏁 Script executed:
# Check the 120B config context and finetuning configs
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '250,290p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2045
🏁 Script executed:
# Check finetuning configs
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '361,420p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2819
🏁 Script executed:
# Continue reading the _gpt_oss_finetune_common function
cat -n ./src/megatron/bridge/recipes/gpt_oss/gpt_oss.py | sed -n '404,460p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2860
🏁 Script executed:
# Search for other recipe files to see how they handle pipeline_dtype
find ./src/megatron/bridge/recipes -type f -name "*.py" | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 556
🏁 Script executed:
# Search for pipeline_dtype in other recipe files
rg "pipeline_dtype" ./src/megatron/bridge/recipes --type py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 50383
Set pipeline_dtype to torch.bfloat16 for PP=4 configs.
Both the 20B and 120B pre-training configurations use pipeline parallelism (PP=4) but have pipeline_dtype = None, which violates the pattern established across other recipe files. When pipeline model parallelism > 1, pipeline_dtype must be explicitly set to torch.bfloat16.
🔧 Fix for both configs
- cfg.model.pipeline_dtype = None
+ cfg.model.pipeline_dtype = torch.bfloat16Apply this fix to both:
- Line 156 (20B pre-training config)
- Line 275 (120B pre-training config)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cfg.model.tensor_model_parallel_size = 2 | |
| cfg.model.pipeline_model_parallel_size = 4 | |
| cfg.model.pipeline_model_parallel_layout = None | |
| cfg.model.pipeline_dtype = None | |
| cfg.model.virtual_pipeline_model_parallel_size = None | |
| cfg.model.tensor_model_parallel_size = 2 | |
| cfg.model.pipeline_model_parallel_size = 4 | |
| cfg.model.pipeline_model_parallel_layout = None | |
| cfg.model.pipeline_dtype = torch.bfloat16 | |
| cfg.model.virtual_pipeline_model_parallel_size = None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` around lines 272 - 276, The
pipeline_dtype is left as None for configs where
cfg.model.pipeline_model_parallel_size == 4; update those occurrences in
src/megatron/bridge/recipes/gpt_oss/gpt_oss.py by setting
cfg.model.pipeline_dtype = torch.bfloat16 (ensure torch is imported/available in
the module) for both the 20B and 120B pre-training config blocks where
cfg.model.pipeline_model_parallel_size is 4 so pipeline_dtype is explicitly set
whenever pipeline parallelism > 1.
| # Optimizer hyperparameters | ||
| cfg.optimizer.lr = 1.6e-3 | ||
| cfg.optimizer.weight_decay = 0.1 | ||
| cfg.scheduler.min_lr = 1.6e-5 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether min_lr lives on optimizer or scheduler
rg -n "\.(optimizer|scheduler)\.min_lr" --type=py -g 'src/megatron/bridge/recipes/**'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 717
🏁 Script executed:
#!/bin/bash
# Look at the context around line 131 in nemotron_3_nano.py
sed -n '120,140p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 795
🏁 Script executed:
#!/bin/bash
# Search for scheduler and optimizer config definitions and their min_lr usage
rg -n "min_lr" --type=py -B2 -A2 src/megatron/bridge/recipes/gpt/gpt3_175b.py | head -30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 221
Change cfg.scheduler.min_lr to cfg.optimizer.min_lr at line 131.
All other recipes (GPT 175B, LLaMA3, GLM45, Nemotron Nano V2 VL) set cfg.optimizer.min_lr. This file is the only outlier using cfg.scheduler.min_lr, which is inconsistent with the established pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` at line 131,
Replace the inconsistent assignment of cfg.scheduler.min_lr with
cfg.optimizer.min_lr in the Nemotron 3 Nano recipe: locate the line setting
cfg.scheduler.min_lr (currently 1.6e-5) and change it to assign that value to
cfg.optimizer.min_lr instead, and scan the file for any other uses of
cfg.scheduler.min_lr to align them with the optimizer config pattern used in
other recipes.
| cfg.optimizer.lr = 1.6e-3 | ||
| cfg.optimizer.weight_decay = 0.1 | ||
| cfg.scheduler.min_lr = 1.6e-5 | ||
| cfg.scheduler.warmup_iters = 333 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find SchedulerConfig class definition
find . -name "*.py" -type f | xargs grep -l "class SchedulerConfig" | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 111
🏁 Script executed:
# Check warmup_iters vs lr_warmup_iters usage in recipes
rg -n "scheduler\.(lr_)?warmup_iters" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5255
🏁 Script executed:
# Look at the specific file to confirm line 132
sed -n '125,140p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 600
🏁 Script executed:
# Search for _pretrain_common to understand what it sets
find . -name "*.py" -type f | xargs grep -l "_pretrain_common" | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 310
🏁 Script executed:
# View SchedulerConfig class and its attributes
sed -n '1600,1650p' src/megatron/bridge/training/config.py | grep -A 50 "class SchedulerConfig"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Look for attribute definitions in SchedulerConfig
rg -A 30 "class SchedulerConfig" src/megatron/bridge/training/config.py | head -60Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1370
Fix incorrect scheduler attribute name on line 132.
Change cfg.scheduler.warmup_iters = 333 to cfg.scheduler.lr_warmup_iters = 333. The SchedulerConfig class defines lr_warmup_iters as the valid attribute (default: 0), not warmup_iters. Using the wrong name silently creates a new attribute without actually setting the scheduler's warmup iterations, leaving it at the default value of 0 instead of the intended 333. All other recipes correctly use cfg.scheduler.lr_warmup_iters.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` at line 132, The
line sets a non-existent scheduler attribute (cfg.scheduler.warmup_iters) so the
scheduler's warmup stays at default; change that assignment to use the
SchedulerConfig's real field name by replacing cfg.scheduler.warmup_iters = 333
with cfg.scheduler.lr_warmup_iters = 333 so the scheduler uses 333 warmup
iterations (refer to cfg.scheduler and SchedulerConfig's lr_warmup_iters).
…rate Moonlight recipe to AutoBridge (#2428) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…2571) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test e39b6da |
|
/ok to test 8fe60aa |
Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test 44468af |
What does this PR do ?
Includes:
Bump NVRX#2563Does not include:
feat: default packed_sequence to True in all finetune recipes #2284 (will be handled separately)Fix LLAMA3 LoRa TFLOPs Formula #2416 (not ready/reverted)Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Improvements
Bug Fixes