Skip to content

[training,ci] fix: guard get_mup_config_overrides import for mcore main compat#2846

Merged
yaoyu-33 merged 4 commits intomainfrom
yuya/mcore-compat-guard-mup-overrides
Mar 18, 2026
Merged

[training,ci] fix: guard get_mup_config_overrides import for mcore main compat#2846
yaoyu-33 merged 4 commits intomainfrom
yuya/mcore-compat-guard-mup-overrides

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 17, 2026

Problem

Launch_Unit_Tests_Core on the mcore bump PR #2829 fails with:

ImportError: cannot import name 'get_mup_config_overrides' from 'megatron.core.optimizer'

get_mup_config_overrides was added by the MuP scaling feature (#2666) and lives in mcore dev, but has not yet landed in mcore main (the branch tracked by the submodule).

Fix

Guard the import in optim.py with try/except ImportError and a _HAS_MUP_CONFIG_OVERRIDES flag. The μP optimizer path is silently skipped when the symbol is unavailable (safe, since μP is opt-in via use_mup=True on the model config).

Both the import guard and usage site are marked with TODO: Remove once get_mup_config_overrides lands in mcore main.

Also adds .claude/skills/mcore-compat/SKILL.md documenting this guard pattern for future main/dev divergence cases.

TODO

Remove the guard in a follow-up PR after the next mcore bump that includes get_mup_config_overrides in main.

Summary by CodeRabbit

  • Documentation

    • Added developer documentation for mcore compatibility guard patterns and repository architecture guidance.
  • Chores

    • Increased CI/CD unit test timeout for improved pipeline reliability.
    • Improved optimizer configuration robustness to gracefully handle optional feature availability across environments.
    • Extended test infrastructure with additional model provider configurations for enhanced testing coverage.

@yaoyu-33 yaoyu-33 requested a review from a team as a code owner March 17, 2026 20:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Contributor Author

/ok to test cd5fecb

@yaoyu-33 yaoyu-33 force-pushed the yuya/mcore-compat-guard-mup-overrides branch from cd5fecb to 08cc75e Compare March 17, 2026 20:16
@yaoyu-33
Copy link
Contributor Author

/ok to test 08cc75e

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This pull request adds documentation on guard patterns for handling optional mcore features, implements a conditional import guard for μP functionality, introduces a new test model provider for μP testing, updates CI/CD timeout configuration, and provides a comprehensive guide for Claude Code within the repository.

Changes

Cohort / File(s) Summary
Documentation & Guidance
.claude/skills/mcore-compat/SKILL.md, CLAUDE.md
New documentation outlining mcore-compat guard patterns with import/usage guards and removal criteria; comprehensive guide for Claude Code including architecture, conventions, dependencies, and available skills.
CI/CD Configuration
.github/workflows/cicd-main.yml
Increased unit test run timeout from 15 to 18 minutes (3-minute extension).
μP Feature Guards
src/megatron/bridge/training/optim.py
Added try/except import guard for get_mup_config_overrides with _HAS_MUP_CONFIG_OVERRIDES flag; conditioned setup_optimizer to apply μP overrides only when the symbol is available.
Test Model Provider
tests/functional_tests/training/test_pretrain.py
Introduced Llama32ModelProvider1B class extending GPTModelProvider with detailed Llama 3.2 configuration (normalization, activation, ROPE, layers, sizes, embeddings) for μP testing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: guarding the get_mup_config_overrides import for mcore main compatibility. It is concise, clear, and specifies the key issue being addressed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains minor changes: documentation additions, CI adjustment, defensive compatibility guard for optional μP functionality, and test infrastructure. No breaking changes, public API modifications, or significant functional impact.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/mcore-compat-guard-mup-overrides
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/functional_tests/training/test_pretrain.py (1)

252-266: ⚠️ Potential issue | 🟠 Major

Add skip guard for test when μP override API is unavailable.

This test asserts that a specific log message appears when μP is enabled (lines 315–321), but that log is only generated when get_mup_config_overrides is available in the mcore variant. On compatible mcore versions lacking this API, the test will fail with "Expected μP optimizer override log message but found none" even though the code gracefully skips μP (as designed by the try/except guard in optim.py). Skipping the test in such environments is the correct behavior.

Proposed fix
     def test_pretrain_with_mup(self, tmp_path, caplog):
         """
         Test end to end training with μP (Maximal Update Parameterization) enabled.
         ...
         """
+        try:
+            from megatron.core.optimizer import get_mup_config_overrides  # noqa: F401
+        except ImportError:
+            pytest.skip("μP overrides API is unavailable in current mcore variant")
+
         initialize_distributed()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain.py` around lines 252 - 266, The
test assumes μP override API exists; add a pre-check at the start of the test to
skip when the μP override function is unavailable (e.g., test should detect
presence of get_mup_config_overrides or the relevant symbol in the mcore.optim
module) so it only asserts the μP optimizer override log when that API exists;
locate the test function in tests/functional_tests/training/test_pretrain.py
that constructs Llama32ModelProvider1B and examines logs for the μP override
message, add a short guard that calls getattr/membership check for
get_mup_config_overrides (or equivalent) and uses the test framework's skip
mechanism if missing, then proceed with the existing log assertions otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 116-117: The sentence in CLAUDE.md claiming the CI uses the
`mcore_variant` input is stale; update the description to reflect that
`cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`)
when selecting the Megatron-LM submodule, and ensure the guidance about the
`3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for
the current sentence referencing `mcore_variant` and replace it with text that
cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI
inputs as appropriate.

In `@src/megatron/bridge/training/optim.py`:
- Around line 82-83: The code currently checks _HAS_MUP_CONFIG_OVERRIDES before
honoring model_config.use_mup which silently skips μP when the helper is
missing; change the logic in the block around _HAS_MUP_CONFIG_OVERRIDES and
get_mup_config_overrides so that you first check getattr(model_config,
"use_mup", False) and if True verify that get_mup_config_overrides is available
(i.e., _HAS_MUP_CONFIG_OVERRIDES is True); if it is not available raise a
RuntimeError with a clear message indicating μP was requested but
get_mup_config_overrides is missing, otherwise call get_mup_config_overrides to
populate mup_overrides and continue as before.

---

Outside diff comments:
In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 252-266: The test assumes μP override API exists; add a pre-check
at the start of the test to skip when the μP override function is unavailable
(e.g., test should detect presence of get_mup_config_overrides or the relevant
symbol in the mcore.optim module) so it only asserts the μP optimizer override
log when that API exists; locate the test function in
tests/functional_tests/training/test_pretrain.py that constructs
Llama32ModelProvider1B and examines logs for the μP override message, add a
short guard that calls getattr/membership check for get_mup_config_overrides (or
equivalent) and uses the test framework's skip mechanism if missing, then
proceed with the existing log assertions otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bfd070d5-cd5d-4565-a39c-284506785e08

📥 Commits

Reviewing files that changed from the base of the PR and between f689296 and cd5fecb.

📒 Files selected for processing (5)
  • .claude/skills/mcore-compat/SKILL.md
  • .github/workflows/cicd-main.yml
  • CLAUDE.md
  • src/megatron/bridge/training/optim.py
  • tests/functional_tests/training/test_pretrain.py

CLAUDE.md Outdated
Comment on lines +116 to +117
The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix stale CI input reference (mcore_variant).

cicd-main.yml in this PR uses mcore_ref/mcore_repo inputs, not mcore_variant. This sentence is currently misleading for contributors.

📝 Proposed doc fix
-The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.
+Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like
+`mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.
The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like `mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~116-~116: The official name of this software platform is spelled with a capital “H”.
Context: ... only used in CI via mcore_variant in .github/workflows/cicd-main.yml. ## Available...

(GITHUB)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 116 - 117, The sentence in CLAUDE.md claiming the CI
uses the `mcore_variant` input is stale; update the description to reflect that
`cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`)
when selecting the Megatron-LM submodule, and ensure the guidance about the
`3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for
the current sentence referencing `mcore_variant` and replace it with text that
cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI
inputs as appropriate.

Comment on lines 82 to 83
if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False):
mup_overrides = get_mup_config_overrides(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether any checked-in configs/tests explicitly enable use_mup
# (helps assess immediate impact of fail-fast behavior).
rg -nP '\buse_mup\s*=\s*True\b'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 944


🏁 Script executed:

# First, let's see the current implementation in optim.py around lines 82-83
sed -n '70,100p' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1524


🏁 Script executed:

# Check what _HAS_MUP_CONFIG_OVERRIDES is and how it's defined
rg -n '_HAS_MUP_CONFIG_OVERRIDES' src/megatron/bridge/training/optim.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 623


🏁 Script executed:

# Check the full context of the function containing lines 82-83
sed -n '1,150p' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5993


🏁 Script executed:

# Look at the test that expects use_mup=True to work
sed -n '65,110p' tests/unit_tests/training/test_optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2330


🏁 Script executed:

# Check the functional test that verifies use_mup=True flows through
sed -n '195,250p' tests/functional_tests/training/test_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2224


🏁 Script executed:

# Check if there are any environment-specific skips or version checks in functional tests
rg -n 'skip|version|_HAS_MUP' tests/functional_tests/training/test_pretrain.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if there are any other fallback/graceful-degradation patterns for optional features
rg -n '_HAS_|ImportError|try.*except' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 337


Raise an error when μP is explicitly enabled but unavailable.

When use_mup=True is set on the model config but get_mup_config_overrides is unavailable in the current Megatron-Core version, the code silently skips μP setup. This masks misconfiguration and produces unintended training dynamics where the user expects μP to be applied but it is not.

The proposed fix correctly changes the logic to fail fast: first check if μP is requested, then explicitly verify that the required function is available before proceeding. If the function is unavailable, raise a clear RuntimeError instead of silently degrading.

Proposed fix
-    if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False):
-        mup_overrides = get_mup_config_overrides(
-            config=optimizer_config,
-            mup_width_mult=model_config.mup_width_mult,
-            optimizer_type=optimizer_config.optimizer,
-        )
-        if mup_overrides:
-            config_overrides = {**(config_overrides or {}), **mup_overrides}
-            G_LOGGER.info(
-                f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): "
-                f"applied {len(mup_overrides)} optimizer param-group override(s)."
-            )
+    if getattr(model_config, "use_mup", False):
+        if not _HAS_MUP_CONFIG_OVERRIDES:
+            raise RuntimeError(
+                "μP is enabled (`use_mup=True`) but `get_mup_config_overrides` is unavailable "
+                "in the current Megatron-Core version."
+            )
+        mup_overrides = get_mup_config_overrides(
+            config=optimizer_config,
+            mup_width_mult=model_config.mup_width_mult,
+            optimizer_type=optimizer_config.optimizer,
+        )
+        if mup_overrides:
+            config_overrides = {**(config_overrides or {}), **mup_overrides}
+            G_LOGGER.info(
+                f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): "
+                f"applied {len(mup_overrides)} optimizer param-group override(s)."
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/optim.py` around lines 82 - 83, The code
currently checks _HAS_MUP_CONFIG_OVERRIDES before honoring model_config.use_mup
which silently skips μP when the helper is missing; change the logic in the
block around _HAS_MUP_CONFIG_OVERRIDES and get_mup_config_overrides so that you
first check getattr(model_config, "use_mup", False) and if True verify that
get_mup_config_overrides is available (i.e., _HAS_MUP_CONFIG_OVERRIDES is True);
if it is not available raise a RuntimeError with a clear message indicating μP
was requested but get_mup_config_overrides is missing, otherwise call
get_mup_config_overrides to populate mup_overrides and continue as before.

@yaoyu-33
Copy link
Contributor Author

/ok to test 33481d7

…in compat

`get_mup_config_overrides` exists in mcore dev but not yet in the main
branch tracked by the submodule. Guard the import with try/except so
unit tests pass against the submodule mcore (fixes bump PR #2829 CI).

Also adds the `mcore-compat` skill documenting the pattern for future
main/dev divergence cases.

TODO: Remove the guard once `get_mup_config_overrides` lands in mcore main.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Contributor Author

/ok to test b7cc766

@yaoyu-33
Copy link
Contributor Author

/claude review

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
@yaoyu-33 yaoyu-33 merged commit 31d0ade into main Mar 18, 2026
68 of 71 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/mcore-compat-guard-mup-overrides branch March 18, 2026 18:17
@yaoyu-33
Copy link
Contributor Author

copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026
…in compat (#2846)

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-more-tests Requires additional L0 and L1 test coverage before merge ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants