[training,ci] fix: guard get_mup_config_overrides import for mcore main compat by yaoyu-33 · Pull Request #2846 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-17T20:14:10Z

Problem

Launch_Unit_Tests_Core on the mcore bump PR #2829 fails with:

ImportError: cannot import name 'get_mup_config_overrides' from 'megatron.core.optimizer'

get_mup_config_overrides was added by the MuP scaling feature (#2666) and lives in mcore dev, but has not yet landed in mcore main (the branch tracked by the submodule).

Fix

Guard the import in optim.py with try/except ImportError and a _HAS_MUP_CONFIG_OVERRIDES flag. The μP optimizer path is silently skipped when the symbol is unavailable (safe, since μP is opt-in via use_mup=True on the model config).

Both the import guard and usage site are marked with TODO: Remove once get_mup_config_overrides lands in mcore main.

Also adds .claude/skills/mcore-compat/SKILL.md documenting this guard pattern for future main/dev divergence cases.

TODO

Remove the guard in a follow-up PR after the next mcore bump that includes get_mup_config_overrides in main.

Summary by CodeRabbit

Documentation
- Added developer documentation for mcore compatibility guard patterns and repository architecture guidance.
Chores
- Increased CI/CD unit test timeout for improved pipeline reliability.
- Improved optimizer configuration robustness to gracefully handle optional feature availability across environments.
- Extended test infrastructure with additional model provider configurations for enhanced testing coverage.

copy-pr-bot · 2026-03-17T20:14:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-03-17T20:14:14Z

/ok to test cd5fecb

yaoyu-33 · 2026-03-17T20:16:51Z

/ok to test 08cc75e

coderabbitai · 2026-03-17T20:23:57Z

📝 Walkthrough

Walkthrough

This pull request adds documentation on guard patterns for handling optional mcore features, implements a conditional import guard for μP functionality, introduces a new test model provider for μP testing, updates CI/CD timeout configuration, and provides a comprehensive guide for Claude Code within the repository.

Changes

Cohort / File(s)	Summary
Documentation & Guidance `.claude/skills/mcore-compat/SKILL.md`, `CLAUDE.md`	New documentation outlining mcore-compat guard patterns with import/usage guards and removal criteria; comprehensive guide for Claude Code including architecture, conventions, dependencies, and available skills.
CI/CD Configuration `.github/workflows/cicd-main.yml`	Increased unit test run timeout from 15 to 18 minutes (3-minute extension).
μP Feature Guards `src/megatron/bridge/training/optim.py`	Added try/except import guard for `get_mup_config_overrides` with `_HAS_MUP_CONFIG_OVERRIDES` flag; conditioned `setup_optimizer` to apply μP overrides only when the symbol is available.
Test Model Provider `tests/functional_tests/training/test_pretrain.py`	Introduced `Llama32ModelProvider1B` class extending `GPTModelProvider` with detailed Llama 3.2 configuration (normalization, activation, ROPE, layers, sizes, embeddings) for μP testing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: guarding the get_mup_config_overrides import for mcore main compatibility. It is concise, clear, and specifies the key issue being addressed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR contains minor changes: documentation additions, CI adjustment, defensive compatibility guard for optional μP functionality, and test infrastructure. No breaking changes, public API modifications, or significant functional impact.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yuya/mcore-compat-guard-mup-overrides

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/functional_tests/training/test_pretrain.py (1)
252-266: ⚠️ Potential issue | 🟠 Major

Add skip guard for test when μP override API is unavailable.

This test asserts that a specific log message appears when μP is enabled (lines 315–321), but that log is only generated when get_mup_config_overrides is available in the mcore variant. On compatible mcore versions lacking this API, the test will fail with "Expected μP optimizer override log message but found none" even though the code gracefully skips μP (as designed by the try/except guard in optim.py). Skipping the test in such environments is the correct behavior.
Proposed fix
     def test_pretrain_with_mup(self, tmp_path, caplog):
         """
         Test end to end training with μP (Maximal Update Parameterization) enabled.
         ...
         """
+        try:
+            from megatron.core.optimizer import get_mup_config_overrides  # noqa: F401
+        except ImportError:
+            pytest.skip("μP overrides API is unavailable in current mcore variant")
+
         initialize_distributed()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain.py` around lines 252 - 266, The
test assumes μP override API exists; add a pre-check at the start of the test to
skip when the μP override function is unavailable (e.g., test should detect
presence of get_mup_config_overrides or the relevant symbol in the mcore.optim
module) so it only asserts the μP optimizer override log when that API exists;
locate the test function in tests/functional_tests/training/test_pretrain.py
that constructs Llama32ModelProvider1B and examines logs for the μP override
message, add a short guard that calls getattr/membership check for
get_mup_config_overrides (or equivalent) and uses the test framework's skip
mechanism if missing, then proceed with the existing log assertions otherwise.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 116-117: The sentence in CLAUDE.md claiming the CI uses the
`mcore_variant` input is stale; update the description to reflect that
`cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`)
when selecting the Megatron-LM submodule, and ensure the guidance about the
`3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for
the current sentence referencing `mcore_variant` and replace it with text that
cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI
inputs as appropriate.

In `@src/megatron/bridge/training/optim.py`:
- Around line 82-83: The code currently checks _HAS_MUP_CONFIG_OVERRIDES before
honoring model_config.use_mup which silently skips μP when the helper is
missing; change the logic in the block around _HAS_MUP_CONFIG_OVERRIDES and
get_mup_config_overrides so that you first check getattr(model_config,
"use_mup", False) and if True verify that get_mup_config_overrides is available
(i.e., _HAS_MUP_CONFIG_OVERRIDES is True); if it is not available raise a
RuntimeError with a clear message indicating μP was requested but
get_mup_config_overrides is missing, otherwise call get_mup_config_overrides to
populate mup_overrides and continue as before.

---

Outside diff comments:
In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 252-266: The test assumes μP override API exists; add a pre-check
at the start of the test to skip when the μP override function is unavailable
(e.g., test should detect presence of get_mup_config_overrides or the relevant
symbol in the mcore.optim module) so it only asserts the μP optimizer override
log when that API exists; locate the test function in
tests/functional_tests/training/test_pretrain.py that constructs
Llama32ModelProvider1B and examines logs for the μP override message, add a
short guard that calls getattr/membership check for get_mup_config_overrides (or
equivalent) and uses the test framework's skip mechanism if missing, then
proceed with the existing log assertions otherwise.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bfd070d5-cd5d-4565-a39c-284506785e08

📥 Commits

Reviewing files that changed from the base of the PR and between f689296 and cd5fecb.

📒 Files selected for processing (5)

.claude/skills/mcore-compat/SKILL.md
.github/workflows/cicd-main.yml
CLAUDE.md
src/megatron/bridge/training/optim.py
tests/functional_tests/training/test_pretrain.py

coderabbitai · 2026-03-17T20:24:00Z

CLAUDE.md

+The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.
+


⚠️ Potential issue | 🟡 Minor

Fix stale CI input reference (mcore_variant).

cicd-main.yml in this PR uses mcore_ref/mcore_repo inputs, not mcore_variant. This sentence is currently misleading for contributors.

📝 Proposed doc fix

-The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`. +Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like +`mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.

The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like `mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~116-~116: The official name of this software platform is spelled with a capital “H”.
Context: ... only used in CI via mcore_variant in .github/workflows/cicd-main.yml. ## Available...

(GITHUB)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` around lines 116 - 117, The sentence in CLAUDE.md claiming the CI uses the `mcore_variant` input is stale; update the description to reflect that `cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`) when selecting the Megatron-LM submodule, and ensure the guidance about the `3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for the current sentence referencing `mcore_variant` and replace it with text that cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI inputs as appropriate.

coderabbitai · 2026-03-17T20:24:00Z

src/megatron/bridge/training/optim.py

+    if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False):
        mup_overrides = get_mup_config_overrides(


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether any checked-in configs/tests explicitly enable use_mup # (helps assess immediate impact of fail-fast behavior). rg -nP '\buse_mup\s*=\s*True\b'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 944

🏁 Script executed:

# First, let's see the current implementation in optim.py around lines 82-83 sed -n '70,100p' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1524

🏁 Script executed:

# Check what _HAS_MUP_CONFIG_OVERRIDES is and how it's defined rg -n '_HAS_MUP_CONFIG_OVERRIDES' src/megatron/bridge/training/optim.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 623

🏁 Script executed:

# Check the full context of the function containing lines 82-83 sed -n '1,150p' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5993

🏁 Script executed:

# Look at the test that expects use_mup=True to work sed -n '65,110p' tests/unit_tests/training/test_optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2330

🏁 Script executed:

# Check the functional test that verifies use_mup=True flows through sed -n '195,250p' tests/functional_tests/training/test_pretrain.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2224

🏁 Script executed:

# Check if there are any environment-specific skips or version checks in functional tests rg -n 'skip|version|_HAS_MUP' tests/functional_tests/training/test_pretrain.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check if there are any other fallback/graceful-degradation patterns for optional features rg -n '_HAS_|ImportError|try.*except' src/megatron/bridge/training/optim.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 337

Raise an error when μP is explicitly enabled but unavailable.

When use_mup=True is set on the model config but get_mup_config_overrides is unavailable in the current Megatron-Core version, the code silently skips μP setup. This masks misconfiguration and produces unintended training dynamics where the user expects μP to be applied but it is not.

The proposed fix correctly changes the logic to fail fast: first check if μP is requested, then explicitly verify that the required function is available before proceeding. If the function is unavailable, raise a clear RuntimeError instead of silently degrading.

Proposed fix

- if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False): - mup_overrides = get_mup_config_overrides( - config=optimizer_config, - mup_width_mult=model_config.mup_width_mult, - optimizer_type=optimizer_config.optimizer, - ) - if mup_overrides: - config_overrides = {**(config_overrides or {}), **mup_overrides} - G_LOGGER.info( - f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): " - f"applied {len(mup_overrides)} optimizer param-group override(s)." - ) + if getattr(model_config, "use_mup", False): + if not _HAS_MUP_CONFIG_OVERRIDES: + raise RuntimeError( + "μP is enabled (`use_mup=True`) but `get_mup_config_overrides` is unavailable " + "in the current Megatron-Core version." + ) + mup_overrides = get_mup_config_overrides( + config=optimizer_config, + mup_width_mult=model_config.mup_width_mult, + optimizer_type=optimizer_config.optimizer, + ) + if mup_overrides: + config_overrides = {**(config_overrides or {}), **mup_overrides} + G_LOGGER.info( + f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): " + f"applied {len(mup_overrides)} optimizer param-group override(s)." + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/optim.py` around lines 82 - 83, The code currently checks _HAS_MUP_CONFIG_OVERRIDES before honoring model_config.use_mup which silently skips μP when the helper is missing; change the logic in the block around _HAS_MUP_CONFIG_OVERRIDES and get_mup_config_overrides so that you first check getattr(model_config, "use_mup", False) and if True verify that get_mup_config_overrides is available (i.e., _HAS_MUP_CONFIG_OVERRIDES is True); if it is not available raise a RuntimeError with a clear message indicating μP was requested but get_mup_config_overrides is missing, otherwise call get_mup_config_overrides to populate mup_overrides and continue as before.

yaoyu-33 · 2026-03-17T21:02:32Z

/ok to test 33481d7

…in compat `get_mup_config_overrides` exists in mcore dev but not yet in the main branch tracked by the submodule. Guard the import with try/except so unit tests pass against the submodule mcore (fixes bump PR #2829 CI). Also adds the `mcore-compat` skill documenting the pattern for future main/dev divergence cases. TODO: Remove the guard once `get_mup_config_overrides` lands in mcore main. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…rd-mup-overrides

yaoyu-33 · 2026-03-17T21:03:52Z

/ok to test b7cc766

yaoyu-33 · 2026-03-18T01:38:40Z

/claude review

src/megatron/bridge/training/optim.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

yaoyu-33 · 2026-03-18T18:17:28Z

https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2829/changes successful dev run.

…in compat (#2846) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

yaoyu-33 requested a review from a team as a code owner March 17, 2026 20:14

copy-pr-bot bot had a problem deploying to test March 17, 2026 20:15 Error

yaoyu-33 force-pushed the yuya/mcore-compat-guard-mup-overrides branch from cd5fecb to 08cc75e Compare March 17, 2026 20:16

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

copy-pr-bot bot had a problem deploying to test March 17, 2026 21:03 Error

yaoyu-33 added 2 commits March 17, 2026 14:03

Merge remote-tracking branch 'origin/main' into yuya/mcore-compat-gua…

b7cc766

…rd-mup-overrides

yaoyu-33 force-pushed the yuya/mcore-compat-guard-mup-overrides branch from 33481d7 to b7cc766 Compare March 17, 2026 21:03

copy-pr-bot bot temporarily deployed to test March 17, 2026 21:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 21:08 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 17, 2026 21:16 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 21:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 21:39 Inactive

claude bot reviewed Mar 18, 2026

View reviewed changes

src/megatron/bridge/training/optim.py Outdated Show resolved Hide resolved

Update src/megatron/bridge/training/optim.py

1601d56

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

copy-pr-bot bot temporarily deployed to nemo-ci March 18, 2026 15:51 Inactive

cuichenx approved these changes Mar 18, 2026

View reviewed changes

yaoyu-33 merged commit 31d0ade into main Mar 18, 2026
68 of 71 checks passed

yaoyu-33 deleted the yuya/mcore-compat-guard-mup-overrides branch March 18, 2026 18:17

		The `3rdparty/Megatron-LM/` submodule pointer must always track the `main` commit (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.

		if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False):
		mup_overrides = get_mup_config_overrides(

Conversation

yaoyu-33 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

TODO

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

yaoyu-33 commented Mar 17, 2026

Uh oh!

yaoyu-33 commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 17, 2026

Uh oh!

yaoyu-33 commented Mar 17, 2026

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading