[training,ci] fix: guard get_mup_config_overrides import for mcore main compat#2846
[training,ci] fix: guard get_mup_config_overrides import for mcore main compat#2846
Conversation
|
/ok to test cd5fecb |
cd5fecb to
08cc75e
Compare
|
/ok to test 08cc75e |
📝 WalkthroughWalkthroughThis pull request adds documentation on guard patterns for handling optional mcore features, implements a conditional import guard for μP functionality, introduces a new test model provider for μP testing, updates CI/CD timeout configuration, and provides a comprehensive guide for Claude Code within the repository. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/functional_tests/training/test_pretrain.py (1)
252-266:⚠️ Potential issue | 🟠 MajorAdd skip guard for test when μP override API is unavailable.
This test asserts that a specific log message appears when μP is enabled (lines 315–321), but that log is only generated when
get_mup_config_overridesis available in the mcore variant. On compatible mcore versions lacking this API, the test will fail with "Expected μP optimizer override log message but found none" even though the code gracefully skips μP (as designed by the try/except guard in optim.py). Skipping the test in such environments is the correct behavior.Proposed fix
def test_pretrain_with_mup(self, tmp_path, caplog): """ Test end to end training with μP (Maximal Update Parameterization) enabled. ... """ + try: + from megatron.core.optimizer import get_mup_config_overrides # noqa: F401 + except ImportError: + pytest.skip("μP overrides API is unavailable in current mcore variant") + initialize_distributed()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_pretrain.py` around lines 252 - 266, The test assumes μP override API exists; add a pre-check at the start of the test to skip when the μP override function is unavailable (e.g., test should detect presence of get_mup_config_overrides or the relevant symbol in the mcore.optim module) so it only asserts the μP optimizer override log when that API exists; locate the test function in tests/functional_tests/training/test_pretrain.py that constructs Llama32ModelProvider1B and examines logs for the μP override message, add a short guard that calls getattr/membership check for get_mup_config_overrides (or equivalent) and uses the test framework's skip mechanism if missing, then proceed with the existing log assertions otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Around line 116-117: The sentence in CLAUDE.md claiming the CI uses the
`mcore_variant` input is stale; update the description to reflect that
`cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`)
when selecting the Megatron-LM submodule, and ensure the guidance about the
`3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for
the current sentence referencing `mcore_variant` and replace it with text that
cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI
inputs as appropriate.
In `@src/megatron/bridge/training/optim.py`:
- Around line 82-83: The code currently checks _HAS_MUP_CONFIG_OVERRIDES before
honoring model_config.use_mup which silently skips μP when the helper is
missing; change the logic in the block around _HAS_MUP_CONFIG_OVERRIDES and
get_mup_config_overrides so that you first check getattr(model_config,
"use_mup", False) and if True verify that get_mup_config_overrides is available
(i.e., _HAS_MUP_CONFIG_OVERRIDES is True); if it is not available raise a
RuntimeError with a clear message indicating μP was requested but
get_mup_config_overrides is missing, otherwise call get_mup_config_overrides to
populate mup_overrides and continue as before.
---
Outside diff comments:
In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 252-266: The test assumes μP override API exists; add a pre-check
at the start of the test to skip when the μP override function is unavailable
(e.g., test should detect presence of get_mup_config_overrides or the relevant
symbol in the mcore.optim module) so it only asserts the μP optimizer override
log when that API exists; locate the test function in
tests/functional_tests/training/test_pretrain.py that constructs
Llama32ModelProvider1B and examines logs for the μP override message, add a
short guard that calls getattr/membership check for get_mup_config_overrides (or
equivalent) and uses the test framework's skip mechanism if missing, then
proceed with the existing log assertions otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bfd070d5-cd5d-4565-a39c-284506785e08
📒 Files selected for processing (5)
.claude/skills/mcore-compat/SKILL.md.github/workflows/cicd-main.ymlCLAUDE.mdsrc/megatron/bridge/training/optim.pytests/functional_tests/training/test_pretrain.py
CLAUDE.md
Outdated
| The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`. | ||
|
|
There was a problem hiding this comment.
Fix stale CI input reference (mcore_variant).
cicd-main.yml in this PR uses mcore_ref/mcore_repo inputs, not mcore_variant. This sentence is currently misleading for contributors.
📝 Proposed doc fix
-The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`.
+Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like
+`mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). The `dev` variant is only used in CI via `mcore_variant` in `.github/workflows/cicd-main.yml`. | |
| The `3rdparty/Megatron-LM/` submodule pointer **must always track the `main` commit** (not `dev`). Main/dev compatibility testing in CI is driven by `workflow_dispatch` inputs like `mcore_ref` (and optionally `mcore_repo`) in `.github/workflows/cicd-main.yml`. |
🧰 Tools
🪛 LanguageTool
[uncategorized] ~116-~116: The official name of this software platform is spelled with a capital “H”.
Context: ... only used in CI via mcore_variant in .github/workflows/cicd-main.yml. ## Available...
(GITHUB)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CLAUDE.md` around lines 116 - 117, The sentence in CLAUDE.md claiming the CI
uses the `mcore_variant` input is stale; update the description to reflect that
`cicd-main.yml` now accepts `mcore_ref` and `mcore_repo` (not `mcore_variant`)
when selecting the Megatron-LM submodule, and ensure the guidance about the
`3rdparty/Megatron-LM/` submodule tracking `main` remains accurate—search for
the current sentence referencing `mcore_variant` and replace it with text that
cites `mcore_ref`/`mcore_repo` and clarifies that `dev` is only used via CI
inputs as appropriate.
| if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False): | ||
| mup_overrides = get_mup_config_overrides( |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether any checked-in configs/tests explicitly enable use_mup
# (helps assess immediate impact of fail-fast behavior).
rg -nP '\buse_mup\s*=\s*True\b'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 944
🏁 Script executed:
# First, let's see the current implementation in optim.py around lines 82-83
sed -n '70,100p' src/megatron/bridge/training/optim.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1524
🏁 Script executed:
# Check what _HAS_MUP_CONFIG_OVERRIDES is and how it's defined
rg -n '_HAS_MUP_CONFIG_OVERRIDES' src/megatron/bridge/training/optim.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 623
🏁 Script executed:
# Check the full context of the function containing lines 82-83
sed -n '1,150p' src/megatron/bridge/training/optim.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5993
🏁 Script executed:
# Look at the test that expects use_mup=True to work
sed -n '65,110p' tests/unit_tests/training/test_optim.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2330
🏁 Script executed:
# Check the functional test that verifies use_mup=True flows through
sed -n '195,250p' tests/functional_tests/training/test_pretrain.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2224
🏁 Script executed:
# Check if there are any environment-specific skips or version checks in functional tests
rg -n 'skip|version|_HAS_MUP' tests/functional_tests/training/test_pretrain.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check if there are any other fallback/graceful-degradation patterns for optional features
rg -n '_HAS_|ImportError|try.*except' src/megatron/bridge/training/optim.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 337
Raise an error when μP is explicitly enabled but unavailable.
When use_mup=True is set on the model config but get_mup_config_overrides is unavailable in the current Megatron-Core version, the code silently skips μP setup. This masks misconfiguration and produces unintended training dynamics where the user expects μP to be applied but it is not.
The proposed fix correctly changes the logic to fail fast: first check if μP is requested, then explicitly verify that the required function is available before proceeding. If the function is unavailable, raise a clear RuntimeError instead of silently degrading.
Proposed fix
- if _HAS_MUP_CONFIG_OVERRIDES and getattr(model_config, "use_mup", False):
- mup_overrides = get_mup_config_overrides(
- config=optimizer_config,
- mup_width_mult=model_config.mup_width_mult,
- optimizer_type=optimizer_config.optimizer,
- )
- if mup_overrides:
- config_overrides = {**(config_overrides or {}), **mup_overrides}
- G_LOGGER.info(
- f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): "
- f"applied {len(mup_overrides)} optimizer param-group override(s)."
- )
+ if getattr(model_config, "use_mup", False):
+ if not _HAS_MUP_CONFIG_OVERRIDES:
+ raise RuntimeError(
+ "μP is enabled (`use_mup=True`) but `get_mup_config_overrides` is unavailable "
+ "in the current Megatron-Core version."
+ )
+ mup_overrides = get_mup_config_overrides(
+ config=optimizer_config,
+ mup_width_mult=model_config.mup_width_mult,
+ optimizer_type=optimizer_config.optimizer,
+ )
+ if mup_overrides:
+ config_overrides = {**(config_overrides or {}), **mup_overrides}
+ G_LOGGER.info(
+ f"μP enabled (width_mult={model_config.mup_width_mult:.4g}): "
+ f"applied {len(mup_overrides)} optimizer param-group override(s)."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/optim.py` around lines 82 - 83, The code
currently checks _HAS_MUP_CONFIG_OVERRIDES before honoring model_config.use_mup
which silently skips μP when the helper is missing; change the logic in the
block around _HAS_MUP_CONFIG_OVERRIDES and get_mup_config_overrides so that you
first check getattr(model_config, "use_mup", False) and if True verify that
get_mup_config_overrides is available (i.e., _HAS_MUP_CONFIG_OVERRIDES is True);
if it is not available raise a RuntimeError with a clear message indicating μP
was requested but get_mup_config_overrides is missing, otherwise call
get_mup_config_overrides to populate mup_overrides and continue as before.
|
/ok to test 33481d7 |
…in compat `get_mup_config_overrides` exists in mcore dev but not yet in the main branch tracked by the submodule. Guard the import with try/except so unit tests pass against the submodule mcore (fixes bump PR #2829 CI). Also adds the `mcore-compat` skill documenting the pattern for future main/dev divergence cases. TODO: Remove the guard once `get_mup_config_overrides` lands in mcore main. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test b7cc766 |
33481d7 to
b7cc766
Compare
|
/claude review |
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
…in compat (#2846) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Problem
Launch_Unit_Tests_Coreon the mcore bump PR #2829 fails with:get_mup_config_overrideswas added by the MuP scaling feature (#2666) and lives in mcore dev, but has not yet landed in mcore main (the branch tracked by the submodule).Fix
Guard the import in
optim.pywithtry/except ImportErrorand a_HAS_MUP_CONFIG_OVERRIDESflag. The μP optimizer path is silently skipped when the symbol is unavailable (safe, since μP is opt-in viause_mup=Trueon the model config).Both the import guard and usage site are marked with
TODO: Remove once get_mup_config_overrides lands in mcore main.Also adds
.claude/skills/mcore-compat/SKILL.mddocumenting this guard pattern for future main/dev divergence cases.TODO
Remove the guard in a follow-up PR after the next mcore bump that includes
get_mup_config_overridesin main.Summary by CodeRabbit
Documentation
Chores