Conversation
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
📝 WalkthroughWalkthroughThis PR introduces functional test infrastructure for checkpoint workflows across three model variants (LLaMA-32 1B, NemotronH 4B, and Qwen3 4B), adding both bidirectional test launcher scripts (MLM-to-MBridge and MBridge-to-MLM) and corresponding Python test modules, while updating the CI/CD pipeline to execute these tests. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (9)
tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh (1)
23-29: Wrap all Python tooling withuv runhere too.Line 23, Line 26, and Line 29 are direct tool invocations; using
uv runconsistently avoids environment drift between commands.As per coding guidelines, "
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh` around lines 23 - 29, The three direct tool invocations (the pytest call on the first line, the coverage combine and the final pytest call) should be run via uv to avoid environment drift; wrap each invocation with "uv run" so the pytest commands (the one invoking torch.distributed.run and the two single-process pytest calls) and the "coverage combine -q" call are executed through uv instead of being called directly, ensuring consistent environment activation for commands referenced in this diff.tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh (1)
24-28: Applyuv runconsistently tocoverage/pytestcommands.Line 24, Line 26, and Line 28 should be wrapped with
uv runfor consistent dependency/environment resolution.As per coding guidelines, "
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh` around lines 24 - 28, The script invokes "coverage combine" and two "pytest ..." commands directly; update each invocation (the "coverage combine" line and both pytest command lines) to be executed via the project runner by prefixing them with "uv run" so they use the repository's pinned environment and dependencies (i.e., replace plain "coverage combine -q" and the two "pytest -o log_cli=..." lines with "uv run coverage combine -q" and "uv run pytest -o log_cli=..." respectively).tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh (1)
23-28: Run all Python tooling throughuv runin this launcher.Line 23, Line 26, and Line 28 invoke Python entrypoints directly (
pytest/coverage). Useuv runconsistently to keep resolver/environment behavior uniform in CI.♻️ Suggested update
-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core +uv run --no-sync pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core @@ -coverage combine -q +uv run --no-sync coverage combine -q @@ -pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts +uv run --no-sync pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifactsAs per coding guidelines, "
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly". Based on learnings, "prefer including --no-sync ... in containers or CI pipelines where dependencies are already installed."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh` around lines 23 - 28, Change the direct Python tooling calls to run under uv: replace the standalone "pytest -o log_cli=..." invocations (test run lines invoking pytest) with "uv run --no-sync pytest -o log_cli=..."; also run coverage via uv by replacing "coverage combine -q" with "uv run --no-sync coverage combine -q". Update the launcher script so all three tooling commands (the two pytest calls and the coverage combine) use "uv run --no-sync" to ensure consistent resolver/environment behavior.tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py (2)
39-40: Rename test methods to snake_case.
test_llama32_1B_ckpt_mbridgeandtest_llama32_1B_ckpt_coreuse an uppercase segment (1B). Rename to snake_case (e.g.,..._1b_...) to match repo convention.As per coding guidelines, "
**/*.py: Use snake_case for function and method names".Also applies to: 63-64
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines 39 - 40, Rename the two test methods that contain an uppercase "1B" to snake_case: change test_llama32_1B_ckpt_mbridge -> test_llama32_1b_ckpt_mbridge and test_llama32_1B_ckpt_core -> test_llama32_1b_ckpt_core; update any local references (calls, decorators or skips) to these exact function names so test discovery and references remain correct, and run the test suite to ensure no other references break.
74-129: Prefer subprocess-based process isolation for this functional test.Instead of patching
sys.argvand callingtorchrun_main()in-process, invoke torchrun via subprocess for better isolation.As per coding guidelines, "
tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines 74 - 129, The test currently injects arguments via monkeypatch.setattr(sys, "argv", ...) and calls torchrun_main() in-process; replace this with launching torchrun as an external process using subprocess.run (or subprocess.check_call) to get proper process isolation. Build the command list using the same flags currently passed to monkeypatch (including load_dir, MCORE_CKPT, TB_DIR, train_iters, etc.), call subprocess.run([...], env=os.environ.copy(), cwd=<appropriate working dir>, stdout=PIPE, stderr=PIPE), and assert the process returncode is zero (or raise on non-zero) and capture logs as needed; remove the monkeypatch.setattr(sys, "argv", ...) and the torchrun_main() invocation. Ensure you reference the same symbols from the diff (torchrun_main, monkeypatch.setattr, sys.argv, load_dir, MCORE_CKPT, TB_DIR) when making the change.tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh (1)
24-28: Useuv runfor remaining Python tool invocations.Line 24, Line 26, and Line 28 currently execute Python tooling directly. Wrap them in
uv runfor consistent environment behavior.As per coding guidelines, "
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh` around lines 24 - 28, Replace direct Python tool invocations with "uv run" wrappers so they run in the unified environment: change the three commands that currently call "coverage combine -q" and the two "pytest ..." invocations to be executed via "uv run" (i.e., "uv run coverage combine -q" and "uv run pytest ..." for the pytest calls) so the script follows the "{**/*.sh,examples/**/*.py}" guideline; update the occurrences of the exact strings "coverage combine -q" and the pytest command lines in the file accordingly.tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py (1)
74-132: Usesubprocessfor this torchrun invocation.This test currently mutates global
sys.argvand callstorchrun_main()in-process. For functional tests requiring process isolation, switch tosubprocess.run([...], check=True).As per coding guidelines, "
tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 74 - 132, The test mutates sys.argv via monkeypatch.setattr and calls torchrun_main() in-process (monkeypatch.setattr on sys, sys.argv and torchrun_main()), which breaks process isolation; change it to invoke the same command with subprocess.run([...], check=True) instead: build the full argv list currently passed to sys.argv (starting with "torchrun", "--nproc_per_node=2", ..., "--no-load-rng") as a command array (or prefix with sys.executable and "-m", e.g. [sys.executable, "-m", "torch.distributed.run", ...] or "torchrun" if available), call subprocess.run(cmd, check=True) and remove the in-process torchrun_main() call and the sys.argv monkeypatch; ensure any environment needed (monkeypatching env vars) is passed via subprocess.run(env=...) if required.tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh (1)
23-28: Standardize this launcher onuv runforpytestandcoverage.Line 23, Line 26, and Line 28 should use
uv runas well, so all Python tooling runs under the same managed environment.As per coding guidelines, "
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh` around lines 23 - 28, The three standalone tooling invocations (the first pytest command "pytest -o log_cli=true ... test_qwen3_4b_ckpt_mcore", the coverage combine call "coverage combine -q", and the last pytest command "pytest -o log_cli=true ... test_remove_artifacts") must be prefixed with "uv run" so they run under the same managed environment as the middle launcher; update those three command lines to use "uv run pytest ..." and "uv run coverage combine -q" respectively.tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py (1)
75-135: Prefer subprocess execution over in-processtorchrun_main().This functional test mutates global argv and executes torchrun in-process. Use a subprocess call for isolation and fewer side effects.
As per coding guidelines, "
tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around lines 75 - 135, The test currently mutates sys.argv via monkeypatch.setattr(...) and calls torchrun_main(), causing global state leakage; instead replace the in-process invocation with a subprocess call: construct the same command list currently assigned to sys.argv (including the "torchrun" entry and all flags like "--nproc_per_node=2", "--load" load_dir, "--save" MCORE_CKPT, TB_DIR, etc.), remove the monkeypatch.setattr(sys, "argv", ...) and the torchrun_main() call, and invoke subprocess.run(command_list, check=True) (or subprocess.run with stdout/stderr capture as needed) to run the job in a separate process, ensuring process isolation and no global argv mutation. Use the symbols monkeypatch.setattr, torchrun_main, and the command list entries to locate and replace the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cicd-main.yml:
- Around line 383-385: The three new matrix entries
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh include a redundant “.sh” suffix
which the test-template action will append again and cause *.sh.sh failures;
remove the “.sh” suffix from each of these entries so they become
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b respectively to match existing
extensionless matrix items.
In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py`:
- Around line 131-134: The test_remove_artifacts teardown currently calls
shutil.rmtree(BASE_DIR) unconditionally which can raise if BASE_DIR is missing;
update test_remove_artifacts to check for existence before removal by using
os.path.exists(BASE_DIR) (or pathlib.Path(BASE_DIR).exists()) and only call
shutil.rmtree(BASE_DIR) when present, or wrap the call in a try/except catching
FileNotFoundError to silently ignore missing directories so the test doesn't
fail when BASE_DIR is absent.
- Around line 66-67: Your test builds conflicting CLI args by unconditionally
appending a hardcoded --load/--save pair and later appending another pair based
on load_dir, and it emits --load even when load_dir is None; update the sys.argv
construction so that you only add a single --load/--save pair: check the
load_dir variable (derived from MBRIDGE_CKPT) and only append "--load", load_dir
when load_dir is truthy, then append a single matching "--save", save_dir (or
use the same save variable) in the same place instead of adding a second
hardcoded pair; ensure any previous duplicate entries are removed or never added
so the CLI sees one consistent source and target checkpoint.
In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py`:
- Around line 137-140: The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) which will raise if BASE_DIR is missing; make it safe by
checking existence or using shutil.rmtree with ignore_errors=True so the test
won't error when artifacts are absent—update the test_remove_artifacts function
to guard removal (e.g., use os.path.exists(BASE_DIR) before calling
shutil.rmtree or pass ignore_errors=True) and add an import for os if you choose
the existence-check approach.
- Around line 67-68: The test currently computes load_dir (load_dir =
MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None) but always appends the
--load flag later, breaking first-run bootstrap; update the command construction
so the --load flag and its value are only added when load_dir is truthy (e.g. if
load_dir: args.extend(['--load', load_dir]) or similar). Keep the existing
train_iters logic (train_iters = 10 if load_dir else 5) unchanged.
In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py`:
- Around line 66-67: The test currently sets load_dir = MBRIDGE_CKPT if the path
exists otherwise None, but the command-line args still unconditionally append
"--load" which breaks bootstrap when no checkpoint exists; update the argument
construction to only append the "--load" flag and its value when load_dir is
truthy (non-None/exists), i.e., check load_dir before adding "--load" (and
before using train_iters logic that depends on it) so the test passes clean runs
without a checkpoint.
- Around line 133-136: The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) unconditionally which raises if BASE_DIR doesn't exist;
make it idempotent by checking for existence or using the rmtree ignore option.
Update test_remove_artifacts to either wrap the removal in an existence check
(os.path.exists(BASE_DIR) before calling shutil.rmtree) or call
shutil.rmtree(BASE_DIR, ignore_errors=True); ensure any new os import is added
if you choose the existence check.
In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh`:
- Around line 24-28: Replace direct invocations of coverage and pytest with the
project's wrapper by prefixing those commands with "uv run" so they run under
the standard runtime environment; specifically update the script commands that
run "coverage combine -q" and the two "pytest ..." lines in
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh to use "uv
run coverage combine -q" and "uv run pytest ..." respectively, keeping the same
pytest flags and test targets (the TestLlama32Ckpt::test_llama32_1B_ckpt_core
and TestLlama32Ckpt::test_remove_artifacts invocations) so behavior is unchanged
but execution complies with the guideline.
---
Nitpick comments:
In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py`:
- Around line 39-40: Rename the two test methods that contain an uppercase "1B"
to snake_case: change test_llama32_1B_ckpt_mbridge ->
test_llama32_1b_ckpt_mbridge and test_llama32_1B_ckpt_core ->
test_llama32_1b_ckpt_core; update any local references (calls, decorators or
skips) to these exact function names so test discovery and references remain
correct, and run the test suite to ensure no other references break.
- Around line 74-129: The test currently injects arguments via
monkeypatch.setattr(sys, "argv", ...) and calls torchrun_main() in-process;
replace this with launching torchrun as an external process using subprocess.run
(or subprocess.check_call) to get proper process isolation. Build the command
list using the same flags currently passed to monkeypatch (including load_dir,
MCORE_CKPT, TB_DIR, train_iters, etc.), call subprocess.run([...],
env=os.environ.copy(), cwd=<appropriate working dir>, stdout=PIPE, stderr=PIPE),
and assert the process returncode is zero (or raise on non-zero) and capture
logs as needed; remove the monkeypatch.setattr(sys, "argv", ...) and the
torchrun_main() invocation. Ensure you reference the same symbols from the diff
(torchrun_main, monkeypatch.setattr, sys.argv, load_dir, MCORE_CKPT, TB_DIR)
when making the change.
In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py`:
- Around line 75-135: The test currently mutates sys.argv via
monkeypatch.setattr(...) and calls torchrun_main(), causing global state
leakage; instead replace the in-process invocation with a subprocess call:
construct the same command list currently assigned to sys.argv (including the
"torchrun" entry and all flags like "--nproc_per_node=2", "--load" load_dir,
"--save" MCORE_CKPT, TB_DIR, etc.), remove the monkeypatch.setattr(sys, "argv",
...) and the torchrun_main() call, and invoke subprocess.run(command_list,
check=True) (or subprocess.run with stdout/stderr capture as needed) to run the
job in a separate process, ensuring process isolation and no global argv
mutation. Use the symbols monkeypatch.setattr, torchrun_main, and the command
list entries to locate and replace the code.
In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py`:
- Around line 74-132: The test mutates sys.argv via monkeypatch.setattr and
calls torchrun_main() in-process (monkeypatch.setattr on sys, sys.argv and
torchrun_main()), which breaks process isolation; change it to invoke the same
command with subprocess.run([...], check=True) instead: build the full argv list
currently passed to sys.argv (starting with "torchrun", "--nproc_per_node=2",
..., "--no-load-rng") as a command array (or prefix with sys.executable and
"-m", e.g. [sys.executable, "-m", "torch.distributed.run", ...] or "torchrun" if
available), call subprocess.run(cmd, check=True) and remove the in-process
torchrun_main() call and the sys.argv monkeypatch; ensure any environment needed
(monkeypatching env vars) is passed via subprocess.run(env=...) if required.
In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh`:
- Around line 24-28: The script invokes "coverage combine" and two "pytest ..."
commands directly; update each invocation (the "coverage combine" line and both
pytest command lines) to be executed via the project runner by prefixing them
with "uv run" so they use the repository's pinned environment and dependencies
(i.e., replace plain "coverage combine -q" and the two "pytest -o log_cli=..."
lines with "uv run coverage combine -q" and "uv run pytest -o log_cli=..."
respectively).
In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh`:
- Around line 24-28: Replace direct Python tool invocations with "uv run"
wrappers so they run in the unified environment: change the three commands that
currently call "coverage combine -q" and the two "pytest ..." invocations to be
executed via "uv run" (i.e., "uv run coverage combine -q" and "uv run pytest
..." for the pytest calls) so the script follows the
"{**/*.sh,examples/**/*.py}" guideline; update the occurrences of the exact
strings "coverage combine -q" and the pytest command lines in the file
accordingly.
In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh`:
- Around line 23-28: Change the direct Python tooling calls to run under uv:
replace the standalone "pytest -o log_cli=..." invocations (test run lines
invoking pytest) with "uv run --no-sync pytest -o log_cli=..."; also run
coverage via uv by replacing "coverage combine -q" with "uv run --no-sync
coverage combine -q". Update the launcher script so all three tooling commands
(the two pytest calls and the coverage combine) use "uv run --no-sync" to ensure
consistent resolver/environment behavior.
In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh`:
- Around line 23-29: The three direct tool invocations (the pytest call on the
first line, the coverage combine and the final pytest call) should be run via uv
to avoid environment drift; wrap each invocation with "uv run" so the pytest
commands (the one invoking torch.distributed.run and the two single-process
pytest calls) and the "coverage combine -q" call are executed through uv instead
of being called directly, ensuring consistent environment activation for
commands referenced in this diff.
In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh`:
- Around line 23-28: The three standalone tooling invocations (the first pytest
command "pytest -o log_cli=true ... test_qwen3_4b_ckpt_mcore", the coverage
combine call "coverage combine -q", and the last pytest command "pytest -o
log_cli=true ... test_remove_artifacts") must be prefixed with "uv run" so they
run under the same managed environment as the middle launcher; update those
three command lines to use "uv run pytest ..." and "uv run coverage combine -q"
respectively.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/cicd-main.ymltests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.shtests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.shtests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.shtests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.shtests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.shtests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.shtests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.pytests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.pytests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py
.github/workflows/cicd-main.yml
Outdated
| - script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh | ||
| - script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh | ||
| - script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's see the context around lines 383-385 in cicd-main.yml
echo "=== Lines 383-385 and surrounding context ==="
sed -n '375,395p' .github/workflows/cicd-main.yml
echo
echo "=== All matrix.include entries to compare ==="
rg -n 'script:' .github/workflows/cicd-main.yml | head -30
echo
echo "=== Check test-template action to understand script handling ==="
cat .github/actions/test-template/action.yml
echo
echo "=== Look for .sh suffix appending logic in test-template ==="
find .github/actions/test-template -type f -exec cat {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 21379
Remove .sh suffix from new launcher script entries to prevent *.sh.sh runtime failure.
The test-template action appends .sh to all script inputs (line in action.yml: bash tests/.../${{ inputs.script }}.sh). All existing 30+ matrix entries use extensionless names (e.g., L2_Launch_training). These three new entries incorrectly include .sh, which will result in ...sh.sh and fail at runtime.
Fix
- - script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh
- - script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh
- - script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh
+ - script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b
+ - script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b
+ - script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/cicd-main.yml around lines 383 - 385, The three new matrix
entries L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh include a redundant “.sh” suffix
which the test-template action will append again and cause *.sh.sh failures;
remove the “.sh” suffix from each of these entries so they become
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b respectively to match existing
extensionless matrix items.
| load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None | ||
| train_iters = 10 if load_dir else 5 |
There was a problem hiding this comment.
Fix conflicting and invalid checkpoint CLI arguments in sys.argv.
Line 81-82 hardcodes one load/save path pair, then Line 113-114 adds a second pair. Also load_dir can be absent, but --load is still emitted. This can select the wrong checkpoint source or fail bootstrap runs.
🐛 Suggested fix
- "--load", "/workspace/test_ckpts/llama32_1b_mbridge",
- "--save", "/workspace/test_ckpts/llama32_1b_mcore",
@@
- "--load", load_dir,
- "--save", MCORE_CKPT,
+ "--save", MCORE_CKPT,
@@
- ],
- )
+ ]
+ if load_dir:
+ argv.extend(["--load", load_dir])
+
+ monkeypatch.setattr(sys, "argv", argv)Also applies to: 81-82, 113-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
66 - 67, Your test builds conflicting CLI args by unconditionally appending a
hardcoded --load/--save pair and later appending another pair based on load_dir,
and it emits --load even when load_dir is None; update the sys.argv construction
so that you only add a single --load/--save pair: check the load_dir variable
(derived from MBRIDGE_CKPT) and only append "--load", load_dir when load_dir is
truthy, then append a single matching "--save", save_dir (or use the same save
variable) in the same place instead of adding a second hardcoded pair; ensure
any previous duplicate entries are removed or never added so the CLI sees one
consistent source and target checkpoint.
| def test_remove_artifacts(self): | ||
| """Removes model artifacts""" | ||
| shutil.rmtree(BASE_DIR) | ||
|
|
There was a problem hiding this comment.
Harden artifact cleanup for missing directories.
Line 133 unconditionally calls shutil.rmtree(BASE_DIR). Guarding existence prevents unrelated failures during cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
131 - 134, The test_remove_artifacts teardown currently calls
shutil.rmtree(BASE_DIR) unconditionally which can raise if BASE_DIR is missing;
update test_remove_artifacts to check for existence before removal by using
os.path.exists(BASE_DIR) (or pathlib.Path(BASE_DIR).exists()) and only call
shutil.rmtree(BASE_DIR) when present, or wrap the call in a try/except catching
FileNotFoundError to silently ignore missing directories so the test doesn't
fail when BASE_DIR is absent.
| load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None | ||
| train_iters = 10 if load_dir else 5 |
There was a problem hiding this comment.
Guard --load so first-run execution remains valid.
If MBRIDGE_CKPT does not exist, load_dir is None; Line 118 still appends --load, which breaks the bootstrap path.
🐛 Suggested fix
- monkeypatch.setattr(
- sys,
- "argv",
- [
+ argv = [
"torchrun",
@@
- "--eval-iters", "4",
- "--load", load_dir,
+ "--eval-iters", "4",
"--save", MCORE_CKPT,
@@
- "--no-load-rng",
- ],
- )
+ "--no-load-rng",
+ ]
+ if load_dir:
+ argv.extend(["--load", load_dir])
+
+ monkeypatch.setattr(sys, "argv", argv)Also applies to: 118-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around
lines 67 - 68, The test currently computes load_dir (load_dir = MBRIDGE_CKPT if
os.path.exists(MBRIDGE_CKPT) else None) but always appends the --load flag
later, breaking first-run bootstrap; update the command construction so the
--load flag and its value are only added when load_dir is truthy (e.g. if
load_dir: args.extend(['--load', load_dir]) or similar). Keep the existing
train_iters logic (train_iters = 10 if load_dir else 5) unchanged.
| def test_remove_artifacts(self): | ||
| """Removes model artifacts""" | ||
| shutil.rmtree(BASE_DIR) | ||
|
|
There was a problem hiding this comment.
Make cleanup safe when artifacts are absent.
Line 139 can raise if BASE_DIR does not exist; guard removal to keep cleanup deterministic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around
lines 137 - 140, The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) which will raise if BASE_DIR is missing; make it safe by
checking existence or using shutil.rmtree with ignore_errors=True so the test
won't error when artifacts are absent—update the test_remove_artifacts function
to guard removal (e.g., use os.path.exists(BASE_DIR) before calling
shutil.rmtree or pass ignore_errors=True) and add an import for os if you choose
the existence-check approach.
| load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None | ||
| train_iters = 10 if load_dir else 5 |
There was a problem hiding this comment.
Avoid emitting --load when no checkpoint exists.
On a clean run, load_dir is None but Line 114 still appends --load. This breaks the no-checkpoint bootstrap path.
🐛 Suggested fix
- monkeypatch.setattr(
- sys,
- "argv",
- [
+ argv = [
"torchrun",
"--nproc_per_node=2",
"/opt/Megatron-Bridge/3rdparty/Megatron-LM/pretrain_gpt.py",
@@
- "--eval-iters", "4",
- "--load", load_dir,
+ "--eval-iters", "4",
"--save", MCORE_CKPT,
"--ckpt-format", "torch_dist",
@@
- "--no-load-rng",
- ],
- )
+ "--no-load-rng",
+ ]
+ if load_dir:
+ argv.extend(["--load", load_dir])
+
+ monkeypatch.setattr(sys, "argv", argv)Also applies to: 114-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 66
- 67, The test currently sets load_dir = MBRIDGE_CKPT if the path exists
otherwise None, but the command-line args still unconditionally append "--load"
which breaks bootstrap when no checkpoint exists; update the argument
construction to only append the "--load" flag and its value when load_dir is
truthy (non-None/exists), i.e., check load_dir before adding "--load" (and
before using train_iters logic that depends on it) so the test passes clean runs
without a checkpoint.
| def test_remove_artifacts(self): | ||
| """Removes model artifacts""" | ||
| shutil.rmtree(BASE_DIR) | ||
|
|
There was a problem hiding this comment.
Make artifact cleanup idempotent.
Line 135 unconditionally removes BASE_DIR; if it was never created, this fails the cleanup test for the wrong reason.
🧹 Suggested fix
- shutil.rmtree(BASE_DIR)
+ if os.path.exists(BASE_DIR):
+ shutil.rmtree(BASE_DIR)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 133
- 136, The test_remove_artifacts cleanup currently calls shutil.rmtree(BASE_DIR)
unconditionally which raises if BASE_DIR doesn't exist; make it idempotent by
checking for existence or using the rmtree ignore option. Update
test_remove_artifacts to either wrap the removal in an existence check
(os.path.exists(BASE_DIR) before calling shutil.rmtree) or call
shutil.rmtree(BASE_DIR, ignore_errors=True); ensure any new os import is added
if you choose the existence check.
| coverage combine -q | ||
|
|
||
| pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core | ||
|
|
||
| pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts |
There was a problem hiding this comment.
Run coverage and pytest via uv run for guideline compliance.
Line 24, Line 26, and Line 28 execute Python tooling directly. This should go through uv run in shell scripts.
Suggested fix
-coverage combine -q
+uv run coverage combine -q
-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| coverage combine -q | |
| pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core | |
| pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts | |
| uv run coverage combine -q | |
| uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core | |
| uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh` around
lines 24 - 28, Replace direct invocations of coverage and pytest with the
project's wrapper by prefixing those commands with "uv run" so they run under
the standard runtime environment; specifically update the script commands that
run "coverage combine -q" and the two "pytest ..." lines in
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh to use "uv
run coverage combine -q" and "uv run pytest ..." respectively, keeping the same
pytest flags and test targets (the TestLlama32Ckpt::test_llama32_1B_ckpt_core
and TestLlama32Ckpt::test_remove_artifacts invocations) so behavior is unchanged
but execution complies with the guideline.
|
/ok to test 890333e |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test 846de61 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test 80dcef0 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test 038f18d |
|
/ok to test b96fda1 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test 4ea3be7 |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit