Skip to content

add ckpt tests#2496

Merged
dimapihtar merged 14 commits intomainfrom
dpykhtar/add_ckpt_tests
Mar 4, 2026
Merged

add ckpt tests#2496
dimapihtar merged 14 commits intomainfrom
dpykhtar/add_ckpt_tests

Conversation

@dimapihtar
Copy link
Contributor

@dimapihtar dimapihtar commented Feb 23, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Tests
    • Added new functional test suites for checkpoint conversion testing across multiple models (LLaMA-32 1B, NemotronH 4B, Qwen3 4B)
    • Tests validate both MBridge and MCore checkpoint pathways
    • Expanded CI/CD test matrix with additional test scenarios

Signed-off-by: dimapihtar <dpihtar@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar and others added 6 commits February 26, 2026 11:14
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar dimapihtar marked this pull request as ready for review March 2, 2026 21:17
@dimapihtar dimapihtar requested a review from a team as a code owner March 2, 2026 21:17
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces functional test infrastructure for checkpoint workflows across three model variants (LLaMA-32 1B, NemotronH 4B, and Qwen3 4B), adding both bidirectional test launcher scripts (MLM-to-MBridge and MBridge-to-MLM) and corresponding Python test modules, while updating the CI/CD pipeline to execute these tests.

Changes

Cohort / File(s) Summary
CI/CD Configuration
.github/workflows/cicd-main.yml
Added three new functional test matrix entries for mbridge-to-mlm checkpoint workflows (llama32_1b, qwen3_4b, nemotronh_4b).
MBridge-to-MLM Test Launchers
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_*.sh (llama32_1b, qwen3_4b, nemotronh_4b)
Three shell scripts executing distributed 2-GPU PyTorch test runs with coverage collection, pytest invocations with logging/verbosity options, and artifact cleanup for each model variant.
MLM-to-MBridge Test Launchers
tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_*.sh (llama32_1b, nemotronh_4b, qwen3_4b)
Three shell scripts orchestrating sequential local and distributed PyTorch test execution with coverage aggregation and artifact removal for each model checkpoint workflow.
LLaMA-32 1B Functional Tests
tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py
Introduces TestLlama32Ckpt class with methods for MBridge checkpoint loading, MCore checkpoint training via torchrun, and artifact cleanup.
NemotronH 4B Functional Tests
tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py
Introduces TestNemotronhCkpt class with methods for MBridge pretrain configuration, MCore training execution via torchrun with customized MLM arguments, and artifact cleanup.
Qwen3 4B Functional Tests
tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py
Introduces TestQwen3Ckpt class with methods for MBridge checkpoint configuration, MCore training via torchrun launcher, and artifact removal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

Suggested labels

Run CICD, community-request

Suggested reviewers

  • cuichenx
  • ko3n1g
  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'add ckpt tests' is vague and overly generic, using abbreviations without clarity and lacking specificity about what checkpoint tests are being added. Consider a more descriptive title that specifies the checkpoint types or models being tested, such as 'add checkpoint tests for llama32, qwen3, and nemotronh models' or 'add functional tests for mbridge and mcore checkpoint workflows'.
Test Results For Major Changes ❓ Inconclusive PR adds ~600 lines of test infrastructure (6 shell scripts, 4 Python test modules) but is Draft status with incomplete description lacking any test execution results or validation evidence. Complete PR description with test execution results, checkpoint conversion validation evidence, or confirm these are minor test additions without core functionality impact.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dpykhtar/add_ckpt_tests

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (9)
tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh (1)

23-29: Wrap all Python tooling with uv run here too.

Line 23, Line 26, and Line 29 are direct tool invocations; using uv run consistently avoids environment drift between commands.

As per coding guidelines, "{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh` around
lines 23 - 29, The three direct tool invocations (the pytest call on the first
line, the coverage combine and the final pytest call) should be run via uv to
avoid environment drift; wrap each invocation with "uv run" so the pytest
commands (the one invoking torch.distributed.run and the two single-process
pytest calls) and the "coverage combine -q" call are executed through uv instead
of being called directly, ensuring consistent environment activation for
commands referenced in this diff.
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh (1)

24-28: Apply uv run consistently to coverage/pytest commands.

Line 24, Line 26, and Line 28 should be wrapped with uv run for consistent dependency/environment resolution.

As per coding guidelines, "{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh` around
lines 24 - 28, The script invokes "coverage combine" and two "pytest ..."
commands directly; update each invocation (the "coverage combine" line and both
pytest command lines) to be executed via the project runner by prefixing them
with "uv run" so they use the repository's pinned environment and dependencies
(i.e., replace plain "coverage combine -q" and the two "pytest -o log_cli=..."
lines with "uv run coverage combine -q" and "uv run pytest -o log_cli=..."
respectively).
tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh (1)

23-28: Run all Python tooling through uv run in this launcher.

Line 23, Line 26, and Line 28 invoke Python entrypoints directly (pytest / coverage). Use uv run consistently to keep resolver/environment behavior uniform in CI.

♻️ Suggested update
-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
+uv run --no-sync pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
@@
-coverage combine -q
+uv run --no-sync coverage combine -q
@@
-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
+uv run --no-sync pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts

As per coding guidelines, "{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly". Based on learnings, "prefer including --no-sync ... in containers or CI pipelines where dependencies are already installed."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh` around
lines 23 - 28, Change the direct Python tooling calls to run under uv: replace
the standalone "pytest -o log_cli=..." invocations (test run lines invoking
pytest) with "uv run --no-sync pytest -o log_cli=..."; also run coverage via uv
by replacing "coverage combine -q" with "uv run --no-sync coverage combine -q".
Update the launcher script so all three tooling commands (the two pytest calls
and the coverage combine) use "uv run --no-sync" to ensure consistent
resolver/environment behavior.
tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py (2)

39-40: Rename test methods to snake_case.

test_llama32_1B_ckpt_mbridge and test_llama32_1B_ckpt_core use an uppercase segment (1B). Rename to snake_case (e.g., ..._1b_...) to match repo convention.

As per coding guidelines, "**/*.py: Use snake_case for function and method names".

Also applies to: 63-64

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
39 - 40, Rename the two test methods that contain an uppercase "1B" to
snake_case: change test_llama32_1B_ckpt_mbridge -> test_llama32_1b_ckpt_mbridge
and test_llama32_1B_ckpt_core -> test_llama32_1b_ckpt_core; update any local
references (calls, decorators or skips) to these exact function names so test
discovery and references remain correct, and run the test suite to ensure no
other references break.

74-129: Prefer subprocess-based process isolation for this functional test.

Instead of patching sys.argv and calling torchrun_main() in-process, invoke torchrun via subprocess for better isolation.

As per coding guidelines, "tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
74 - 129, The test currently injects arguments via monkeypatch.setattr(sys,
"argv", ...) and calls torchrun_main() in-process; replace this with launching
torchrun as an external process using subprocess.run (or subprocess.check_call)
to get proper process isolation. Build the command list using the same flags
currently passed to monkeypatch (including load_dir, MCORE_CKPT, TB_DIR,
train_iters, etc.), call subprocess.run([...], env=os.environ.copy(),
cwd=<appropriate working dir>, stdout=PIPE, stderr=PIPE), and assert the process
returncode is zero (or raise on non-zero) and capture logs as needed; remove the
monkeypatch.setattr(sys, "argv", ...) and the torchrun_main() invocation. Ensure
you reference the same symbols from the diff (torchrun_main,
monkeypatch.setattr, sys.argv, load_dir, MCORE_CKPT, TB_DIR) when making the
change.
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh (1)

24-28: Use uv run for remaining Python tool invocations.

Line 24, Line 26, and Line 28 currently execute Python tooling directly. Wrap them in uv run for consistent environment behavior.

As per coding guidelines, "{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh` around
lines 24 - 28, Replace direct Python tool invocations with "uv run" wrappers so
they run in the unified environment: change the three commands that currently
call "coverage combine -q" and the two "pytest ..." invocations to be executed
via "uv run" (i.e., "uv run coverage combine -q" and "uv run pytest ..." for the
pytest calls) so the script follows the "{**/*.sh,examples/**/*.py}" guideline;
update the occurrences of the exact strings "coverage combine -q" and the pytest
command lines in the file accordingly.
tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py (1)

74-132: Use subprocess for this torchrun invocation.

This test currently mutates global sys.argv and calls torchrun_main() in-process. For functional tests requiring process isolation, switch to subprocess.run([...], check=True).

As per coding guidelines, "tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 74
- 132, The test mutates sys.argv via monkeypatch.setattr and calls
torchrun_main() in-process (monkeypatch.setattr on sys, sys.argv and
torchrun_main()), which breaks process isolation; change it to invoke the same
command with subprocess.run([...], check=True) instead: build the full argv list
currently passed to sys.argv (starting with "torchrun", "--nproc_per_node=2",
..., "--no-load-rng") as a command array (or prefix with sys.executable and
"-m", e.g. [sys.executable, "-m", "torch.distributed.run", ...] or "torchrun" if
available), call subprocess.run(cmd, check=True) and remove the in-process
torchrun_main() call and the sys.argv monkeypatch; ensure any environment needed
(monkeypatching env vars) is passed via subprocess.run(env=...) if required.
tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh (1)

23-28: Standardize this launcher on uv run for pytest and coverage.

Line 23, Line 26, and Line 28 should use uv run as well, so all Python tooling runs under the same managed environment.

As per coding guidelines, "{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh` around
lines 23 - 28, The three standalone tooling invocations (the first pytest
command "pytest -o log_cli=true ... test_qwen3_4b_ckpt_mcore", the coverage
combine call "coverage combine -q", and the last pytest command "pytest -o
log_cli=true ... test_remove_artifacts") must be prefixed with "uv run" so they
run under the same managed environment as the middle launcher; update those
three command lines to use "uv run pytest ..." and "uv run coverage combine -q"
respectively.
tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py (1)

75-135: Prefer subprocess execution over in-process torchrun_main().

This functional test mutates global argv and executes torchrun in-process. Use a subprocess call for isolation and fewer side effects.

As per coding guidelines, "tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around
lines 75 - 135, The test currently mutates sys.argv via monkeypatch.setattr(...)
and calls torchrun_main(), causing global state leakage; instead replace the
in-process invocation with a subprocess call: construct the same command list
currently assigned to sys.argv (including the "torchrun" entry and all flags
like "--nproc_per_node=2", "--load" load_dir, "--save" MCORE_CKPT, TB_DIR,
etc.), remove the monkeypatch.setattr(sys, "argv", ...) and the torchrun_main()
call, and invoke subprocess.run(command_list, check=True) (or subprocess.run
with stdout/stderr capture as needed) to run the job in a separate process,
ensuring process isolation and no global argv mutation. Use the symbols
monkeypatch.setattr, torchrun_main, and the command list entries to locate and
replace the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/cicd-main.yml:
- Around line 383-385: The three new matrix entries
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh include a redundant “.sh” suffix
which the test-template action will append again and cause *.sh.sh failures;
remove the “.sh” suffix from each of these entries so they become
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b respectively to match existing
extensionless matrix items.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py`:
- Around line 131-134: The test_remove_artifacts teardown currently calls
shutil.rmtree(BASE_DIR) unconditionally which can raise if BASE_DIR is missing;
update test_remove_artifacts to check for existence before removal by using
os.path.exists(BASE_DIR) (or pathlib.Path(BASE_DIR).exists()) and only call
shutil.rmtree(BASE_DIR) when present, or wrap the call in a try/except catching
FileNotFoundError to silently ignore missing directories so the test doesn't
fail when BASE_DIR is absent.
- Around line 66-67: Your test builds conflicting CLI args by unconditionally
appending a hardcoded --load/--save pair and later appending another pair based
on load_dir, and it emits --load even when load_dir is None; update the sys.argv
construction so that you only add a single --load/--save pair: check the
load_dir variable (derived from MBRIDGE_CKPT) and only append "--load", load_dir
when load_dir is truthy, then append a single matching "--save", save_dir (or
use the same save variable) in the same place instead of adding a second
hardcoded pair; ensure any previous duplicate entries are removed or never added
so the CLI sees one consistent source and target checkpoint.

In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py`:
- Around line 137-140: The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) which will raise if BASE_DIR is missing; make it safe by
checking existence or using shutil.rmtree with ignore_errors=True so the test
won't error when artifacts are absent—update the test_remove_artifacts function
to guard removal (e.g., use os.path.exists(BASE_DIR) before calling
shutil.rmtree or pass ignore_errors=True) and add an import for os if you choose
the existence-check approach.
- Around line 67-68: The test currently computes load_dir (load_dir =
MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None) but always appends the
--load flag later, breaking first-run bootstrap; update the command construction
so the --load flag and its value are only added when load_dir is truthy (e.g. if
load_dir: args.extend(['--load', load_dir]) or similar). Keep the existing
train_iters logic (train_iters = 10 if load_dir else 5) unchanged.

In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py`:
- Around line 66-67: The test currently sets load_dir = MBRIDGE_CKPT if the path
exists otherwise None, but the command-line args still unconditionally append
"--load" which breaks bootstrap when no checkpoint exists; update the argument
construction to only append the "--load" flag and its value when load_dir is
truthy (non-None/exists), i.e., check load_dir before adding "--load" (and
before using train_iters logic that depends on it) so the test passes clean runs
without a checkpoint.
- Around line 133-136: The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) unconditionally which raises if BASE_DIR doesn't exist;
make it idempotent by checking for existence or using the rmtree ignore option.
Update test_remove_artifacts to either wrap the removal in an existence check
(os.path.exists(BASE_DIR) before calling shutil.rmtree) or call
shutil.rmtree(BASE_DIR, ignore_errors=True); ensure any new os import is added
if you choose the existence check.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh`:
- Around line 24-28: Replace direct invocations of coverage and pytest with the
project's wrapper by prefixing those commands with "uv run" so they run under
the standard runtime environment; specifically update the script commands that
run "coverage combine -q" and the two "pytest ..." lines in
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh to use "uv
run coverage combine -q" and "uv run pytest ..." respectively, keeping the same
pytest flags and test targets (the TestLlama32Ckpt::test_llama32_1B_ckpt_core
and TestLlama32Ckpt::test_remove_artifacts invocations) so behavior is unchanged
but execution complies with the guideline.

---

Nitpick comments:
In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py`:
- Around line 39-40: Rename the two test methods that contain an uppercase "1B"
to snake_case: change test_llama32_1B_ckpt_mbridge ->
test_llama32_1b_ckpt_mbridge and test_llama32_1B_ckpt_core ->
test_llama32_1b_ckpt_core; update any local references (calls, decorators or
skips) to these exact function names so test discovery and references remain
correct, and run the test suite to ensure no other references break.
- Around line 74-129: The test currently injects arguments via
monkeypatch.setattr(sys, "argv", ...) and calls torchrun_main() in-process;
replace this with launching torchrun as an external process using subprocess.run
(or subprocess.check_call) to get proper process isolation. Build the command
list using the same flags currently passed to monkeypatch (including load_dir,
MCORE_CKPT, TB_DIR, train_iters, etc.), call subprocess.run([...],
env=os.environ.copy(), cwd=<appropriate working dir>, stdout=PIPE, stderr=PIPE),
and assert the process returncode is zero (or raise on non-zero) and capture
logs as needed; remove the monkeypatch.setattr(sys, "argv", ...) and the
torchrun_main() invocation. Ensure you reference the same symbols from the diff
(torchrun_main, monkeypatch.setattr, sys.argv, load_dir, MCORE_CKPT, TB_DIR)
when making the change.

In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py`:
- Around line 75-135: The test currently mutates sys.argv via
monkeypatch.setattr(...) and calls torchrun_main(), causing global state
leakage; instead replace the in-process invocation with a subprocess call:
construct the same command list currently assigned to sys.argv (including the
"torchrun" entry and all flags like "--nproc_per_node=2", "--load" load_dir,
"--save" MCORE_CKPT, TB_DIR, etc.), remove the monkeypatch.setattr(sys, "argv",
...) and the torchrun_main() call, and invoke subprocess.run(command_list,
check=True) (or subprocess.run with stdout/stderr capture as needed) to run the
job in a separate process, ensuring process isolation and no global argv
mutation. Use the symbols monkeypatch.setattr, torchrun_main, and the command
list entries to locate and replace the code.

In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py`:
- Around line 74-132: The test mutates sys.argv via monkeypatch.setattr and
calls torchrun_main() in-process (monkeypatch.setattr on sys, sys.argv and
torchrun_main()), which breaks process isolation; change it to invoke the same
command with subprocess.run([...], check=True) instead: build the full argv list
currently passed to sys.argv (starting with "torchrun", "--nproc_per_node=2",
..., "--no-load-rng") as a command array (or prefix with sys.executable and
"-m", e.g. [sys.executable, "-m", "torch.distributed.run", ...] or "torchrun" if
available), call subprocess.run(cmd, check=True) and remove the in-process
torchrun_main() call and the sys.argv monkeypatch; ensure any environment needed
(monkeypatching env vars) is passed via subprocess.run(env=...) if required.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh`:
- Around line 24-28: The script invokes "coverage combine" and two "pytest ..."
commands directly; update each invocation (the "coverage combine" line and both
pytest command lines) to be executed via the project runner by prefixing them
with "uv run" so they use the repository's pinned environment and dependencies
(i.e., replace plain "coverage combine -q" and the two "pytest -o log_cli=..."
lines with "uv run coverage combine -q" and "uv run pytest -o log_cli=..."
respectively).

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh`:
- Around line 24-28: Replace direct Python tool invocations with "uv run"
wrappers so they run in the unified environment: change the three commands that
currently call "coverage combine -q" and the two "pytest ..." invocations to be
executed via "uv run" (i.e., "uv run coverage combine -q" and "uv run pytest
..." for the pytest calls) so the script follows the
"{**/*.sh,examples/**/*.py}" guideline; update the occurrences of the exact
strings "coverage combine -q" and the pytest command lines in the file
accordingly.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh`:
- Around line 23-28: Change the direct Python tooling calls to run under uv:
replace the standalone "pytest -o log_cli=..." invocations (test run lines
invoking pytest) with "uv run --no-sync pytest -o log_cli=..."; also run
coverage via uv by replacing "coverage combine -q" with "uv run --no-sync
coverage combine -q". Update the launcher script so all three tooling commands
(the two pytest calls and the coverage combine) use "uv run --no-sync" to ensure
consistent resolver/environment behavior.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh`:
- Around line 23-29: The three direct tool invocations (the pytest call on the
first line, the coverage combine and the final pytest call) should be run via uv
to avoid environment drift; wrap each invocation with "uv run" so the pytest
commands (the one invoking torch.distributed.run and the two single-process
pytest calls) and the "coverage combine -q" call are executed through uv instead
of being called directly, ensuring consistent environment activation for
commands referenced in this diff.

In `@tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh`:
- Around line 23-28: The three standalone tooling invocations (the first pytest
command "pytest -o log_cli=true ... test_qwen3_4b_ckpt_mcore", the coverage
combine call "coverage combine -q", and the last pytest command "pytest -o
log_cli=true ... test_remove_artifacts") must be prefixed with "uv run" so they
run under the same managed environment as the middle launcher; update those
three command lines to use "uv run pytest ..." and "uv run coverage combine -q"
respectively.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1fb06a and 890333e.

📒 Files selected for processing (10)
  • .github/workflows/cicd-main.yml
  • tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh
  • tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh
  • tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh
  • tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_llama32_1b.sh
  • tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b.sh
  • tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh
  • tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py
  • tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py
  • tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py

Comment on lines +383 to +385
- script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh
- script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh
- script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# First, let's see the context around lines 383-385 in cicd-main.yml
echo "=== Lines 383-385 and surrounding context ==="
sed -n '375,395p' .github/workflows/cicd-main.yml

echo
echo "=== All matrix.include entries to compare ==="
rg -n 'script:' .github/workflows/cicd-main.yml | head -30

echo
echo "=== Check test-template action to understand script handling ==="
cat .github/actions/test-template/action.yml

echo
echo "=== Look for .sh suffix appending logic in test-template ==="
find .github/actions/test-template -type f -exec cat {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 21379


Remove .sh suffix from new launcher script entries to prevent *.sh.sh runtime failure.

The test-template action appends .sh to all script inputs (line in action.yml: bash tests/.../${{ inputs.script }}.sh). All existing 30+ matrix entries use extensionless names (e.g., L2_Launch_training). These three new entries incorrectly include .sh, which will result in ...sh.sh and fail at runtime.

Fix
-          - script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh
-          - script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh 
-          - script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh
+          - script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b
+          - script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b
+          - script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cicd-main.yml around lines 383 - 385, The three new matrix
entries L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b.sh include a redundant “.sh” suffix
which the test-template action will append again and cause *.sh.sh failures;
remove the “.sh” suffix from each of these entries so they become
L2_Launch_ckpts_mbridge_to_mlm_llama32_1b,
L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b, and
L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b respectively to match existing
extensionless matrix items.

Comment on lines +66 to +67
load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None
train_iters = 10 if load_dir else 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix conflicting and invalid checkpoint CLI arguments in sys.argv.

Line 81-82 hardcodes one load/save path pair, then Line 113-114 adds a second pair. Also load_dir can be absent, but --load is still emitted. This can select the wrong checkpoint source or fail bootstrap runs.

🐛 Suggested fix
-                "--load", "/workspace/test_ckpts/llama32_1b_mbridge",
-                "--save", "/workspace/test_ckpts/llama32_1b_mcore",
@@
-                "--load", load_dir,
-                "--save", MCORE_CKPT,
+                "--save", MCORE_CKPT,
@@
-            ],
-        )
+            ]
+        if load_dir:
+            argv.extend(["--load", load_dir])
+
+        monkeypatch.setattr(sys, "argv", argv)

Also applies to: 81-82, 113-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
66 - 67, Your test builds conflicting CLI args by unconditionally appending a
hardcoded --load/--save pair and later appending another pair based on load_dir,
and it emits --load even when load_dir is None; update the sys.argv construction
so that you only add a single --load/--save pair: check the load_dir variable
(derived from MBRIDGE_CKPT) and only append "--load", load_dir when load_dir is
truthy, then append a single matching "--save", save_dir (or use the same save
variable) in the same place instead of adding a second hardcoded pair; ensure
any previous duplicate entries are removed or never added so the CLI sees one
consistent source and target checkpoint.

Comment on lines +131 to +134
def test_remove_artifacts(self):
"""Removes model artifacts"""
shutil.rmtree(BASE_DIR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Harden artifact cleanup for missing directories.

Line 133 unconditionally calls shutil.rmtree(BASE_DIR). Guarding existence prevents unrelated failures during cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
131 - 134, The test_remove_artifacts teardown currently calls
shutil.rmtree(BASE_DIR) unconditionally which can raise if BASE_DIR is missing;
update test_remove_artifacts to check for existence before removal by using
os.path.exists(BASE_DIR) (or pathlib.Path(BASE_DIR).exists()) and only call
shutil.rmtree(BASE_DIR) when present, or wrap the call in a try/except catching
FileNotFoundError to silently ignore missing directories so the test doesn't
fail when BASE_DIR is absent.

Comment on lines +67 to +68
load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None
train_iters = 10 if load_dir else 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard --load so first-run execution remains valid.

If MBRIDGE_CKPT does not exist, load_dir is None; Line 118 still appends --load, which breaks the bootstrap path.

🐛 Suggested fix
-        monkeypatch.setattr(
-            sys,
-            "argv",
-            [
+        argv = [
                 "torchrun",
@@
-                "--eval-iters", "4",
-                "--load", load_dir,
+                "--eval-iters", "4",
                 "--save", MCORE_CKPT,
@@
-                "--no-load-rng",
-            ],
-        )
+                "--no-load-rng",
+            ]
+        if load_dir:
+            argv.extend(["--load", load_dir])
+
+        monkeypatch.setattr(sys, "argv", argv)

Also applies to: 118-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around
lines 67 - 68, The test currently computes load_dir (load_dir = MBRIDGE_CKPT if
os.path.exists(MBRIDGE_CKPT) else None) but always appends the --load flag
later, breaking first-run bootstrap; update the command construction so the
--load flag and its value are only added when load_dir is truthy (e.g. if
load_dir: args.extend(['--load', load_dir]) or similar). Keep the existing
train_iters logic (train_iters = 10 if load_dir else 5) unchanged.

Comment on lines +137 to +140
def test_remove_artifacts(self):
"""Removes model artifacts"""
shutil.rmtree(BASE_DIR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make cleanup safe when artifacts are absent.

Line 139 can raise if BASE_DIR does not exist; guard removal to keep cleanup deterministic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py` around
lines 137 - 140, The test_remove_artifacts cleanup currently calls
shutil.rmtree(BASE_DIR) which will raise if BASE_DIR is missing; make it safe by
checking existence or using shutil.rmtree with ignore_errors=True so the test
won't error when artifacts are absent—update the test_remove_artifacts function
to guard removal (e.g., use os.path.exists(BASE_DIR) before calling
shutil.rmtree or pass ignore_errors=True) and add an import for os if you choose
the existence-check approach.

Comment on lines +66 to +67
load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None
train_iters = 10 if load_dir else 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid emitting --load when no checkpoint exists.

On a clean run, load_dir is None but Line 114 still appends --load. This breaks the no-checkpoint bootstrap path.

🐛 Suggested fix
-        monkeypatch.setattr(
-            sys,
-            "argv",
-            [
+        argv = [
                 "torchrun",
                 "--nproc_per_node=2",
                 "/opt/Megatron-Bridge/3rdparty/Megatron-LM/pretrain_gpt.py",
@@
-                "--eval-iters", "4",
-                "--load", load_dir,
+                "--eval-iters", "4",
                 "--save", MCORE_CKPT,
                 "--ckpt-format", "torch_dist",
@@
-                "--no-load-rng",
-            ],
-        )
+                "--no-load-rng",
+            ]
+        if load_dir:
+            argv.extend(["--load", load_dir])
+
+        monkeypatch.setattr(sys, "argv", argv)

Also applies to: 114-115

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 66
- 67, The test currently sets load_dir = MBRIDGE_CKPT if the path exists
otherwise None, but the command-line args still unconditionally append "--load"
which breaks bootstrap when no checkpoint exists; update the argument
construction to only append the "--load" flag and its value when load_dir is
truthy (non-None/exists), i.e., check load_dir before adding "--load" (and
before using train_iters logic that depends on it) so the test passes clean runs
without a checkpoint.

Comment on lines +133 to +136
def test_remove_artifacts(self):
"""Removes model artifacts"""
shutil.rmtree(BASE_DIR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make artifact cleanup idempotent.

Line 135 unconditionally removes BASE_DIR; if it was never created, this fails the cleanup test for the wrong reason.

🧹 Suggested fix
-        shutil.rmtree(BASE_DIR)
+        if os.path.exists(BASE_DIR):
+            shutil.rmtree(BASE_DIR)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py` around lines 133
- 136, The test_remove_artifacts cleanup currently calls shutil.rmtree(BASE_DIR)
unconditionally which raises if BASE_DIR doesn't exist; make it idempotent by
checking for existence or using the rmtree ignore option. Update
test_remove_artifacts to either wrap the removal in an existence check
(os.path.exists(BASE_DIR) before calling shutil.rmtree) or call
shutil.rmtree(BASE_DIR, ignore_errors=True); ensure any new os import is added
if you choose the existence check.

Comment on lines +24 to +28
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run coverage and pytest via uv run for guideline compliance.

Line 24, Line 26, and Line 28 execute Python tooling directly. This should go through uv run in shell scripts.

Suggested fix
-coverage combine -q
+uv run coverage combine -q

-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core

-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
As per coding guidelines: "`{**/*.sh,examples/**/*.py}`: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
coverage combine -q
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
uv run coverage combine -q
uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh` around
lines 24 - 28, Replace direct invocations of coverage and pytest with the
project's wrapper by prefixing those commands with "uv run" so they run under
the standard runtime environment; specifically update the script commands that
run "coverage combine -q" and the two "pytest ..." lines in
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh to use "uv
run coverage combine -q" and "uv run pytest ..." respectively, keeping the same
pytest flags and test targets (the TestLlama32Ckpt::test_llama32_1B_ckpt_core
and TestLlama32Ckpt::test_remove_artifacts invocations) so behavior is unchanged
but execution complies with the guideline.

@dimapihtar
Copy link
Contributor Author

/ok to test 890333e

Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar
Copy link
Contributor Author

/ok to test 846de61

Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar
Copy link
Contributor Author

/ok to test 80dcef0

Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar
Copy link
Contributor Author

/ok to test 038f18d

Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar
Copy link
Contributor Author

/ok to test b96fda1

dimapihtar and others added 2 commits March 4, 2026 09:02
@dimapihtar
Copy link
Contributor Author

/ok to test 4ea3be7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants