Skip to content
Merged
6 changes: 6 additions & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,12 @@ jobs:
# - script: L2_Launch_quantization_export
- script: L2_Launch_recipes_llama_cuda_graphs
- script: L2_Launch_utils
- script: L2_Launch_ckpts_mbridge_to_mlm_llama32_1b
- script: L2_Launch_ckpts_mlm_to_mbridge_llama32_1b
- script: L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b
- script: L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b
- script: L2_Launch_ckpts_mbridge_to_mlm_nemotronh_4b
- script: L2_Launch_ckpts_mlm_to_mbridge_nemotronh_4b
needs: [pre-flight, cicd-unit-tests]
runs-on: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2
if: |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_mbridge
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
Comment on lines +24 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run coverage and pytest via uv run for guideline compliance.

Line 24, Line 26, and Line 28 execute Python tooling directly. This should go through uv run in shell scripts.

Suggested fix
-coverage combine -q
+uv run coverage combine -q

-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core

-pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
+uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
As per coding guidelines: "`{**/*.sh,examples/**/*.py}`: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
coverage combine -q
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
uv run coverage combine -q
uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core
uv run pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh` around
lines 24 - 28, Replace direct invocations of coverage and pytest with the
project's wrapper by prefixing those commands with "uv run" so they run under
the standard runtime environment; specifically update the script commands that
run "coverage combine -q" and the two "pytest ..." lines in
tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_llama32_1b.sh to use "uv
run coverage combine -q" and "uv run pytest ..." respectively, keeping the same
pytest flags and test targets (the TestLlama32Ckpt::test_llama32_1B_ckpt_core
and TestLlama32Ckpt::test_remove_artifacts invocations) so behavior is unchanged
but execution complies with the guideline.

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_nemotronh_4b_ckpt_mbridge
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_nemotronh_4b_ckpt_mcore

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_remove_artifacts
28 changes: 28 additions & 0 deletions tests/functional_tests/L2_Launch_ckpts_mbridge_to_mlm_qwen3_4b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_qwen3_4b_ckpt_mbridge
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_qwen3_4b_ckpt_mcore

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_remove_artifacts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_core

uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_llama32_1B_ckpt_mbridge
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py::TestLlama32Ckpt::test_remove_artifacts
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_nemotronh_4b_ckpt_mcore

uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_nemotronh_4b_ckpt_mbridge
coverage combine -q


pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/nemotronh_4b/test_nemotronh_4b_ckpt.py::TestNemotronhCkpt::test_remove_artifacts
28 changes: 28 additions & 0 deletions tests/functional_tests/L2_Launch_ckpts_mlm_to_mbridge_qwen3_4b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -xeuo pipefail # Exit immediately if a command exits with a non-zero status

export CUDA_VISIBLE_DEVICES="0,1"

# Run recipe functional tests on 2 GPUs
# This script tests recipe configurations with their default settings to ensure
# they can run basic training without crashes
pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_qwen3_4b_ckpt_mcore

uv run python -m torch.distributed.run --nproc_per_node=2 --nnodes=1 -m coverage run --data-file=/opt/Megatron-Bridge/.coverage --source=/opt/Megatron-Bridge/ --parallel-mode -m pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_qwen3_4b_ckpt_mbridge
coverage combine -q

pytest -o log_cli=true -o log_cli_level=INFO -v -s -x -m "not pleasefixme" --tb=short -rA tests/functional_tests/ckpts/qwen3_4b/test_qwen3_4b_ckpt.py::TestQwen3Ckpt::test_remove_artifacts
169 changes: 169 additions & 0 deletions tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Functional smoke tests for LLaMA checkpointing."""

import os
import shutil
import sys

import pytest
from torch.distributed.run import main as torchrun_main

from megatron.bridge.recipes.llama import llama32_1b_pretrain_config
from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain


BASE_DIR = "/workspace/test_ckpts/llama32_1b"
MBRIDGE_CKPT = f"{BASE_DIR}/mbridge"
MCORE_CKPT = f"{BASE_DIR}/mcore"
TB_DIR = f"{BASE_DIR}/tb"


class TestLlama32Ckpt:
"""Test class for LLama checkpoint functional tests."""

@pytest.mark.run_only_on("GPU")
def test_llama32_1B_ckpt_mbridge(self):
"""Functional test for LLama MBridge checkpoint."""

config = llama32_1b_pretrain_config()

config.checkpoint.save = MBRIDGE_CKPT
config.checkpoint.load = MCORE_CKPT if os.path.exists(MCORE_CKPT) else None
config.checkpoint.load_optim = False

config.model.seq_length = 8192

config.train.train_iters = 10 if config.checkpoint.load else 5
config.train.eval_iters = 5
config.train.save_interval = 5
config.train.global_batch_size = 8
config.train.micro_batch_size = 1

config.scheduler.lr_warmup_iters = 2

config.logger.log_interval = 1

pretrain(config=config, forward_step_func=forward_step)

@pytest.mark.run_only_on("GPU")
def test_llama32_1B_ckpt_core(self, monkeypatch):
"""Functional test for LLama MCore checkpoint."""

load_dir = MBRIDGE_CKPT if os.path.exists(MBRIDGE_CKPT) else None
train_iters = 10 if load_dir else 5
Comment on lines +66 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix conflicting and invalid checkpoint CLI arguments in sys.argv.

Line 81-82 hardcodes one load/save path pair, then Line 113-114 adds a second pair. Also load_dir can be absent, but --load is still emitted. This can select the wrong checkpoint source or fail bootstrap runs.

🐛 Suggested fix
-                "--load", "/workspace/test_ckpts/llama32_1b_mbridge",
-                "--save", "/workspace/test_ckpts/llama32_1b_mcore",
@@
-                "--load", load_dir,
-                "--save", MCORE_CKPT,
+                "--save", MCORE_CKPT,
@@
-            ],
-        )
+            ]
+        if load_dir:
+            argv.extend(["--load", load_dir])
+
+        monkeypatch.setattr(sys, "argv", argv)

Also applies to: 81-82, 113-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
66 - 67, Your test builds conflicting CLI args by unconditionally appending a
hardcoded --load/--save pair and later appending another pair based on load_dir,
and it emits --load even when load_dir is None; update the sys.argv construction
so that you only add a single --load/--save pair: check the load_dir variable
(derived from MBRIDGE_CKPT) and only append "--load", load_dir when load_dir is
truthy, then append a single matching "--save", save_dir (or use the same save
variable) in the same place instead of adding a second hardcoded pair; ensure
any previous duplicate entries are removed or never added so the CLI sees one
consistent source and target checkpoint.


# Set environment variables
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,1")
monkeypatch.setenv("CUDA_DEVICE_MAX_CONNECTIONS", "1")

# Set MLM script
monkeypatch.setattr(
sys,
"argv",
[
"torchrun",
"--nproc-per-node=2",
"/opt/Megatron-Bridge/3rdparty/Megatron-LM/pretrain_gpt.py",
"--load",
"/workspace/test_ckpts/llama32_1b_mbridge",
"--save",
"/workspace/test_ckpts/llama32_1b_mcore",
"--init-method-std",
"0.014",
"--disable-bias-linear",
"--use-rope-scaling",
"--swiglu",
"--use-rotary-position-embeddings",
"--num-layers",
"16",
"--hidden-size",
"2048",
"--num-attention-heads",
"32",
"--ffn-hidden-size",
"8192",
"--kv-channels",
"64",
"--group-query-attention",
"--position-embedding-type",
"rope",
"--attention-backend",
"fused",
"--num-query-groups",
"8",
"--normalization",
"RMSNorm",
"--attention-dropout",
"0.0",
"--hidden-dropout",
"0.0",
"--tensor-model-parallel-size",
"1",
"--pipeline-model-parallel-size",
"1",
"--seq-length",
"8192",
"--max-position-embeddings",
"8192",
"--micro-batch-size",
"1",
"--global-batch-size",
"8",
"--mock-data",
"--tokenizer-type",
"NullTokenizer",
"--vocab-size",
"131072",
"--train-iters",
f"{train_iters}",
"--save-interval",
"5",
"--eval-interval",
"5",
"--eval-iters",
"5",
"--load",
load_dir,
"--save",
MCORE_CKPT,
"--ckpt-format",
"torch_dist",
"--log-progress",
"--bf16",
"--lr",
"4.5e-4",
"--min-lr",
"4.5e-5",
"--num-workers",
"2",
"--tensorboard-dir",
TB_DIR,
"--log-interval",
"1",
"--log-throughput",
"--no-load-optim",
],
)

# Run MLM script
torchrun_main()

def test_remove_artifacts(self):
"""Removes model artifacts"""
shutil.rmtree(BASE_DIR)

Comment on lines +165 to +168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Harden artifact cleanup for missing directories.

Line 133 unconditionally calls shutil.rmtree(BASE_DIR). Guarding existence prevents unrelated failures during cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/ckpts/llama32_1b/test_llama32_1b_ckpt.py` around lines
131 - 134, The test_remove_artifacts teardown currently calls
shutil.rmtree(BASE_DIR) unconditionally which can raise if BASE_DIR is missing;
update test_remove_artifacts to check for existence before removal by using
os.path.exists(BASE_DIR) (or pathlib.Path(BASE_DIR).exists()) and only call
shutil.rmtree(BASE_DIR) when present, or wrap the call in a try/except catching
FileNotFoundError to silently ignore missing directories so the test doesn't
fail when BASE_DIR is absent.

assert not os.path.exists(BASE_DIR)
Loading
Loading