cp: `[doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs (2151)` into `r0.3.0` by ko3n1g · Pull Request #2226 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-05T00:54:23Z

beep boop [🤖]: Hi @yaoyu-33 👋,

we've cherry picked #2151 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

Documentation
- Reorganized Gemma 3 VL and GLM-4.5V documentation with consolidated, example-focused references.
- Updated GLM-4.5V hardware requirements: higher GPU allocations for LoRA/DoRA and Full SFT training.
New Features
- Added comprehensive GLM-4.5V example scripts for checkpoint conversion, inference, and fine-tuning workflows.
- Expanded Gemma 3 VL examples with detailed configurations, training recipes, and step-by-step guidance.

#2151) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com> Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Ao Tang <aot@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Dingqing Yang <dingqingy@nvidia.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: malay-nagda <malayn@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g · 2026-02-05T00:54:26Z

/ok to test 17a2560

copy-pr-bot · 2026-02-05T00:54:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-05T01:05:20Z

📝 Walkthrough

Walkthrough

This PR consolidates Vision-Language Model documentation by simplifying docs with references to example directories, adds comprehensive GLM-4.5V example scripts for conversion/inference/finetuning workflows, introduces PEFT support to GLM-4.5V recipe configuration with updated default parallelism settings, and updates tests to match new configuration defaults.

Changes

Cohort / File(s)	Summary
Documentation Consolidation `docs/models/llm/gemma3.md`, `docs/models/vlm/gemma3-vl.md`, `docs/models/vlm/glm-45v.md`	Updated model documentation to remove extensive detailed sections (conversion steps, finetuning recipes, configurations) and replaced with consolidated references directing to example directories. Updated hardware recommendations for GLM-4.5V from 4/16 nodes to 32/64 nodes for LoRA and Full SFT respectively.
Gemma3-VL Examples `examples/models/vlm/gemma3_vl/README.md`	Expanded README from high-level overview to detailed example guide with explicit workspace setup, checkpoint conversion subsections, inference examples, finetuning recipes, recommended configurations table, and updated placeholders for coming features.
GLM-4.5V Examples `examples/models/vlm/glm_45v/README.md`, `examples/models/vlm/glm_45v/conversion.sh`, `examples/models/vlm/glm_45v/inference.sh`, `examples/models/vlm/glm_45v/slurm_peft.sh`, `examples/models/vlm/glm_45v/slurm_sft.sh`	Added comprehensive README documenting checkpoint conversion (HF ↔ Megatron), inference workflows, and finetuning recipes; introduced conversion.sh for round-trip model conversion validation; inference.sh for distributed inference across three checkpoint formats; slurm_peft.sh and slurm_sft.sh for parameter-efficient and full supervised fine-tuning with SLURM job orchestration.
Recipe Configuration & Exports `src/megatron/bridge/recipes/__init__.py`, `src/megatron/bridge/recipes/glm_vl/glm_45v.py`	Extended module exports to include GLM and GLM-VL recipe configurations; introduced `is_peft` parameter to `set_glm_45v_pipeline_model_parallel_layout` with PEFT-aware layout selection; increased default sequence length from 4096 to 8192; updated pipeline/expert parallelism defaults (PP 4→8, EP 2→4 for non-full-SFT); increased full SFT global batch size from 32 to 64.
Test Updates `tests/unit_tests/recipes/test_glm_45v_recipes.py`	Updated parallelism expectations to reflect new defaults; expanded pipeline layout tests to cover additional PP values (4, 8, 16) and PEFT variants; updated call sites to pass `is_peft` flag; adjusted layout composition assertions for stages, embedding presence, and decoder counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[training, recipe] feat: Add unified run_recipe.py and Gemma3 VL examples #2110 — Modifies GLM-4.5V VLM recipe and example materials in parallel with these documentation and recipe updates.
[doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs #2151 — Updates the same recipe modules (src/megatron/bridge/recipes/glm_vl/glm_45v.py) with GLM-4.5V defaults and PEFT-aware configuration changes.
add peft to recipe qwen3vl #2023 — Adds PEFT support to Vision-Language recipe modules, sharing similar pattern of introducing PEFT-aware behavior to VLM recipes.

Suggested labels

r0.3.0

Suggested reviewers

yaoyu-33
ananthsub

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR includes significant default parameter changes and new logic without test results or performance validation demonstrating impact.	Add convergence curves, performance benchmarks, and regression analysis confirming the modified defaults do not cause training issues.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the PR's main objectives: adding GLM-4.5V VL examples and updating Gemma 3 VL docs, matching the changeset across documentation and code files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-2151-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit_tests/recipes/test_glm_45v_recipes.py (1)
368-455: ⚠️ Potential issue | 🟡 Minor

Add test markers and return type hints for the new pipeline-layout tests.

This keeps unit tests categorized and aligns with the project’s Python typing rules.
Suggested change
- def test_glm_45v_pipeline_layout_pp4():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp4() -> None:
@@
- def test_glm_45v_pipeline_layout_pp8():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp8() -> None:
@@
- def test_glm_45v_pipeline_layout_pp16():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp16() -> None:
@@
- def test_glm_45v_pipeline_layout_pp8_peft():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp8_peft() -> None:
@@
- def test_glm_45v_pipeline_layout_pp16_peft():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp16_peft() -> None:
As per coding guidelines, tests/**/*.py: Use 'pytest.mark' to categorize tests (unit, integration, system); **/*.py: Use type hints for function arguments and return types.

🤖 Fix all issues with AI agents

In `@examples/models/vlm/glm_45v/inference.sh`:
- Around line 1-54: Enable strict bash mode and quote all expansions of
WORKSPACE and other variables in inference.sh to prevent silent failures and
word-splitting: add a strict-mode header (set -euo pipefail and optionally
IFS=$'\n\t') at the top of the script, and update usages like
${WORKSPACE}/models/GLM-4.5V/iter_0000000 and
${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions
("${WORKSPACE}/...") as well as any other variable usages in the uv run command
lines to be quoted; keep the existing default assignment
WORKSPACE=${WORKSPACE:-/workspace} intact.

In `@examples/models/vlm/glm_45v/slurm_sft.sh`:
- Around line 44-73: The script currently unconditionally resets CONTAINER_IMAGE
and CONTAINER_MOUNTS to empty strings, clobbering any exported environment
values; update those lines to preserve environment overrides using parameter
expansion (e.g., set CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty
assignment) and ensure any usages of these variables are quoted (e.g., use
"$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are
handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in
the slurm_sft.sh snippet when making the change.

In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 24-25: The star-import lines in
megatron.bridge.recipes.__init__.py ("from megatron.bridge.recipes.glm import *"
and "from megatron.bridge.recipes.glm_vl import *") trigger Ruff/Flake8
F401/F403; silence these warnings by appending an explicit noqa for those codes
to the import lines (e.g. add "# noqa: F401,F403" to each star-import) so the
re-exports remain intentional and linting will pass.

In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py`:
- Line 264: The call to set_glm_45v_pipeline_model_parallel_layout uses
is_peft=peft is not None which treats the string "none" as PEFT; change the
predicate so "none" is treated as no-PEFT (e.g., compute is_peft = peft is not
None and peft != "none" or equivalent) and pass that boolean to
set_glm_45v_pipeline_model_parallel_layout to ensure PEFT layouts are only
selected for actual PEFT values.
- Around line 80-88: The (16,1) entry in layout_map under-allocates decoder
layers (totals 45) for the 46-layer model; update that entry so the sum of
decoder occurrences equals 46 — e.g., change the repeated block [["decoder"] *
3] * 14 to [["decoder"] * 3] * 15 (or otherwise increment the decoder count in
the (16,1) branch) so layout_map[(16,1)] plus the final ["decoder"] * 3 +
last_layer yields 46 decoders in total; adjust only the (16,1) list construction
referencing layout_map and last_layer.

🧹 Nitpick comments (7)

examples/models/vlm/glm_45v/README.md (1)
24-27: Consider using uv run in example commands for consistency.

The actual conversion.sh script uses uv run python, but the example commands here use python directly. This inconsistency might confuse users who follow the README versus run the scripts.
Suggested change
-python examples/conversion/convert_checkpoints.py import \
+uv run python examples/conversion/convert_checkpoints.py import \
 --hf-model zai-org/GLM-4.5V \
 --megatron-path /models/GLM-4.5V
examples/models/vlm/gemma3_vl/README.md (1)
24-27: Consider using uv run in example commands for consistency.

Same as the GLM-4.5V README - consider using uv run python to match the actual shell scripts.
Suggested change
-python examples/conversion/convert_checkpoints.py import \
+uv run python examples/conversion/convert_checkpoints.py import \
 --hf-model google/gemma-3-4b-it \
 --megatron-path /models/gemma-3-4b-it
examples/models/vlm/glm_45v/conversion.sh (2)
1-17: Consider adding set -euo pipefail for robust error handling.

Per Google Shell Style Guide, adding error handling at the start of the script helps catch failures early.
Suggested change
 #!/usr/bin/env bash
 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # ...
 # limitations under the License.

+set -euo pipefail
+
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
20-28: Quote variable expansions to handle paths with spaces.
Suggested change
 # Import HF → Megatron
 uv run python examples/conversion/convert_checkpoints.py import \
     --hf-model zai-org/GLM-4.5V \
-    --megatron-path ${WORKSPACE}/models/GLM-4.5V
+    --megatron-path "${WORKSPACE}/models/GLM-4.5V"

 # Export Megatron → HF
 uv run python examples/conversion/convert_checkpoints.py export \
     --hf-model zai-org/GLM-4.5V \
-    --megatron-path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
-    --hf-path ${WORKSPACE}/models/GLM-4.5V-hf-export
+    --megatron-path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \
+    --hf-path "${WORKSPACE}/models/GLM-4.5V-hf-export"
examples/models/vlm/glm_45v/slurm_peft.sh (2)
136-138: Fragile synchronization using sleep 2.

The sleep 2 workaround for waiting on uv sync is brittle and may not be sufficient if sync takes longer. Consider using a more robust synchronization mechanism like a file-based lock or barrier.
Alternative approach using file-based barrier
-# Only local rank 0 on each node runs uv sync, then all ranks run with --no-sync
-CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 2; fi && "
+# Only local rank 0 on each node runs uv sync with file-based barrier
+CMD="SYNC_DONE=/tmp/uv_sync_done_\${SLURM_NODEID}; "
+CMD="\$CMD if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync && touch \$SYNC_DONE; else while [ ! -f \$SYNC_DONE ]; do sleep 0.5; done; fi && "
Alternatively, pre-sync the UV cache before job submission as documented in the README (Lines 145-173) to avoid runtime synchronization entirely.
156-160: Mount handling may fail with paths containing spaces.

Iterating over space-separated CONTAINER_MOUNTS will break if any path contains spaces.
Consider using an array or comma-separated format
 # Container mounts (optional, space-separated)
-CONTAINER_MOUNTS=""
-# CONTAINER_MOUNTS="/data:/data /workspace:/workspace"
+CONTAINER_MOUNTS=()
+# CONTAINER_MOUNTS=("/data:/data" "/workspace:/workspace")

 ...

 # Add container mounts
-if [ -n "$CONTAINER_MOUNTS" ]; then
-    for mount in $CONTAINER_MOUNTS; do
+if [ ${`#CONTAINER_MOUNTS`[@]} -gt 0 ]; then
+    for mount in "${CONTAINER_MOUNTS[@]}"; do
         SRUN_CMD="$SRUN_CMD --container-mounts=$mount"
     done
 fi
src/megatron/bridge/recipes/glm_vl/glm_45v.py (1)
46-48: Prefer PEP 604 union syntax for Python 3.10+.

This aligns with the repository typing guidelines.
Suggested change
 def set_glm_45v_pipeline_model_parallel_layout(
-    model_cfg: GPTModelProvider, layout: Optional[Union[str, List[List[str]]]] = None, is_peft: bool = False
+    model_cfg: GPTModelProvider, layout: str | list[list[str]] | None = None, is_peft: bool = False
 ) -> None:
As per coding guidelines, use 'T | None' for nullable types instead of 'Optional[T]' and 'X | Y' for union types instead of 'Union[X, Y]'.

coderabbitai · 2026-02-05T01:05:23Z

examples/models/vlm/glm_45v/inference.sh

+#!/usr/bin/env bash
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# GLM-4.5V is a large MoE model (106B parameters)
+# Using TP=1, PP=4, EP=2 for inference (8 GPUs minimum)
+
+# Inference with Hugging Face checkpoints
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path zai-org/GLM-4.5V \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 50 \
+    --tp 1 \
+    --pp 4 \
+    --ep 2 \
+    --trust_remote_code
+
+# Inference with imported Megatron checkpoints
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path zai-org/GLM-4.5V \
+    --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 50 \
+    --tp 1 \
+    --pp 2 \
+    --ep 4 \
+    --trust_remote_code
+
+# Inference with exported HF checkpoints
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 50 \
+    --tp 1 \
+    --pp 2 \
+    --ep 4 \
+    --trust_remote_code


⚠️ Potential issue | 🟡 Minor

Harden the script with strict mode and quoted path expansions.

This avoids silent failures and word-splitting when WORKSPACE is customized.

Suggested change

#!/usr/bin/env bash +set -euo pipefail @@ -WORKSPACE=${WORKSPACE:-/workspace} +WORKSPACE="${WORKSPACE:-/workspace}" @@ - --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \ + --megatron_model_path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \ @@ - --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \ + --hf_model_path "${WORKSPACE}/models/GLM-4.5V-hf-export" \

As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.

🤖 Prompt for AI Agents

In `@examples/models/vlm/glm_45v/inference.sh` around lines 1 - 54, Enable strict bash mode and quote all expansions of WORKSPACE and other variables in inference.sh to prevent silent failures and word-splitting: add a strict-mode header (set -euo pipefail and optionally IFS=$'\n\t') at the top of the script, and update usages like ${WORKSPACE}/models/GLM-4.5V/iter_0000000 and ${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions ("${WORKSPACE}/...") as well as any other variable usages in the uv run command lines to be quoted; keep the existing default assignment WORKSPACE=${WORKSPACE:-/workspace} intact.

coderabbitai · 2026-02-05T01:05:23Z

examples/models/vlm/glm_45v/slurm_sft.sh

+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# Model and training configurations
+PRETRAINED_CHECKPOINT=${WORKSPACE}/models/GLM-4.5V
+MODEL_NAME=glm_45v
+DATASET_NAME=cord_v2
+SEQ_LENGTH=8192
+TRAIN_ITERS=50
+GLOBAL_BATCH_SIZE=64
+MICRO_BATCH_SIZE=1
+EVAL_ITERS=10
+LR=0.000005
+MIN_LR=0.0000005
+LR_WARMUP_ITERS=10
+LOG_INTERVAL=1
+WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
+
+# Parallelism configuration
+TP=1
+PP=8
+EP=16
+
+# Container image (required)
+CONTAINER_IMAGE=""
+# CONTAINER_IMAGE="/path/to/container.sqsh"
+
+# Container mounts (optional, space-separated)
+CONTAINER_MOUNTS=""
+# CONTAINER_MOUNTS="/data:/data /workspace:/workspace"


⚠️ Potential issue | 🟡 Minor

Respect environment overrides for container settings (and quote paths).

Currently CONTAINER_IMAGE/CONTAINER_MOUNTS are reset to empty even if exported before submission; parameter expansion keeps env overrides while preserving defaults.

Suggested change

-WORKSPACE=${WORKSPACE:-/workspace} +WORKSPACE="${WORKSPACE:-/workspace}" @@ -CONTAINER_IMAGE="" +CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" @@ -CONTAINER_MOUNTS="" +CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}"

As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.

🤖 Prompt for AI Agents

In `@examples/models/vlm/glm_45v/slurm_sft.sh` around lines 44 - 73, The script currently unconditionally resets CONTAINER_IMAGE and CONTAINER_MOUNTS to empty strings, clobbering any exported environment values; update those lines to preserve environment overrides using parameter expansion (e.g., set CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty assignment) and ensure any usages of these variables are quoted (e.g., use "$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in the slurm_sft.sh snippet when making the change.

coderabbitai · 2026-02-05T01:05:23Z

src/megatron/bridge/recipes/__init__.py

+from megatron.bridge.recipes.glm import *
+from megatron.bridge.recipes.glm_vl import *


⚠️ Potential issue | 🟡 Minor

Silence F401/F403 for the new re-exports.

Ruff/Flake8 will flag these star imports unless re-exports are explicitly allowed.

Suggested change

-from megatron.bridge.recipes.glm import * -from megatron.bridge.recipes.glm_vl import * +from megatron.bridge.recipes.glm import * # noqa: F401,F403 +from megatron.bridge.recipes.glm_vl import * # noqa: F401,F403

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 24-24: 'megatron.bridge.recipes.glm.*' imported but unused

(F401)

[error] 25-25: 'megatron.bridge.recipes.glm_vl.*' imported but unused

(F401)

🪛 Ruff (0.14.14)

[error] 24-24: from megatron.bridge.recipes.glm import * used; unable to detect undefined names

(F403)

[error] 25-25: from megatron.bridge.recipes.glm_vl import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/__init__.py` around lines 24 - 25, The star-import lines in megatron.bridge.recipes.__init__.py ("from megatron.bridge.recipes.glm import *" and "from megatron.bridge.recipes.glm_vl import *") trigger Ruff/Flake8 F401/F403; silence these warnings by appending an explicit noqa for those codes to the import lines (e.g. add "# noqa: F401,F403" to each star-import) so the re-exports remain intentional and linting will pass.

coderabbitai · 2026-02-05T01:05:23Z

src/megatron/bridge/recipes/glm_vl/glm_45v.py

+        layout_map = {
+            (4, 1): [
+                ["embedding"] + ["decoder"] * 11,
+                ["decoder"] * 12,
+                ["decoder"] * 12,
+                ["decoder"] * 11 + last_layer,
+            ],
+            (8, 1): [["embedding"] + ["decoder"]] + [["decoder"] * 7] * 6 + [["decoder"] * 3 + last_layer],
+            (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer],


⚠️ Potential issue | 🟠 Major

PP=16 non‑PEFT layout totals 45 decoders, but the model is described as 46‑layer.

This appears to under‑allocate one decoder layer for the PP=16 full‑SFT layout. Please confirm the intended distribution and adjust to sum to 46.

Suggested change (one possible fix)

- (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer], + (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 4 + last_layer],

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 87-87: Consider ["embedding", "decoder"] instead of concatenation

Replace with ["embedding", "decoder"]

(RUF005)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` around lines 80 - 88, The (16,1) entry in layout_map under-allocates decoder layers (totals 45) for the 46-layer model; update that entry so the sum of decoder occurrences equals 46 — e.g., change the repeated block [["decoder"] * 3] * 14 to [["decoder"] * 3] * 15 (or otherwise increment the decoder count in the (16,1) branch) so layout_map[(16,1)] plus the final ["decoder"] * 3 + last_layer yields 46 decoders in total; adjust only the (16,1) list construction referencing layout_map and last_layer.

coderabbitai · 2026-02-05T01:05:23Z

src/megatron/bridge/recipes/glm_vl/glm_45v.py


    # Set pipeline model parallel layout for asymmetric stages
-    set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout)
+    set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)


⚠️ Potential issue | 🟠 Major

peft="none" still selects the PEFT layout.

peft="none" is treated as full SFT elsewhere, but is_peft=peft is not None flips to PEFT layouts, which is inconsistent and can misconfigure pipeline splits.

Suggested change

- set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None) + is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none")) + set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)

is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none"))

set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` at line 264, The call to set_glm_45v_pipeline_model_parallel_layout uses is_peft=peft is not None which treats the string "none" as PEFT; change the predicate so "none" is treated as no-PEFT (e.g., compute is_peft = peft is not None and peft != "none" or equivalent) and pass that boolean to set_glm_45v_pipeline_model_parallel_layout to ensure PEFT layouts are only selected for actual PEFT values.

ko3n1g

@yaoyu-33 Do we have failing tests we're fixing with this PR? If not, I'd recommend to split this into docs-only vs. fixes. We can merge the fixes post-major

ko3n1g requested a review from yaoyu-33 February 5, 2026 00:54

ko3n1g added cherry-pick Run CICD labels Feb 5, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 00:54 Inactive

copy-pr-bot bot temporarily deployed to test February 5, 2026 00:55 Inactive

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:36 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 05:59 Inactive

ko3n1g commented Feb 5, 2026

View reviewed changes

ko3n1g marked this pull request as draft February 5, 2026 09:59

yaoyu-33 approved these changes Feb 5, 2026

View reviewed changes

yaoyu-33 marked this pull request as ready for review February 5, 2026 21:53

yaoyu-33 merged commit 3cb4dee into r0.3.0 Feb 5, 2026
49 checks passed

yaoyu-33 deleted the cherry-pick-2151-r0.3.0 branch February 5, 2026 21:53

		from megatron.bridge.recipes.glm import *
		from megatron.bridge.recipes.glm_vl import *

	set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)
	is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none"))
	set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)

Conversation

ko3n1g commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

ko3n1g commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Feb 5, 2026 •

edited by coderabbitai bot

Loading