Skip to content

cp: [doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs (2151) into r0.3.0#2226

Merged
yaoyu-33 merged 1 commit intor0.3.0from
cherry-pick-2151-r0.3.0
Feb 5, 2026
Merged

cp: [doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs (2151) into r0.3.0#2226
yaoyu-33 merged 1 commit intor0.3.0from
cherry-pick-2151-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 5, 2026

beep boop [🤖]: Hi @yaoyu-33 👋,

we've cherry picked #2151 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

  • Documentation

    • Reorganized Gemma 3 VL and GLM-4.5V documentation with consolidated, example-focused references.
    • Updated GLM-4.5V hardware requirements: higher GPU allocations for LoRA/DoRA and Full SFT training.
  • New Features

    • Added comprehensive GLM-4.5V example scripts for checkpoint conversion, inference, and fine-tuning workflows.
    • Expanded Gemma 3 VL examples with detailed configurations, training recipes, and step-by-step guidance.

#2151)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Ao Tang <aot@nvidia.com>
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Dingqing Yang <dingqingy@nvidia.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: malay-nagda <malayn@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 5, 2026

/ok to test 17a2560

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

This PR consolidates Vision-Language Model documentation by simplifying docs with references to example directories, adds comprehensive GLM-4.5V example scripts for conversion/inference/finetuning workflows, introduces PEFT support to GLM-4.5V recipe configuration with updated default parallelism settings, and updates tests to match new configuration defaults.

Changes

Cohort / File(s) Summary
Documentation Consolidation
docs/models/llm/gemma3.md, docs/models/vlm/gemma3-vl.md, docs/models/vlm/glm-45v.md
Updated model documentation to remove extensive detailed sections (conversion steps, finetuning recipes, configurations) and replaced with consolidated references directing to example directories. Updated hardware recommendations for GLM-4.5V from 4/16 nodes to 32/64 nodes for LoRA and Full SFT respectively.
Gemma3-VL Examples
examples/models/vlm/gemma3_vl/README.md
Expanded README from high-level overview to detailed example guide with explicit workspace setup, checkpoint conversion subsections, inference examples, finetuning recipes, recommended configurations table, and updated placeholders for coming features.
GLM-4.5V Examples
examples/models/vlm/glm_45v/README.md, examples/models/vlm/glm_45v/conversion.sh, examples/models/vlm/glm_45v/inference.sh, examples/models/vlm/glm_45v/slurm_peft.sh, examples/models/vlm/glm_45v/slurm_sft.sh
Added comprehensive README documenting checkpoint conversion (HF ↔ Megatron), inference workflows, and finetuning recipes; introduced conversion.sh for round-trip model conversion validation; inference.sh for distributed inference across three checkpoint formats; slurm_peft.sh and slurm_sft.sh for parameter-efficient and full supervised fine-tuning with SLURM job orchestration.
Recipe Configuration & Exports
src/megatron/bridge/recipes/__init__.py, src/megatron/bridge/recipes/glm_vl/glm_45v.py
Extended module exports to include GLM and GLM-VL recipe configurations; introduced is_peft parameter to set_glm_45v_pipeline_model_parallel_layout with PEFT-aware layout selection; increased default sequence length from 4096 to 8192; updated pipeline/expert parallelism defaults (PP 4→8, EP 2→4 for non-full-SFT); increased full SFT global batch size from 32 to 64.
Test Updates
tests/unit_tests/recipes/test_glm_45v_recipes.py
Updated parallelism expectations to reflect new defaults; expanded pipeline layout tests to cover additional PP values (4, 8, 16) and PEFT variants; updated call sites to pass is_peft flag; adjusted layout composition assertions for stages, embedding presence, and decoder counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

r0.3.0

Suggested reviewers

  • yaoyu-33
  • ananthsub
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR includes significant default parameter changes and new logic without test results or performance validation demonstrating impact. Add convergence curves, performance benchmarks, and regression analysis confirming the modified defaults do not cause training issues.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the PR's main objectives: adding GLM-4.5V VL examples and updating Gemma 3 VL docs, matching the changeset across documentation and code files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2151-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit_tests/recipes/test_glm_45v_recipes.py (1)

368-455: ⚠️ Potential issue | 🟡 Minor

Add test markers and return type hints for the new pipeline-layout tests.

This keeps unit tests categorized and aligns with the project’s Python typing rules.

Suggested change
- def test_glm_45v_pipeline_layout_pp4():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp4() -> None:
@@
- def test_glm_45v_pipeline_layout_pp8():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp8() -> None:
@@
- def test_glm_45v_pipeline_layout_pp16():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp16() -> None:
@@
- def test_glm_45v_pipeline_layout_pp8_peft():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp8_peft() -> None:
@@
- def test_glm_45v_pipeline_layout_pp16_peft():
+@pytest.mark.unit
+def test_glm_45v_pipeline_layout_pp16_peft() -> None:

As per coding guidelines, tests/**/*.py: Use 'pytest.mark' to categorize tests (unit, integration, system); **/*.py: Use type hints for function arguments and return types.

🤖 Fix all issues with AI agents
In `@examples/models/vlm/glm_45v/inference.sh`:
- Around line 1-54: Enable strict bash mode and quote all expansions of
WORKSPACE and other variables in inference.sh to prevent silent failures and
word-splitting: add a strict-mode header (set -euo pipefail and optionally
IFS=$'\n\t') at the top of the script, and update usages like
${WORKSPACE}/models/GLM-4.5V/iter_0000000 and
${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions
("${WORKSPACE}/...") as well as any other variable usages in the uv run command
lines to be quoted; keep the existing default assignment
WORKSPACE=${WORKSPACE:-/workspace} intact.

In `@examples/models/vlm/glm_45v/slurm_sft.sh`:
- Around line 44-73: The script currently unconditionally resets CONTAINER_IMAGE
and CONTAINER_MOUNTS to empty strings, clobbering any exported environment
values; update those lines to preserve environment overrides using parameter
expansion (e.g., set CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty
assignment) and ensure any usages of these variables are quoted (e.g., use
"$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are
handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in
the slurm_sft.sh snippet when making the change.

In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 24-25: The star-import lines in
megatron.bridge.recipes.__init__.py ("from megatron.bridge.recipes.glm import *"
and "from megatron.bridge.recipes.glm_vl import *") trigger Ruff/Flake8
F401/F403; silence these warnings by appending an explicit noqa for those codes
to the import lines (e.g. add "# noqa: F401,F403" to each star-import) so the
re-exports remain intentional and linting will pass.

In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py`:
- Line 264: The call to set_glm_45v_pipeline_model_parallel_layout uses
is_peft=peft is not None which treats the string "none" as PEFT; change the
predicate so "none" is treated as no-PEFT (e.g., compute is_peft = peft is not
None and peft != "none" or equivalent) and pass that boolean to
set_glm_45v_pipeline_model_parallel_layout to ensure PEFT layouts are only
selected for actual PEFT values.
- Around line 80-88: The (16,1) entry in layout_map under-allocates decoder
layers (totals 45) for the 46-layer model; update that entry so the sum of
decoder occurrences equals 46 — e.g., change the repeated block [["decoder"] *
3] * 14 to [["decoder"] * 3] * 15 (or otherwise increment the decoder count in
the (16,1) branch) so layout_map[(16,1)] plus the final ["decoder"] * 3 +
last_layer yields 46 decoders in total; adjust only the (16,1) list construction
referencing layout_map and last_layer.
🧹 Nitpick comments (7)
examples/models/vlm/glm_45v/README.md (1)

24-27: Consider using uv run in example commands for consistency.

The actual conversion.sh script uses uv run python, but the example commands here use python directly. This inconsistency might confuse users who follow the README versus run the scripts.

Suggested change
-python examples/conversion/convert_checkpoints.py import \
+uv run python examples/conversion/convert_checkpoints.py import \
 --hf-model zai-org/GLM-4.5V \
 --megatron-path /models/GLM-4.5V
examples/models/vlm/gemma3_vl/README.md (1)

24-27: Consider using uv run in example commands for consistency.

Same as the GLM-4.5V README - consider using uv run python to match the actual shell scripts.

Suggested change
-python examples/conversion/convert_checkpoints.py import \
+uv run python examples/conversion/convert_checkpoints.py import \
 --hf-model google/gemma-3-4b-it \
 --megatron-path /models/gemma-3-4b-it
examples/models/vlm/glm_45v/conversion.sh (2)

1-17: Consider adding set -euo pipefail for robust error handling.

Per Google Shell Style Guide, adding error handling at the start of the script helps catch failures early.

Suggested change
 #!/usr/bin/env bash
 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # ...
 # limitations under the License.

+set -euo pipefail
+
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}

20-28: Quote variable expansions to handle paths with spaces.

Suggested change
 # Import HF → Megatron
 uv run python examples/conversion/convert_checkpoints.py import \
     --hf-model zai-org/GLM-4.5V \
-    --megatron-path ${WORKSPACE}/models/GLM-4.5V
+    --megatron-path "${WORKSPACE}/models/GLM-4.5V"

 # Export Megatron → HF
 uv run python examples/conversion/convert_checkpoints.py export \
     --hf-model zai-org/GLM-4.5V \
-    --megatron-path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
-    --hf-path ${WORKSPACE}/models/GLM-4.5V-hf-export
+    --megatron-path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \
+    --hf-path "${WORKSPACE}/models/GLM-4.5V-hf-export"
examples/models/vlm/glm_45v/slurm_peft.sh (2)

136-138: Fragile synchronization using sleep 2.

The sleep 2 workaround for waiting on uv sync is brittle and may not be sufficient if sync takes longer. Consider using a more robust synchronization mechanism like a file-based lock or barrier.

Alternative approach using file-based barrier
-# Only local rank 0 on each node runs uv sync, then all ranks run with --no-sync
-CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 2; fi && "
+# Only local rank 0 on each node runs uv sync with file-based barrier
+CMD="SYNC_DONE=/tmp/uv_sync_done_\${SLURM_NODEID}; "
+CMD="\$CMD if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync && touch \$SYNC_DONE; else while [ ! -f \$SYNC_DONE ]; do sleep 0.5; done; fi && "

Alternatively, pre-sync the UV cache before job submission as documented in the README (Lines 145-173) to avoid runtime synchronization entirely.


156-160: Mount handling may fail with paths containing spaces.

Iterating over space-separated CONTAINER_MOUNTS will break if any path contains spaces.

Consider using an array or comma-separated format
 # Container mounts (optional, space-separated)
-CONTAINER_MOUNTS=""
-# CONTAINER_MOUNTS="/data:/data /workspace:/workspace"
+CONTAINER_MOUNTS=()
+# CONTAINER_MOUNTS=("/data:/data" "/workspace:/workspace")

 ...

 # Add container mounts
-if [ -n "$CONTAINER_MOUNTS" ]; then
-    for mount in $CONTAINER_MOUNTS; do
+if [ ${`#CONTAINER_MOUNTS`[@]} -gt 0 ]; then
+    for mount in "${CONTAINER_MOUNTS[@]}"; do
         SRUN_CMD="$SRUN_CMD --container-mounts=$mount"
     done
 fi
src/megatron/bridge/recipes/glm_vl/glm_45v.py (1)

46-48: Prefer PEP 604 union syntax for Python 3.10+.

This aligns with the repository typing guidelines.

Suggested change
 def set_glm_45v_pipeline_model_parallel_layout(
-    model_cfg: GPTModelProvider, layout: Optional[Union[str, List[List[str]]]] = None, is_peft: bool = False
+    model_cfg: GPTModelProvider, layout: str | list[list[str]] | None = None, is_peft: bool = False
 ) -> None:

As per coding guidelines, use 'T | None' for nullable types instead of 'Optional[T]' and 'X | Y' for union types instead of 'Union[X, Y]'.

Comment on lines +1 to +54
#!/usr/bin/env bash
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Workspace directory for checkpoints and results
WORKSPACE=${WORKSPACE:-/workspace}

# GLM-4.5V is a large MoE model (106B parameters)
# Using TP=1, PP=4, EP=2 for inference (8 GPUs minimum)

# Inference with Hugging Face checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path zai-org/GLM-4.5V \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 1 \
--pp 4 \
--ep 2 \
--trust_remote_code

# Inference with imported Megatron checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path zai-org/GLM-4.5V \
--megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 1 \
--pp 2 \
--ep 4 \
--trust_remote_code

# Inference with exported HF checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 1 \
--pp 2 \
--ep 4 \
--trust_remote_code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Harden the script with strict mode and quoted path expansions.

This avoids silent failures and word-splitting when WORKSPACE is customized.

Suggested change
 #!/usr/bin/env bash
+set -euo pipefail
@@
-WORKSPACE=${WORKSPACE:-/workspace}
+WORKSPACE="${WORKSPACE:-/workspace}"
@@
-    --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
+    --megatron_model_path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \
@@
-    --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \
+    --hf_model_path "${WORKSPACE}/models/GLM-4.5V-hf-export" \

As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.

🤖 Prompt for AI Agents
In `@examples/models/vlm/glm_45v/inference.sh` around lines 1 - 54, Enable strict
bash mode and quote all expansions of WORKSPACE and other variables in
inference.sh to prevent silent failures and word-splitting: add a strict-mode
header (set -euo pipefail and optionally IFS=$'\n\t') at the top of the script,
and update usages like ${WORKSPACE}/models/GLM-4.5V/iter_0000000 and
${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions
("${WORKSPACE}/...") as well as any other variable usages in the uv run command
lines to be quoted; keep the existing default assignment
WORKSPACE=${WORKSPACE:-/workspace} intact.

Comment on lines +44 to +73
# Workspace directory for checkpoints and results
WORKSPACE=${WORKSPACE:-/workspace}

# Model and training configurations
PRETRAINED_CHECKPOINT=${WORKSPACE}/models/GLM-4.5V
MODEL_NAME=glm_45v
DATASET_NAME=cord_v2
SEQ_LENGTH=8192
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=64
MICRO_BATCH_SIZE=1
EVAL_ITERS=10
LR=0.000005
MIN_LR=0.0000005
LR_WARMUP_ITERS=10
LOG_INTERVAL=1
WANDB_PROJECT=megatron-bridge-${DATASET_NAME}

# Parallelism configuration
TP=1
PP=8
EP=16

# Container image (required)
CONTAINER_IMAGE=""
# CONTAINER_IMAGE="/path/to/container.sqsh"

# Container mounts (optional, space-separated)
CONTAINER_MOUNTS=""
# CONTAINER_MOUNTS="/data:/data /workspace:/workspace"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Respect environment overrides for container settings (and quote paths).

Currently CONTAINER_IMAGE/CONTAINER_MOUNTS are reset to empty even if exported before submission; parameter expansion keeps env overrides while preserving defaults.

Suggested change
-WORKSPACE=${WORKSPACE:-/workspace}
+WORKSPACE="${WORKSPACE:-/workspace}"
@@
-CONTAINER_IMAGE=""
+CONTAINER_IMAGE="${CONTAINER_IMAGE:-}"
@@
-CONTAINER_MOUNTS=""
+CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}"

As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.

🤖 Prompt for AI Agents
In `@examples/models/vlm/glm_45v/slurm_sft.sh` around lines 44 - 73, The script
currently unconditionally resets CONTAINER_IMAGE and CONTAINER_MOUNTS to empty
strings, clobbering any exported environment values; update those lines to
preserve environment overrides using parameter expansion (e.g., set
CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty
assignment) and ensure any usages of these variables are quoted (e.g., use
"$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are
handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in
the slurm_sft.sh snippet when making the change.

Comment on lines +24 to +25
from megatron.bridge.recipes.glm import *
from megatron.bridge.recipes.glm_vl import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silence F401/F403 for the new re-exports.

Ruff/Flake8 will flag these star imports unless re-exports are explicitly allowed.

Suggested change
-from megatron.bridge.recipes.glm import *
-from megatron.bridge.recipes.glm_vl import *
+from megatron.bridge.recipes.glm import *  # noqa: F401,F403
+from megatron.bridge.recipes.glm_vl import *  # noqa: F401,F403
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 24-24: 'megatron.bridge.recipes.glm.*' imported but unused

(F401)


[error] 25-25: 'megatron.bridge.recipes.glm_vl.*' imported but unused

(F401)

🪛 Ruff (0.14.14)

[error] 24-24: from megatron.bridge.recipes.glm import * used; unable to detect undefined names

(F403)


[error] 25-25: from megatron.bridge.recipes.glm_vl import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/__init__.py` around lines 24 - 25, The
star-import lines in megatron.bridge.recipes.__init__.py ("from
megatron.bridge.recipes.glm import *" and "from megatron.bridge.recipes.glm_vl
import *") trigger Ruff/Flake8 F401/F403; silence these warnings by appending an
explicit noqa for those codes to the import lines (e.g. add "# noqa: F401,F403"
to each star-import) so the re-exports remain intentional and linting will pass.

Comment on lines +80 to +88
layout_map = {
(4, 1): [
["embedding"] + ["decoder"] * 11,
["decoder"] * 12,
["decoder"] * 12,
["decoder"] * 11 + last_layer,
],
(8, 1): [["embedding"] + ["decoder"]] + [["decoder"] * 7] * 6 + [["decoder"] * 3 + last_layer],
(16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

PP=16 non‑PEFT layout totals 45 decoders, but the model is described as 46‑layer.

This appears to under‑allocate one decoder layer for the PP=16 full‑SFT layout. Please confirm the intended distribution and adjust to sum to 46.

Suggested change (one possible fix)
-            (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer],
+            (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 4 + last_layer],
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 87-87: Consider ["embedding", "decoder"] instead of concatenation

Replace with ["embedding", "decoder"]

(RUF005)

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` around lines 80 - 88, The
(16,1) entry in layout_map under-allocates decoder layers (totals 45) for the
46-layer model; update that entry so the sum of decoder occurrences equals 46 —
e.g., change the repeated block [["decoder"] * 3] * 14 to [["decoder"] * 3] * 15
(or otherwise increment the decoder count in the (16,1) branch) so
layout_map[(16,1)] plus the final ["decoder"] * 3 + last_layer yields 46
decoders in total; adjust only the (16,1) list construction referencing
layout_map and last_layer.


# Set pipeline model parallel layout for asymmetric stages
set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout)
set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

peft="none" still selects the PEFT layout.

peft="none" is treated as full SFT elsewhere, but is_peft=peft is not None flips to PEFT layouts, which is inconsistent and can misconfigure pipeline splits.

Suggested change
-    set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)
+    is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none"))
+    set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)
is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none"))
set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` at line 264, The call to
set_glm_45v_pipeline_model_parallel_layout uses is_peft=peft is not None which
treats the string "none" as PEFT; change the predicate so "none" is treated as
no-PEFT (e.g., compute is_peft = peft is not None and peft != "none" or
equivalent) and pass that boolean to set_glm_45v_pipeline_model_parallel_layout
to ensure PEFT layouts are only selected for actual PEFT values.

Copy link
Copy Markdown
Contributor Author

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 Do we have failing tests we're fixing with this PR? If not, I'd recommend to split this into docs-only vs. fixes. We can merge the fixes post-major

@ko3n1g ko3n1g marked this pull request as draft February 5, 2026 09:59
@yaoyu-33 yaoyu-33 marked this pull request as ready for review February 5, 2026 21:53
@yaoyu-33 yaoyu-33 merged commit 3cb4dee into r0.3.0 Feb 5, 2026
49 checks passed
@yaoyu-33 yaoyu-33 deleted the cherry-pick-2151-r0.3.0 branch February 5, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants