Conversation
#2151) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com> Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Malay Nagda <malayn@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Ao Tang <aot@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Dingqing Yang <dingqingy@nvidia.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: malay-nagda <malayn@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 17a2560 |
📝 WalkthroughWalkthroughThis PR consolidates Vision-Language Model documentation by simplifying docs with references to example directories, adds comprehensive GLM-4.5V example scripts for conversion/inference/finetuning workflows, introduces PEFT support to GLM-4.5V recipe configuration with updated default parallelism settings, and updates tests to match new configuration defaults. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit_tests/recipes/test_glm_45v_recipes.py (1)
368-455:⚠️ Potential issue | 🟡 MinorAdd test markers and return type hints for the new pipeline-layout tests.
This keeps unit tests categorized and aligns with the project’s Python typing rules.
Suggested change
- def test_glm_45v_pipeline_layout_pp4(): +@pytest.mark.unit +def test_glm_45v_pipeline_layout_pp4() -> None: @@ - def test_glm_45v_pipeline_layout_pp8(): +@pytest.mark.unit +def test_glm_45v_pipeline_layout_pp8() -> None: @@ - def test_glm_45v_pipeline_layout_pp16(): +@pytest.mark.unit +def test_glm_45v_pipeline_layout_pp16() -> None: @@ - def test_glm_45v_pipeline_layout_pp8_peft(): +@pytest.mark.unit +def test_glm_45v_pipeline_layout_pp8_peft() -> None: @@ - def test_glm_45v_pipeline_layout_pp16_peft(): +@pytest.mark.unit +def test_glm_45v_pipeline_layout_pp16_peft() -> None:As per coding guidelines,
tests/**/*.py: Use 'pytest.mark' to categorize tests (unit, integration, system);**/*.py: Use type hints for function arguments and return types.
🤖 Fix all issues with AI agents
In `@examples/models/vlm/glm_45v/inference.sh`:
- Around line 1-54: Enable strict bash mode and quote all expansions of
WORKSPACE and other variables in inference.sh to prevent silent failures and
word-splitting: add a strict-mode header (set -euo pipefail and optionally
IFS=$'\n\t') at the top of the script, and update usages like
${WORKSPACE}/models/GLM-4.5V/iter_0000000 and
${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions
("${WORKSPACE}/...") as well as any other variable usages in the uv run command
lines to be quoted; keep the existing default assignment
WORKSPACE=${WORKSPACE:-/workspace} intact.
In `@examples/models/vlm/glm_45v/slurm_sft.sh`:
- Around line 44-73: The script currently unconditionally resets CONTAINER_IMAGE
and CONTAINER_MOUNTS to empty strings, clobbering any exported environment
values; update those lines to preserve environment overrides using parameter
expansion (e.g., set CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty
assignment) and ensure any usages of these variables are quoted (e.g., use
"$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are
handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in
the slurm_sft.sh snippet when making the change.
In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 24-25: The star-import lines in
megatron.bridge.recipes.__init__.py ("from megatron.bridge.recipes.glm import *"
and "from megatron.bridge.recipes.glm_vl import *") trigger Ruff/Flake8
F401/F403; silence these warnings by appending an explicit noqa for those codes
to the import lines (e.g. add "# noqa: F401,F403" to each star-import) so the
re-exports remain intentional and linting will pass.
In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py`:
- Line 264: The call to set_glm_45v_pipeline_model_parallel_layout uses
is_peft=peft is not None which treats the string "none" as PEFT; change the
predicate so "none" is treated as no-PEFT (e.g., compute is_peft = peft is not
None and peft != "none" or equivalent) and pass that boolean to
set_glm_45v_pipeline_model_parallel_layout to ensure PEFT layouts are only
selected for actual PEFT values.
- Around line 80-88: The (16,1) entry in layout_map under-allocates decoder
layers (totals 45) for the 46-layer model; update that entry so the sum of
decoder occurrences equals 46 — e.g., change the repeated block [["decoder"] *
3] * 14 to [["decoder"] * 3] * 15 (or otherwise increment the decoder count in
the (16,1) branch) so layout_map[(16,1)] plus the final ["decoder"] * 3 +
last_layer yields 46 decoders in total; adjust only the (16,1) list construction
referencing layout_map and last_layer.
🧹 Nitpick comments (7)
examples/models/vlm/glm_45v/README.md (1)
24-27: Consider usinguv runin example commands for consistency.The actual
conversion.shscript usesuv run python, but the example commands here usepythondirectly. This inconsistency might confuse users who follow the README versus run the scripts.Suggested change
-python examples/conversion/convert_checkpoints.py import \ +uv run python examples/conversion/convert_checkpoints.py import \ --hf-model zai-org/GLM-4.5V \ --megatron-path /models/GLM-4.5Vexamples/models/vlm/gemma3_vl/README.md (1)
24-27: Consider usinguv runin example commands for consistency.Same as the GLM-4.5V README - consider using
uv run pythonto match the actual shell scripts.Suggested change
-python examples/conversion/convert_checkpoints.py import \ +uv run python examples/conversion/convert_checkpoints.py import \ --hf-model google/gemma-3-4b-it \ --megatron-path /models/gemma-3-4b-itexamples/models/vlm/glm_45v/conversion.sh (2)
1-17: Consider addingset -euo pipefailfor robust error handling.Per Google Shell Style Guide, adding error handling at the start of the script helps catch failures early.
Suggested change
#!/usr/bin/env bash # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # ... # limitations under the License. +set -euo pipefail + # Workspace directory for checkpoints and results WORKSPACE=${WORKSPACE:-/workspace}
20-28: Quote variable expansions to handle paths with spaces.Suggested change
# Import HF → Megatron uv run python examples/conversion/convert_checkpoints.py import \ --hf-model zai-org/GLM-4.5V \ - --megatron-path ${WORKSPACE}/models/GLM-4.5V + --megatron-path "${WORKSPACE}/models/GLM-4.5V" # Export Megatron → HF uv run python examples/conversion/convert_checkpoints.py export \ --hf-model zai-org/GLM-4.5V \ - --megatron-path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \ - --hf-path ${WORKSPACE}/models/GLM-4.5V-hf-export + --megatron-path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \ + --hf-path "${WORKSPACE}/models/GLM-4.5V-hf-export"examples/models/vlm/glm_45v/slurm_peft.sh (2)
136-138: Fragile synchronization usingsleep 2.The
sleep 2workaround for waiting onuv syncis brittle and may not be sufficient if sync takes longer. Consider using a more robust synchronization mechanism like a file-based lock or barrier.Alternative approach using file-based barrier
-# Only local rank 0 on each node runs uv sync, then all ranks run with --no-sync -CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 2; fi && " +# Only local rank 0 on each node runs uv sync with file-based barrier +CMD="SYNC_DONE=/tmp/uv_sync_done_\${SLURM_NODEID}; " +CMD="\$CMD if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync && touch \$SYNC_DONE; else while [ ! -f \$SYNC_DONE ]; do sleep 0.5; done; fi && "Alternatively, pre-sync the UV cache before job submission as documented in the README (Lines 145-173) to avoid runtime synchronization entirely.
156-160: Mount handling may fail with paths containing spaces.Iterating over space-separated
CONTAINER_MOUNTSwill break if any path contains spaces.Consider using an array or comma-separated format
# Container mounts (optional, space-separated) -CONTAINER_MOUNTS="" -# CONTAINER_MOUNTS="/data:/data /workspace:/workspace" +CONTAINER_MOUNTS=() +# CONTAINER_MOUNTS=("/data:/data" "/workspace:/workspace") ... # Add container mounts -if [ -n "$CONTAINER_MOUNTS" ]; then - for mount in $CONTAINER_MOUNTS; do +if [ ${`#CONTAINER_MOUNTS`[@]} -gt 0 ]; then + for mount in "${CONTAINER_MOUNTS[@]}"; do SRUN_CMD="$SRUN_CMD --container-mounts=$mount" done fisrc/megatron/bridge/recipes/glm_vl/glm_45v.py (1)
46-48: Prefer PEP 604 union syntax for Python 3.10+.This aligns with the repository typing guidelines.
Suggested change
def set_glm_45v_pipeline_model_parallel_layout( - model_cfg: GPTModelProvider, layout: Optional[Union[str, List[List[str]]]] = None, is_peft: bool = False + model_cfg: GPTModelProvider, layout: str | list[list[str]] | None = None, is_peft: bool = False ) -> None:As per coding guidelines, use 'T | None' for nullable types instead of 'Optional[T]' and 'X | Y' for union types instead of 'Union[X, Y]'.
| #!/usr/bin/env bash | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Workspace directory for checkpoints and results | ||
| WORKSPACE=${WORKSPACE:-/workspace} | ||
|
|
||
| # GLM-4.5V is a large MoE model (106B parameters) | ||
| # Using TP=1, PP=4, EP=2 for inference (8 GPUs minimum) | ||
|
|
||
| # Inference with Hugging Face checkpoints | ||
| uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ | ||
| --hf_model_path zai-org/GLM-4.5V \ | ||
| --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ | ||
| --prompt "Describe this image." \ | ||
| --max_new_tokens 50 \ | ||
| --tp 1 \ | ||
| --pp 4 \ | ||
| --ep 2 \ | ||
| --trust_remote_code | ||
|
|
||
| # Inference with imported Megatron checkpoints | ||
| uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ | ||
| --hf_model_path zai-org/GLM-4.5V \ | ||
| --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \ | ||
| --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ | ||
| --prompt "Describe this image." \ | ||
| --max_new_tokens 50 \ | ||
| --tp 1 \ | ||
| --pp 2 \ | ||
| --ep 4 \ | ||
| --trust_remote_code | ||
|
|
||
| # Inference with exported HF checkpoints | ||
| uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ | ||
| --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \ | ||
| --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ | ||
| --prompt "Describe this image." \ | ||
| --max_new_tokens 50 \ | ||
| --tp 1 \ | ||
| --pp 2 \ | ||
| --ep 4 \ | ||
| --trust_remote_code |
There was a problem hiding this comment.
Harden the script with strict mode and quoted path expansions.
This avoids silent failures and word-splitting when WORKSPACE is customized.
Suggested change
#!/usr/bin/env bash
+set -euo pipefail
@@
-WORKSPACE=${WORKSPACE:-/workspace}
+WORKSPACE="${WORKSPACE:-/workspace}"
@@
- --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \
+ --megatron_model_path "${WORKSPACE}/models/GLM-4.5V/iter_0000000" \
@@
- --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \
+ --hf_model_path "${WORKSPACE}/models/GLM-4.5V-hf-export" \As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.
🤖 Prompt for AI Agents
In `@examples/models/vlm/glm_45v/inference.sh` around lines 1 - 54, Enable strict
bash mode and quote all expansions of WORKSPACE and other variables in
inference.sh to prevent silent failures and word-splitting: add a strict-mode
header (set -euo pipefail and optionally IFS=$'\n\t') at the top of the script,
and update usages like ${WORKSPACE}/models/GLM-4.5V/iter_0000000 and
${WORKSPACE}/models/GLM-4.5V-hf-export to use quoted expansions
("${WORKSPACE}/...") as well as any other variable usages in the uv run command
lines to be quoted; keep the existing default assignment
WORKSPACE=${WORKSPACE:-/workspace} intact.
| # Workspace directory for checkpoints and results | ||
| WORKSPACE=${WORKSPACE:-/workspace} | ||
|
|
||
| # Model and training configurations | ||
| PRETRAINED_CHECKPOINT=${WORKSPACE}/models/GLM-4.5V | ||
| MODEL_NAME=glm_45v | ||
| DATASET_NAME=cord_v2 | ||
| SEQ_LENGTH=8192 | ||
| TRAIN_ITERS=50 | ||
| GLOBAL_BATCH_SIZE=64 | ||
| MICRO_BATCH_SIZE=1 | ||
| EVAL_ITERS=10 | ||
| LR=0.000005 | ||
| MIN_LR=0.0000005 | ||
| LR_WARMUP_ITERS=10 | ||
| LOG_INTERVAL=1 | ||
| WANDB_PROJECT=megatron-bridge-${DATASET_NAME} | ||
|
|
||
| # Parallelism configuration | ||
| TP=1 | ||
| PP=8 | ||
| EP=16 | ||
|
|
||
| # Container image (required) | ||
| CONTAINER_IMAGE="" | ||
| # CONTAINER_IMAGE="/path/to/container.sqsh" | ||
|
|
||
| # Container mounts (optional, space-separated) | ||
| CONTAINER_MOUNTS="" | ||
| # CONTAINER_MOUNTS="/data:/data /workspace:/workspace" |
There was a problem hiding this comment.
Respect environment overrides for container settings (and quote paths).
Currently CONTAINER_IMAGE/CONTAINER_MOUNTS are reset to empty even if exported before submission; parameter expansion keeps env overrides while preserving defaults.
Suggested change
-WORKSPACE=${WORKSPACE:-/workspace}
+WORKSPACE="${WORKSPACE:-/workspace}"
@@
-CONTAINER_IMAGE=""
+CONTAINER_IMAGE="${CONTAINER_IMAGE:-}"
@@
-CONTAINER_MOUNTS=""
+CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}"As per coding guidelines, **/*.sh: Follow Google Shell Style Guide.
🤖 Prompt for AI Agents
In `@examples/models/vlm/glm_45v/slurm_sft.sh` around lines 44 - 73, The script
currently unconditionally resets CONTAINER_IMAGE and CONTAINER_MOUNTS to empty
strings, clobbering any exported environment values; update those lines to
preserve environment overrides using parameter expansion (e.g., set
CONTAINER_IMAGE="${CONTAINER_IMAGE:-}" and
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-}" or simply remove the explicit empty
assignment) and ensure any usages of these variables are quoted (e.g., use
"$CONTAINER_IMAGE" and "$CONTAINER_MOUNTS") so path values with spaces are
handled correctly; refer to the CONTAINER_IMAGE and CONTAINER_MOUNTS symbols in
the slurm_sft.sh snippet when making the change.
| from megatron.bridge.recipes.glm import * | ||
| from megatron.bridge.recipes.glm_vl import * |
There was a problem hiding this comment.
Silence F401/F403 for the new re-exports.
Ruff/Flake8 will flag these star imports unless re-exports are explicitly allowed.
Suggested change
-from megatron.bridge.recipes.glm import *
-from megatron.bridge.recipes.glm_vl import *
+from megatron.bridge.recipes.glm import * # noqa: F401,F403
+from megatron.bridge.recipes.glm_vl import * # noqa: F401,F403🧰 Tools
🪛 Flake8 (7.3.0)
[error] 24-24: 'megatron.bridge.recipes.glm.*' imported but unused
(F401)
[error] 25-25: 'megatron.bridge.recipes.glm_vl.*' imported but unused
(F401)
🪛 Ruff (0.14.14)
[error] 24-24: from megatron.bridge.recipes.glm import * used; unable to detect undefined names
(F403)
[error] 25-25: from megatron.bridge.recipes.glm_vl import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/__init__.py` around lines 24 - 25, The
star-import lines in megatron.bridge.recipes.__init__.py ("from
megatron.bridge.recipes.glm import *" and "from megatron.bridge.recipes.glm_vl
import *") trigger Ruff/Flake8 F401/F403; silence these warnings by appending an
explicit noqa for those codes to the import lines (e.g. add "# noqa: F401,F403"
to each star-import) so the re-exports remain intentional and linting will pass.
| layout_map = { | ||
| (4, 1): [ | ||
| ["embedding"] + ["decoder"] * 11, | ||
| ["decoder"] * 12, | ||
| ["decoder"] * 12, | ||
| ["decoder"] * 11 + last_layer, | ||
| ], | ||
| (8, 1): [["embedding"] + ["decoder"]] + [["decoder"] * 7] * 6 + [["decoder"] * 3 + last_layer], | ||
| (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer], |
There was a problem hiding this comment.
PP=16 non‑PEFT layout totals 45 decoders, but the model is described as 46‑layer.
This appears to under‑allocate one decoder layer for the PP=16 full‑SFT layout. Please confirm the intended distribution and adjust to sum to 46.
Suggested change (one possible fix)
- (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer],
+ (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 4 + last_layer],🧰 Tools
🪛 Ruff (0.14.14)
[warning] 87-87: Consider ["embedding", "decoder"] instead of concatenation
Replace with ["embedding", "decoder"]
(RUF005)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` around lines 80 - 88, The
(16,1) entry in layout_map under-allocates decoder layers (totals 45) for the
46-layer model; update that entry so the sum of decoder occurrences equals 46 —
e.g., change the repeated block [["decoder"] * 3] * 14 to [["decoder"] * 3] * 15
(or otherwise increment the decoder count in the (16,1) branch) so
layout_map[(16,1)] plus the final ["decoder"] * 3 + last_layer yields 46
decoders in total; adjust only the (16,1) list construction referencing
layout_map and last_layer.
|
|
||
| # Set pipeline model parallel layout for asymmetric stages | ||
| set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout) | ||
| set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None) |
There was a problem hiding this comment.
peft="none" still selects the PEFT layout.
peft="none" is treated as full SFT elsewhere, but is_peft=peft is not None flips to PEFT layouts, which is inconsistent and can misconfigure pipeline splits.
Suggested change
- set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None)
+ is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none"))
+ set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None) | |
| is_peft_layout = not (peft is None or (isinstance(peft, str) and peft.lower() == "none")) | |
| set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=is_peft_layout) |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/glm_vl/glm_45v.py` at line 264, The call to
set_glm_45v_pipeline_model_parallel_layout uses is_peft=peft is not None which
treats the string "none" as PEFT; change the predicate so "none" is treated as
no-PEFT (e.g., compute is_peft = peft is not None and peft != "none" or
equivalent) and pass that boolean to set_glm_45v_pipeline_model_parallel_layout
to ensure PEFT layouts are only selected for actual PEFT values.
beep boop [🤖]: Hi @yaoyu-33 👋,
Summary by CodeRabbit
Release Notes
Documentation
New Features