[doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs by yaoyu-33 · Pull Request #2151 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-01-30T19:42:25Z

Summary

This PR adds GLM-4.5V Vision-Language model examples and updates Gemma 3 VL documentation.

Changes

New: GLM-4.5V VL Examples

Add comprehensive README with model details, conversion, inference, and training instructions
Add Slurm scripts for full SFT training (16 nodes, 128 GPUs, TP=1, PP=8, EP=16)
Add Slurm scripts for LoRA/PEFT fine-tuning (4 nodes, 32 GPUs, TP=1, PP=8, EP=4)
Add checkpoint conversion and inference example scripts

Updated: Gemma 3 VL Examples

Expanded README with more detailed documentation

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive GLM-4.5V Vision-Language model documentation with inference, fine-tuning, and checkpoint conversion workflows.
- Expanded Gemma 3 VL documentation with detailed model architecture features and recommended configurations.
Documentation
- Reorganized documentation structure for improved clarity and navigation.
- Updated example references throughout to reflect consolidated directory structure.
Configuration
- Updated default training settings for GLM-4.5V models.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… distillation, decentralized_pg Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Add complete GLM-4.5V VLM example folder: - README.md: Model documentation, architecture details, usage instructions - slurm_sft.sh: Slurm job script for full SFT (16 nodes, 128 GPUs, TP=1/PP=2/EP=16) - slurm_peft.sh: Slurm job script for LoRA (4 nodes, 32 GPUs, TP=1/PP=2/EP=4) - conversion.sh: Checkpoint conversion scripts (HF <-> Megatron) - inference.sh: Inference examples Update Gemma 3 VL example scripts: - README.md and peft.sh updates GLM-4.5V is a 106B parameter MoE vision-language model based on GLM-4.5 Air.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Update PP values from PP=2 to PP=8 for both SFT and PEFT configurations to match the actual slurm scripts.

copy-pr-bot · 2026-01-30T19:42:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-30T19:54:36Z

📝 Walkthrough

Walkthrough

This pull request restructures script and documentation references, moving examples from the examples/recipes/ directory to examples/models/ and related locations. It adds comprehensive documentation and deployment scripts for GLM-4.5V Vision-Language models, updates configuration defaults for GLM-4.5V training, and removes outdated documentation files while redirecting references to new locations.

Changes

Cohort / File(s)	Summary
Documentation Path Reorganization `README.md`, `docs/megatron-lm-to-megatron-bridge.md`, `docs/models/llm/gemma3.md`, `docs/models/llm/nemotron3.md`, `docs/models/llm/nemotronh.md`, `docs/models/vlm/README.md`, `docs/models/vlm/index.md`, `docs/models/vlm/ministral3.md`, `docs/models/vlm/nemotron-nano-v2-vl.md`, `docs/models/vlm/qwen2.5-vl.md`, `docs/models/vlm/qwen3-vl.md`, `docs/recipe-usage.md`, `docs/training/distillation.md`	Updated hyperlinks and command paths from `examples/recipes/` directory references to `examples/models/` and other target locations. Adjusted internal documentation links to reflect the new file structure for Gemma3 VL and other models.
Example Script Path Updates `examples/decentralized_pg/README.md`, `examples/decentralized_pg/pretrain_qwen3_simple.py`, `examples/decentralized_pg/pretrain_qwen3_vl_simple.py`, `examples/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py`, `examples/distillation/llama/distill_llama32_3b-1b.py`, `examples/models/qwen3_next/finetune_qwen3_next_80b_a3b.py`, `examples/models/vlm/qwen_vl/finetune_qwen_vl.py`, `examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py`	Updated command examples and internal documentation paths to reference scripts in their new locations, removing `recipes/` path segments and updating directory prefixes accordingly.
GLM-4.5V Model Documentation `examples/models/vlm/glm_45v/README.md`	Added comprehensive README covering model overview, available model variants (4B, 12B, 27B), architecture features, workspace configuration, checkpoint conversion workflows (HF ↔ Megatron), inference guidance, fine-tuning recipes (SFT and PEFT/LoRA), recommended configurations table, and related documentation references.
GLM-4.5V Deployment Scripts `examples/models/vlm/glm_45v/conversion.sh`, `examples/models/vlm/glm_45v/inference.sh`, `examples/models/vlm/glm_45v/slurm_peft.sh`, `examples/models/vlm/glm_45v/slurm_sft.sh`	Added shell scripts for checkpoint conversion workflow, three-variant inference execution, SLURM-based LoRA PEFT fine-tuning, and SLURM-based full supervised fine-tuning with environment setup, parallelism configuration, and containerized execution support.
Gemma3 VL Documentation Migration `docs/models/vlm/gemma3-vl.md`, `examples/models/vlm/gemma3_vl/README.md`	Removed the legacy documentation file and added expanded README at the new location within examples directory, with detailed model specifications, architecture details, conversion workflows, and fine-tuning guidance.
Recipe Configuration Updates `src/megatron/bridge/recipes/__init__.py`, `src/megatron/bridge/recipes/glm_vl/glm_45v.py`	Added imports for GLM and GLM_VL recipe families to module exports. Updated default parameters in GLM-4.5V configuration: increased `seq_length` from 4096 to 8192 and `pipeline_model_parallel_size` from 4 to 8 in finetune path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[training, recipe] feat: Add unified run_recipe.py and Gemma3 VL examples #2110: Modifies src/megatron/bridge/recipes/glm_vl/glm_45v.py to change default dataset_type — directly overlaps with this PR's configuration updates to the same file.
[doc] refactor: Restructure examples folder - move recipes to models, distillation, decentralized_pg #2131: Performs the same examples/recipes/ to examples/models/ path restructuring and documentation reference updates as this PR.

Suggested reviewers

cuichenx
ananthsub

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major changes including GLM-4.5V VLM support and recipe parameter modifications (seq_length 4096→8192, PP 4→8) without documenting test results or validation.	Add testing documentation demonstrating parameter changes do not impact convergence and validation results for new GLM-4.5V example scripts.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: adding GLM-4.5V VL examples and updating Gemma 3 VL docs. It is specific, concise, and directly reflects the primary objectives of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch add-glm45v-gemma3vl-examples

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py (1)
20-24: ⚠️ Potential issue | 🟡 Minor

Use uv run in the example command. The usage block currently invokes the script with bare python.
Suggested fix
- python examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py \
+ uv run python examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py \
As per coding guidelines, "{/*.sh,examples//*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".
docs/models/vlm/nemotron-nano-v2-vl.md (1)
129-140: ⚠️ Potential issue | 🟡 Minor

Fix the LoRA flag typo in the PEFT command.

—-lora-on-vision-model uses an em dash; the CLI expects --lora-on-vision-model, so this command will fail as written.
💡 Suggested fix
---—-lora-on-vision-model \
+--lora-on-vision-model \

🤖 Fix all issues with AI agents

In `@docs/models/vlm/index.md`:
- Around line 5-10: Update the toctree entry string
'../../../../examples/models/vlm/gemma3_vl/README.md' to
'../../../examples/models/vlm/gemma3_vl/README.md' in the toctree block so the
relative path matches the one used in README.md and no longer resolves outside
the repository; locate the toctree block containing the existing
'../../../../examples/models/vlm/gemma3_vl/README.md' entry and change that path
to '../../../examples/models/vlm/gemma3_vl/README.md'.

In `@docs/models/vlm/README.md`:
- Around line 9-12: The relative link
"../../../examples/models/vlm/gemma3_vl/README.md" in docs/models/vlm/README.md
will break on the published docs site; update the table row for "Gemma 3 VL"
(and similarly check "Nemotron Nano V2 VL") to use a stable target: either
replace the relative path with the absolute GitHub URL to the examples repo
README or create a docs-local wrapper page (e.g., a small Markdown file inside
docs/models/vlm/ that redirects/links to the example) and point the table link
to that wrapper; ensure the link text and the table entry for the
Model/Documentation remain unchanged.

In `@examples/models/vlm/gemma3_vl/README.md`:
- Around line 173-175: Replace the bare URLs for the Gemma model cards with
inline Markdown links to satisfy MD034; for each line like "Gemma 3 VL 4B:
https://huggingface.co/google/gemma-3-4b-it" change it to use an inline link
format (e.g., "Gemma 3 VL 4B: [Gemma 3 VL
4B](https://huggingface.co/google/gemma-3-4b-it)"), and do the same for the
"Gemma 3 VL 12B" and "Gemma 3 VL 27B" entries so the model labels are linked
rather than presenting bare URLs.
- Around line 156-163: The markdown table's separator row is misaligned with the
header spacing and fails MD060; update the separator row under the header "|
Model | Mode | TP | PP | Global Batch Size | Learning Rate | Hardware |" so each
column divider and surrounding spaces match the header exactly (e.g., use
"|-------|------|----|----|-------------------|---------------|----------|" with
spacing consistent for each column), ensuring the separator columns align with
the header text for the table containing "Gemma 3 VL 4B/12B/27B" entries.

In `@examples/models/vlm/glm_45v/inference.sh`:
- Around line 19-21: The header comment currently specifies "TP=1, PP=4, EP=2"
but the later inference invocation blocks use "PP=2, EP=4", causing confusion;
either make the header match the later blocks or update the script to
consistently use one configuration and/or add a clarifying comment explaining
both valid PP/EP permutations for different inference variants. Locate the
TP/PP/EP flag mentions (search for "TP=", "PP=", "EP=" and the string "TP=1,
PP=4, EP=2" and the blocks using "PP=2, EP=4") and then (a) update the header
comment to list both variants and when to use each, or (b) change the later
blocks to match the header so all PP/EP values are consistent across the script.

In `@examples/models/vlm/glm_45v/README.md`:
- Around line 170-172: Replace the bare URL in the README entry "GLM-4.5V:
https://huggingface.co/zai-org/GLM-4.5V" with a Markdown link (e.g., GLM-4.5V:
[GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)) so the URL is not displayed
raw and the file satisfies MD034; update the line containing "GLM-4.5V"
accordingly.
- Around line 164-165: Typo in the LoRA/DoRA note: replace the incorrect
fragment "allowing fo fewer GPUs" with the correct phrase "allowing for fewer
GPUs" in the README sentence that reads "**Note:** LoRA/DoRA significantly
reduces memory requirements, allowing fo fewer GPUs. Expert parallelism (EP) is
essential for efficient training of this MoE model." to fix the spelling error.
- Around line 123-126: Make the wording consistent by updating the heading or
the sentence so both use the same form; e.g., change the heading "Pretrain" to
"Pretraining" (or alter the sentence "Pretraining is not verified for this
model." to "Pretrain is not verified for this model.") so the heading text
"Pretrain" and the following sentence use the identical term.
- Around line 92-109: The fenced code block under the "Expected output:" section
in examples/models/vlm/glm_45v/README.md is missing a language tag; update the
opening triple-backtick for that block to include the language "text" (i.e.,
change ``` to ```text) so the block passes the MD040 lint rule and renders
correctly.

In `@examples/models/vlm/glm_45v/slurm_peft.sh`:
- Around line 23-26: Add a pre-submit guard at the top of slurm_peft.sh that
creates the logs/ directory when the script is run locally and then submits
itself with sbatch; specifically, detect absence of SLURM_JOB_ID (e.g., if [ -z
"$SLURM_JOB_ID" ]; then mkdir -p logs && sbatch "$0" && exit; fi), so the logs/
directory exists before Slurm opens --output/--error files; place this near the
top before any `#SBATCH` directives or the submission path logic and ensure it
won’t run inside the allocated Slurm job.
- Around line 19-22: Header guidance ("Recommended: TP=1, PP=2, EP=4") conflicts
with the runtime config (PP=8). Update either the header or the configuration so
they match: either change the header recommendation to "TP=1, PP=8, EP=4" to
reflect the script's PP=8 setting, or change the configured value PP=8 to PP=2
to match the header; ensure the related mentions (the same header text and the
occurrence of PP in the script) are updated consistently.

In `@examples/models/vlm/glm_45v/slurm_sft.sh`:
- Around line 23-26: Add a pre-submit step to ensure the logs/ directory exists
to prevent Slurm from failing when it opens --output/--error files; update the
top of slurm_sft.sh to run a safe command like mkdir -p logs (or an equivalent
check) before any SBATCH directives or before submission, and apply the same
precaution for the other output/error paths referenced around the
CONTAINER_IMAGE usage and lines 36-37 so all log paths are created prior to
Slurm opening them.
- Around line 19-21: Update the mismatch between the header guidance and the
actual parallelism configuration by either changing the header comment or the
variables so they match; specifically reconcile the PP value referenced in the
header ("PP=2") with the PP variable used in the script (PP=8) and ensure TP and
EP recommendations (TP=1, EP=16) in the header match the TP and EP values set in
the script (refer to the PP, TP, and EP variable assignments in slurm_sft.sh);
pick one source of truth (header or variables), make them consistent, and update
any other related header lines (63-65) to reflect the chosen configuration.

In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 21-26: Static analysis is failing due to star imports from glm and
glm_vl in megatron.bridge.recipes.__init__ (star imports from deepseek, gemma,
gemma3_vl, glm, glm_vl, gpt); fix by either adding "# noqa: F401,F403" to the
glm and glm_vl import lines to silence Ruff/Flake8, or replace the star imports
with explicit exported symbols and an __all__ list (e.g., import the specific
functions/classes from megatron.bridge.recipes.glm and
megatron.bridge.recipes.glm_vl and add them to __all__) so the lint tools no
longer report F401/F403.

docs/models/vlm/index.md

docs/models/vlm/README.md

examples/models/vlm/gemma3_vl/README.md

examples/models/vlm/glm_45v/inference.sh

examples/models/vlm/glm_45v/slurm_peft.sh

examples/models/vlm/glm_45v/slurm_sft.sh

src/megatron/bridge/recipes/__init__.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

cuichenx

LGTM

Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com>

…)" This reverts commit bfbc759.

Signed-off-by: Chen Cui <chcui@nvidia.com>

…P8-CS (#2175) Signed-off-by: Malay Nagda <malayn@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

- Split docs into model introduction (docs/) and examples (examples/) - docs/models/vlm/: Model overview and architecture details - examples/models/vlm/: Training scripts, conversion, and step-by-step guides - Update GLM-4.5V pipeline layout for better vision encoder balance - Update hardware requirements: GLM-4.5V SFT 64 nodes, LoRA 32 nodes - Add multi-node uv cache setup instructions - Update recommended configurations with actual script values

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2026-02-04T19:15:27Z

/ok to test b5030db

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2026-02-04T20:03:56Z

/ok to test bb011ea

yaoyu-33 added 5 commits January 29, 2026 13:30

[doc] refactor: Restructure examples folder - move recipes to models,…

f723594

… distillation, decentralized_pg Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

docs update

dcae909

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

update run scripts

166d3e0

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

docs(glm_45v): fix parallelism config in README to match scripts

b8d29d4

Update PP values from PP=2 to PP=8 for both SFT and PEFT configurations to match the actual slurm scripts.

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

yaoyu-33 and others added 2 commits January 30, 2026 12:25

fix comments

1232791

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'main' into add-glm45v-gemma3vl-examples

6a41455

cuichenx self-requested a review February 2, 2026 18:35

cuichenx added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 2, 2026

cuichenx previously approved these changes Feb 2, 2026

View reviewed changes

balance the decoder layer number on first PP rank for PP8 case

f71dbce

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi dismissed cuichenx’s stale review via f71dbce February 3, 2026 15:49

suiyoubi and others added 14 commits February 3, 2026 10:25

update examples for glm45v

e0b62a1

Signed-off-by: Ao Tang <aot@nvidia.com>

Add resiliency examples for ft launcher and straggler detection (#2115)

0f4af5c

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

[🤖]: Update docs-versions after code-freeze for r0.3.0

f283ad1

[🤖]: Update docs-versions after code-freeze for r0.3.0

219195c

ci(fix): code freeze workflow (#2177)

61e91b3

Signed-off-by: oliver könig <okoenig@nvidia.com>

Add refactored recipe files for pretrain configs of LLMs (#2067)

89dddf0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

docs: remove empty bullet point from News section (#2179)

ae8ae0d

Dsv3 Recipe Update (#2152)

78f0e92

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

Version bump to 0.4.0rc0.dev0 (#2176)

779dce3

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com>

Revert "Add refactored recipe files for pretrain configs of LLMs (#2067…

9ebb2c6

…)" This reverts commit bfbc759.

Revert packed seq extra checks (#2180)

8866c35

Signed-off-by: Chen Cui <chcui@nvidia.com>

DSv3 EP=8 for B200, PP8-VP2 for B300 BF16, Lm3.1 405B TP4-CP1 GB300 F…

dc8b7af

…P8-CS (#2175) Signed-off-by: Malay Nagda <malayn@nvidia.com>

ci: Add secrets detector (#2154)

e2fdda2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

yaoyu-33 requested a review from a team as a code owner February 3, 2026 22:37

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 17:27 Failure

update glm45 layout and test

b5030db

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 19:15 Inactive

ruff

bb011ea

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 20:04 Inactive

copy-pr-bot bot temporarily deployed to test February 4, 2026 20:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 22:44 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 23:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 23:30 Inactive

suiyoubi approved these changes Feb 5, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 5, 2026

cp: [doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs (2151) into r0.3.0 #2226

Merged

cuichenx mentioned this pull request Feb 25, 2026

GPT-OSS examples #2422

Merged

5 tasks

Conversation

yaoyu-33 commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New: GLM-4.5V VL Examples

Updated: Gemma 3 VL Examples

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented Feb 4, 2026

Uh oh!

suiyoubi commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yaoyu-33 commented Jan 30, 2026 •

edited by coderabbitai bot

Loading