Skip to content

[doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs#2151

Merged
yaoyu-33 merged 29 commits intomainfrom
add-glm45v-gemma3vl-examples
Feb 5, 2026
Merged

[doc, model] feat: Add GLM-4.5V VL examples and update Gemma 3 VL docs#2151
yaoyu-33 merged 29 commits intomainfrom
add-glm45v-gemma3vl-examples

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Jan 30, 2026

Summary

This PR adds GLM-4.5V Vision-Language model examples and updates Gemma 3 VL documentation.

Changes

New: GLM-4.5V VL Examples

  • Add comprehensive README with model details, conversion, inference, and training instructions
  • Add Slurm scripts for full SFT training (16 nodes, 128 GPUs, TP=1, PP=8, EP=16)
  • Add Slurm scripts for LoRA/PEFT fine-tuning (4 nodes, 32 GPUs, TP=1, PP=8, EP=4)
  • Add checkpoint conversion and inference example scripts

Updated: Gemma 3 VL Examples

  • Expanded README with more detailed documentation

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive GLM-4.5V Vision-Language model documentation with inference, fine-tuning, and checkpoint conversion workflows.
    • Expanded Gemma 3 VL documentation with detailed model architecture features and recommended configurations.
  • Documentation

    • Reorganized documentation structure for improved clarity and navigation.
    • Updated example references throughout to reflect consolidated directory structure.
  • Configuration

    • Updated default training settings for GLM-4.5V models.

✏️ Tip: You can customize this high-level summary in your review settings.

… distillation, decentralized_pg

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Add complete GLM-4.5V VLM example folder:
- README.md: Model documentation, architecture details, usage instructions
- slurm_sft.sh: Slurm job script for full SFT (16 nodes, 128 GPUs, TP=1/PP=2/EP=16)
- slurm_peft.sh: Slurm job script for LoRA (4 nodes, 32 GPUs, TP=1/PP=2/EP=4)
- conversion.sh: Checkpoint conversion scripts (HF <-> Megatron)
- inference.sh: Inference examples

Update Gemma 3 VL example scripts:
- README.md and peft.sh updates

GLM-4.5V is a 106B parameter MoE vision-language model based on GLM-4.5 Air.
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Update PP values from PP=2 to PP=8 for both SFT and PEFT
configurations to match the actual slurm scripts.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jan 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

This pull request restructures script and documentation references, moving examples from the examples/recipes/ directory to examples/models/ and related locations. It adds comprehensive documentation and deployment scripts for GLM-4.5V Vision-Language models, updates configuration defaults for GLM-4.5V training, and removes outdated documentation files while redirecting references to new locations.

Changes

Cohort / File(s) Summary
Documentation Path Reorganization
README.md, docs/megatron-lm-to-megatron-bridge.md, docs/models/llm/gemma3.md, docs/models/llm/nemotron3.md, docs/models/llm/nemotronh.md, docs/models/vlm/README.md, docs/models/vlm/index.md, docs/models/vlm/ministral3.md, docs/models/vlm/nemotron-nano-v2-vl.md, docs/models/vlm/qwen2.5-vl.md, docs/models/vlm/qwen3-vl.md, docs/recipe-usage.md, docs/training/distillation.md
Updated hyperlinks and command paths from examples/recipes/ directory references to examples/models/ and other target locations. Adjusted internal documentation links to reflect the new file structure for Gemma3 VL and other models.
Example Script Path Updates
examples/decentralized_pg/README.md, examples/decentralized_pg/pretrain_qwen3_simple.py, examples/decentralized_pg/pretrain_qwen3_vl_simple.py, examples/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py, examples/distillation/llama/distill_llama32_3b-1b.py, examples/models/qwen3_next/finetune_qwen3_next_80b_a3b.py, examples/models/vlm/qwen_vl/finetune_qwen_vl.py, examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py
Updated command examples and internal documentation paths to reference scripts in their new locations, removing recipes/ path segments and updating directory prefixes accordingly.
GLM-4.5V Model Documentation
examples/models/vlm/glm_45v/README.md
Added comprehensive README covering model overview, available model variants (4B, 12B, 27B), architecture features, workspace configuration, checkpoint conversion workflows (HF ↔ Megatron), inference guidance, fine-tuning recipes (SFT and PEFT/LoRA), recommended configurations table, and related documentation references.
GLM-4.5V Deployment Scripts
examples/models/vlm/glm_45v/conversion.sh, examples/models/vlm/glm_45v/inference.sh, examples/models/vlm/glm_45v/slurm_peft.sh, examples/models/vlm/glm_45v/slurm_sft.sh
Added shell scripts for checkpoint conversion workflow, three-variant inference execution, SLURM-based LoRA PEFT fine-tuning, and SLURM-based full supervised fine-tuning with environment setup, parallelism configuration, and containerized execution support.
Gemma3 VL Documentation Migration
docs/models/vlm/gemma3-vl.md, examples/models/vlm/gemma3_vl/README.md
Removed the legacy documentation file and added expanded README at the new location within examples directory, with detailed model specifications, architecture details, conversion workflows, and fine-tuning guidance.
Recipe Configuration Updates
src/megatron/bridge/recipes/__init__.py, src/megatron/bridge/recipes/glm_vl/glm_45v.py
Added imports for GLM and GLM_VL recipe families to module exports. Updated default parameters in GLM-4.5V configuration: increased seq_length from 4096 to 8192 and pipeline_model_parallel_size from 4 to 8 in finetune path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • cuichenx
  • ananthsub
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major changes including GLM-4.5V VLM support and recipe parameter modifications (seq_length 4096→8192, PP 4→8) without documenting test results or validation. Add testing documentation demonstrating parameter changes do not impact convergence and validation results for new GLM-4.5V example scripts.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding GLM-4.5V VL examples and updating Gemma 3 VL docs. It is specific, concise, and directly reflects the primary objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-glm45v-gemma3vl-examples

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py (1)

20-24: ⚠️ Potential issue | 🟡 Minor

Use uv run in the example command. The usage block currently invokes the script with bare python.

Suggested fix
- python examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py \
+ uv run python examples/models/vlm/qwen_vl/data/convert_to_qwenvl_wds.py \

As per coding guidelines, "{/*.sh,examples//*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

docs/models/vlm/nemotron-nano-v2-vl.md (1)

129-140: ⚠️ Potential issue | 🟡 Minor

Fix the LoRA flag typo in the PEFT command.

—-lora-on-vision-model uses an em dash; the CLI expects --lora-on-vision-model, so this command will fail as written.

💡 Suggested fix
---—-lora-on-vision-model \
+--lora-on-vision-model \
🤖 Fix all issues with AI agents
In `@docs/models/vlm/index.md`:
- Around line 5-10: Update the toctree entry string
'../../../../examples/models/vlm/gemma3_vl/README.md' to
'../../../examples/models/vlm/gemma3_vl/README.md' in the toctree block so the
relative path matches the one used in README.md and no longer resolves outside
the repository; locate the toctree block containing the existing
'../../../../examples/models/vlm/gemma3_vl/README.md' entry and change that path
to '../../../examples/models/vlm/gemma3_vl/README.md'.

In `@docs/models/vlm/README.md`:
- Around line 9-12: The relative link
"../../../examples/models/vlm/gemma3_vl/README.md" in docs/models/vlm/README.md
will break on the published docs site; update the table row for "Gemma 3 VL"
(and similarly check "Nemotron Nano V2 VL") to use a stable target: either
replace the relative path with the absolute GitHub URL to the examples repo
README or create a docs-local wrapper page (e.g., a small Markdown file inside
docs/models/vlm/ that redirects/links to the example) and point the table link
to that wrapper; ensure the link text and the table entry for the
Model/Documentation remain unchanged.

In `@examples/models/vlm/gemma3_vl/README.md`:
- Around line 173-175: Replace the bare URLs for the Gemma model cards with
inline Markdown links to satisfy MD034; for each line like "Gemma 3 VL 4B:
https://huggingface.co/google/gemma-3-4b-it" change it to use an inline link
format (e.g., "Gemma 3 VL 4B: [Gemma 3 VL
4B](https://huggingface.co/google/gemma-3-4b-it)"), and do the same for the
"Gemma 3 VL 12B" and "Gemma 3 VL 27B" entries so the model labels are linked
rather than presenting bare URLs.
- Around line 156-163: The markdown table's separator row is misaligned with the
header spacing and fails MD060; update the separator row under the header "|
Model | Mode | TP | PP | Global Batch Size | Learning Rate | Hardware |" so each
column divider and surrounding spaces match the header exactly (e.g., use
"|-------|------|----|----|-------------------|---------------|----------|" with
spacing consistent for each column), ensuring the separator columns align with
the header text for the table containing "Gemma 3 VL 4B/12B/27B" entries.

In `@examples/models/vlm/glm_45v/inference.sh`:
- Around line 19-21: The header comment currently specifies "TP=1, PP=4, EP=2"
but the later inference invocation blocks use "PP=2, EP=4", causing confusion;
either make the header match the later blocks or update the script to
consistently use one configuration and/or add a clarifying comment explaining
both valid PP/EP permutations for different inference variants. Locate the
TP/PP/EP flag mentions (search for "TP=", "PP=", "EP=" and the string "TP=1,
PP=4, EP=2" and the blocks using "PP=2, EP=4") and then (a) update the header
comment to list both variants and when to use each, or (b) change the later
blocks to match the header so all PP/EP values are consistent across the script.

In `@examples/models/vlm/glm_45v/README.md`:
- Around line 170-172: Replace the bare URL in the README entry "GLM-4.5V:
https://huggingface.co/zai-org/GLM-4.5V" with a Markdown link (e.g., GLM-4.5V:
[GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)) so the URL is not displayed
raw and the file satisfies MD034; update the line containing "GLM-4.5V"
accordingly.
- Around line 164-165: Typo in the LoRA/DoRA note: replace the incorrect
fragment "allowing fo fewer GPUs" with the correct phrase "allowing for fewer
GPUs" in the README sentence that reads "**Note:** LoRA/DoRA significantly
reduces memory requirements, allowing fo fewer GPUs. Expert parallelism (EP) is
essential for efficient training of this MoE model." to fix the spelling error.
- Around line 123-126: Make the wording consistent by updating the heading or
the sentence so both use the same form; e.g., change the heading "Pretrain" to
"Pretraining" (or alter the sentence "Pretraining is not verified for this
model." to "Pretrain is not verified for this model.") so the heading text
"Pretrain" and the following sentence use the identical term.
- Around line 92-109: The fenced code block under the "Expected output:" section
in examples/models/vlm/glm_45v/README.md is missing a language tag; update the
opening triple-backtick for that block to include the language "text" (i.e.,
change ``` to ```text) so the block passes the MD040 lint rule and renders
correctly.

In `@examples/models/vlm/glm_45v/slurm_peft.sh`:
- Around line 23-26: Add a pre-submit guard at the top of slurm_peft.sh that
creates the logs/ directory when the script is run locally and then submits
itself with sbatch; specifically, detect absence of SLURM_JOB_ID (e.g., if [ -z
"$SLURM_JOB_ID" ]; then mkdir -p logs && sbatch "$0" && exit; fi), so the logs/
directory exists before Slurm opens --output/--error files; place this near the
top before any `#SBATCH` directives or the submission path logic and ensure it
won’t run inside the allocated Slurm job.
- Around line 19-22: Header guidance ("Recommended: TP=1, PP=2, EP=4") conflicts
with the runtime config (PP=8). Update either the header or the configuration so
they match: either change the header recommendation to "TP=1, PP=8, EP=4" to
reflect the script's PP=8 setting, or change the configured value PP=8 to PP=2
to match the header; ensure the related mentions (the same header text and the
occurrence of PP in the script) are updated consistently.

In `@examples/models/vlm/glm_45v/slurm_sft.sh`:
- Around line 23-26: Add a pre-submit step to ensure the logs/ directory exists
to prevent Slurm from failing when it opens --output/--error files; update the
top of slurm_sft.sh to run a safe command like mkdir -p logs (or an equivalent
check) before any SBATCH directives or before submission, and apply the same
precaution for the other output/error paths referenced around the
CONTAINER_IMAGE usage and lines 36-37 so all log paths are created prior to
Slurm opening them.
- Around line 19-21: Update the mismatch between the header guidance and the
actual parallelism configuration by either changing the header comment or the
variables so they match; specifically reconcile the PP value referenced in the
header ("PP=2") with the PP variable used in the script (PP=8) and ensure TP and
EP recommendations (TP=1, EP=16) in the header match the TP and EP values set in
the script (refer to the PP, TP, and EP variable assignments in slurm_sft.sh);
pick one source of truth (header or variables), make them consistent, and update
any other related header lines (63-65) to reflect the chosen configuration.

In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 21-26: Static analysis is failing due to star imports from glm and
glm_vl in megatron.bridge.recipes.__init__ (star imports from deepseek, gemma,
gemma3_vl, glm, glm_vl, gpt); fix by either adding "# noqa: F401,F403" to the
glm and glm_vl import lines to silence Ruff/Flake8, or replace the star imports
with explicit exported symbols and an __all__ list (e.g., import the specific
functions/classes from megatron.bridge.recipes.glm and
megatron.bridge.recipes.glm_vl and add them to __all__) so the lint tools no
longer report F401/F403.

yaoyu-33 and others added 2 commits January 30, 2026 12:25
@cuichenx cuichenx self-requested a review February 2, 2026 18:35
@cuichenx cuichenx added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 2, 2026
cuichenx
cuichenx previously approved these changes Feb 2, 2026
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

suiyoubi and others added 14 commits February 3, 2026 10:25
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
…P8-CS (#2175)

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
- Split docs into model introduction (docs/) and examples (examples/)
- docs/models/vlm/: Model overview and architecture details
- examples/models/vlm/: Training scripts, conversion, and step-by-step guides
- Update GLM-4.5V pipeline layout for better vision encoder balance
- Update hardware requirements: GLM-4.5V SFT 64 nodes, LoRA 32 nodes
- Add multi-node uv cache setup instructions
- Update recommended configurations with actual script values
@yaoyu-33 yaoyu-33 requested a review from a team as a code owner February 3, 2026 22:37
Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Copy Markdown
Contributor

suiyoubi commented Feb 4, 2026

/ok to test b5030db

Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Copy Markdown
Contributor

suiyoubi commented Feb 4, 2026

/ok to test bb011ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants