Skip to content

cp: [model] fix: correct GLM-4.5V inference parallelism for 46-layer model (2322) into r0.3.0#2336

Merged
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2322-r0.3.0
Feb 12, 2026
Merged

cp: [model] fix: correct GLM-4.5V inference parallelism for 46-layer model (2322) into r0.3.0#2336
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2322-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 11, 2026

beep boop [🤖]: Hi @yaoyu-33 👋,

we've cherry picked #2322 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Bug Fixes

    • Applied runtime hardening to model initialization in both text and visual language model conversion examples.
  • Configuration Updates

    • Optimized distributed inference parameters for visual language model processing on multi-GPU setups.

#2322)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 11, 2026

/ok to test 0cd4f86

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Three example and configuration files are modified to disable the mtp_num_layers parameter in sub-model configurations and adjust pipeline/expert parallelism settings for GLM-4.5V inference. The changes add post-initialization steps clearing mtp_num_layers and update tensor parallelism distribution parameters.

Changes

Cohort / File(s) Summary
Conversion script hardening
examples/conversion/hf_to_megatron_generate_text.py, examples/conversion/hf_to_megatron_generate_vlm.py
Adds loops to clear mtp_num_layers by setting m.config.mtp_num_layers = None for each sub-model immediately after model loading and before device allocation.
Inference configuration
examples/models/vlm/glm_45v/inference.sh
Updates pipeline parallelism (PP) from 4 to 2 and expert parallelism (EP) from 2 to 4 in the Hugging Face inference command for 8-GPU execution.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

Suggested labels

r0.3.0

Suggested reviewers

  • cuichenx
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains non-trivial changes affecting model inference parallelism and parameter configuration, but lacks documented test results, performance benchmarks, or regression testing evidence. Include test results, performance benchmarks comparing old vs. new configuration, regression testing evidence, and justification for chosen parallelism values (PP=2, EP=4).
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: fixing GLM-4.5V inference parallelism configuration for a 46-layer model, with a reference to the original PR (#2322) and target branch (r0.3.0).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2322-r0.3.0

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
examples/conversion/hf_to_megatron_generate_text.py (1)

170-172: Consider adding a tracking reference for this temp fix.

The TEMP FIX comment is helpful, but it would be good to include a link to an issue or TODO ticket so this workaround doesn't become permanent. This makes it easier to track when the proper fix lands upstream.

-    # TEMP FIX for inference failure when mtp_num_layers is not None
+    # TEMP FIX for inference failure when mtp_num_layers is not None
+    # TODO: Remove once MTP inference is properly supported (see PR `#2322`)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants