[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix by askliar · Pull Request #43342 · vllm-project/vllm

askliar · 2026-05-21T17:47:31Z

Purpose

Two model-side fixes needed to run Nemotron-H 3.5 on the released NVFP4 checkpoints.

Part 2/3 of splitting #43333 (the combined B12x + Nemotron-3.5 + Qwen3.5 PR). Logically depends on the LM-head quant routing in part 1/3 (#43341) — once that routing exists, quant_config=self.quant_config on ParallelLMHead actually does something useful.

Changes

NemotronHForCausalLM, NemotronHMTP: pass quant_config to ParallelLMHead so quantized LM heads load correctly.
NemotronHMTP: extend compressed_tensors_config.ignore with the per-expert MTP linears (gate_proj / up_proj / down_proj) before constructing self.layers. Those linears are BF16 in the released nvidia/Nemotron-H-3.5-* compressed-tensors checkpoints; without this, the FusedMoE quant-method dispatch fails with KeyError during weight load (same failure mode as [Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994 for Qwen3.5 MTP, different model).

Relation to existing open PRs (overlap disclosure)

Add LM head quantization support for ModelOpt #42124 (Add LM head quantization support for ModelOpt) already includes the one-line quant_config=self.quant_config for NemotronHForCausalLM.lm_head. My PR includes the same one-liner plus the matching one for NemotronHMTP.lm_head. If Add LM head quantization support for ModelOpt #42124 lands first, drop the NemotronHForCausalLM hunk on rebase.

Unique to this PR:

NemotronHMTP quant_config plumbing for lm_head (Add LM head quantization support for ModelOpt #42124 does not touch nemotron_h_mtp.py).
NemotronHMTP compressed-tensors ignore extension for per-expert MTP linears — analogous to [Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994 but for Nemotron-H. No open PR I found covers this for Nemotron.

Test Plan

End-to-end serve Nemotron-H 3.5 (modelopt NVFP4 W4A4) — exercises quantized LM head + MTP.
End-to-end serve Nemotron-H 3.5 (compressed-tensors nvfp4-pack-quantized W4A16) — exercises MTP ignore-list extension.

Fill in actual test results before review per AGENTS.md.

AI assistance

This change was developed with AI assistance (Claude). Every changed line was reviewed locally by the submitter.

🤖 Generated with Claude Code

- NemotronHForCausalLM, NemotronHMTP: pass quant_config to ParallelLMHead so quantized LM heads load correctly. - NemotronHMTP: extend compressed_tensors_config.ignore with the per-expert MTP linears (gate_proj / up_proj / down_proj) — those are BF16 in the released checkpoints and would otherwise fail the quant-method dispatch. Test plan - End-to-end serve Nemotron-H 3.5 (modelopt NVFP4 W4A4) — verifies the quantized LM head loads. - End-to-end serve Nemotron-H 3.5 (compressed-tensors `nvfp4-pack-quantized` W4A16) — verifies the MTP ignore-list extension. AI assistance was used for drafting; every changed line was reviewed locally by the submitter. Co-authored-by: Claude Signed-off-by: Andrii Skliar <askliar@nvidia.com>

gemini-code-assist

Code Review

This pull request updates the NemotronH and NemotronH-MTP models to support quantization configurations by passing the quantization config to the language model head. Additionally, it implements logic in the NemotronH-MTP model to automatically exclude specific per-expert linear layers from compressed-tensors quantization when they are expected to remain in BF16 format. I have no feedback to provide.

askliar requested a review from tomeras91 as a code owner May 21, 2026 17:47

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

MerkyorLynn mentioned this pull request Jun 5, 2026

Add ModelOpt W4A16 lm_head regression tests #44671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342

[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342
askliar wants to merge 1 commit into
vllm-project:mainfrom
askliar:feat/nemotron_3_5

askliar commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

askliar commented May 21, 2026

Purpose

Changes

Relation to existing open PRs (overlap disclosure)

Test Plan

AI assistance

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant