Skip to content

[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342

Open
askliar wants to merge 1 commit into
vllm-project:mainfrom
askliar:feat/nemotron_3_5
Open

[Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix#43342
askliar wants to merge 1 commit into
vllm-project:mainfrom
askliar:feat/nemotron_3_5

Conversation

@askliar
Copy link
Copy Markdown
Contributor

@askliar askliar commented May 21, 2026

Purpose

Two model-side fixes needed to run Nemotron-H 3.5 on the released NVFP4 checkpoints.

Part 2/3 of splitting #43333 (the combined B12x + Nemotron-3.5 + Qwen3.5 PR). Logically depends on the LM-head quant routing in part 1/3 (#43341) — once that routing exists, quant_config=self.quant_config on ParallelLMHead actually does something useful.

Changes

  • NemotronHForCausalLM, NemotronHMTP: pass quant_config to ParallelLMHead so quantized LM heads load correctly.
  • NemotronHMTP: extend compressed_tensors_config.ignore with the per-expert MTP linears (gate_proj / up_proj / down_proj) before constructing self.layers. Those linears are BF16 in the released nvidia/Nemotron-H-3.5-* compressed-tensors checkpoints; without this, the FusedMoE quant-method dispatch fails with KeyError during weight load (same failure mode as [Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994 for Qwen3.5 MTP, different model).

Relation to existing open PRs (overlap disclosure)

Unique to this PR:

Test Plan

  • End-to-end serve Nemotron-H 3.5 (modelopt NVFP4 W4A4) — exercises quantized LM head + MTP.
  • End-to-end serve Nemotron-H 3.5 (compressed-tensors nvfp4-pack-quantized W4A16) — exercises MTP ignore-list extension.

Fill in actual test results before review per AGENTS.md.

AI assistance

This change was developed with AI assistance (Claude). Every changed line was reviewed locally by the submitter.

🤖 Generated with Claude Code

- NemotronHForCausalLM, NemotronHMTP: pass quant_config to ParallelLMHead
  so quantized LM heads load correctly.
- NemotronHMTP: extend compressed_tensors_config.ignore with the per-expert
  MTP linears (gate_proj / up_proj / down_proj) — those are BF16 in the
  released checkpoints and would otherwise fail the quant-method dispatch.

Test plan
- End-to-end serve Nemotron-H 3.5 (modelopt NVFP4 W4A4) — verifies the
  quantized LM head loads.
- End-to-end serve Nemotron-H 3.5 (compressed-tensors `nvfp4-pack-quantized`
  W4A16) — verifies the MTP ignore-list extension.

AI assistance was used for drafting; every changed line was reviewed
locally by the submitter.

Co-authored-by: Claude
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
@askliar askliar requested a review from tomeras91 as a code owner May 21, 2026 17:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the NemotronH and NemotronH-MTP models to support quantization configurations by passing the quantization config to the language model head. Additionally, it implements logic in the NemotronH-MTP model to automatically exclude specific per-expert linear layers from compressed-tensors quantization when they are expected to remain in BF16 format. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant