[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 by vadiklyutiy · Pull Request #38832 · vllm-project/vllm

vadiklyutiy · 2026-04-02T17:20:51Z

Description

Fix AssertionError when loading nvidia/Qwen3.5-397B-A17B-NVFP4 with method="mtp".

The NVFP4 checkpoint stores the entire MTP branch in BF16, but hf_quant_config.json only excludes mtp.layers.0* — missing mtp.fc. This causes ColumnParallelLinear for mtp.fc to be created with NVFP4 quantization (packed uint8, half input dim), which then crashes at weight loading when the BF16 checkpoint weight shape doesn't match.

Fix: Force quant_config=None for mtp.fc when the quant is modelopt_fp4.

This is a temporary workaround until NVIDIA/Model-Optimizer#1124 is merged and the checkpoint is re-exported with the corrected exclude_modules.

2x B200, TP=2:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --tensor-parallel-size 2 \
  --language-model-only \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --max-model-len 1024

Before: AssertionError at parameter.py:153 during MTP weight loading.

After: Server starts, inference works:

{"prompt": "What is 2+2?", "max_tokens": 32}
-> "The sum of 2 and 2 is 4..."

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request implements a workaround for Qwen 3.5 MTP models by forcing the fc layer to remain unquantized when the modelopt_fp4 quantization configuration is used. This addresses an issue where the layer is stored as BF16 in checkpoints but missing from the exclusion list in the quantization configuration. I have no feedback to provide.

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's fused expert weights as BF16 unquantized tensors (`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`) while the rest of the model is NVFP4-quantized per-expert per-projection. However the per-expert MTP linears are not listed in the compressed-tensors `quantization_config.ignore` field. vLLM ends up constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed` / `w2_weight_packed`), and weight loading fails: KeyError: 'layers.0.mlp.experts.w2_weight' in Qwen3_5MultiTokenPredictor.load_fused_expert_weights This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the experts. We extend the active CT `ignore` list with every per-expert MTP linear before constructing `self.layers`, so the FusedMoE picks `UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` / `w2_weight` matching the checkpoint. Note: this is complementary to (not duplicative of) PR vllm-project#27608, which fixes the orthogonal CT-loader bug that `get_quant_method` doesn't honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected checkpoint would still crash because its `ignore` list is missing the per-expert MTP entries entirely. Once both vllm-project#27608 and corrected checkpoint metadata are in place, this workaround can be removed. Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768, ISL=2048, OSL=1024): | K | Before patch | After patch (out tput) | |---|---------------|------------------------| | 0 | 63.08 t/s | 63.08 t/s (unaffected) | | 1 | crash | 71.08 t/s | | 3 | crash | 84.81 t/s | | 5 | crash | 87.76 t/s | Spec config also requires `moe_backend in {triton, flashinfer_trtllm, flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin` rejects unquantized FusedMoE. This is unrelated and not changed here. Drive-by: update stale PR reference in the existing mtp.fc workaround comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix). Assisted-by: Claude Signed-off-by: Serge Panev <spanev@nvidia.com>

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Carries upstream PR vllm-project#41994 (vllm-project/vllm) onto the fiosco release line as a single net-change commit. Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store MTP-layer fused experts as BF16 unquantized but omit them from the compressed-tensors ignore list. Without this fix, vLLM builds a quantized FusedMoE for the MTP layer and weight loading fails (KeyError: layers.0.mlp.experts.w2_weight). Mirrors the existing mtp.fc workaround (PR vllm-project#38832) for the per-expert MTP linears. Signed-off-by: Serge Panev <spanev@nvidia.com> Signed-off-by: Bryan Farrell <12701870+bryanfarrell@users.noreply.github.com>

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5

11b9c12

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested a review from sighingnow as a code owner April 2, 2026 17:20

mergify Bot added qwen Related to Qwen models bug Something isn't working labels Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

vadiklyutiy mentioned this pull request Apr 2, 2026

[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint #38650

Closed

ZJY0516 approved these changes Apr 2, 2026

View reviewed changes

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026

Merge branch 'main' into qwen35-fp4-mtp.fc

a07aebe

vadiklyutiy merged commit 771913e into vllm-project:main Apr 3, 2026
57 checks passed

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

d981820

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

e50c906

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Kh4L mentioned this pull request May 7, 2026

[Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994

Open

2 tasks

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

fd58034

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

5818099

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

e420a47

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 (v…

b3b1525

…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832
vadiklyutiy merged 2 commits into
vllm-project:mainfrom
vadiklyutiy:qwen35-fp4-mtp.fc

vadiklyutiy commented Apr 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vadiklyutiy commented Apr 2, 2026

Description

Test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants