Skip to content

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832

Merged
vadiklyutiy merged 2 commits into
vllm-project:mainfrom
vadiklyutiy:qwen35-fp4-mtp.fc
Apr 3, 2026
Merged

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832
vadiklyutiy merged 2 commits into
vllm-project:mainfrom
vadiklyutiy:qwen35-fp4-mtp.fc

Conversation

@vadiklyutiy

Copy link
Copy Markdown
Member

Description

Fix AssertionError when loading nvidia/Qwen3.5-397B-A17B-NVFP4 with method="mtp".

The NVFP4 checkpoint stores the entire MTP branch in BF16, but hf_quant_config.json only excludes mtp.layers.0* — missing mtp.fc. This causes ColumnParallelLinear for mtp.fc to be created with NVFP4 quantization (packed uint8, half input dim), which then crashes at weight loading when the BF16 checkpoint weight shape doesn't match.

Fix: Force quant_config=None for mtp.fc when the quant is modelopt_fp4.

This is a temporary workaround until NVIDIA/Model-Optimizer#1124 is merged and the checkpoint is re-exported with the corrected exclude_modules.

Related:

Test

2x B200, TP=2:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --tensor-parallel-size 2 \
  --language-model-only \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --max-model-len 1024

Before: AssertionError at parameter.py:153 during MTP weight loading.

After: Server starts, inference works:

{"prompt": "What is 2+2?", "max_tokens": 32}
-> "The sum of 2 and 2 is 4..."

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy requested a review from sighingnow as a code owner April 2, 2026 17:20
@mergify mergify Bot added qwen Related to Qwen models bug Something isn't working labels Apr 2, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a workaround for Qwen 3.5 MTP models by forcing the fc layer to remain unquantized when the modelopt_fp4 quantization configuration is used. This addresses an issue where the layer is stored as BF16 in checkpoints but missing from the exclusion list in the quantization configuration. I have no feedback to provide.

@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026
@vadiklyutiy vadiklyutiy merged commit 771913e into vllm-project:main Apr 3, 2026
57 checks passed
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Kh4L added a commit to Kh4L/vllm that referenced this pull request May 7, 2026
Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's
fused expert weights as BF16 unquantized tensors
(`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`)
while the rest of the model is NVFP4-quantized per-expert per-projection.
However the per-expert MTP linears are not listed in the
compressed-tensors `quantization_config.ignore` field. vLLM ends up
constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed`
/ `w2_weight_packed`), and weight loading fails:

  KeyError: 'layers.0.mlp.experts.w2_weight'
    in Qwen3_5MultiTokenPredictor.load_fused_expert_weights

This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the
experts. We extend the active CT `ignore` list with every per-expert MTP
linear before constructing `self.layers`, so the FusedMoE picks
`UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` /
`w2_weight` matching the checkpoint.

Note: this is complementary to (not duplicative of) PR vllm-project#27608, which
fixes the orthogonal CT-loader bug that `get_quant_method` doesn't
honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected
checkpoint would still crash because its `ignore` list is missing the
per-expert MTP entries entirely. Once both vllm-project#27608 and corrected
checkpoint metadata are in place, this workaround can be removed.

Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768,
ISL=2048, OSL=1024):

  | K | Before patch  | After patch (out tput) |
  |---|---------------|------------------------|
  | 0 | 63.08 t/s     | 63.08 t/s (unaffected) |
  | 1 | crash         | 71.08 t/s              |
  | 3 | crash         | 84.81 t/s              |
  | 5 | crash         | 87.76 t/s              |

Spec config also requires `moe_backend in {triton, flashinfer_trtllm,
flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin`
rejects unquantized FusedMoE. This is unrelated and not changed here.

Drive-by: update stale PR reference in the existing mtp.fc workaround
comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix).

Assisted-by: Claude
Signed-off-by: Serge Panev <spanev@nvidia.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
bryanfarrell pushed a commit to fiosco/vllm that referenced this pull request May 24, 2026
Carries upstream PR vllm-project#41994 (vllm-project/vllm) onto the fiosco release line as a single net-change commit.

Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store MTP-layer fused experts as BF16 unquantized but omit them from the compressed-tensors ignore list. Without this fix, vLLM builds a quantized FusedMoE for the MTP layer and weight loading fails (KeyError: layers.0.mlp.experts.w2_weight). Mirrors the existing mtp.fc workaround (PR vllm-project#38832) for the per-expert MTP linears.

Signed-off-by: Serge Panev <spanev@nvidia.com>

Signed-off-by: Bryan Farrell <12701870+bryanfarrell@users.noreply.github.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…llm-project#38832)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants