[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832
Merged
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a workaround for Qwen 3.5 MTP models by forcing the fc layer to remain unquantized when the modelopt_fp4 quantization configuration is used. This addresses an issue where the layer is stored as BF16 in checkpoints but missing from the exclusion list in the quantization configuration. I have no feedback to provide.
ZJY0516
approved these changes
Apr 2, 2026
HenryTangDev
pushed a commit
to HenryTangMain/vllm
that referenced
this pull request
Apr 6, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
puririshi98
pushed a commit
to puririshi98/vllm
that referenced
this pull request
Apr 7, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet
pushed a commit
to blackfuel-ai/vllm
that referenced
this pull request
Apr 9, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2 tasks
Kh4L
added a commit
to Kh4L/vllm
that referenced
this pull request
May 7, 2026
Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's
fused expert weights as BF16 unquantized tensors
(`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`)
while the rest of the model is NVFP4-quantized per-expert per-projection.
However the per-expert MTP linears are not listed in the
compressed-tensors `quantization_config.ignore` field. vLLM ends up
constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed`
/ `w2_weight_packed`), and weight loading fails:
KeyError: 'layers.0.mlp.experts.w2_weight'
in Qwen3_5MultiTokenPredictor.load_fused_expert_weights
This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the
experts. We extend the active CT `ignore` list with every per-expert MTP
linear before constructing `self.layers`, so the FusedMoE picks
`UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` /
`w2_weight` matching the checkpoint.
Note: this is complementary to (not duplicative of) PR vllm-project#27608, which
fixes the orthogonal CT-loader bug that `get_quant_method` doesn't
honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected
checkpoint would still crash because its `ignore` list is missing the
per-expert MTP entries entirely. Once both vllm-project#27608 and corrected
checkpoint metadata are in place, this workaround can be removed.
Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768,
ISL=2048, OSL=1024):
| K | Before patch | After patch (out tput) |
|---|---------------|------------------------|
| 0 | 63.08 t/s | 63.08 t/s (unaffected) |
| 1 | crash | 71.08 t/s |
| 3 | crash | 84.81 t/s |
| 5 | crash | 87.76 t/s |
Spec config also requires `moe_backend in {triton, flashinfer_trtllm,
flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin`
rejects unquantized FusedMoE. This is unrelated and not changed here.
Drive-by: update stale PR reference in the existing mtp.fc workaround
comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix).
Assisted-by: Claude
Signed-off-by: Serge Panev <spanev@nvidia.com>
mystous
pushed a commit
to mystous/vllm_hybrid
that referenced
this pull request
May 10, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
jhu960213
pushed a commit
to jhu960213/vllm
that referenced
this pull request
May 20, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
bryanfarrell
pushed a commit
to fiosco/vllm
that referenced
this pull request
May 24, 2026
Carries upstream PR vllm-project#41994 (vllm-project/vllm) onto the fiosco release line as a single net-change commit. Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store MTP-layer fused experts as BF16 unquantized but omit them from the compressed-tensors ignore list. Without this fix, vLLM builds a quantized FusedMoE for the MTP layer and weight loading fails (KeyError: layers.0.mlp.experts.w2_weight). Mirrors the existing mtp.fc workaround (PR vllm-project#38832) for the per-expert MTP linears. Signed-off-by: Serge Panev <spanev@nvidia.com> Signed-off-by: Bryan Farrell <12701870+bryanfarrell@users.noreply.github.com>
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
…llm-project#38832) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix
AssertionErrorwhen loadingnvidia/Qwen3.5-397B-A17B-NVFP4withmethod="mtp".The NVFP4 checkpoint stores the entire MTP branch in BF16, but
hf_quant_config.jsononly excludesmtp.layers.0*— missingmtp.fc. This causesColumnParallelLinearformtp.fcto be created with NVFP4 quantization (packed uint8, half input dim), which then crashes at weight loading when the BF16 checkpoint weight shape doesn't match.Fix: Force
quant_config=Noneformtp.fcwhen the quant ismodelopt_fp4.This is a temporary workaround until NVIDIA/Model-Optimizer#1124 is merged and the checkpoint is re-exported with the corrected
exclude_modules.Related:
Test
2x B200, TP=2:
Before:
AssertionErroratparameter.py:153during MTP weight loading.After: Server starts, inference works:
{"prompt": "What is 2+2?", "max_tokens": 32} -> "The sum of 2 and 2 is 4..."