[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch by nguyen599 · Pull Request #35675 · vllm-project/vllm

nguyen599 · 2026-03-01T20:13:27Z

With Qwen3.5-nvfp4, when launching with --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}', the engine crashes during drafter weight loading:

(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self._init_executor()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.driver_worker.load_model()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 336, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4227, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.drafter.load_model(self.model)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1298, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model = self._get_model()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1277, in _get_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     model = get_model(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]             ^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 136, in get_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return loader.load_model(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 62, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 290, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py", line 439, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return loader.load_weights(remap_weight_names(weights))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/reload/torchao_decorator.py", line 50, in patched_model_load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return original_load_weights(self, weights, *args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 340, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 287, in _load_module
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     yield from self._load_module(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 260, in _load_module
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py", line 319, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     weight_loader(param, loaded_weight)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 566, in weight_loader_v2
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     param.load_column_parallel_weight(loaded_weight=loaded_weight)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/parameter.py", line 153, in load_column_parallel_weight
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     assert self.data.shape == loaded_weight.shape
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] AssertionError

This PR addresses the weight shape mismatch issue in MTP speculative decoding.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: nguyen599 <nguyenmanh599123@gmail.com>

github-actions · 2026-03-01T20:13:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses a crash during weight loading for Qwen3.5-nvfp4 models when using MTP speculative decoding. The root cause appears to be a shape mismatch for the fc layer's weights in the Qwen3_5MultiTokenPredictor. The fix involves changing this layer from a ColumnParallelLinear to a ReplicatedLinear, which avoids tensor-parallel sharding for this specific weight and aligns with the expected weight format. Consequentially, quantization is disabled for this layer, and the forward pass is updated to correctly handle the layer's output. The changes are logical, well-contained, and directly resolve the issue described.

Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)

voipmonitor · 2026-03-04T09:42:50Z

@nguyen599 isnt this already fixed in the latest branch? can you doublecheck?

nguyen599 · 2026-03-04T10:09:20Z

@nguyen599 isnt this already fixed in the latest branch? can you doublecheck?

@voipmonitor I just check vllm at commit 6f0dd93. It still show error, reproduce error command:

python -m vllm.entrypoints.openai.api_server --model '/path/to/model' \
        --served-model-name 'qwen3-5-nvfp4' \
        --gpu-memory-utilization '0.96' \
        --host '0.0.0.0' --port '8000' \
        --quantization modelopt --max-model-len 40960 \
        --speculative-config '{"method":"mtp", "num_speculative_tokens": 2}'

I used 1xH100 and model at txn545/Qwen3.5-35B-A3B-NVFP4.

Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)

PR vllm-project#35675 equivalent (MTP fc layer fix) Updated qwen3_5_mtp.py Switched import from ColumnParallelLinear to ReplicatedLinear Changed FC construction from self.fc = ColumnParallelLinear(...) to self.fc = ReplicatedLinear(...) Removed TP-only args (gather_output, return_bias) Set quant_config=None for this layer Updated call site to unpack tuple: hidden_states, _ = self.fc(hidden_states) PR vllm-project#35936 equivalent (tool_choice="required" fallback) Updated engine/serving.py Replaced JSON parse suppress-block at elif request.tool_choice == "required": New flow: First try TypeAdapter(...).validate_json(content) On ValidationError or JSON decode error, fallback to configured tool parser when available Convert parsed tool calls into FunctionCall(...) entries Removed now-unused contextlib import Signed-off-by: ec-jt <james.trappett@elementalcompute.com>

[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch

52160b9

Signed-off-by: nguyen599 <nguyenmanh599123@gmail.com>

nguyen599 requested a review from sighingnow as a code owner March 1, 2026 20:13

mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 1, 2026

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

nguyen599 mentioned this pull request Mar 9, 2026

[Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy #36094

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch#35675

[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch#35675
nguyen599 wants to merge 1 commit intovllm-project:mainfrom
nguyen599:qwen3-5/nvfp4-weight-shape-mismatch

nguyen599 commented Mar 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

voipmonitor commented Mar 4, 2026

Uh oh!

nguyen599 commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nguyen599 commented Mar 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

voipmonitor commented Mar 4, 2026

Uh oh!

nguyen599 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nguyen599 commented Mar 1, 2026 •

edited by github-actions bot

Loading

nguyen599 commented Mar 4, 2026 •

edited

Loading