Skip to content

[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch#35675

Open
nguyen599 wants to merge 1 commit intovllm-project:mainfrom
nguyen599:qwen3-5/nvfp4-weight-shape-mismatch
Open

[Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch#35675
nguyen599 wants to merge 1 commit intovllm-project:mainfrom
nguyen599:qwen3-5/nvfp4-weight-shape-mismatch

Conversation

@nguyen599
Copy link
Copy Markdown

@nguyen599 nguyen599 commented Mar 1, 2026

With Qwen3.5-nvfp4, when launching with --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}', the engine crashes during drafter weight loading:

(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self._init_executor()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.driver_worker.load_model()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 336, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4227, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.drafter.load_model(self.model)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1298, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.model = self._get_model()
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1277, in _get_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     model = get_model(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]             ^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 136, in get_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return loader.load_model(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 62, in load_model
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 290, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py", line 439, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return loader.load_weights(remap_weight_names(weights))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/reload/torchao_decorator.py", line 50, in patched_model_load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     return original_load_weights(self, weights, *args, **kwargs)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 340, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 287, in _load_module
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     yield from self._load_module(
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 260, in _load_module
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py", line 319, in load_weights
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     weight_loader(param, loaded_weight)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 566, in weight_loader_v2
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     param.load_column_parallel_weight(loaded_weight=loaded_weight)
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/parameter.py", line 153, in load_column_parallel_weight
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]     assert self.data.shape == loaded_weight.shape
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8329) ERROR 03-01 20:10:41 [core.py:1100] AssertionError

This PR addresses the weight shape mismatch issue in MTP speculative decoding.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: nguyen599 <nguyenmanh599123@gmail.com>
@nguyen599 nguyen599 requested a review from sighingnow as a code owner March 1, 2026 20:13
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 1, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash during weight loading for Qwen3.5-nvfp4 models when using MTP speculative decoding. The root cause appears to be a shape mismatch for the fc layer's weights in the Qwen3_5MultiTokenPredictor. The fix involves changing this layer from a ColumnParallelLinear to a ReplicatedLinear, which avoids tensor-parallel sharding for this specific weight and aligns with the expected weight format. Consequentially, quantization is disabled for this layer, and the forward pass is updated to correctly handle the layer's output. The changes are logical, well-contained, and directly resolve the issue described.

scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Cherry-pick upstream fixes for GB10 Spark (SM121):

- PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8
  kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py)
- PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4
  by using ReplicatedLinear with quant_config=None
- PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds
  on-the-fly FP8 dequantization in Triton kernels
- PR vllm-project#35936: tool_choice="required" falls back to tool_parser for
  non-JSON (XML) tool calls from Qwen3 models

Local patches:
- Patch FlashInfer TRTLLM JIT to compile for SM12x
  (supported_major_versions=[10] → [10, 12])
- Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
@voipmonitor
Copy link
Copy Markdown
Contributor

@nguyen599 isnt this already fixed in the latest branch? can you doublecheck?

@nguyen599
Copy link
Copy Markdown
Author

nguyen599 commented Mar 4, 2026

@nguyen599 isnt this already fixed in the latest branch? can you doublecheck?

@voipmonitor I just check vllm at commit 6f0dd93. It still show error, reproduce error command:

python -m vllm.entrypoints.openai.api_server --model '/path/to/model' \
        --served-model-name 'qwen3-5-nvfp4' \
        --gpu-memory-utilization '0.96' \
        --host '0.0.0.0' --port '8000' \
        --quantization modelopt --max-model-len 40960 \
        --speculative-config '{"method":"mtp", "num_speculative_tokens": 2}'

I used 1xH100 and model at txn545/Qwen3.5-35B-A3B-NVFP4.

scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Cherry-pick upstream fixes for GB10 Spark (SM121):

- PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8
  kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py)
- PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4
  by using ReplicatedLinear with quant_config=None
- PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds
  on-the-fly FP8 dequantization in Triton kernels
- PR vllm-project#35936: tool_choice="required" falls back to tool_parser for
  non-JSON (XML) tool calls from Qwen3 models

Local patches:
- Patch FlashInfer TRTLLM JIT to compile for SM12x
  (supported_major_versions=[10] → [10, 12])
- Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 5, 2026
Cherry-pick upstream fixes for GB10 Spark (SM121):

- PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8
  kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py)
- PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4
  by using ReplicatedLinear with quant_config=None
- PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds
  on-the-fly FP8 dequantization in Triton kernels
- PR vllm-project#35936: tool_choice="required" falls back to tool_parser for
  non-JSON (XML) tool calls from Qwen3 models

Local patches:
- Patch FlashInfer TRTLLM JIT to compile for SM12x
  (supported_major_versions=[10] → [10, 12])
- Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 18, 2026
Cherry-pick upstream fixes for GB10 Spark (SM121):

- PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8
  kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py)
- PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4
  by using ReplicatedLinear with quant_config=None
- PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds
  on-the-fly FP8 dequantization in Triton kernels
- PR vllm-project#35936: tool_choice="required" falls back to tool_parser for
  non-JSON (XML) tool calls from Qwen3 models

Local patches:
- Patch FlashInfer TRTLLM JIT to compile for SM12x
  (supported_major_versions=[10] → [10, 12])
- Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
ec-jt added a commit to ec-jt/vllm that referenced this pull request Mar 22, 2026
PR vllm-project#35675 equivalent (MTP fc layer fix)
Updated qwen3_5_mtp.py
Switched import from ColumnParallelLinear to ReplicatedLinear
Changed FC construction from self.fc = ColumnParallelLinear(...) to self.fc = ReplicatedLinear(...)
Removed TP-only args (gather_output, return_bias)
Set quant_config=None for this layer
Updated call site to unpack tuple: hidden_states, _ = self.fc(hidden_states)
PR vllm-project#35936 equivalent (tool_choice="required" fallback)
Updated engine/serving.py
Replaced JSON parse suppress-block at elif request.tool_choice == "required":
New flow:
First try TypeAdapter(...).validate_json(content)
On ValidationError or JSON decode error, fallback to configured tool parser when available
Convert parsed tool calls into FunctionCall(...) entries
Removed now-unused contextlib import

Signed-off-by: ec-jt <james.trappett@elementalcompute.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants