[Bugfix] Fix DSV32 weight loading by zyongye · Pull Request #38870 · vllm-project/vllm

zyongye · 2026-04-03T02:53:28Z

Purpose

#38684 intorude bug on the fp8 checkpoint

KeyError on indexer.wk_weights_proj.weight_scale_inv during model loading

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2  \
  --port 8001 \
  -tp 8 \
  --kv-cache-dtype fp8  \
  --max_model_len auto \
  --reasoning-parser deepseek_v3 \
  --speculative-config.method deepseek_mtp \
  --speculative-config.num_speculative_tokens 1

gsm8k score

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9500|±  | 0.006|
|     |       |strict-match    |     5|exact_match|↑  |0.9507|±  | 0.006|

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

gemini-code-assist

Code Review

This pull request introduces FP8 quantization support for DeepSeek MTP and V2 models by conditionally de-fusing the wk_weights_proj layer into separate components and adding checks for missing parameters during weight loading. The review feedback identifies several critical issues, including an incorrect initialization of quant_config from the Hugging Face configuration dictionary instead of the vLLM configuration object, and multiple instances where accessing get_name() lacks null-safety for unquantized models. Additionally, a redundant initialization of k_norm was found in the DeepSeek V2 model which leads to unnecessary memory allocation.

gemini-code-assist · 2026-04-03T02:56:45Z

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        self.config = vllm_config.model_config.hf_config
+        self.quant_config = self.config.quantization_config


The quant_config should be initialized from vllm_config.quant_config instead of self.config.quantization_config. The latter is a dictionary from the Hugging Face configuration and does not possess the get_name() method required later in load_weights. This will cause an AttributeError during model loading.

Suggested change

self.quant_config = self.config.quantization_config

self.quant_config = vllm_config.quant_config

gemini-code-assist · 2026-04-03T02:56:46Z

-            ("wk_weights_proj", "weights_proj", 1),
        ]

+        if self.quant_config.get_name() != "fp8":


Accessing self.quant_config.get_name() is unsafe if self.quant_config is None (which occurs for unquantized models). A check for None should be added to prevent an AttributeError.

Suggested change

if self.quant_config.get_name() != "fp8":

if self.quant_config is None or self.quant_config.get_name() != "fp8":

gemini-code-assist · 2026-04-03T02:56:46Z

-            disable_tp=True,
-            prefix=f"{prefix}.wk_weights_proj",
-        )
+        if self.quant_config.get_name() == "fp8":


Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash when running the model without quantization.

Suggested change

if self.quant_config.get_name() == "fp8":

if self.quant_config is not None and self.quant_config.get_name() == "fp8":

gemini-code-assist · 2026-04-03T02:56:46Z

+                quant_config=quant_config,
+                prefix=f"{prefix}.wk",
+            )
+            self.k_norm = LayerNorm(self.head_dim, eps=1e-6)


This initialization of self.k_norm is redundant as it is unconditionally initialized again at line 674. This results in unnecessary memory allocation for duplicate parameters.

gemini-code-assist · 2026-04-03T02:56:46Z

-        kw, _ = self.wk_weights_proj(hidden_states)
-        k = kw[:, : self.head_dim]
-        weights_raw = kw[:, self.head_dim :]
+        if self.quant_config.get_name() == "fp8":


Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash during the forward pass for unquantized models.

Suggested change

if self.quant_config.get_name() == "fp8":

if self.quant_config is not None and self.quant_config.get_name() == "fp8":

gemini-code-assist · 2026-04-03T02:56:46Z

-            ("wk_weights_proj", "weights_proj", 1),
-        ]
-        stacked_params_mapping.extend(indexer_fused_mapping)
+        if self.quant_config.get_name() != "fp8":


Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash during weight loading for unquantized models.

Suggested change

if self.quant_config.get_name() != "fp8":

if self.quant_config is None or self.quant_config.get_name() != "fp8":

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2026-04-03T03:00:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces conditional handling for FP8 quantization in DeepSeek models, specifically by separating the previously fused 'wk' and 'weights_proj' layers into individual ReplicatedLinear layers when FP8 is active. This change is reflected in both the model initialization and the weight loading logic. A critical logic error was identified in the weight loading section of deepseek_v2.py, where the condition for using fused mappings was inverted, potentially causing loading failures for both quantized and non-quantized models.

gemini-code-assist · 2026-04-03T03:02:42Z

-            ("wk_weights_proj", "weights_proj", 1),
-        ]
-        stacked_params_mapping.extend(indexer_fused_mapping)
+        if self.quant_config is not None and self.quant_config.get_name() == "fp8":


The condition for extending stacked_params_mapping with indexer_fused_mapping is inverted. Based on the Indexer class implementation and the logic in vllm/model_executor/models/deepseek_mtp.py, the fused mapping should be used when the model is not using FP8 quantization. In the FP8 case, the Indexer uses separate wk and weights_proj layers, so they should be loaded individually by name rather than through a fused mapping. This inversion will cause loading failures for both FP8 and non-FP8 models.

Suggested change

if self.quant_config is not None and self.quant_config.get_name() == "fp8":

if self.quant_config is None or self.quant_config.get_name() != "fp8":

tlrmchlsmth

Thanks for taking on this fix! I left a couple of comments on making the guards a bit less brittle

tlrmchlsmth · 2026-04-03T13:59:43Z

+                ("wk_weights_proj", "wk", 0),
+                ("wk_weights_proj", "weights_proj", 1),
+            ]
+            stacked_params_mapping.extend(indexer_fused_mapping)


It looks like this guards against cases where we know the wk_weights_proj.wk and wk_weights_proj.weights_proj don't exist. Would it be more brittle to only put them in the mapping when we know they do exist instead?

I think the problem is wk and weight_proj always exist. But on fp8 checkpoint one of them is fp8 quanted and another one is not quanted so it 's not good to fuse them. On the other hand, on nvfp4, they are all unquanted so we can fuse them into a single GEMM. This is just separate fuse and un-fuse path.

Makes sense, thanks for the explanation

tlrmchlsmth · 2026-04-03T14:00:04Z

+            # weights_proj does not get quantized,
+            # so we run both with quant_config=None
+            # wk may be upcasted from the default quant;
+            # experiments show fusion is always aster unless WK proj is in FP4,


nit: aster -> faster

tlrmchlsmth · 2026-04-03T14:03:07Z

-            disable_tp=True,
-            prefix=f"{prefix}.wk_weights_proj",
-        )
+        if self.quant_config is not None and self.quant_config.get_name() == "fp8":


There's a few similar checks in this file, and in vllm/model_executor/models/deepseek_mtp.py. The code would be simpler and easier to reason about if they use the exact same logic, and might be good to factor out the check into a standalone function

This check in particular seems brittle and I wonder if we can make it more robust

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2026-04-03T16:37:26Z

refactor the guard to be self.is_fp4_ckpt and add assignment into each init.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

benchislett · 2026-04-03T19:46:28Z

See also #38928

GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor

fix mtp

37f373f

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye requested a review from luccafong as a code owner April 3, 2026 02:53

zyongye requested review from benchislett and robertgshaw2-redhat April 3, 2026 02:53

mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

fixing gemini

03ef998

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

zyongye mentioned this pull request Apr 3, 2026

[Perf] DSV3.2 Indexer Fused Weights Projection #38684

Merged

tlrmchlsmth reviewed Apr 3, 2026

View reviewed changes

update guard name

3a47699

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye added 4 commits April 3, 2026 16:48

fix

9c63474

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

delete print

1d9dd70

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

no skip params

c31e4ad

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

nit

5210982

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

tlrmchlsmth approved these changes Apr 3, 2026

View reviewed changes

benchislett mentioned this pull request Apr 3, 2026

[Bugfix][Perf] Indexer upcast WK to BF16 for fusion #38928

Merged

Merge branch 'main' into dsv32_fp8_weight_loading_fix

fd6cf94

njhill merged commit 8617f86 into vllm-project:main Apr 4, 2026
58 checks passed

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026

[Bugfix] Fix DSV32 weight loading (vllm-project#38870)

43394d8

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[Bugfix] Fix DSV32 weight loading (vllm-project#38870)

d95b973

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Bugfix] Fix DSV32 weight loading (vllm-project#38870)

cedc087

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

	self.quant_config = self.config.quantization_config
	self.quant_config = vllm_config.quant_config

	if self.quant_config.get_name() != "fp8":
	if self.quant_config is None or self.quant_config.get_name() != "fp8":

	if self.quant_config.get_name() == "fp8":
	if self.quant_config is not None and self.quant_config.get_name() == "fp8":

Uh oh!

Conversation

zyongye commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented Apr 3, 2026

Uh oh!

benchislett commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zyongye commented Apr 3, 2026 •

edited

Loading