Skip to content

[Bugfix] Fix DSV32 weight loading#38870

Merged
njhill merged 8 commits intovllm-project:mainfrom
zyongye:dsv32_fp8_weight_loading_fix
Apr 4, 2026
Merged

[Bugfix] Fix DSV32 weight loading#38870
njhill merged 8 commits intovllm-project:mainfrom
zyongye:dsv32_fp8_weight_loading_fix

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented Apr 3, 2026

Purpose

#38684 intorude bug on the fp8 checkpoint

KeyError on indexer.wk_weights_proj.weight_scale_inv during model loading

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2  \
  --port 8001 \
  -tp 8 \
  --kv-cache-dtype fp8  \
  --max_model_len auto \
  --reasoning-parser deepseek_v3 \
  --speculative-config.method deepseek_mtp \
  --speculative-config.num_speculative_tokens 1

gsm8k score

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9500|±  | 0.006|
|     |       |strict-match    |     5|exact_match|↑  |0.9507|±  | 0.006|

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye requested a review from luccafong as a code owner April 3, 2026 02:53
@mergify mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FP8 quantization support for DeepSeek MTP and V2 models by conditionally de-fusing the wk_weights_proj layer into separate components and adding checks for missing parameters during weight loading. The review feedback identifies several critical issues, including an incorrect initialization of quant_config from the Hugging Face configuration dictionary instead of the vLLM configuration object, and multiple instances where accessing get_name() lacks null-safety for unquantized models. Additionally, a redundant initialization of k_norm was found in the DeepSeek V2 model which leads to unnecessary memory allocation.

def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.config = vllm_config.model_config.hf_config
self.quant_config = self.config.quantization_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The quant_config should be initialized from vllm_config.quant_config instead of self.config.quantization_config. The latter is a dictionary from the Hugging Face configuration and does not possess the get_name() method required later in load_weights. This will cause an AttributeError during model loading.

Suggested change
self.quant_config = self.config.quantization_config
self.quant_config = vllm_config.quant_config

("wk_weights_proj", "weights_proj", 1),
]

if self.quant_config.get_name() != "fp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing self.quant_config.get_name() is unsafe if self.quant_config is None (which occurs for unquantized models). A check for None should be added to prevent an AttributeError.

Suggested change
if self.quant_config.get_name() != "fp8":
if self.quant_config is None or self.quant_config.get_name() != "fp8":

disable_tp=True,
prefix=f"{prefix}.wk_weights_proj",
)
if self.quant_config.get_name() == "fp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash when running the model without quantization.

Suggested change
if self.quant_config.get_name() == "fp8":
if self.quant_config is not None and self.quant_config.get_name() == "fp8":

quant_config=quant_config,
prefix=f"{prefix}.wk",
)
self.k_norm = LayerNorm(self.head_dim, eps=1e-6)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This initialization of self.k_norm is redundant as it is unconditionally initialized again at line 674. This results in unnecessary memory allocation for duplicate parameters.

kw, _ = self.wk_weights_proj(hidden_states)
k = kw[:, : self.head_dim]
weights_raw = kw[:, self.head_dim :]
if self.quant_config.get_name() == "fp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash during the forward pass for unquantized models.

Suggested change
if self.quant_config.get_name() == "fp8":
if self.quant_config is not None and self.quant_config.get_name() == "fp8":

("wk_weights_proj", "weights_proj", 1),
]
stacked_params_mapping.extend(indexer_fused_mapping)
if self.quant_config.get_name() != "fp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing self.quant_config.get_name() is unsafe if self.quant_config is None. This will cause a crash during weight loading for unquantized models.

Suggested change
if self.quant_config.get_name() != "fp8":
if self.quant_config is None or self.quant_config.get_name() != "fp8":

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 3, 2026
@zyongye
Copy link
Copy Markdown
Member Author

zyongye commented Apr 3, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces conditional handling for FP8 quantization in DeepSeek models, specifically by separating the previously fused 'wk' and 'weights_proj' layers into individual ReplicatedLinear layers when FP8 is active. This change is reflected in both the model initialization and the weight loading logic. A critical logic error was identified in the weight loading section of deepseek_v2.py, where the condition for using fused mappings was inverted, potentially causing loading failures for both quantized and non-quantized models.

("wk_weights_proj", "weights_proj", 1),
]
stacked_params_mapping.extend(indexer_fused_mapping)
if self.quant_config is not None and self.quant_config.get_name() == "fp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition for extending stacked_params_mapping with indexer_fused_mapping is inverted. Based on the Indexer class implementation and the logic in vllm/model_executor/models/deepseek_mtp.py, the fused mapping should be used when the model is not using FP8 quantization. In the FP8 case, the Indexer uses separate wk and weights_proj layers, so they should be loaded individually by name rather than through a fused mapping. This inversion will cause loading failures for both FP8 and non-FP8 models.

Suggested change
if self.quant_config is not None and self.quant_config.get_name() == "fp8":
if self.quant_config is None or self.quant_config.get_name() != "fp8":

Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this fix! I left a couple of comments on making the guards a bit less brittle

("wk_weights_proj", "wk", 0),
("wk_weights_proj", "weights_proj", 1),
]
stacked_params_mapping.extend(indexer_fused_mapping)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this guards against cases where we know the wk_weights_proj.wk and wk_weights_proj.weights_proj don't exist. Would it be more brittle to only put them in the mapping when we know they do exist instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is wk and weight_proj always exist. But on fp8 checkpoint one of them is fp8 quanted and another one is not quanted so it 's not good to fuse them. On the other hand, on nvfp4, they are all unquanted so we can fuse them into a single GEMM. This is just separate fuse and un-fuse path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the explanation

# weights_proj does not get quantized,
# so we run both with quant_config=None
# wk may be upcasted from the default quant;
# experiments show fusion is always aster unless WK proj is in FP4,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: aster -> faster

disable_tp=True,
prefix=f"{prefix}.wk_weights_proj",
)
if self.quant_config is not None and self.quant_config.get_name() == "fp8":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few similar checks in this file, and in vllm/model_executor/models/deepseek_mtp.py. The code would be simpler and easier to reason about if they use the exact same logic, and might be good to factor out the check into a standalone function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check in particular seems brittle and I wonder if we can make it more robust

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye
Copy link
Copy Markdown
Member Author

zyongye commented Apr 3, 2026

refactor the guard to be self.is_fp4_ckpt and add assignment into each init.

zyongye added 4 commits April 3, 2026 16:48
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@benchislett
Copy link
Copy Markdown
Collaborator

See also #38928

@njhill njhill merged commit 8617f86 into vllm-project:main Apr 4, 2026
58 checks passed
ChuanLi1101 added a commit to ChuanLi1101/vllm that referenced this pull request Apr 5, 2026
GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with
FP8 block quantization (weight + weight_scale_inv).  Resolve merge
conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using
fused MergedColumnParallelLinear with quant_config:
- FP4: quant_config=None (weights_proj should not be quantized)
- Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv)

Add fallback in load_weights to handle both fused and separate checkpoint
formats gracefully via stacked_params_mapping.

Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per
review feedback (will be submitted as a standalone PR).

Co-authored-by: Claude
Signed-off-by: Chuan Li <Chuan.Li2@amd.com>
Made-with: Cursor
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with
FP8 block quantization (weight + weight_scale_inv).  Resolve merge
conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using
fused MergedColumnParallelLinear with quant_config:
- FP4: quant_config=None (weights_proj should not be quantized)
- Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv)

Add fallback in load_weights to handle both fused and separate checkpoint
formats gracefully via stacked_params_mapping.

Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per
review feedback (will be submitted as a standalone PR).

Co-authored-by: Claude
Signed-off-by: Chuan Li <Chuan.Li2@amd.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants