[Bugfix] Fix DSV32 weight loading#38870
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces FP8 quantization support for DeepSeek MTP and V2 models by conditionally de-fusing the wk_weights_proj layer into separate components and adding checks for missing parameters during weight loading. The review feedback identifies several critical issues, including an incorrect initialization of quant_config from the Hugging Face configuration dictionary instead of the vLLM configuration object, and multiple instances where accessing get_name() lacks null-safety for unquantized models. Additionally, a redundant initialization of k_norm was found in the DeepSeek V2 model which leads to unnecessary memory allocation.
| def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): | ||
| super().__init__() | ||
| self.config = vllm_config.model_config.hf_config | ||
| self.quant_config = self.config.quantization_config |
There was a problem hiding this comment.
The quant_config should be initialized from vllm_config.quant_config instead of self.config.quantization_config. The latter is a dictionary from the Hugging Face configuration and does not possess the get_name() method required later in load_weights. This will cause an AttributeError during model loading.
| self.quant_config = self.config.quantization_config | |
| self.quant_config = vllm_config.quant_config |
| ("wk_weights_proj", "weights_proj", 1), | ||
| ] | ||
|
|
||
| if self.quant_config.get_name() != "fp8": |
There was a problem hiding this comment.
Accessing self.quant_config.get_name() is unsafe if self.quant_config is None (which occurs for unquantized models). A check for None should be added to prevent an AttributeError.
| if self.quant_config.get_name() != "fp8": | |
| if self.quant_config is None or self.quant_config.get_name() != "fp8": |
| disable_tp=True, | ||
| prefix=f"{prefix}.wk_weights_proj", | ||
| ) | ||
| if self.quant_config.get_name() == "fp8": |
There was a problem hiding this comment.
| quant_config=quant_config, | ||
| prefix=f"{prefix}.wk", | ||
| ) | ||
| self.k_norm = LayerNorm(self.head_dim, eps=1e-6) |
| kw, _ = self.wk_weights_proj(hidden_states) | ||
| k = kw[:, : self.head_dim] | ||
| weights_raw = kw[:, self.head_dim :] | ||
| if self.quant_config.get_name() == "fp8": |
There was a problem hiding this comment.
| ("wk_weights_proj", "weights_proj", 1), | ||
| ] | ||
| stacked_params_mapping.extend(indexer_fused_mapping) | ||
| if self.quant_config.get_name() != "fp8": |
There was a problem hiding this comment.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces conditional handling for FP8 quantization in DeepSeek models, specifically by separating the previously fused 'wk' and 'weights_proj' layers into individual ReplicatedLinear layers when FP8 is active. This change is reflected in both the model initialization and the weight loading logic. A critical logic error was identified in the weight loading section of deepseek_v2.py, where the condition for using fused mappings was inverted, potentially causing loading failures for both quantized and non-quantized models.
| ("wk_weights_proj", "weights_proj", 1), | ||
| ] | ||
| stacked_params_mapping.extend(indexer_fused_mapping) | ||
| if self.quant_config is not None and self.quant_config.get_name() == "fp8": |
There was a problem hiding this comment.
The condition for extending stacked_params_mapping with indexer_fused_mapping is inverted. Based on the Indexer class implementation and the logic in vllm/model_executor/models/deepseek_mtp.py, the fused mapping should be used when the model is not using FP8 quantization. In the FP8 case, the Indexer uses separate wk and weights_proj layers, so they should be loaded individually by name rather than through a fused mapping. This inversion will cause loading failures for both FP8 and non-FP8 models.
| if self.quant_config is not None and self.quant_config.get_name() == "fp8": | |
| if self.quant_config is None or self.quant_config.get_name() != "fp8": |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Thanks for taking on this fix! I left a couple of comments on making the guards a bit less brittle
| ("wk_weights_proj", "wk", 0), | ||
| ("wk_weights_proj", "weights_proj", 1), | ||
| ] | ||
| stacked_params_mapping.extend(indexer_fused_mapping) |
There was a problem hiding this comment.
It looks like this guards against cases where we know the wk_weights_proj.wk and wk_weights_proj.weights_proj don't exist. Would it be more brittle to only put them in the mapping when we know they do exist instead?
There was a problem hiding this comment.
I think the problem is wk and weight_proj always exist. But on fp8 checkpoint one of them is fp8 quanted and another one is not quanted so it 's not good to fuse them. On the other hand, on nvfp4, they are all unquanted so we can fuse them into a single GEMM. This is just separate fuse and un-fuse path.
There was a problem hiding this comment.
Makes sense, thanks for the explanation
| # weights_proj does not get quantized, | ||
| # so we run both with quant_config=None | ||
| # wk may be upcasted from the default quant; | ||
| # experiments show fusion is always aster unless WK proj is in FP4, |
| disable_tp=True, | ||
| prefix=f"{prefix}.wk_weights_proj", | ||
| ) | ||
| if self.quant_config is not None and self.quant_config.get_name() == "fp8": |
There was a problem hiding this comment.
There's a few similar checks in this file, and in vllm/model_executor/models/deepseek_mtp.py. The code would be simpler and easier to reason about if they use the exact same logic, and might be good to factor out the check into a standalone function
There was a problem hiding this comment.
This check in particular seems brittle and I wonder if we can make it more robust
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
refactor the guard to be |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
See also #38928 |
GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor
Purpose
#38684 intorude bug on the fp8 checkpoint
Test Plan
gsm8k score
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.