[Perf] DSV3.2 Indexer Fused Weights Projection by benchislett · Pull Request #38684 · vllm-project/vllm

benchislett · 2026-04-01T03:59:29Z

Purpose

Fuse the WK and Weights_Proj projections in the DSV3.2 Indexer. This is an alternative optimization to #35968, which overlaps the projections instead of fusing them. Doing the fusion provides a greater speedup:

Benchmark timings for DSV3.2 NVFP4 on 8xB200 (TP8, No Specdec)

BS128 8k/1k

Fused WK+WeightsProj (Decode):  21.6 ms
Baseline             (Decode):  22.3 ms

BS1 8k/1k

Fused WK+WeightsProj (Decode):  9.5 ms
Overlap              (Decode):  9.8  ms
Baseline             (Decode):  10.2  ms

Fused WK+WeightsProj (TTFT):  339 ms
Overlap              (TTFT):  340 ms
Baseline             (TTFT):  341 ms

Testing

GSM8k shows a slight decrease.

PR (2 runs):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9500|±  | 0.006|
|     |       |strict-match    |     5|exact_match|↑  |0.9492|±  | 0.006|

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9530|±  |0.0058|

Main (2 runs):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request fuses the "wk" and "weights_proj" layers into a single "MergedColumnParallelLinear" named "wk_weights_proj" within the DeepSeek-V2 and MTP models to optimize performance. Several critical issues were identified: forcing "quant_config=None" may lead to loading errors with quantized checkpoints, the tensor slicing in the forward pass assumes a 2D input which could fail with 3D tensors (a flattening suggestion was provided), and the weight loading logic is susceptible to name corruption and potential crashes due to substring matching and missing guard conditions.

gemini-code-assist · 2026-04-01T04:01:34Z

+        self.wk_weights_proj = MergedColumnParallelLinear(
            hidden_size,
-            self.head_dim,
-            bias=False,
-            quant_config=quant_config,
-            prefix=f"{prefix}.wk",
-        )
-        self.k_norm = LayerNorm(self.head_dim, eps=1e-6)
-        self.weights_proj = ReplicatedLinear(
-            hidden_size,
-            self.n_head,
+            [self.head_dim, self.n_head],
            bias=False,
            quant_config=None,
-            prefix=f"{prefix}.weights_proj",
+            disable_tp=True,
+            prefix=f"{prefix}.wk_weights_proj",
        )


Forcing quant_config=None for wk_weights_proj will cause correctness issues when loading from quantized checkpoints (e.g., FP8). Since wk is typically quantized in DeepSeek-V3/V3.2 checkpoints, the weight_loader will attempt to bit-copy quantized weights into a non-quantized parameter without applying the necessary scales. The current weight_loader for MergedColumnParallelLinear does not handle on-the-fly dequantization. If fusion is required for performance, you must implement a custom weight loader that can dequantize wk during the loading process or ensure that the quantization configuration is correctly propagated.

I can easily run DSR1 NVFP4 on this branch, which notably does not have the indexer module. I don't expect any DeepSeek checkpoints will have these weights pre-fused. If the bot has a more nuanced point here, I don't see it

This actually makes sense. One proj with fp8 quant and the other without quant. It is not wise to fuse all of them. We should make a special case for the original checkpoint

gemini-code-assist · 2026-04-01T04:01:34Z

+        kw, _ = self.wk_weights_proj(hidden_states)
+        k = kw[:, : self.head_dim]
+        weights_raw = kw[:, self.head_dim :]


The slicing kw[:, : self.head_dim] assumes that kw is a 2D tensor. However, when using torch.compile or in certain prefill paths, hidden_states (and thus kw) can be a 3D tensor. In such cases, this slicing will produce incorrect results. It is safer to flatten hidden_states to a token-based representation before the projection, which also ensures consistency for the subsequent indexer_op call on line 733.

Suggested change

kw, _ = self.wk_weights_proj(hidden_states)

k = kw[:, : self.head_dim]

weights_raw = kw[:, self.head_dim :]

hidden_states = hidden_states.view(-1, hidden_states.shape[-1])

kw, _ = self.wk_weights_proj(hidden_states)

k = kw[:, : self.head_dim]

weights_raw = kw[:, self.head_dim :]

gemini-code-assist · 2026-04-01T04:01:34Z

+        indexer_fused_mapping = [
+            ("wk_weights_proj", "wk", 0),
+            ("wk_weights_proj", "weights_proj", 1),
+        ]
+        stacked_params_mapping.extend(indexer_fused_mapping)


The substring replacement logic in load_weights is susceptible to a bug where wk matches wk_weights_proj. If a checkpoint already contains fused weights (e.g., from a previous save), name.replace("wk", "wk_weights_proj") will corrupt the parameter name (e.g., resulting in ...wk_weights_proj_weights_proj). Additionally, this mapping is added unconditionally, which may cause crashes in non-V3.2 models where wk_weights_proj is not defined. Please add wk_weights_proj to the guard condition in the weight loading loop (around line 1503) to ensure it only attempts to map if the parameter exists in params_dict.

gemini-code-assist · 2026-04-01T04:01:35Z

+            ("wk_weights_proj", "wk", 0),
+            ("wk_weights_proj", "weights_proj", 1),


Similar to the issue in deepseek_v2.py, the wk substring match can corrupt weight names if the checkpoint is already fused. Ensure that wk_weights_proj is included in the guard condition within the load_weights loop (around line 292) to prevent incorrect mapping and potential KeyError when the parameter is missing in certain model configurations.

LopezCastroRoberto

LGTM. Projections are tiny, bandwidth-bound, and same precision. Fusion wins :)

robertgshaw2-redhat · 2026-04-01T16:22:48Z

great job ben, love picking the low hanging fruit

…38684)" This reverts commit 5f96f9a.

zyongye · 2026-04-02T20:04:28Z

@benchislett This change is breaking the fp8 checkpoint. Can you look into it? Thanks

zyongye · 2026-04-03T02:05:56Z

@robertgshaw2-redhat @benchislett can you revert this? This is breaking the fp8 model.

zyongye · 2026-04-03T03:13:52Z

fixed by #38870

GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

GLM-5-FP8 checkpoints quantize the fused wk_weights_proj tensor with FP8 block quantization (weight + weight_scale_inv). Resolve merge conflict with upstream indexer refactor (vllm-project#38684/vllm-project#38870) by always using fused MergedColumnParallelLinear with quant_config: - FP4: quant_config=None (weights_proj should not be quantized) - Non-FP4: quant_config=quant_config (supports FP8 weight_scale_inv) Add fallback in load_weights to handle both fused and separate checkpoint formats gracefully via stacked_params_mapping. Also reverts glm_moe_dsa from _DEEPSEEK_V3_FAMILY_MODEL_TYPES per review feedback (will be submitted as a standalone PR). Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor

fused wk + weights_proj

7628dd8

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from luccafong as a code owner April 1, 2026 03:59

mergify Bot added the deepseek Related to DeepSeek models label Apr 1, 2026

benchislett mentioned this pull request Apr 1, 2026

[Performance] DeepSeek V3.2 multi-stream indexer overlap #35968

Closed

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

LopezCastroRoberto approved these changes Apr 1, 2026

View reviewed changes

benchislett requested review from mgoin and robertgshaw2-redhat April 1, 2026 13:48

robertgshaw2-redhat approved these changes Apr 1, 2026

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) April 1, 2026 16:22

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2026

benchislett added 2 commits April 1, 2026 14:39

Merge branch 'main' into perf/deepseek-fused-wk

a3779ff

Merge branch 'main' into perf/deepseek-fused-wk

d16e26d

robertgshaw2-redhat merged commit 5f96f9a into vllm-project:main Apr 2, 2026
58 checks passed

benchislett deleted the perf/deepseek-fused-wk branch April 2, 2026 03:49

vllm-agent pushed a commit to vllm-agent/vllm that referenced this pull request Apr 2, 2026

Revert "[Perf] DSV3.2 Indexer Fused Weights Projection (vllm-project#…

b8f9ee1

…38684)" This reverts commit 5f96f9a.

vllm-agent mentioned this pull request Apr 2, 2026

Revert "[Perf] DSV3.2 Indexer Fused Weights Projection" (#38684) #38806

Draft

zyongye mentioned this pull request Apr 3, 2026

[Bugfix] Fix DSV32 weight loading #38870

Merged

5 tasks

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026

[Perf] DSV3.2 Indexer Fused Weights Projection (vllm-project#38684)

a2f4819

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[Perf] DSV3.2 Indexer Fused Weights Projection (vllm-project#38684)

8f32dfb

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

LopezCastroRoberto mentioned this pull request Apr 8, 2026

[Performance] DSV3.2 Indexer: Overlap indexer k+w path || q path on separate CUDA streams #39299

Closed

1 task

LopezCastroRoberto mentioned this pull request Apr 8, 2026

[Performance] DSV3.2 Indexer: Overlap indexer k+w path || q path on separate CUDA streams #39309

Open

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Perf] DSV3.2 Indexer Fused Weights Projection (vllm-project#38684)

50b168b

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

		("wk_weights_proj", "wk", 0),
		("wk_weights_proj", "weights_proj", 1),

Uh oh!

Conversation

benchislett commented Apr 1, 2026

Purpose

BS128 8k/1k

BS1 8k/1k

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Apr 1, 2026

Uh oh!

Uh oh!

zyongye commented Apr 2, 2026

Uh oh!

zyongye commented Apr 3, 2026

Uh oh!

zyongye commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants