[Bugfix][Perf] Indexer upcast WK to BF16 for fusion by benchislett · Pull Request #38928 · vllm-project/vllm

benchislett · 2026-04-03T18:06:04Z

Purpose

Alternative fix to #38870 which maintains the fusion.

Performance:

B200 TP8 FP8 BS1 8k/1k
Compares this PR (top), multi-stream ([Performance] DeepSeek V3.2 multi-stream indexer overlap #35968), and baseline ([Bugfix] Fix DSV32 weight loading #38870)

Upcast+Fused WK: (Decode): 11.90 ms
Upcast+Fused WK: (TTFT):   375.0 ms

Multi-Stream:    (Decode): 12.74 ms
Multi-Stream:    (TTFT):   376.5 ms

Separate:        (Decode): 12.59 ms
Separate:        (TTFT):   378.2 ms

Testing

My setup is broken, GSM8k is giving 0.00 for me even with #38870. Will try to fix my setup and rerun, but have moderate confidence in this fix. Would be handy if someone else could try running this in the meantime.

I'm okay with merging #38870 if we need to fix ASAP this weekend.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

zyongye · 2026-04-03T18:08:00Z

I would certainly want to run the OG ckpt as it is with multi-stream.

gemini-code-assist

Code Review

This pull request introduces a mechanism to handle DeepSeek-V2 checkpoints where 'wk' weights are stored in FP8 with separate scales. By dequantizing these weights to BF16 during the loading process, the model maintains fusion with 'weights_proj'. The implementation includes a new helper function, '_try_load_fp8_indexer_wk', and a buffering system to synchronize weights and scales. Feedback was provided regarding the potential for memory leaks in the buffering logic if the loading process is interrupted or if keys are missing from the checkpoint.

gemini-code-assist · 2026-04-03T18:09:19Z

+    if "weight" not in entry or "scale" not in entry:
+        return True  # still waiting for the other param


The logic to buffer tensors until both weight and scale are present is susceptible to memory leaks if the checkpoint loading process is interrupted or if certain keys are missing from the checkpoint. Consider adding a timeout or a mechanism to clear the buffer if loading fails.

They would go out of scope once the weight loading function returns. The references are not stored on self.

benchislett · 2026-04-03T19:45:37Z

I would certainly want to run the OG ckpt as it is with multi-stream.

This is the multi-stream baseline I compared against. It seems like the multi-streaming with the FP8 kernels is disabling torch.compile and undoing one of the elementwise op fusions, causing a slowdown. Could you explain what/why you want to run that I have not compared here?

zyongye · 2026-04-03T19:50:25Z

I would certainly want to run the OG ckpt as it is with multi-stream.

This is the multi-stream baseline I compared against. It seems like the multi-streaming with the FP8 kernels is disabling torch.compile and undoing one of the elementwise op fusions, causing a slowdown. Could you explain what/why you want to run that I have not compared here?

I am hoping that for the original model ckpt we want to let it run as its original precision, at least on one platform. Changing the precision will cause unknown problem later on and hard to trace back the bug.

mergify · 2026-04-04T22:48:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett · 2026-04-06T00:30:46Z

I'm of the opinion that upcasting for the fusion is worth the trouble. The perf improvement seems nontrivial, and it does help avoid having to support both multi-stream and fusion for this op. @robertgshaw2-redhat to weight in

zyongye · 2026-04-06T01:57:34Z

If they really care about perf they should run the nvfp4 checkpoint. I feel like we want to have accuracy guarantee before perf on the OG checkpoint.

benchislett · 2026-04-07T15:08:40Z

There are many cases where we still care about perf for non-NVFP4 checkpoints. And as I understand, upcasting from FP8 to BF16 shouldn't hurt accuracy if done properly

mgoin

I think it is fair enough to support this, as we do for MLA upconvert as well before we had fp8_bmm available. Specific numerics may change, but accuracy should certainly not be worse

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-04-15T19:01:45Z

Reran on 4xGB200, same perf results and GSM8k is good

Details

4xGB200, TP4, DSV3.2 FP8, Fused WK

GSM8k:

Results:
Accuracy: 0.955
Invalid responses: 0.000
Total latency: 236.064 s
Questions per second: 5.587
Total output tokens: 120664
Output tokens per second: 511.148

1k1k BS1:

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  135.16    
Total input tokens:                      9704      
Total generated tokens:                  11079     
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         81.97     
Peak output token throughput (tok/s):    86.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          153.76    
---------------Time to First Token----------------
Mean TTFT (ms):                          407.79    
Median TTFT (ms):                        158.75    
P99 TTFT (ms):                           2362.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.84     
Median TPOT (ms):                        11.80     
P99 TPOT (ms):                           11.98     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         11.81     
P99 ITL (ms):                            12.47     
==================================================

1k1k BS8:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  193.45    
Total input tokens:                      82493     
Total generated tokens:                  82702     
Request throughput (req/s):              0.41      
Output token throughput (tok/s):         427.50    
Peak output token throughput (tok/s):    496.00    
Peak concurrent requests:                11.00     
Total token throughput (tok/s):          853.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          224.54    
Median TTFT (ms):                        187.26    
P99 TTFT (ms):                           574.97    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.96     
Median TPOT (ms):                        18.10     
P99 TPOT (ms):                           18.59     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.96     
Median ITL (ms):                         17.14     
P99 ITL (ms):                            18.11     
==================================================

4xGB200, TP4, DSV3.2 FP8, No-Fused-WK

GSM8k:

Results:
Accuracy: 0.956
Invalid responses: 0.000
Total latency: 238.044 s
Questions per second: 5.541
Total output tokens: 120447
Output tokens per second: 505.987

1k1k BS1:

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  139.43    
Total input tokens:                      9704      
Total generated tokens:                  11079     
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         79.46     
Peak output token throughput (tok/s):    82.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          149.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          178.87    
Median TTFT (ms):                        174.89    
P99 TTFT (ms):                           211.54    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.43     
Median TPOT (ms):                        12.39     
P99 TPOT (ms):                           12.57     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.43     
Median ITL (ms):                         12.40     
P99 ITL (ms):                            13.07     
==================================================

1k1k BS8:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  200.50    
Total input tokens:                      82493     
Total generated tokens:                  82702     
Request throughput (req/s):              0.40      
Output token throughput (tok/s):         412.48    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                11.00     
Total token throughput (tok/s):          823.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          233.02    
Median TTFT (ms):                        196.72    
P99 TTFT (ms):                           586.79    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.61     
Median TPOT (ms):                        18.71     
P99 TPOT (ms):                           19.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.62     
Median ITL (ms):                         17.74     
P99 ITL (ms):                            18.77     
==================================================

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-04-15T19:14:24Z

I also manually inspected a few of the FP8 weight scales for the WK layers. They're all fairly small factors (e.g. 0.0002), so we should be staying comfortably in the range where BF16 is well-represented.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

**Commit range:** `6f786f2`..`d886c26` 1. Fix 'DPMetadata' object has no attribute 'max_tokens_across_dp_cpu' by vllm-project/vllm#39107 2. Fix 'Indexer' object has no attribute 'wk' by vllm-project/vllm#38928 3. Fix 'float' object has no attribute 'language_model' by vllm-project/vllm#39240 ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>

upcast wk to bf16 to enable fusion

5d8d766

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from robertgshaw2-redhat and zyongye April 3, 2026 18:06

fix mtp weight load

f8a5632

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from luccafong as a code owner April 3, 2026 18:06

mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

benchislett mentioned this pull request Apr 3, 2026

[Bugfix] Fix DSV32 weight loading #38870

Merged

5 tasks

mergify Bot added the needs-rebase label Apr 4, 2026

yg7445 mentioned this pull request Apr 10, 2026

Fix reference cycle in hf_raise_for_status delaying object destruction huggingface/huggingface_hub#4084

Closed

mgoin approved these changes Apr 14, 2026

View reviewed changes

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Apr 14, 2026

benchislett added 2 commits April 15, 2026 08:34

Merge branch 'main' into fix-wk-fusion-fp8

7a226b3

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove fp4/fp8 distinction

4184336

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from LucasWilkinson and MatthewBonanni as code owners April 15, 2026 18:23

remove irrelevant diff

481ff20

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mgoin enabled auto-merge (squash) April 15, 2026 19:26

mergify Bot removed the needs-rebase label Apr 15, 2026

mgoin merged commit ac3dac5 into vllm-project:main Apr 15, 2026
59 checks passed

ZhanqiuHu mentioned this pull request Apr 16, 2026

[CI Bug 2026-04-16] Fusion & Compile Tests: MLA FP8 quant fusion produces 100% NaN ZhanqiuHu/vllm-ci-watch#13

Open

benchislett deleted the fix-wk-fusion-fp8 branch April 16, 2026 16:47

baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026

[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (vllm-project#38928)

ba72af9

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (vllm-project#38928)

56984e1

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Meihan-chen mentioned this pull request Apr 27, 2026

[Misc][main2main] Adapt vllm-ascend to vLLM 0420 vllm-project/vllm-ascend#8428

Merged

		if "weight" not in entry or "scale" not in entry:
		return True # still waiting for the other param

Uh oh!

Conversation

benchislett commented Apr 3, 2026

Purpose

Testing

Uh oh!

zyongye commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett commented Apr 3, 2026

Uh oh!

zyongye commented Apr 3, 2026

Uh oh!

mergify Bot commented Apr 4, 2026

Uh oh!

benchislett commented Apr 6, 2026

Uh oh!

zyongye commented Apr 6, 2026

Uh oh!

benchislett commented Apr 7, 2026

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett commented Apr 15, 2026

4xGB200, TP4, DSV3.2 FP8, Fused WK

GSM8k:

1k1k BS1:

1k1k BS8:

4xGB200, TP4, DSV3.2 FP8, No-Fused-WK

GSM8k:

1k1k BS1:

1k1k BS8:

Uh oh!

benchislett commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyongye commented Apr 3, 2026 •

edited

Loading

mgoin left a comment •

edited

Loading