Skip to content

[Bugfix][Perf] Indexer upcast WK to BF16 for fusion#38928

Merged
mgoin merged 5 commits intovllm-project:mainfrom
CentML:fix-wk-fusion-fp8
Apr 15, 2026
Merged

[Bugfix][Perf] Indexer upcast WK to BF16 for fusion#38928
mgoin merged 5 commits intovllm-project:mainfrom
CentML:fix-wk-fusion-fp8

Conversation

@benchislett
Copy link
Copy Markdown
Collaborator

Purpose

Alternative fix to #38870 which maintains the fusion.

Performance:

Upcast+Fused WK: (Decode): 11.90 ms
Upcast+Fused WK: (TTFT):   375.0 ms

Multi-Stream:    (Decode): 12.74 ms
Multi-Stream:    (TTFT):   376.5 ms

Separate:        (Decode): 12.59 ms
Separate:        (TTFT):   378.2 ms

Testing

My setup is broken, GSM8k is giving 0.00 for me even with #38870. Will try to fix my setup and rerun, but have moderate confidence in this fix. Would be handy if someone else could try running this in the meantime.

I'm okay with merging #38870 if we need to fix ASAP this weekend.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett requested a review from luccafong as a code owner April 3, 2026 18:06
@mergify mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 3, 2026
@zyongye
Copy link
Copy Markdown
Member

zyongye commented Apr 3, 2026

I would certainly want to run the OG ckpt as it is with multi-stream.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle DeepSeek-V2 checkpoints where 'wk' weights are stored in FP8 with separate scales. By dequantizing these weights to BF16 during the loading process, the model maintains fusion with 'weights_proj'. The implementation includes a new helper function, '_try_load_fp8_indexer_wk', and a buffering system to synchronize weights and scales. Feedback was provided regarding the potential for memory leaks in the buffering logic if the loading process is interrupted or if keys are missing from the checkpoint.

Comment on lines +756 to +757
if "weight" not in entry or "scale" not in entry:
return True # still waiting for the other param
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to buffer tensors until both weight and scale are present is susceptible to memory leaks if the checkpoint loading process is interrupted or if certain keys are missing from the checkpoint. Consider adding a timeout or a mechanism to clear the buffer if loading fails.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would go out of scope once the weight loading function returns. The references are not stored on self.

@benchislett
Copy link
Copy Markdown
Collaborator Author

I would certainly want to run the OG ckpt as it is with multi-stream.

This is the multi-stream baseline I compared against. It seems like the multi-streaming with the FP8 kernels is disabling torch.compile and undoing one of the elementwise op fusions, causing a slowdown. Could you explain what/why you want to run that I have not compared here?

@zyongye
Copy link
Copy Markdown
Member

zyongye commented Apr 3, 2026

I would certainly want to run the OG ckpt as it is with multi-stream.

This is the multi-stream baseline I compared against. It seems like the multi-streaming with the FP8 kernels is disabling torch.compile and undoing one of the elementwise op fusions, causing a slowdown. Could you explain what/why you want to run that I have not compared here?

I am hoping that for the original model ckpt we want to let it run as its original precision, at least on one platform. Changing the precision will cause unknown problem later on and hard to trace back the bug.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 4, 2026
@benchislett
Copy link
Copy Markdown
Collaborator Author

I'm of the opinion that upcasting for the fusion is worth the trouble. The perf improvement seems nontrivial, and it does help avoid having to support both multi-stream and fusion for this op. @robertgshaw2-redhat to weight in

@zyongye
Copy link
Copy Markdown
Member

zyongye commented Apr 6, 2026

If they really care about perf they should run the nvfp4 checkpoint. I feel like we want to have accuracy guarantee before perf on the OG checkpoint.

@benchislett
Copy link
Copy Markdown
Collaborator Author

There are many cases where we still care about perf for non-NVFP4 checkpoints. And as I understand, upcasting from FP8 to BF16 shouldn't hurt accuracy if done properly

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fair enough to support this, as we do for MLA upconvert as well before we had fp8_bmm available. Specific numerics may change, but accuracy should certainly not be worse

@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Apr 14, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett
Copy link
Copy Markdown
Collaborator Author

Reran on 4xGB200, same perf results and GSM8k is good

Details

4xGB200, TP4, DSV3.2 FP8, Fused WK

GSM8k:

Results:
Accuracy: 0.955
Invalid responses: 0.000
Total latency: 236.064 s
Questions per second: 5.587
Total output tokens: 120664
Output tokens per second: 511.148

1k1k BS1:

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  135.16    
Total input tokens:                      9704      
Total generated tokens:                  11079     
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         81.97     
Peak output token throughput (tok/s):    86.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          153.76    
---------------Time to First Token----------------
Mean TTFT (ms):                          407.79    
Median TTFT (ms):                        158.75    
P99 TTFT (ms):                           2362.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.84     
Median TPOT (ms):                        11.80     
P99 TPOT (ms):                           11.98     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         11.81     
P99 ITL (ms):                            12.47     
==================================================

1k1k BS8:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  193.45    
Total input tokens:                      82493     
Total generated tokens:                  82702     
Request throughput (req/s):              0.41      
Output token throughput (tok/s):         427.50    
Peak output token throughput (tok/s):    496.00    
Peak concurrent requests:                11.00     
Total token throughput (tok/s):          853.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          224.54    
Median TTFT (ms):                        187.26    
P99 TTFT (ms):                           574.97    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.96     
Median TPOT (ms):                        18.10     
P99 TPOT (ms):                           18.59     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.96     
Median ITL (ms):                         17.14     
P99 ITL (ms):                            18.11     
==================================================

4xGB200, TP4, DSV3.2 FP8, No-Fused-WK

GSM8k:

Results:
Accuracy: 0.956
Invalid responses: 0.000
Total latency: 238.044 s
Questions per second: 5.541
Total output tokens: 120447
Output tokens per second: 505.987

1k1k BS1:

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  139.43    
Total input tokens:                      9704      
Total generated tokens:                  11079     
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         79.46     
Peak output token throughput (tok/s):    82.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          149.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          178.87    
Median TTFT (ms):                        174.89    
P99 TTFT (ms):                           211.54    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.43     
Median TPOT (ms):                        12.39     
P99 TPOT (ms):                           12.57     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.43     
Median ITL (ms):                         12.40     
P99 ITL (ms):                            13.07     
==================================================

1k1k BS8:

============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  200.50    
Total input tokens:                      82493     
Total generated tokens:                  82702     
Request throughput (req/s):              0.40      
Output token throughput (tok/s):         412.48    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                11.00     
Total token throughput (tok/s):          823.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          233.02    
Median TTFT (ms):                        196.72    
P99 TTFT (ms):                           586.79    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.61     
Median TPOT (ms):                        18.71     
P99 TPOT (ms):                           19.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.62     
Median ITL (ms):                         17.74     
P99 ITL (ms):                            18.77     
==================================================

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett
Copy link
Copy Markdown
Collaborator Author

I also manually inspected a few of the FP8 weight scales for the WK layers. They're all fairly small factors (e.g. 0.0002), so we should be staying comfortably in the range where BF16 is well-represented.

@mgoin mgoin enabled auto-merge (squash) April 15, 2026 19:26
@mergify mergify Bot removed the needs-rebase label Apr 15, 2026
@mgoin mgoin merged commit ac3dac5 into vllm-project:main Apr 15, 2026
59 checks passed
@benchislett benchislett deleted the fix-wk-fusion-fp8 branch April 16, 2026 16:47
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 29, 2026
**Commit range:** `6f786f2`..`d886c26`
1. Fix 'DPMetadata' object has no attribute 'max_tokens_across_dp_cpu'
by vllm-project/vllm#39107
2. Fix 'Indexer' object has no attribute 'wk' by
vllm-project/vllm#38928
3. Fix 'float' object has no attribute 'language_model' by
vllm-project/vllm#39240
### What this PR does / why we need it? 

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
**Commit range:** `6f786f2`..`d886c26`
1. Fix 'DPMetadata' object has no attribute 'max_tokens_across_dp_cpu'
by vllm-project/vllm#39107
2. Fix 'Indexer' object has no attribute 'wk' by
vllm-project/vllm#38928
3. Fix 'float' object has no attribute 'language_model' by
vllm-project/vllm#39240
### What this PR does / why we need it? 

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants