[Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention by LopezCastroRoberto · Pull Request #34265 · vllm-project/vllm

LopezCastroRoberto · 2026-02-10T18:47:29Z

Summary

This PR integrates FlashInfer's radix-based top-k kernel as an alternative implementation for the large context top-k operation in the sparse attention indexer, specifically for DeepSeek-V3.2 models.

Kernel adapted from: flashinfer-ai/flashinfer#2215

Microbenchmark study

E2E results

Example on NVIDIA B200:

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 4
vllm bench serve --backend vllm --model nvidia/DeepSeek-V3.2-NVFP4 --input-len 128000 --output-len 4096 --num-prompts 1

MAIN:

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  59.15     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         69.24     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2233.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          717.80    
Median TTFT (ms):                        717.80    
P99 TTFT (ms):                           717.80    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.27     
Median TPOT (ms):                        14.27     
P99 TPOT (ms):                           14.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.27     
Median ITL (ms):                         14.27     
P99 ITL (ms):                            14.57     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  53.62     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         76.39     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2463.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          732.15    
Median TTFT (ms):                        732.15    
P99 TTFT (ms):                           732.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.92     
Median TPOT (ms):                        12.92     
P99 TPOT (ms):                           12.92     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.92     
Median ITL (ms):                         12.72     
P99 ITL (ms):                            13.81     
==================================================

In this example, the current PR improves MAIN throughput by ~10%

Here is a more general analysis across different sequence lengths on NVIDIA B300:

vllm bench serve --backend vllm --model nvidia/DeepSeek-V3.2-NVFP4 --input-len seq_len --output-len 4096 --num-prompts 1

Accuracy

python tests/evals/gsm8k/gsm8k_eval.py

MAIN:

Results:
Accuracy: 0.926
Invalid responses: 0.000
Total latency: 54.086 s
Questions per second: 24.387
Total output tokens: 121416
Output tokens per second: 2244.889

PR:

Results:
Accuracy: 0.929
Invalid responses: 0.000
Total latency: 52.035 s
Questions per second: 25.348
Total output tokens: 121881
Output tokens per second: 2342.299

gemini-code-assist

Code Review

This pull request replaces the custom large_context_topk kernel with flashinfer.top_k_ragged_transform for handling top-k operations in the large context decode path. The changes primarily involve updating sparse_attn_indexer.py to use the FlashInfer function and passing a new offsets_buffer. Corresponding changes are made for API compatibility in the ROCm path. The tests are also updated to validate the new implementation. My review found a critical issue in the test file where a new test function shadows an existing one due to having the same name, and also misuses a pytest parameter. I've provided a suggestion to fix this.

tests/kernels/test_top_k_per_row.py

mergify · 2026-02-13T02:39:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin · 2026-02-16T15:58:50Z

Interesting that the gsm8k eval is ~15% faster even though it is only ~2k context length and has especially short prefills (due to high prefix cache hit rate)

LopezCastroRoberto · 2026-02-16T17:28:55Z

Interesting that the gsm8k eval is ~15% faster even though it is only ~2k context length and has especially short prefills (due to high prefix cache hit rate)

@mgoin Yeah, there isn’t actually a remarkable speedup on GSM8K. My goal wasn’t to demonstrate performance improvements on this test, but simply to verify that accuracy was preserved. It looks like I probably just copied one of the later runs (second/third/fourth) from several consecutive executions I did for this test, rather than the initial run.

e.g.,
python tests/evals/gsm8k/gsm8k_eval.py

First execution:

Results:
Accuracy: 0.929
Invalid responses: 0.000
Total latency: 52.035 s
Questions per second: 25.348
Total output tokens: 121881
Output tokens per second: 2342.299

Second execution:

Accuracy: 0.929
Invalid responses: 0.000
Total latency: 46.138 s
Questions per second: 28.588
Total output tokens: 122155
Output tokens per second: 2647.619

I agree this can be confusing. I will update it in the PR description.

LopezCastroRoberto · 2026-02-16T18:00:10Z

vllm/v1/attention/backends/mla/indexer.py

+            # See: https://github.com/vllm-project/vllm/pull/34265
+            max_seq_len = common_attn_metadata.max_seq_len
+            use_radix_topk = max_seq_len >= 65536
+            use_large_context_topk = max_seq_len == 2048 or (8192 < max_seq_len < 65536)


However, there may be a minor improvement for 2k contexts due to the updated heuristic introduced in this PR. Specifically, large_context_topk (integrated in a previous PR) is selected instead of top_k_per_row_decode. This choice is based on the microbenchmark results described above.

csrc/topk.cuh

LopezCastroRoberto · 2026-03-02T15:07:43Z

This could be fairly expensive, any way to avoid? I noticed it isn't included in the benchmarking code. Maybe the kernel itself could zero only the required sections just for RadixRowState?
cc: @mgoin

I’ve fused the topk_workspace.zero_() op directly into the kernel. At the kernel level, there’s no remarkable overhead from this change. I also added a few additional test points to further validate that the heuristic behaves as expected.

vllm/v1/attention/backends/mla/indexer.py

LucasWilkinson

do we know how much perf the cudagraph specialization brings e2e? it adds quite a bit of complexity, just wondering if its worth it? how hard would it be optimize the topk for the shorter contexts?

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto requested review from WoosukKwon, mgoin, tjtanaa, tlrmchlsmth and yewentao256 as code owners February 10, 2026 18:47

LopezCastroRoberto marked this pull request as draft February 10, 2026 18:47

LopezCastroRoberto changed the title ~~Add FlashInfer top-k support to large context decode path~~ [Perf] Add FlashInfer top-k support to large context decode path Feb 10, 2026

mergify bot added rocm Related to AMD ROCm v1 labels Feb 10, 2026

github-project-automation bot added this to AMD Feb 10, 2026

github-project-automation bot moved this to Todo in AMD Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

tests/kernels/test_top_k_per_row.py Outdated Show resolved Hide resolved

LopezCastroRoberto changed the title ~~[Perf] Add FlashInfer top-k support to large context decode path~~ [Perf] Add FlashInfer top-k support to large context decode path - DeepSeek-V3.2 sparse attention Feb 10, 2026

mergify bot added the deepseek Related to DeepSeek models label Feb 10, 2026

LopezCastroRoberto force-pushed the perf/topKperRow-FI branch from bbba437 to 2d74e0f Compare February 11, 2026 15:00

mergify bot added the nvidia label Feb 12, 2026

github-project-automation bot added this to NVIDIA Feb 12, 2026

LopezCastroRoberto changed the title ~~[Perf] Add FlashInfer top-k support to large context decode path - DeepSeek-V3.2 sparse attention~~ [Perf][Kernel] Improve topKperRow routine for large context decode path - DeepSeek-V3.2 sparse attention Feb 12, 2026

LopezCastroRoberto changed the title ~~[Perf][Kernel] Improve topKperRow routine for large context decode path - DeepSeek-V3.2 sparse attention~~ [Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention Feb 12, 2026

LopezCastroRoberto marked this pull request as ready for review February 12, 2026 18:49

mergify bot added the needs-rebase label Feb 13, 2026

LopezCastroRoberto requested a review from pavanimajety as a code owner February 13, 2026 11:50

mergify bot added the performance Performance-related issues label Feb 16, 2026

LopezCastroRoberto commented Feb 16, 2026

View reviewed changes

pavanimajety reviewed Feb 16, 2026

View reviewed changes

csrc/topk.cuh Show resolved Hide resolved

LopezCastroRoberto changed the title ~~[Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention~~ [Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention Feb 18, 2026

LucasWilkinson mentioned this pull request Feb 24, 2026

[Performance]: DeepSeek-V3.2 Performance Optimization Tracking #31473

Open

15 tasks

LopezCastroRoberto requested a review from mgoin March 2, 2026 15:08

LopezCastroRoberto requested a review from njhill as a code owner March 3, 2026 19:54

LucasWilkinson reviewed Mar 3, 2026

View reviewed changes

vllm/v1/attention/backends/mla/indexer.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Mar 3, 2026

View reviewed changes

LopezCastroRoberto marked this pull request as draft March 17, 2026 10:33

LopezCastroRoberto and others added 13 commits March 17, 2026 14:50

init FI topk integration

bf94899

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add adapted topK FI kernel

2c4ae45

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

moving workkspace buffer allocation

e189d43

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add new heuristic

3f6159d

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor

6d088b6

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor

f103fbb

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor kernel

dc03510

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

improve test set

143da75

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

rename kernels

d5ce8a5

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

review comments

5024a36

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

fix cudagraph issue

d5507a9

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add one more bucket

367255a

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add persistent scheduler for topK

02bc7c5

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto force-pushed the perf/topKperRow-FI branch from 6e3a5f1 to 02bc7c5 Compare March 17, 2026 14:50

LopezCastroRoberto marked this pull request as ready for review March 17, 2026 15:13

LopezCastroRoberto added 2 commits March 17, 2026 16:21

Merge branch 'main' into perf/topKperRow-FI

c1d9d55

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

Merge branch 'main' into perf/topKperRow-FI

41c3f5b

LopezCastroRoberto marked this pull request as draft March 17, 2026 15:21

mergify bot removed the needs-rebase label Mar 17, 2026

update persistent scheduler

4743674

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto mentioned this pull request Mar 18, 2026

[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode #37421

Open

LopezCastroRoberto closed this Mar 18, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 18, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention#34265

[Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention#34265
LopezCastroRoberto wants to merge 16 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/topKperRow-FI

LopezCastroRoberto commented Feb 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Feb 13, 2026

Uh oh!

mgoin commented Feb 16, 2026

Uh oh!

LopezCastroRoberto commented Feb 16, 2026 •

edited

Loading

Uh oh!

LopezCastroRoberto Feb 16, 2026

Uh oh!

Uh oh!

LopezCastroRoberto commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

LucasWilkinson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

LopezCastroRoberto commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Microbenchmark study

E2E results

Accuracy

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 13, 2026

Uh oh!

mgoin commented Feb 16, 2026

Uh oh!

LopezCastroRoberto commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LopezCastroRoberto Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LopezCastroRoberto commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LopezCastroRoberto commented Feb 10, 2026 •

edited by github-actions bot

Loading

LopezCastroRoberto commented Feb 16, 2026 •

edited

Loading

LopezCastroRoberto commented Mar 2, 2026 •

edited

Loading