[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode by LopezCastroRoberto · Pull Request #37421 · vllm-project/vllm

LopezCastroRoberto · 2026-03-18T11:46:51Z

Summary

Redesigns the persistent TopK kernel used by DSA as a true persistent scheduler with dynamic per-row path selection.

This supersedes and closes #34265, which took a CUDAGraph-specialization approach. Instead, this PR follows a persistent scheduler pattern where a single fixed-grid kernel dynamically dispatches each row to the appropriate path at runtime.

Problem

As #34265 demonstrated, there are four different topK-per-row kernel variants, each optimal for a different sequence length regime. This isn't an implementation artifact — it reflects fundamental algorithmic trade-offs:

Short sequences (≤ 8K) benefit from fine-grained histograms (2048 bins from FP16) that resolve the top-k in a single pass, since with only ~4 elements per bin on average, the threshold bin is small enough that refinement is rarely needed.
Medium sequences (8K–64K) use coarser 256-bin histograms with multi-pass FP32 radix refinement. The coarser initial pass reduces shared memory pressure and atomic contention, which matters more as element count grows.
Large sequences (> 64K) exceed a single CTA's shared memory capacity, requiring cooperative multi-CTA radix select with inter-CTA synchronization via global memory barriers.
Trivial sequences (≤ TopK) just copy all indices directly.

Since max_seq_len changes at runtime (batches mix short decode sequences with long-context prefills), the initial approach in #34265 handled kernel selection via CUDAGraph specialization. However, this added complexity on the host side and required multiple graph variants. This PR simplifies the problem with a persistent scheduler that handles dispatch on-the-fly inside a single kernel.

Approach

Single persistent kernel, fixed grid, dynamic dispatch:

The grid is always configured for the worst case (large path: multiple CTAs per row group)
Inside the row loop, each row dynamically selects its path based on actual seq_len:
- Trivial (≤ TopK): direct index copy
- Decode (≤ 8192): 2048-bin FP16 histogram + FP32 radix refinement
- Medium (≤ 64K): 256-bin FP16 histogram + FP32 radix refinement
- Large (> 64K): multi-CTA cooperative radix select
For non-large rows, only CTA 0 of each group does work — other CTAs skip with no barrier overhead
The kernel self-initializes and self-cleans state (no cudaMemsetAsync needed, which avoids an extra CUDAGraph node per step)

This is CUDAGraph-safe by construction: the grid shape never changes, and the captured kernel handles all sequence lengths.

Microbenchmarking

┌─────┬─────────┬─────────────────────┬──────────────────────┬─────────┐
│ BS  │ seq_len │      MAIN topk (μs) │ persistent_topk (μs) │ speedup │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 4096    │                6.49 │                 5.08 │   1.28x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 8192    │                8.99 │                 9.41 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 16384   │               11.31 │                11.68 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 32768   │               17.52 │                17.75 │   0.99x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 65536   │               28.48 │                19.09 │   1.49x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 96000   │               39.13 │                18.59 │   2.10x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 128000  │               46.98 │                19.45 │   2.42x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 163840  │               54.10 │                20.97 │   2.58x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 4096    │                7.20 │                 7.96 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 8192    │                9.02 │                 9.47 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 16384   │               12.67 │                13.05 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 32768   │               17.04 │                18.16 │   0.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 65536   │               31.03 │                21.87 │   1.42x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 96000   │               41.19 │                21.19 │   1.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 128000  │               47.41 │                22.34 │   2.12x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 163840  │               63.95 │                21.77 │   2.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 4096    │                7.07 │                 7.79 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 8192    │                8.85 │                 9.43 │   0.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 16384   │               12.35 │                13.48 │   0.92x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 32768   │               18.00 │                19.02 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 65536   │               32.65 │                21.85 │   1.49x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 96000   │               40.30 │                21.75 │   1.85x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 128000  │               48.91 │                21.80 │   2.24x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 163840  │               64.44 │                22.09 │   2.92x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 4096    │                6.78 │                 7.90 │   0.86x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 8192    │                8.99 │                 9.48 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 16384   │               12.39 │                13.74 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 32768   │               17.48 │                18.88 │   0.93x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 65536   │               30.90 │                22.28 │   1.39x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 96000   │               39.20 │                22.07 │   1.78x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 128000  │               49.66 │                22.66 │   2.19x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 163840  │               63.20 │                22.51 │   2.81x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 4096    │                7.09 │                 7.90 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 8192    │                9.24 │                 9.62 │   0.96x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 16384   │               12.67 │                14.82 │   0.85x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 32768   │               17.43 │                21.04 │   0.83x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 65536   │               32.01 │                31.73 │   1.01x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 96000   │               40.09 │                37.71 │   1.06x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 128000  │               51.65 │                34.91 │   1.48x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 163840  │               63.78 │                39.40 │   1.62x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 4096    │                7.32 │                 8.05 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 8192    │                9.42 │                 9.69 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 16384   │               12.69 │                14.01 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 32768   │               17.11 │                20.72 │   0.83x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 65536   │               31.71 │                31.90 │   0.99x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 96000   │               41.66 │                37.79 │   1.10x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 128000  │               52.12 │                35.14 │   1.48x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 163840  │               64.20 │                39.95 │   1.61x │
└─────┴─────────┴─────────────────────┴──────────────────────┴─────────┘

E2E results (NVIDIA B200)

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 4

vllm bench serve --backend vllm --model nvidia/DeepSeek-V3.2-NVFP4 --input-len seq_len --output-len 4096 --num-prompts 1

Example

MAIN

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  52.09     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         78.64     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2536.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          497.61    
Median TTFT (ms):                        497.61    
P99 TTFT (ms):                           497.61    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.60     
Median TPOT (ms):                        12.60     
P99 TPOT (ms):                           12.60     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.61     
Median ITL (ms):                         12.61     
P99 ITL (ms):                            12.97     
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  44.58     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         91.87     
Peak output token throughput (tok/s):    93.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2962.82   
---------------Time to First Token----------------
Mean TTFT (ms):                          501.63    
Median TTFT (ms):                        501.63    
P99 TTFT (ms):                           501.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.76     
Median TPOT (ms):                        10.76     
P99 TPOT (ms):                           10.76     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.78     
Median ITL (ms):                         10.78     
P99 ITL (ms):                            11.28     
==================================================

i.e., the current PR improves MAIN throughput by up to ~17%

Examples DP4

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -dp 4 --enable-expert-parallel --kv-cache-dtype fp8

Accuracy

python tests/evals/gsm8k/gsm8k_eval.py  --port 8001

Results:
Accuracy: 0.954
Invalid responses: 0.000
Total latency: 41.613 s
Questions per second: 31.697
Total output tokens: 119943
Output tokens per second: 2882.335

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

gemini-code-assist

Code Review

The pull request introduces a unified persistent TopK scheduler for DSA, which is a significant improvement for CUDAGraph safety and host-side simplification. The new approach dynamically dispatches to different TopK kernel variants based on sequence length, replacing the previous specialized kernel selection. This refactoring streamlines the TopK implementation and its integration within the system. The changes are comprehensive, spanning CUDA kernel implementations, Python bindings, and integration into the model executor and attention backend. Several hardcoded constants are introduced in the new CUDA files; while constexpr is beneficial for performance, it's important to ensure these values are well-justified and documented. The addition of RADIX_TOPK_WORKSPACE_SIZE as a named constant in Python is a good step towards improving readability and maintainability.

vllm/model_executor/layers/sparse_attn_indexer.py

gemini-code-assist · 2026-03-18T11:53:42Z

csrc/persistent_topk_medium.cuh

+    int* __restrict__ output_indices,
+    int logits_offset,
+    int seq_len) {
+  alignas(128) __shared__ int shared_histogram[2][RADIX + 128];


The magic number 128 used in the alignas specifier and RADIX + 128 should be replaced with a named constant or explained with a comment. This improves readability and makes the intent clearer.

gemini-code-assist · 2026-03-18T11:53:42Z

csrc/persistent_topk_medium.cuh

+    int seq_len) {
+  alignas(128) __shared__ int shared_histogram[2][RADIX + 128];
+  alignas(128) __shared__ int shared_output_count;
+  alignas(128) __shared__ int shared_threshold_bin;


The magic number 128 used in the alignas specifier should be replaced with a named constant or explained with a comment.

benchmarks/kernels/bench_top_k_per_row.py

gemini-code-assist · 2026-03-18T11:53:43Z

csrc/topk.cuh

+// Returns 1, 2, 4, or 8
+template <typename DType>
+constexpr int ComputeFilteredTopKVecSize(uint32_t max_len) {
+  constexpr int MAX_VEC = 16 / sizeof(DType);  // 4 for float32, 8 for fp16/bf16


The magic number 16 used in MAX_VEC calculation should be replaced with a named constant or explained with a comment.

gemini-code-assist · 2026-03-18T11:53:43Z

tests/kernels/test_top_k_per_row.py

+        lengths = (seq_lens.unsqueeze(1) - next_n + 1 + offsets).flatten()
+
+    if kernel_name == "large_context_topk":
+        workspace = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")


The magic number 1024 * 1024 (1MB) for the workspace size should be replaced with the RADIX_TOPK_WORKSPACE_SIZE constant defined in vllm/model_executor/layers/sparse_attn_indexer.py to ensure consistency and avoid duplication.

gemini-code-assist · 2026-03-18T11:53:44Z

tests/kernels/test_top_k_per_row.py

+                max_non_topk = non_topk_vals.max()
+
+                # Allow small tolerance for floating point errors
+                assert min_cuda_val >= max_non_topk - 1e-4, (


The magic number 1e-4 used as a tolerance for floating point errors should be replaced with a named constant. This improves readability and makes it easier to adjust the tolerance if needed.

gemini-code-assist · 2026-03-18T11:53:44Z

tests/kernels/test_top_k_per_row.py

+        assert torch.allclose(
+            cuda_vals.sort(descending=True)[0],
+            torch_vals.sort(descending=True)[0],
+            rtol=1e-4,
+            atol=1e-4,


The magic number 1e-4 used as rtol and atol for torch.allclose should be replaced with a named constant. This improves readability and makes it easier to adjust the tolerance if needed.

benchmarks/kernels/bench_top_k_per_row.py

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mergify · 2026-03-20T15:35:59Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mergify · 2026-03-24T11:12:07Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

csrc/persistent_topk.cuh

mergify · 2026-03-25T09:47:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto · 2026-03-27T16:40:03Z

  ┌─────┬─────────┬────────────────┬─────────────────────────┬────────────┬─────────────────────────┬────────────┐
  │ BS  │ seq_len │ MAIN topk (μs) │ persistent_topk_v1 (μs) │ speedup_v1 │ persistent_topk_v2 (μs) │ speedup_v2 │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │    4096 │           6.49 │                    5.08 │      1.28x │                    6.77 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │    8192 │           8.99 │                    9.41 │      0.95x │                    6.08 │      1.48x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   16384 │          11.31 │                   11.68 │      0.97x │                   12.75 │      0.89x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   32768 │          17.52 │                   17.75 │      0.99x │                   16.87 │      1.04x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   65536 │          28.48 │                   19.09 │      1.49x │                   18.87 │      1.51x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   96000 │          39.13 │                   18.59 │      2.10x │                   20.83 │      1.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │  128000 │          46.98 │                   19.45 │      2.42x │                   20.12 │      2.33x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │  163840 │          54.10 │                   20.97 │      2.58x │                   20.56 │      2.63x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │    4096 │           7.20 │                    7.96 │      0.90x │                    6.76 │      1.07x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │    8192 │           9.02 │                    9.47 │      0.95x │                    9.29 │      0.97x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   16384 │          12.67 │                   13.05 │      0.97x │                   12.87 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   32768 │          17.04 │                   18.16 │      0.94x │                   17.35 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   65536 │          31.03 │                   21.87 │      1.42x │                   20.68 │      1.50x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   96000 │          41.19 │                   21.19 │      1.94x │                   20.50 │      2.01x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │  128000 │          47.41 │                   22.34 │      2.12x │                   19.76 │      2.40x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │  163840 │          63.95 │                   21.77 │      2.94x │                   20.14 │      3.17x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │    4096 │           7.07 │                    7.79 │      0.91x │                    7.65 │      0.92x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │    8192 │           8.85 │                    9.43 │      0.94x │                    9.23 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   16384 │          12.35 │                   13.48 │      0.92x │                   14.04 │      0.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   32768 │          18.00 │                   19.02 │      0.95x │                   17.79 │      1.01x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   65536 │          32.65 │                   21.85 │      1.49x │                   20.89 │      1.56x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   96000 │          40.30 │                   21.75 │      1.85x │                   20.64 │      1.95x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │  128000 │          48.91 │                   21.80 │      2.24x │                   20.68 │      2.36x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │  163840 │          64.44 │                   22.09 │      2.92x │                   20.45 │      3.15x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │    4096 │           6.78 │                    7.90 │      0.86x │                    7.61 │      0.89x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │    8192 │           8.99 │                    9.48 │      0.95x │                   10.16 │      0.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   16384 │          12.39 │                   13.74 │      0.90x │                   12.90 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   32768 │          17.48 │                   18.88 │      0.93x │                   17.79 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   65536 │          30.90 │                   22.28 │      1.39x │                   20.43 │      1.51x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   96000 │          39.20 │                   22.07 │      1.78x │                   21.62 │      1.81x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │  128000 │          49.66 │                   22.66 │      2.19x │                   22.48 │      2.21x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │  163840 │          63.20 │                   22.51 │      2.81x │                   22.90 │      2.76x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │    4096 │           7.09 │                    7.90 │      0.90x │                    7.70 │      0.92x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │    8192 │           9.24 │                    9.62 │      0.96x │                    9.29 │      0.99x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   16384 │          12.67 │                   14.82 │      0.85x │                   13.35 │      0.95x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   32768 │          17.43 │                   21.04 │      0.83x │                   20.01 │      0.87x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   65536 │          32.01 │                   31.73 │      1.01x │                   32.74 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   96000 │          40.09 │                   37.71 │      1.06x │                   38.99 │      1.03x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │  128000 │          51.65 │                   34.91 │      1.48x │                   36.03 │      1.43x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │  163840 │          63.78 │                   39.40 │      1.62x │                   40.80 │      1.56x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │    4096 │           7.32 │                    8.05 │      0.91x │                    7.79 │      0.94x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │    8192 │           9.42 │                    9.69 │      0.97x │                   10.03 │      0.94x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   16384 │          12.69 │                   14.01 │      0.91x │                   14.05 │      0.90x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   32768 │          17.11 │                   20.72 │      0.83x │                   19.94 │      0.86x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   65536 │          31.71 │                   31.90 │      0.99x │                   32.61 │      0.97x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   96000 │          41.66 │                   37.79 │      1.10x │                   38.91 │      1.07x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │  128000 │          52.12 │                   35.14 │      1.48x │                   36.16 │      1.44x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │  163840 │          64.20 │                   39.95 │      1.61x │                   41.09 │      1.56x │
  └─────┴─────────┴────────────────┴─────────────────────────┴────────────┴─────────────────────────┴────────────┘

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

LucasWilkinson

This is really awesome! Thanks for all the hard work!

one nit: instead of threading the topk_workspace through the whole model definition can we just use current_workspace_manager().get_simultaneous(...)?

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto and others added 17 commits March 17, 2026 14:50

init FI topk integration

bf94899

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add adapted topK FI kernel

2c4ae45

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

moving workkspace buffer allocation

e189d43

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add new heuristic

3f6159d

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor

6d088b6

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor

f103fbb

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

refractor kernel

dc03510

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

improve test set

143da75

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

rename kernels

d5ce8a5

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

review comments

5024a36

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

fix cudagraph issue

d5507a9

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add one more bucket

367255a

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add persistent scheduler for topK

02bc7c5

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Merge branch 'main' into perf/topKperRow-FI

c1d9d55

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

Merge branch 'main' into perf/topKperRow-FI

41c3f5b

update persistent scheduler

4743674

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

init persistent scheduler PR

639a068

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto requested review from WoosukKwon, mgoin, njhill, pavanimajety, tlrmchlsmth and yewentao256 as code owners March 18, 2026 11:46

LopezCastroRoberto marked this pull request as draft March 18, 2026 11:47

mergify bot added deepseek Related to DeepSeek models performance Performance-related issues nvidia v1 labels Mar 18, 2026

github-project-automation bot added this to NVIDIA Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

LopezCastroRoberto added 3 commits March 18, 2026 09:03

cleaning and adding tests

37188d2

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add removed file

41e4f56

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

add missing cpyright comments

580abcd

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto marked this pull request as ready for review March 18, 2026 13:16

LopezCastroRoberto added 3 commits March 18, 2026 09:20

cleaning

23e5f80

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

cleaning

1733c5b

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

cleaning

531c275

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

additional optimizations

4538046

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

fix CG issue

cbd2e8f

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LucasWilkinson reviewed Mar 24, 2026

View reviewed changes

csrc/persistent_topk.cuh Outdated Show resolved Hide resolved

LucasWilkinson reviewed Mar 24, 2026

View reviewed changes

csrc/persistent_topk.cuh Outdated Show resolved Hide resolved

LucasWilkinson reviewed Mar 24, 2026

View reviewed changes

csrc/persistent_topk.cuh Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 25, 2026

fixed comments and improved performance

43f6a9f

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Merge branch 'main' into perf/topK_persistent_scheduler

a6bac3a

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

mergify bot removed the needs-rebase label Mar 31, 2026

LopezCastroRoberto requested a review from LucasWilkinson March 31, 2026 10:14

LucasWilkinson approved these changes Apr 1, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Apr 1, 2026

use get_simultaneous for topk workspace

e856535

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2026

Merge branch 'main' into perf/topK_persistent_scheduler

72752fb

LucasWilkinson enabled auto-merge (squash) April 1, 2026 17:10

Merge branch 'main' into perf/topK_persistent_scheduler

30d7b24

Uh oh!

Conversation

LopezCastroRoberto commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Approach

Microbenchmarking

E2E results (NVIDIA B200)

Example

Examples DP4

Accuracy

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

mergify bot commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

LopezCastroRoberto commented Mar 27, 2026

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LopezCastroRoberto commented Mar 18, 2026 •

edited

Loading