Skip to content

[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode#37421

Open
LopezCastroRoberto wants to merge 30 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/topK_persistent_scheduler
Open

[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode#37421
LopezCastroRoberto wants to merge 30 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/topK_persistent_scheduler

Conversation

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

@LopezCastroRoberto LopezCastroRoberto commented Mar 18, 2026

Summary

Redesigns the persistent TopK kernel used by DSA as a true persistent scheduler with dynamic per-row path selection.

This supersedes and closes #34265, which took a CUDAGraph-specialization approach. Instead, this PR follows a persistent scheduler pattern where a single fixed-grid kernel dynamically dispatches each row to the appropriate path at runtime.

Problem

As #34265 demonstrated, there are four different topK-per-row kernel variants, each optimal for a different sequence length regime. This isn't an implementation artifact — it reflects fundamental algorithmic trade-offs:

  • Short sequences (≤ 8K) benefit from fine-grained histograms (2048 bins from FP16) that resolve the top-k in a single pass, since with only ~4 elements per bin on average, the threshold bin is small enough that refinement is rarely needed.
  • Medium sequences (8K–64K) use coarser 256-bin histograms with multi-pass FP32 radix refinement. The coarser initial pass reduces shared memory pressure and atomic contention, which matters more as element count grows.
  • Large sequences (> 64K) exceed a single CTA's shared memory capacity, requiring cooperative multi-CTA radix select with inter-CTA synchronization via global memory barriers.
  • Trivial sequences (≤ TopK) just copy all indices directly.

Since max_seq_len changes at runtime (batches mix short decode sequences with long-context prefills), the initial approach in #34265 handled kernel selection via CUDAGraph specialization. However, this added complexity on the host side and required multiple graph variants. This PR simplifies the problem with a persistent scheduler that handles dispatch on-the-fly inside a single kernel.

Approach

Single persistent kernel, fixed grid, dynamic dispatch:

  • The grid is always configured for the worst case (large path: multiple CTAs per row group)
  • Inside the row loop, each row dynamically selects its path based on actual seq_len:
    • Trivial (≤ TopK): direct index copy
    • Decode (≤ 8192): 2048-bin FP16 histogram + FP32 radix refinement
    • Medium (≤ 64K): 256-bin FP16 histogram + FP32 radix refinement
    • Large (> 64K): multi-CTA cooperative radix select
  • For non-large rows, only CTA 0 of each group does work — other CTAs skip with no barrier overhead
  • The kernel self-initializes and self-cleans state (no cudaMemsetAsync needed, which avoids an extra CUDAGraph node per step)

This is CUDAGraph-safe by construction: the grid shape never changes, and the captured kernel handles all sequence lengths.

Microbenchmarking

┌─────┬─────────┬─────────────────────┬──────────────────────┬─────────┐
│ BS  │ seq_len │      MAIN topk (μs) │ persistent_topk (μs) │ speedup │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 4096    │                6.49 │                 5.08 │   1.28x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 8192    │                8.99 │                 9.41 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 16384   │               11.31 │                11.68 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 32768   │               17.52 │                17.75 │   0.99x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 65536   │               28.48 │                19.09 │   1.49x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 96000   │               39.13 │                18.59 │   2.10x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 128000  │               46.98 │                19.45 │   2.42x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 1   │ 163840  │               54.10 │                20.97 │   2.58x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 4096    │                7.20 │                 7.96 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 8192    │                9.02 │                 9.47 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 16384   │               12.67 │                13.05 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 32768   │               17.04 │                18.16 │   0.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 65536   │               31.03 │                21.87 │   1.42x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 96000   │               41.19 │                21.19 │   1.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 128000  │               47.41 │                22.34 │   2.12x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 2   │ 163840  │               63.95 │                21.77 │   2.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 4096    │                7.07 │                 7.79 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 8192    │                8.85 │                 9.43 │   0.94x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 16384   │               12.35 │                13.48 │   0.92x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 32768   │               18.00 │                19.02 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 65536   │               32.65 │                21.85 │   1.49x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 96000   │               40.30 │                21.75 │   1.85x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 128000  │               48.91 │                21.80 │   2.24x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 4   │ 163840  │               64.44 │                22.09 │   2.92x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 4096    │                6.78 │                 7.90 │   0.86x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 8192    │                8.99 │                 9.48 │   0.95x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 16384   │               12.39 │                13.74 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 32768   │               17.48 │                18.88 │   0.93x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 65536   │               30.90 │                22.28 │   1.39x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 96000   │               39.20 │                22.07 │   1.78x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 128000  │               49.66 │                22.66 │   2.19x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 8   │ 163840  │               63.20 │                22.51 │   2.81x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 4096    │                7.09 │                 7.90 │   0.90x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 8192    │                9.24 │                 9.62 │   0.96x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 16384   │               12.67 │                14.82 │   0.85x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 32768   │               17.43 │                21.04 │   0.83x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 65536   │               32.01 │                31.73 │   1.01x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 96000   │               40.09 │                37.71 │   1.06x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 128000  │               51.65 │                34.91 │   1.48x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 16  │ 163840  │               63.78 │                39.40 │   1.62x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│     │         │                     │                      │         │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 4096    │                7.32 │                 8.05 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 8192    │                9.42 │                 9.69 │   0.97x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 16384   │               12.69 │                14.01 │   0.91x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 32768   │               17.11 │                20.72 │   0.83x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 65536   │               31.71 │                31.90 │   0.99x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 96000   │               41.66 │                37.79 │   1.10x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 128000  │               52.12 │                35.14 │   1.48x │
├─────┼─────────┼─────────────────────┼──────────────────────┼─────────┤
│ 32  │ 163840  │               64.20 │                39.95 │   1.61x │
└─────┴─────────┴─────────────────────┴──────────────────────┴─────────┘

E2E results (NVIDIA B200)

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 4
vllm bench serve --backend vllm --model nvidia/DeepSeek-V3.2-NVFP4 --input-len seq_len --output-len 4096 --num-prompts 1

topk_e2e_serving_benchmark

Example

MAIN

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  52.09     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         78.64     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2536.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          497.61    
Median TTFT (ms):                        497.61    
P99 TTFT (ms):                           497.61    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.60     
Median TPOT (ms):                        12.60     
P99 TPOT (ms):                           12.60     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.61     
Median ITL (ms):                         12.61     
P99 ITL (ms):                            12.97     
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  44.58     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         91.87     
Peak output token throughput (tok/s):    93.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2962.82   
---------------Time to First Token----------------
Mean TTFT (ms):                          501.63    
Median TTFT (ms):                        501.63    
P99 TTFT (ms):                           501.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.76     
Median TPOT (ms):                        10.76     
P99 TPOT (ms):                           10.76     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.78     
Median ITL (ms):                         10.78     
P99 ITL (ms):                            11.28     
==================================================

i.e., the current PR improves MAIN throughput by up to ~17%

Examples DP4

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -dp 4 --enable-expert-parallel --kv-cache-dtype fp8

main_vs_pr_tpot

v31_vs_v32_throughput

Accuracy

python tests/evals/gsm8k/gsm8k_eval.py  --port 8001                                                                  
Results:
Accuracy: 0.954
Invalid responses: 0.000
Total latency: 41.613 s
Questions per second: 31.697
Total output tokens: 119943
Output tokens per second: 2882.335

LopezCastroRoberto and others added 17 commits March 17, 2026 14:50
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft March 18, 2026 11:47
@mergify mergify bot added deepseek Related to DeepSeek models performance Performance-related issues nvidia v1 labels Mar 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a unified persistent TopK scheduler for DSA, which is a significant improvement for CUDAGraph safety and host-side simplification. The new approach dynamically dispatches to different TopK kernel variants based on sequence length, replacing the previous specialized kernel selection. This refactoring streamlines the TopK implementation and its integration within the system. The changes are comprehensive, spanning CUDA kernel implementations, Python bindings, and integration into the model executor and attention backend. Several hardcoded constants are introduced in the new CUDA files; while constexpr is beneficial for performance, it's important to ensure these values are well-justified and documented. The addition of RADIX_TOPK_WORKSPACE_SIZE as a named constant in Python is a good step towards improving readability and maintainability.

int* __restrict__ output_indices,
int logits_offset,
int seq_len) {
alignas(128) __shared__ int shared_histogram[2][RADIX + 128];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 128 used in the alignas specifier and RADIX + 128 should be replaced with a named constant or explained with a comment. This improves readability and makes the intent clearer.

int seq_len) {
alignas(128) __shared__ int shared_histogram[2][RADIX + 128];
alignas(128) __shared__ int shared_output_count;
alignas(128) __shared__ int shared_threshold_bin;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 128 used in the alignas specifier should be replaced with a named constant or explained with a comment.

csrc/topk.cuh Outdated
// Returns 1, 2, 4, or 8
template <typename DType>
constexpr int ComputeFilteredTopKVecSize(uint32_t max_len) {
constexpr int MAX_VEC = 16 / sizeof(DType); // 4 for float32, 8 for fp16/bf16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 16 used in MAX_VEC calculation should be replaced with a named constant or explained with a comment.

lengths = (seq_lens.unsqueeze(1) - next_n + 1 + offsets).flatten()

if kernel_name == "large_context_topk":
workspace = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 1024 * 1024 (1MB) for the workspace size should be replaced with the RADIX_TOPK_WORKSPACE_SIZE constant defined in vllm/model_executor/layers/sparse_attn_indexer.py to ensure consistency and avoid duplication.

max_non_topk = non_topk_vals.max()

# Allow small tolerance for floating point errors
assert min_cuda_val >= max_non_topk - 1e-4, (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 1e-4 used as a tolerance for floating point errors should be replaced with a named constant. This improves readability and makes it easier to adjust the tolerance if needed.

Comment on lines +637 to +641
assert torch.allclose(
cuda_vals.sort(descending=True)[0],
torch_vals.sort(descending=True)[0],
rtol=1e-4,
atol=1e-4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The magic number 1e-4 used as rtol and atol for torch.allclose should be replaced with a named constant. This improves readability and makes it easier to adjust the tolerance if needed.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review March 18, 2026 13:16
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode Mar 18, 2026
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 20, 2026

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 24, 2026

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 25, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 25, 2026
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto
Copy link
Copy Markdown
Contributor Author

  ┌─────┬─────────┬────────────────┬─────────────────────────┬────────────┬─────────────────────────┬────────────┐
  │ BS  │ seq_len │ MAIN topk (μs) │ persistent_topk_v1 (μs) │ speedup_v1 │ persistent_topk_v2 (μs) │ speedup_v2 │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │    4096 │           6.49 │                    5.08 │      1.28x │                    6.77 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │    8192 │           8.99 │                    9.41 │      0.95x │                    6.08 │      1.48x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   16384 │          11.31 │                   11.68 │      0.97x │                   12.75 │      0.89x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   32768 │          17.52 │                   17.75 │      0.99x │                   16.87 │      1.04x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   65536 │          28.48 │                   19.09 │      1.49x │                   18.87 │      1.51x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │   96000 │          39.13 │                   18.59 │      2.10x │                   20.83 │      1.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │  128000 │          46.98 │                   19.45 │      2.42x │                   20.12 │      2.33x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   1 │  163840 │          54.10 │                   20.97 │      2.58x │                   20.56 │      2.63x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │    4096 │           7.20 │                    7.96 │      0.90x │                    6.76 │      1.07x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │    8192 │           9.02 │                    9.47 │      0.95x │                    9.29 │      0.97x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   16384 │          12.67 │                   13.05 │      0.97x │                   12.87 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   32768 │          17.04 │                   18.16 │      0.94x │                   17.35 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   65536 │          31.03 │                   21.87 │      1.42x │                   20.68 │      1.50x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │   96000 │          41.19 │                   21.19 │      1.94x │                   20.50 │      2.01x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │  128000 │          47.41 │                   22.34 │      2.12x │                   19.76 │      2.40x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   2 │  163840 │          63.95 │                   21.77 │      2.94x │                   20.14 │      3.17x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │    4096 │           7.07 │                    7.79 │      0.91x │                    7.65 │      0.92x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │    8192 │           8.85 │                    9.43 │      0.94x │                    9.23 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   16384 │          12.35 │                   13.48 │      0.92x │                   14.04 │      0.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   32768 │          18.00 │                   19.02 │      0.95x │                   17.79 │      1.01x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   65536 │          32.65 │                   21.85 │      1.49x │                   20.89 │      1.56x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │   96000 │          40.30 │                   21.75 │      1.85x │                   20.64 │      1.95x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │  128000 │          48.91 │                   21.80 │      2.24x │                   20.68 │      2.36x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   4 │  163840 │          64.44 │                   22.09 │      2.92x │                   20.45 │      3.15x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │    4096 │           6.78 │                    7.90 │      0.86x │                    7.61 │      0.89x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │    8192 │           8.99 │                    9.48 │      0.95x │                   10.16 │      0.88x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   16384 │          12.39 │                   13.74 │      0.90x │                   12.90 │      0.96x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   32768 │          17.48 │                   18.88 │      0.93x │                   17.79 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   65536 │          30.90 │                   22.28 │      1.39x │                   20.43 │      1.51x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │   96000 │          39.20 │                   22.07 │      1.78x │                   21.62 │      1.81x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │  128000 │          49.66 │                   22.66 │      2.19x │                   22.48 │      2.21x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │   8 │  163840 │          63.20 │                   22.51 │      2.81x │                   22.90 │      2.76x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │    4096 │           7.09 │                    7.90 │      0.90x │                    7.70 │      0.92x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │    8192 │           9.24 │                    9.62 │      0.96x │                    9.29 │      0.99x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   16384 │          12.67 │                   14.82 │      0.85x │                   13.35 │      0.95x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   32768 │          17.43 │                   21.04 │      0.83x │                   20.01 │      0.87x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   65536 │          32.01 │                   31.73 │      1.01x │                   32.74 │      0.98x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │   96000 │          40.09 │                   37.71 │      1.06x │                   38.99 │      1.03x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │  128000 │          51.65 │                   34.91 │      1.48x │                   36.03 │      1.43x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  16 │  163840 │          63.78 │                   39.40 │      1.62x │                   40.80 │      1.56x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │     │         │                │                         │            │                         │            │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │    4096 │           7.32 │                    8.05 │      0.91x │                    7.79 │      0.94x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │    8192 │           9.42 │                    9.69 │      0.97x │                   10.03 │      0.94x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   16384 │          12.69 │                   14.01 │      0.91x │                   14.05 │      0.90x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   32768 │          17.11 │                   20.72 │      0.83x │                   19.94 │      0.86x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   65536 │          31.71 │                   31.90 │      0.99x │                   32.61 │      0.97x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │   96000 │          41.66 │                   37.79 │      1.10x │                   38.91 │      1.07x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │  128000 │          52.12 │                   35.14 │      1.48x │                   36.16 │      1.44x │
  ├─────┼─────────┼────────────────┼─────────────────────────┼────────────┼─────────────────────────┼────────────┤
  │  32 │  163840 │          64.20 │                   39.95 │      1.61x │                   41.09 │      1.56x │
  └─────┴─────────┴────────────────┴─────────────────────────┴────────────┴─────────────────────────┴────────────┘

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome! Thanks for all the hard work!

one nit: instead of threading the topk_workspace through the whole model definition can we just use current_workspace_manager().get_simultaneous(...)?

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Apr 1, 2026
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2026
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) April 1, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

3 participants