[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless by zyongye · Pull Request #41665 · vllm-project/vllm

zyongye · 2026-05-04T21:12:49Z

Purpose

External memset is introduced in #41444 but only when need_cooperative is true. need_cooperative only trigger when max_seq_len is greater than a certain threshold. And this means that the trigger will never fire at cuda graph capture time.

This PR remove need_cooperative statement so that the memset kernel is always trigger at capture time.

Test Plan

gsm8k v4-pro DEP4 on B300 MTP2

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9522|±  |0.0059|

Speedbench 8k/1k result

================= Serving Benchmark Result =================
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  33.69     
Total input tokens:                      81920     
Total generated tokens:                  4385      
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         130.14    
Peak output token throughput (tok/s):    65.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          2561.38   
--------------------Time to First Token---------------------
Mean TTFT (ms):                          592.15    
Median TTFT (ms):                        591.43    
P99 TTFT (ms):                           603.63    
P90 TTFT (ms):                           593.99    
----------Time per Output Token (excl. 1st token)-----------
Mean TPOT (ms):                          6.53      
Median TPOT (ms):                        6.43      
P99 TPOT (ms):                           7.75      
P90 TPOT (ms):                           6.89      
--------------------Inter-token Latency---------------------
Mean ITL (ms):                           15.55     
Median ITL (ms):                         15.44     
P99 ITL (ms):                            25.05     
P90 ITL (ms):                            15.67     
---------------------End-to-end Latency---------------------
Mean E2EL (ms):                          3369.39   
Median E2EL (ms):                        3154.05   
P99 E2EL (ms):                           6512.02   
P90 E2EL (ms):                           5079.81   
---------------Speculative Decoding---------------
Acceptance rate (%):                     73.34     
Acceptance length:                       2.47      
Drafts:                                  1776      
Draft tokens:                            3552      
Accepted tokens:                         2605      
Per-position acceptance (%):
  Position 0:                            88.29     
  Position 1:                            58.39     
============================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-04T21:13:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an unconditional cudaMemsetAsync to zero the RadixRowState workspace before the persistent_topk_kernel launch, ensuring correct state initialization for CUDA graph replays and inter-CTA synchronization. Feedback identifies a potential race condition where the existing in-kernel initialization of the same state might conflict with the new host-side memset, and it is recommended to remove the redundant in-kernel reset to ensure correctness.

gemini-code-assist · 2026-05-04T21:14:41Z

+    {
+      cudaError_t mz_err = cudaMemsetAsync(workspace.data_ptr<uint8_t>(), 0,
+                                           state_bytes, stream);
+      TORCH_CHECK(mz_err == cudaSuccess,
+                  "row_states memset failed: ", cudaGetErrorString(mz_err));
+    }


The addition of cudaMemsetAsync correctly ensures that the RadixRowState is zeroed and globally visible before the kernel launch, which is necessary for CUDA graph capture and correct inter-CTA synchronization. However, the persistent_topk_kernel in csrc/persistent_topk.cuh (lines 894-909) still contains an in-kernel initialization block that resets arrival_counter and other fields.

Because there is no global barrier between that in-kernel initialization and the subsequent work loop, a race condition exists: other CTAs (e.g., cta_in_group > 0) may start executing radix_topk, incrementing the arrival_counter, only to have cta_in_group == 0 reset it to zero. This will lead to synchronization failures or hangs.

Now that cudaMemsetAsync is used, the initialization block in csrc/persistent_topk.cuh should be removed to ensure correctness and avoid this race condition.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2026-05-04T23:23:09Z

@claude review

claude

LGTM — minimal, well-justified fix to ensure the workspace memset is captured as a graph node regardless of capture-time max_seq_len.

Extended reasoning...

Overview

Single-line change in csrc/topk.cu that removes a host-side if (needs_cooperative) guard around cudaMemsetAsync. The memset zeroes the per-group RadixRowState workspace before kernel launch. The guard caused the memset to be skipped at CUDA-graph capture time when capture-time max_seq_len <= RADIX_THRESHOLD, so the zeroing was never recorded into the captured graph, and stale workspace state could persist across replays if the runtime data later took the radix path.

Security risks

None. This is a CUDA kernel launch correctness fix in the DeepSeek V3 sparse attention indexer. No auth, crypto, network I/O, deserialization, or user-input handling is touched.

Level of scrutiny

Low. The change is mechanical (drop a conditional, keep the memset body), the rationale is fully captured in the new comment, the cost is negligible (~3 KB per group, sub-microsecond), and the author included gsm8k accuracy and serving benchmark numbers showing correct behavior on B300 MTP2.

Other factors

The Gemini bot flagged a potential in-kernel-init vs memset race, but that is stale: csrc/persistent_topk.cuh no longer has an in-kernel initialization block for arrival_counter (the comment at lines 890-895 explicitly notes the init was moved host-side for exactly this happens-before reason, and a grep for arrival_counter shows only red_release/wait_ge uses inside the kernel). Mergify also flagged conflicts, but rebasing is the author's responsibility and orthogonal to review correctness.

benchislett · 2026-05-06T21:36:34Z

+    // Issued UNCONDITIONALLY so the memset is captured as its own node in
+    // the cudagraph (a separate cudaMemsetAsync node, sequenced before the
+    // persistent_topk_kernel launch on the same stream). The previous
+    // host-side guard `if (needs_cooperative)` was evaluated at capture time;
+    // when capture-time max_seq_len <= RADIX_THRESHOLD (always true under
+    // FULL_DECODE_ONLY with max_model_len < 32 K) the memset would NOT be
+    // captured, leaving the workspace state to accumulate across replays.
+    // That's a latent correctness bug if the runtime data ever takes the
+    // radix path, and removes one variable while debugging hangs in the
+    // decode/medium paths.
+    //
+    // Cost is sub-microsecond: state_bytes = num_groups * sizeof(RadixRowState)
+    // is ~3 KB per group, ~100 KB for the largest grids on this hardware.


I don't think we need quite so much context in the comment.

Suggested change

// Issued UNCONDITIONALLY so the memset is captured as its own node in

// the cudagraph (a separate cudaMemsetAsync node, sequenced before the

// persistent_topk_kernel launch on the same stream). The previous

// host-side guard `if (needs_cooperative)` was evaluated at capture time;

// when capture-time max_seq_len <= RADIX_THRESHOLD (always true under

// FULL_DECODE_ONLY with max_model_len < 32 K) the memset would NOT be

// captured, leaving the workspace state to accumulate across replays.

// That's a latent correctness bug if the runtime data ever takes the

// radix path, and removes one variable while debugging hangs in the

// decode/medium paths.

//

// Cost is sub-microsecond: state_bytes = num_groups * sizeof(RadixRowState)

// is ~3 KB per group, ~100 KB for the largest grids on this hardware.

// Issued UNCONDITIONALLY so the memset is captured as its own node in

// the cudagraph (a separate cudaMemsetAsync node, sequenced before the

// persistent_topk_kernel launch on the same stream).

claude Bot reviewed May 4, 2026

View reviewed changes

mergify Bot added needs-rebase bug Something isn't working labels May 4, 2026

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

fix

2e4b843

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the bug/clear_persistent_topk_scratchpad branch from aee0c46 to 2e4b843 Compare May 4, 2026 21:15

mergify Bot removed the needs-rebase label May 4, 2026

zyongye added the deepseek Related to DeepSeek models label May 4, 2026

zyongye linked an issue May 4, 2026 that may be closed by this pull request

[Bug]: h200 deepseekv4 pro mtp #41483

Closed

1 task

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026

claude Bot reviewed May 4, 2026

View reviewed changes

zyongye and others added 4 commits May 4, 2026 21:26

Merge branch 'main' into bug/clear_persistent_topk_scratchpad

75b858f

Merge branch 'main' into bug/clear_persistent_topk_scratchpad

91ea9b9

Merge branch 'main' into bug/clear_persistent_topk_scratchpad

3aa36f5

Merge branch 'main' into bug/clear_persistent_topk_scratchpad

0a7a36c

LopezCastroRoberto mentioned this pull request May 5, 2026

[Perf] Replace cudaMemsetAsync with in-kernel cleanup for persistent_topk #41748

Open

Merge branch 'main' into bug/clear_persistent_topk_scratchpad

27c3fc6

zyongye enabled auto-merge (squash) May 6, 2026 02:53

benchislett reviewed May 6, 2026

View reviewed changes

ywang96 disabled auto-merge May 6, 2026 23:17

ywang96 approved these changes May 6, 2026

View reviewed changes

ywang96 merged commit 80d5e7d into vllm-project:main May 6, 2026
149 of 152 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless#41665

[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless#41665
ywang96 merged 6 commits intovllm-project:mainfrom
zyongye:bug/clear_persistent_topk_scratchpad

zyongye commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

zyongye commented May 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

benchislett May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zyongye commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented May 4, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

benchislett May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyongye commented May 4, 2026 •

edited

Loading