Skip to content

[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless#41665

Merged
ywang96 merged 6 commits intovllm-project:mainfrom
zyongye:bug/clear_persistent_topk_scratchpad
May 6, 2026
Merged

[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless#41665
ywang96 merged 6 commits intovllm-project:mainfrom
zyongye:bug/clear_persistent_topk_scratchpad

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented May 4, 2026

Purpose

External memset is introduced in #41444 but only when need_cooperative is true. need_cooperative only trigger when max_seq_len is greater than a certain threshold. And this means that the trigger will never fire at cuda graph capture time.

This PR remove need_cooperative statement so that the memset kernel is always trigger at capture time.

Test Plan

gsm8k v4-pro DEP4 on B300 MTP2

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9522|±  |0.0059|

Speedbench 8k/1k result

================= Serving Benchmark Result =================
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  33.69     
Total input tokens:                      81920     
Total generated tokens:                  4385      
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         130.14    
Peak output token throughput (tok/s):    65.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          2561.38   
--------------------Time to First Token---------------------
Mean TTFT (ms):                          592.15    
Median TTFT (ms):                        591.43    
P99 TTFT (ms):                           603.63    
P90 TTFT (ms):                           593.99    
----------Time per Output Token (excl. 1st token)-----------
Mean TPOT (ms):                          6.53      
Median TPOT (ms):                        6.43      
P99 TPOT (ms):                           7.75      
P90 TPOT (ms):                           6.89      
--------------------Inter-token Latency---------------------
Mean ITL (ms):                           15.55     
Median ITL (ms):                         15.44     
P99 ITL (ms):                            25.05     
P90 ITL (ms):                            15.67     
---------------------End-to-end Latency---------------------
Mean E2EL (ms):                          3369.39   
Median E2EL (ms):                        3154.05   
P99 E2EL (ms):                           6512.02   
P90 E2EL (ms):                           5079.81   
---------------Speculative Decoding---------------
Acceptance rate (%):                     73.34     
Acceptance length:                       2.47      
Drafts:                                  1776      
Draft tokens:                            3552      
Accepted tokens:                         2605      
Per-position acceptance (%):
  Position 0:                            88.29     
  Position 1:                            58.39     
============================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added needs-rebase bug Something isn't working labels May 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an unconditional cudaMemsetAsync to zero the RadixRowState workspace before the persistent_topk_kernel launch, ensuring correct state initialization for CUDA graph replays and inter-CTA synchronization. Feedback identifies a potential race condition where the existing in-kernel initialization of the same state might conflict with the new host-side memset, and it is recommended to remove the redundant in-kernel reset to ensure correctness.

Comment thread csrc/topk.cu
Comment on lines +181 to +186
{
cudaError_t mz_err = cudaMemsetAsync(workspace.data_ptr<uint8_t>(), 0,
state_bytes, stream);
TORCH_CHECK(mz_err == cudaSuccess,
"row_states memset failed: ", cudaGetErrorString(mz_err));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addition of cudaMemsetAsync correctly ensures that the RadixRowState is zeroed and globally visible before the kernel launch, which is necessary for CUDA graph capture and correct inter-CTA synchronization. However, the persistent_topk_kernel in csrc/persistent_topk.cuh (lines 894-909) still contains an in-kernel initialization block that resets arrival_counter and other fields.

Because there is no global barrier between that in-kernel initialization and the subsequent work loop, a race condition exists: other CTAs (e.g., cta_in_group > 0) may start executing radix_topk, incrementing the arrival_counter, only to have cta_in_group == 0 reset it to zero. This will lead to synchronization failures or hangs.

Now that cudaMemsetAsync is used, the initialization block in csrc/persistent_topk.cuh should be removed to ensure correctness and avoid this race condition.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye force-pushed the bug/clear_persistent_topk_scratchpad branch from aee0c46 to 2e4b843 Compare May 4, 2026 21:15
@mergify mergify Bot removed the needs-rebase label May 4, 2026
@zyongye zyongye added the deepseek Related to DeepSeek models label May 4, 2026
@zyongye zyongye linked an issue May 4, 2026 that may be closed by this pull request
1 task
@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026
@zyongye
Copy link
Copy Markdown
Member Author

zyongye commented May 4, 2026

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — minimal, well-justified fix to ensure the workspace memset is captured as a graph node regardless of capture-time max_seq_len.

Extended reasoning...

Overview

Single-line change in csrc/topk.cu that removes a host-side if (needs_cooperative) guard around cudaMemsetAsync. The memset zeroes the per-group RadixRowState workspace before kernel launch. The guard caused the memset to be skipped at CUDA-graph capture time when capture-time max_seq_len <= RADIX_THRESHOLD, so the zeroing was never recorded into the captured graph, and stale workspace state could persist across replays if the runtime data later took the radix path.

Security risks

None. This is a CUDA kernel launch correctness fix in the DeepSeek V3 sparse attention indexer. No auth, crypto, network I/O, deserialization, or user-input handling is touched.

Level of scrutiny

Low. The change is mechanical (drop a conditional, keep the memset body), the rationale is fully captured in the new comment, the cost is negligible (~3 KB per group, sub-microsecond), and the author included gsm8k accuracy and serving benchmark numbers showing correct behavior on B300 MTP2.

Other factors

The Gemini bot flagged a potential in-kernel-init vs memset race, but that is stale: csrc/persistent_topk.cuh no longer has an in-kernel initialization block for arrival_counter (the comment at lines 890-895 explicitly notes the init was moved host-side for exactly this happens-before reason, and a grep for arrival_counter shows only red_release/wait_ge uses inside the kernel). Mergify also flagged conflicts, but rebasing is the author's responsibility and orthogonal to review correctness.

@zyongye zyongye enabled auto-merge (squash) May 6, 2026 02:53
Comment thread csrc/topk.cu
Comment on lines +158 to +170
// Issued UNCONDITIONALLY so the memset is captured as its own node in
// the cudagraph (a separate cudaMemsetAsync node, sequenced before the
// persistent_topk_kernel launch on the same stream). The previous
// host-side guard `if (needs_cooperative)` was evaluated at capture time;
// when capture-time max_seq_len <= RADIX_THRESHOLD (always true under
// FULL_DECODE_ONLY with max_model_len < 32 K) the memset would NOT be
// captured, leaving the workspace state to accumulate across replays.
// That's a latent correctness bug if the runtime data ever takes the
// radix path, and removes one variable while debugging hangs in the
// decode/medium paths.
//
// Cost is sub-microsecond: state_bytes = num_groups * sizeof(RadixRowState)
// is ~3 KB per group, ~100 KB for the largest grids on this hardware.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need quite so much context in the comment.

Suggested change
// Issued UNCONDITIONALLY so the memset is captured as its own node in
// the cudagraph (a separate cudaMemsetAsync node, sequenced before the
// persistent_topk_kernel launch on the same stream). The previous
// host-side guard `if (needs_cooperative)` was evaluated at capture time;
// when capture-time max_seq_len <= RADIX_THRESHOLD (always true under
// FULL_DECODE_ONLY with max_model_len < 32 K) the memset would NOT be
// captured, leaving the workspace state to accumulate across replays.
// That's a latent correctness bug if the runtime data ever takes the
// radix path, and removes one variable while debugging hangs in the
// decode/medium paths.
//
// Cost is sub-microsecond: state_bytes = num_groups * sizeof(RadixRowState)
// is ~3 KB per group, ~100 KB for the largest grids on this hardware.
// Issued UNCONDITIONALLY so the memset is captured as its own node in
// the cudagraph (a separate cudaMemsetAsync node, sequenced before the
// persistent_topk_kernel launch on the same stream).

@ywang96 ywang96 disabled auto-merge May 6, 2026 23:17
@ywang96 ywang96 merged commit 80d5e7d into vllm-project:main May 6, 2026
149 of 152 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: h200 deepseekv4 pro mtp

3 participants