Skip to content

Temporary disable persistent topk#41442

Merged
ywang96 merged 1 commit intovllm-project:releases/v0.20.1from
zyongye:persistent_topk_disable
May 1, 2026
Merged

Temporary disable persistent topk#41442
ywang96 merged 1 commit intovllm-project:releases/v0.20.1from
zyongye:persistent_topk_disable

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented May 1, 2026

No description provided.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the sparse_attn_indexer to remove the 1024 token size from the persistent top-k optimization on CUDA platforms. Feedback suggests that if the goal is to disable this optimization due to stability or correctness issues, it should likely be disabled for all supported sizes (512 and 2048) rather than just 1024 to ensure consistency with the PR's objective.

topk_indices = topk_indices_buffer[:num_padded_tokens, :topk_tokens]

if current_platform.is_cuda() and topk_tokens in (512, 1024, 2048):
if current_platform.is_cuda() and topk_tokens in (512, 2048):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR title 'Temporary disable persistent topk' suggests an intent to disable the persistent topk optimization entirely. However, the current implementation only removes the 1024 case, leaving it enabled for 512 and 2048. If the kernel is being disabled due to a general issue (e.g., stability or correctness), it should likely be disabled for all supported sizes to ensure the workaround is effective across all configurations.

@ywang96 ywang96 added this to the v0.20.1 milestone May 1, 2026
@ywang96 ywang96 merged commit 05ebca5 into vllm-project:releases/v0.20.1 May 1, 2026
6 of 7 checks passed
@khluu khluu mentioned this pull request May 1, 2026
khluu added a commit that referenced this pull request May 2, 2026
zixi-qi added a commit to zixi-qi/vllm that referenced this pull request May 4, 2026
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell
(SM10x), but disable it on Hopper and other CUDA archs so the original
revert (vllm-project#41442) behavior is preserved there.

Co-authored-by: Claude <noreply@anthropic.com>
zixi-qi added a commit to zixi-qi/vllm that referenced this pull request May 4, 2026
This reverts commit a4debbd.

Signed-off-by: zixi-qi <zixi@inferact.ai>
zixi-qi added a commit to zixi-qi/vllm that referenced this pull request May 4, 2026
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell
(SM10x), but disable it on Hopper and other CUDA archs so the original
revert (vllm-project#41442) behavior is preserved there.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: zixi-qi <zixi@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants