Temporary disable persistent topk#41442
Temporary disable persistent topk#41442ywang96 merged 1 commit intovllm-project:releases/v0.20.1from
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the sparse_attn_indexer to remove the 1024 token size from the persistent top-k optimization on CUDA platforms. Feedback suggests that if the goal is to disable this optimization due to stability or correctness issues, it should likely be disabled for all supported sizes (512 and 2048) rather than just 1024 to ensure consistency with the PR's objective.
| topk_indices = topk_indices_buffer[:num_padded_tokens, :topk_tokens] | ||
|
|
||
| if current_platform.is_cuda() and topk_tokens in (512, 1024, 2048): | ||
| if current_platform.is_cuda() and topk_tokens in (512, 2048): |
There was a problem hiding this comment.
The PR title 'Temporary disable persistent topk' suggests an intent to disable the persistent topk optimization entirely. However, the current implementation only removes the 1024 case, leaving it enabled for 512 and 2048. If the kernel is being disabled due to a general issue (e.g., stability or correctness), it should likely be disabled for all supported sizes to ensure the workaround is effective across all configurations.
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell (SM10x), but disable it on Hopper and other CUDA archs so the original revert (vllm-project#41442) behavior is preserved there. Co-authored-by: Claude <noreply@anthropic.com>
This reverts commit a4debbd. Signed-off-by: zixi-qi <zixi@inferact.ai>
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell (SM10x), but disable it on Hopper and other CUDA archs so the original revert (vllm-project#41442) behavior is preserved there. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: zixi-qi <zixi@inferact.ai>
No description provided.