Temporary disable persistent topk for Hopper#41605
Temporary disable persistent topk for Hopper#41605ywang96 merged 2 commits intovllm-project:releases/v0.20.1from
Conversation
This reverts commit a4debbd. Signed-off-by: zixi-qi <zixi@inferact.ai>
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell (SM10x), but disable it on Hopper and other CUDA archs so the original revert (vllm-project#41442) behavior is preserved there. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: zixi-qi <zixi@inferact.ai>
1e042d7 to
d92dd26
Compare
|
@claude review |
There was a problem hiding this comment.
Straightforward release-branch backport of the existing Blackwell gating pattern from main (d92dd26); safe to land.
Extended reasoning...
Overview
This PR modifies a single conditional in vllm/model_executor/layers/sparse_attn_indexer.py (5 lines added, 1 removed) that gates whether the persistent topk kernel is used. After the change, topk_tokens == 1024 only takes the persistent path on Blackwell (is_device_capability_family(100)); Hopper and other non-Blackwell CUDA devices fall back to the per-row decode path. This is a backport of the same gating that already lives on main (commit d92dd26 "Gate persistent topk for topk_tokens=1024 to Blackwell only").
Security risks
None. The change is a pure dispatch gate based on device compute capability — no auth, no user-input handling, no crypto, no IO. Worst case is a perf regression on Blackwell or a missed hang fix on Hopper.
Level of scrutiny
Low. The PR is a release-branch revert-of-revert restoring an upstream-validated pattern. is_device_capability_family is verified to exist at vllm/platforms/interface.py:348 and correctly checks the major version of the compute capability (100 → 10.x). The author also reports concrete validation: MTP=1 no longer hangs on Hopper for DeepSeek v4 Pro, with gsm8k flexible-extract 0.9522 / strict-match 0.9530.
Other factors
The diff is small and self-contained, the bug hunting system found no issues, and the equivalent change has already been merged to main. The author has explicitly noted that AI assistance was used and they reviewed every changed line.
Summary
releases/v0.20.1, which had reverted Temporary disable persistent topk #41442 ("Temporary disable persistent topk").topk_tokens in (512, 1024, 2048)gate invllm/model_executor/layers/sparse_attn_indexer.pyso the persistent topk path is re-disabled fortopk_tokens == 1024on this release branch.Test plan
AI assistance was used to prepare this revert PR (Claude Code). The change is a pure revert of a single-line gate; the submitter has reviewed every changed line.
Co-authored-by: Claude noreply@anthropic.com