Skip to content

Temporary disable persistent topk for Hopper#41605

Merged
ywang96 merged 2 commits intovllm-project:releases/v0.20.1from
zixi-qi:revert-persistent-topk-v0.20.1
May 4, 2026
Merged

Temporary disable persistent topk for Hopper#41605
ywang96 merged 2 commits intovllm-project:releases/v0.20.1from
zixi-qi:revert-persistent-topk-v0.20.1

Conversation

@zixi-qi
Copy link
Copy Markdown
Collaborator

@zixi-qi zixi-qi commented May 4, 2026

Summary

  • Reverts commit a4debbd on releases/v0.20.1, which had reverted Temporary disable persistent topk #41442 ("Temporary disable persistent topk").
  • Restores the topk_tokens in (512, 1024, 2048) gate in vllm/model_executor/layers/sparse_attn_indexer.py so the persistent topk path is re-disabled for topk_tokens == 1024 on this release branch.

Test plan

  • Verified MTP=1 no longer hangs on hopper for deepseek v4 pro and gsm8k score as expected

✓ task=gsm8k | flexible-extract=0.9522 | strict-match=0.9530


AI assistance was used to prepare this revert PR (Claude Code). The change is a pure revert of a single-line gate; the submitter has reviewed every changed line.

Co-authored-by: Claude noreply@anthropic.com

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the sparse_attn_indexer function in vllm/model_executor/layers/sparse_attn_indexer.py to remove 1024 from the supported topk_tokens values when executing on CUDA platforms. I have no feedback to provide.

zixi-qi and others added 2 commits May 4, 2026 04:46
This reverts commit a4debbd.

Signed-off-by: zixi-qi <zixi@inferact.ai>
Keep `topk_tokens == 1024` on the persistent_topk path on Blackwell
(SM10x), but disable it on Hopper and other CUDA archs so the original
revert (vllm-project#41442) behavior is preserved there.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: zixi-qi <zixi@inferact.ai>
@zixi-qi zixi-qi force-pushed the revert-persistent-topk-v0.20.1 branch from 1e042d7 to d92dd26 Compare May 4, 2026 04:46
@zixi-qi zixi-qi changed the title Revert "Revert \"Temporary disable persistent topk (#41442)\"" Temporary disable persistent topk for Hopper May 4, 2026
@zixi-qi zixi-qi marked this pull request as draft May 4, 2026 04:52
@zixi-qi
Copy link
Copy Markdown
Collaborator Author

zixi-qi commented May 4, 2026

@claude review

@zixi-qi zixi-qi marked this pull request as ready for review May 4, 2026 05:02
@zixi-qi zixi-qi requested review from ywang96 and zyongye May 4, 2026 05:03
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straightforward release-branch backport of the existing Blackwell gating pattern from main (d92dd26); safe to land.

Extended reasoning...

Overview

This PR modifies a single conditional in vllm/model_executor/layers/sparse_attn_indexer.py (5 lines added, 1 removed) that gates whether the persistent topk kernel is used. After the change, topk_tokens == 1024 only takes the persistent path on Blackwell (is_device_capability_family(100)); Hopper and other non-Blackwell CUDA devices fall back to the per-row decode path. This is a backport of the same gating that already lives on main (commit d92dd26 "Gate persistent topk for topk_tokens=1024 to Blackwell only").

Security risks

None. The change is a pure dispatch gate based on device compute capability — no auth, no user-input handling, no crypto, no IO. Worst case is a perf regression on Blackwell or a missed hang fix on Hopper.

Level of scrutiny

Low. The PR is a release-branch revert-of-revert restoring an upstream-validated pattern. is_device_capability_family is verified to exist at vllm/platforms/interface.py:348 and correctly checks the major version of the compute capability (100 → 10.x). The author also reports concrete validation: MTP=1 no longer hangs on Hopper for DeepSeek v4 Pro, with gsm8k flexible-extract 0.9522 / strict-match 0.9530.

Other factors

The diff is small and self-contained, the bug hunting system found no issues, and the equivalent change has already been merged to main. The author has explicitly noted that AI assistance was used and they reviewed every changed line.

@ywang96 ywang96 merged commit 43a21e6 into vllm-project:releases/v0.20.1 May 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants