-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 #35923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
DarkLight1337
merged 16 commits into
vllm-project:main
from
JartX:bugfix/qwen35_rocm_attn
Mar 11, 2026
Merged
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 #35923
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
90e602f
add 1056 block_size to triton fallback
JartX 2dbcd4b
precommit
JartX bd4b501
precommit doc
JartX 6212617
Clarify comment on non-standard model block sizes
JartX 9c161ae
qwen3.5 27b
JartX 5848396
qwen3.5 27b
JartX 1bbc515
allow multiple of 16 via triton path
JartX c12fe6f
Merge branch 'main' into bugfix/qwen35_rocm_attn
JartX cd8be20
control blocks
JartX bffc779
Merge branch 'main' into bugfix/qwen35_rocm_attn
JartX 3e3f037
Merge remote-tracking branch 'upstream/main' into bugfix/qwen35_rocm_…
JartX c832a9a
Merge branch 'main' into bugfix/qwen35_rocm_attn
JartX d987032
Merge branch 'main' into bugfix/qwen35_rocm_attn
JartX 95ad792
Merge branch 'main' into bugfix/qwen35_rocm_attn
tjtanaa cc9db38
remove redundant code
JartX eb6d6f5
Merge branch 'main' into bugfix/qwen35_rocm_attn
DarkLight1337 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change to allow any block size that is a multiple of 16 is correct for supporting models like Qwen3.5, it introduces a potential failure for other models.
The dispatch logic in
do_kv_cache_update(lines 450-480) usesis_pow2to decide whether to use the native C++ kernel or the Triton fallback. The native C++ kernel, as noted in the comments and confirmed incsrc/rocm/attention.cu, only supports block sizes of 16 and 32.With this PR, a model using a block size that is a power of two but not 16 or 32 (e.g., 64) will be incorrectly routed to the native C++ kernel, which will then raise an error.
To fix this, the condition in
do_kv_cache_updateshould be changed fromif is_pow2:toif block_size in (16, 32):. This will ensure that only the explicitly supported block sizes are routed to the native kernel, and all others (including other powers of two) use the Triton fallback.