[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5#35923
Conversation
Signed-off-by: JartX <sagformas@epdcenter.es>
There was a problem hiding this comment.
Code Review
This pull request correctly adds support for the Qwen3.5 model on ROCm by including its non-standard block size of 1056 in the RocmAttentionBackend. This change is a simple and effective fix for the reported issue, allowing the model to use the appropriate Triton kernel fallback. The implementation is correct and I have no further suggestions for improvement.
|
Hi @JartX, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: JartX <sagformas@epdcenter.es>
|
Documentation preview: https://vllm--35923.org.readthedocs.build/en/35923/ |
Signed-off-by: JartX <sagformas@epdcenter.es>
c07fc35 to
6212617
Compare
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where Qwen3.5 models produced incorrect outputs on the ROCm backend. The fix correctly identifies that the non-standard block size was the issue and generalizes the supported block sizes for the ROCM_ATTN backend to any multiple of 16. This is a good change that improves robustness for future models. The corresponding documentation has also been updated. However, I've found a critical issue related to this change that could cause failures for other models.
| def get_supported_kernel_block_sizes() -> list[int | MultipleOf]: | ||
| # ROCM paged attention kernel only supports block sizes 16 and 32 | ||
| # ROCM paged attention native C++ kernel only supports block sizes 16 and 32 | ||
| # due to shared memory (LDS) constraints on AMD GPUs. | ||
| # See csrc/rocm/attention.cu CALL_CUSTOM_LAUNCHER_BLK macro. | ||
|
|
||
| # However, The limitations in [16, 32] are reasonable for a native C++ kernel, | ||
| # but vLLM should allow support for non-standard sizes via the Triton path, | ||
| # as addressed in this PR: https://github.com/vllm-project/vllm/pull/31380, | ||
| # where the Triton kernel under rocm_atten does not support inference | ||
| # for a non-standard qwen3-next model with a block_size of 544. | ||
| # We have fixed the Triton kernel so that the standard model uses the original | ||
| # bit-addressing logic, while the non-standard model | ||
| # uses our optimized kernel logic. | ||
| return [16, 32, 544] | ||
| # However, vLLM allows support for any multiple of 16 via the Triton path. | ||
| # As addressed in PR: https://github.com/vllm-project/vllm/pull/31380, | ||
| # non-standard models (like qwen3-next with block_size 544, or qwen3_5 | ||
| # with 784 and 1056) are dynamically routed to our optimized Triton kernel | ||
| # in `do_kv_cache_update`. | ||
| return [MultipleOf(16)] |
There was a problem hiding this comment.
While this change to allow any block size that is a multiple of 16 is correct for supporting models like Qwen3.5, it introduces a potential failure for other models.
The dispatch logic in do_kv_cache_update (lines 450-480) uses is_pow2 to decide whether to use the native C++ kernel or the Triton fallback. The native C++ kernel, as noted in the comments and confirmed in csrc/rocm/attention.cu, only supports block sizes of 16 and 32.
With this PR, a model using a block size that is a power of two but not 16 or 32 (e.g., 64) will be incorrectly routed to the native C++ kernel, which will then raise an error.
To fix this, the condition in do_kv_cache_update should be changed from if is_pow2: to if block_size in (16, 32):. This will ensure that only the explicitly supported block sizes are routed to the native kernel, and all others (including other powers of two) use the Triton fallback.
|
This pull request has merge conflicts that must be resolved before it can be |
|
/gemini review |
|
@tjtanaa failed no such container in AMD entrypoint |
|
@Rohan138 |
|
@JartX can you rebase your branch? This test group should be green as of yesterday. |
Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com>
4f16fcd to
cd8be20
Compare
|
@AndreasKaratzas all test passed :) |
|
That's great :) Unfortunately, even though my tag says "member" my approval won't turn your PR green (I only have read permissions 😅). I have forwarded your PR to the right channels. |
|
@AndreasKaratzas many thanks ! Hahah :) |
|
@tjtanaa Please check this out when you can :) |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: JartX <sagformas@epdcenter.es>
Head branch was pushed to by a user without write access
…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
This PR adds multiple of 16 to the list of supported kernel block sizes in RocmAttentionBackend
When running Qwen3.5 models using the ROCM_ATTN backend, the model produces broken, nonsensical outputs (e.g., repeating exclamation marks like !!!!!!!!!!). This happens because Qwen3.5 utilizes a non-standard block size of 1056. Since this size was not explicitly permitted, the model failed to correctly route the value_cache through the optimized Triton kernel fallback (triton_reshape_and_cache_flash).