[ROCm] Validate block_size for explicitly selected attention backends#36846
[ROCm] Validate block_size for explicitly selected attention backends#36846gshtras merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the get_attn_backend_cls method in vllm/platforms/rocm.py to validate the block_size for explicitly selected attention backends. A new check is added to verify that the selected backend supports the specified block_size before this parameter is stripped from the configuration. This fixes a bug where this validation was being bypassed. The change is targeted and does not affect the automatic backend selection path. I have reviewed the changes and found no issues.
vllm/platforms/rocm.py
Outdated
| f"{backend_class.get_supported_kernel_block_sizes()}." | ||
| ) | ||
|
|
||
| attn_selector_config = attn_selector_config._replace(block_size=None) |
There was a problem hiding this comment.
@AndreasKaratzas We should also guard this block size. Because there are also a use case where users do not specify a specific backend, but he specified the --block-size explicitly.
This #36274 looks more like a hotpatch.
We should be looking into solving supporting the correct size through get_supported_kernel_block_sizes of attention backend class.
There was a problem hiding this comment.
@AndreasKaratzas Let's try to fix the get_supported_kernel_block_sizes of attention backend in this PR. Else Meta will encounter issue running Qwen3.5 after this PR.
There was a problem hiding this comment.
@tjtanaa But Qwen works fine. Are you referring to:
vllm bench throughput --model Qwen/Qwen3-Next-80B-A3B-Instruct --kv-cache-dtype auto --load-format dummy --input-len 1024 --output-len 1024 --num-prompts 128 --tensor-parallel-size 8 --dtype float16
Throughput: 3.61 requests/s, 4157.86 total tokens/s, 461.98 output tokens/s
Total num prompt tokens: 131072
Total num output tokens: 16384vllm bench throughput --model Qwen/Qwen3-Next-80B-A3B-Instruct --kv-cache-dtype auto --load-format dummy --input-len 1024 --output-len 1024 --num-prompts 128 --tensor-parallel-size 8 --dtype float16 --attention-backend ROCM_ATTN
Throughput: 8.01 requests/s, 9228.46 total tokens/s, 1025.38 output tokens/s
Total num prompt tokens: 131072
Total num output tokens: 16384The above are with this PR.
There was a problem hiding this comment.
Tried Qwen 3.5 today as well.
vllm bench throughput --model Qwen/Qwen3.5-35B-A3B --load-format dummy --input-len 1024 --output-len 1024 --num-prompts 128 --tensor-parallel-size 8 --dtype float16
Throughput: 5.53 requests/s, 6367.64 total tokens/s, 707.52 output tokens/s
Total num prompt tokens: 131072
Total num output tokens: 16384vllm bench throughput --model Qwen/Qwen3.5-35B-A3B --load-format dummy --input-len 1024 --output-len 1024 --num-prompts 128 --tensor-parallel-size 8 --dtype float16 --attention-backend ROCM_ATTN
Throughput: 8.34 requests/s, 9602.49 total tokens/s, 1066.94 output tokens/s
Total num prompt tokens: 131072
Total num output tokens: 16384There was a problem hiding this comment.
There is a code refactor effort on cuda side #35122 and probably fixed the previous issue.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
cc @Rohan138 |
…vllm-project#36846) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…vllm-project#36846) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
#36274 stripped
block_sizefromattn_selector_configbefore backend validation inget_attn_backend_cls, which was correct for auto-selection (block_size may not be finalized at that point). However, this also bypassed block_size validation for explicitly user-selected backends, breaking the contract established in #36292.supports_block_sizecheck for the selected-backend path, before the stripblock_sizeis still stripped theretest_mla_backend_selection[env_vars1-TRITON_MLA-1-None-True]cc @kenroche