[Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform#30212
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the attention backend selection mechanism by introducing an AttentionSelectorConfig NamedTuple. This new configuration object encapsulates various attention parameters such as head size, data type, KV cache data type, block size, MLA usage, sink token presence, sparse attention usage, and attention type. The get_attn_backend and _cached_get_attn_backend functions in vllm/attention/selector.py are updated to create and pass this single config object instead of multiple individual arguments. This change propagates throughout the platform-specific attention backend selection logic in vllm/platforms/cpu.py, vllm/platforms/cuda.py, and vllm/platforms/interface.py, where relevant methods like get_attn_backend_cls and get_valid_backends are modified to accept and utilize the AttentionSelectorConfig object, simplifying their signatures and improving parameter management. Additionally, logging for attention configurations is updated to use the __repr__ method of the new config object.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the attention backend selection by introducing AttentionSelectorConfig to encapsulate the configuration parameters. This is a great improvement as it simplifies the function signatures in platform-specific modules and makes the interface more stable for out-of-tree platforms. The changes are applied consistently across all relevant files. I've found one minor issue with an incomplete __repr__ implementation which could affect debugging.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com>
|
Hi @Isotr0py, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
DarkLight1337
left a comment
There was a problem hiding this comment.
LGTM, thanks for cleaning this up! cc @tjtanaa
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
| head_size, | ||
| dtype, | ||
| kv_cache_dtype, | ||
| None, |
There was a problem hiding this comment.
Just found that block_size is set to None here to bypass attention backend validation for state space model, otherwise the validation will fail:
[2025-12-15T14:12:44Z] models/language/generation/test_hybrid.py::test_models[5-64-hmellor/tiny-random-BambaForCausalLM] The fast path for Bamba will be used when running the model on a GPU
...
[2025-12-15T14:13:30Z] (EngineCore_DP0 pid=2108) ERROR 12-15 06:13:30 [core.py:866] ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=64, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=48, use_mla=False, has_sink=False, use_sparse=False, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN: [block_size not supported], FLASHINFER: [block_size not supported], TRITON_ATTN: [block_size not supported], FLEX_ATTENTION: [block_size not supported]}.
Perhaps we need to update FlashAttention backend's get_supported_kernel_block_sizes for state space model? @tdoublep
…akpoint for OOT platform (vllm-project#30212) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
…os_emb (#725) Fix for vllm-project/vllm#30212 + cherry pick #724 --------- Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
### What this PR does / why we need it? Upstream vLLM PR #30212 vllm-project/vllm#30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>
…os_emb (vllm-project#725) Fix for vllm-project/vllm#30212 + cherry pick vllm-project#724 --------- Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai> Signed-off-by: lvkaokao <kaokao.lv@intel.com>
### What this PR does / why we need it? Upstream vLLM PR #30212 vllm-project/vllm#30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>
…akpoint for OOT platform (vllm-project#30212) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…os_emb (vllm-project#725) Fix for vllm-project/vllm#30212 + cherry pick vllm-project#724 --------- Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
### What this PR does / why we need it? Upstream vLLM PR #30212 vllm-project/vllm#30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Upstream vLLM PR #30212 vllm-project/vllm#30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Purpose
get_attn_backend_cls, while not all platforms use all of them. And it will also easily break OOT platform when introduce attention feature with new argument likeuse_mlaanduse_sink.AttentionSelectorConfig, so that platform can use these arguments on demand and no longer need to update interface for each feature update.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.