[Attn,KV-cache] Use per-head scales in the attention selector#34281
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the check for per-head attention quantization scales support. The changes introduce a mechanism to filter attention backends early during selection based on this capability. This is achieved by adding a requires_per_head_quant_scales parameter that is propagated down to the attention backend selector. The AttentionBackend class is updated with a supports_per_head_quant_scales method to facilitate this check, with FlashAttentionBackend correctly implementing it based on the FlashAttention version. The previous runtime assertion is removed, enabling earlier failure for unsupported configurations, which improves user experience. The changes are well-structured and correctly implemented.
b3cf85b to
656470c
Compare
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, maybe we could add a case to the attention selector test to verify this is working?
|
tests added, let me know if this is what you had in mind @MatthewBonanni |
|
Hi @eldarkurtic, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
MatthewBonanni
left a comment
There was a problem hiding this comment.
Thanks for adding the test! Just one comment
| with ( | ||
| set_current_vllm_config(vllm_config), | ||
| patch("vllm.platforms.current_platform", CudaPlatform()), | ||
| patch(supports_attr, return_value=supports_per_head), |
There was a problem hiding this comment.
By patching this I think the test isn't actually exercising the backend support, it'll just always pass as long as supports_per_head matches should_succeed in the test case. Can you get rid of this patch?
| patch(supports_attr, return_value=supports_per_head), |
| ], | ||
| ) | ||
| def test_per_head_quant_scales_backend_selection( | ||
| backend_name: str, supports_per_head: bool, should_succeed: bool |
There was a problem hiding this comment.
see below
| backend_name: str, supports_per_head: bool, should_succeed: bool | |
| backend_name: str, should_succeed: bool |
|
@MatthewBonanni great point, thanks a lot! I wanted to use that to distinguish between FA2 and FA3, but just found out that I can simply pass |
06b6a84 to
0ec4a22
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0ec4a22 to
a027512
Compare
|
Hi @eldarkurtic, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
As requested by @MatthewBonanni and @LucasWilkinson in #30141 attention backends should be filtered during backend selection based on whether they support per-head attention quantization scales.
This enables early failure when a user attempts to load a model that requires per-head scales but no compatible attention backend is available.