[Attn,KV-cache] Use per-head scales in the attention selector by eldarkurtic · Pull Request #34281 · vllm-project/vllm

eldarkurtic · 2026-02-10T22:11:29Z

As requested by @MatthewBonanni and @LucasWilkinson in #30141 attention backends should be filtered during backend selection based on whether they support per-head attention quantization scales.

This enables early failure when a user attempts to load a model that requires per-head scales but no compatible attention backend is available.

gemini-code-assist

Code Review

This pull request refactors the check for per-head attention quantization scales support. The changes introduce a mechanism to filter attention backends early during selection based on this capability. This is achieved by adding a requires_per_head_quant_scales parameter that is propagated down to the attention backend selector. The AttentionBackend class is updated with a supports_per_head_quant_scales method to facilitate this check, with FlashAttentionBackend correctly implementing it based on the FlashAttention version. The previous runtime assertion is removed, enabling earlier failure for unsupported configurations, which improves user experience. The changes are well-structured and correctly implemented.

MatthewBonanni

LGTM, maybe we could add a case to the attention selector test to verify this is working?

vllm/v1/attention/selector.py

eldarkurtic · 2026-02-11T09:19:28Z

tests added, let me know if this is what you had in mind @MatthewBonanni

mergify · 2026-02-11T09:20:54Z

Hi @eldarkurtic, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

MatthewBonanni

Thanks for adding the test! Just one comment

MatthewBonanni · 2026-02-11T15:48:15Z

tests/kernels/attention/test_attention_selector.py

+    with (
+        set_current_vllm_config(vllm_config),
+        patch("vllm.platforms.current_platform", CudaPlatform()),
+        patch(supports_attr, return_value=supports_per_head),


By patching this I think the test isn't actually exercising the backend support, it'll just always pass as long as supports_per_head matches should_succeed in the test case. Can you get rid of this patch?

Suggested change

patch(supports_attr, return_value=supports_per_head),

MatthewBonanni · 2026-02-11T15:48:34Z

tests/kernels/attention/test_attention_selector.py

+    ],
+)
+def test_per_head_quant_scales_backend_selection(
+    backend_name: str, supports_per_head: bool, should_succeed: bool


see below

Suggested change

backend_name: str, supports_per_head: bool, should_succeed: bool

backend_name: str, should_succeed: bool

eldarkurtic · 2026-02-13T07:26:32Z

@MatthewBonanni great point, thanks a lot! I wanted to use that to distinguish between FA2 and FA3, but just found out that I can simply pass flash_attn_version when selecting the backend.
The test is updated and fixed now.

LucasWilkinson

LGTM, thanks!

mergify · 2026-02-19T21:21:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eldarkurtic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-02-19T21:33:57Z

Hi @eldarkurtic, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

eldarkurtic requested review from LucasWilkinson, WoosukKwon, alexm-redhat, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners February 10, 2026 22:11

mergify bot added the v1 label Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

eldarkurtic force-pushed the use-perhead-scales-in-attn-selector branch from b3cf85b to 656470c Compare February 10, 2026 22:13

MatthewBonanni approved these changes Feb 10, 2026

View reviewed changes

vllm/v1/attention/selector.py Outdated Show resolved Hide resolved

MatthewBonanni reviewed Feb 11, 2026

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 17, 2026

LucasWilkinson approved these changes Feb 17, 2026

View reviewed changes

eldarkurtic force-pushed the use-perhead-scales-in-attn-selector branch from 06b6a84 to 0ec4a22 Compare February 19, 2026 21:20

eldarkurtic requested review from bigPYJ1151, gshtras, hmellor, jikunshang, noooop, patrickvonplaten, sighingnow and tjtanaa as code owners February 19, 2026 21:20

mergify bot added new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia labels Feb 19, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Feb 19, 2026

mergify bot added the rocm Related to AMD ROCm label Feb 19, 2026

github-project-automation bot added this to AMD and NVIDIA Feb 19, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Feb 19, 2026

mergify bot added the cpu Related to CPU backends label Feb 19, 2026

github-project-automation bot moved this to Ready in NVIDIA Feb 19, 2026

mergify bot added the structured-output label Feb 19, 2026

github-project-automation bot moved this to Todo in AMD Feb 19, 2026

mergify bot added the speculative-decoding label Feb 19, 2026

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 19, 2026

mergify bot added the tpu Related to Google TPUs label Feb 19, 2026

github-project-automation bot added this to Structured Output Feb 19, 2026

mergify bot added the kv-connector label Feb 19, 2026

mergify bot added the needs-rebase label Feb 19, 2026

eldarkurtic force-pushed the use-perhead-scales-in-attn-selector branch from 0ec4a22 to a027512 Compare February 19, 2026 21:29

mergify bot removed tpu Related to Google TPUs needs-rebase labels Feb 19, 2026

Eldar Kurtic and others added 5 commits February 23, 2026 14:44

use per-head in attn selector

243df93

Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

fix ruff

134a194

Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

rename

ca9782f

Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

add tests

988dba4

Signed-off-by: Your Name <you@example.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

skip FA3 on non-Hopper

788e820

Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attn,KV-cache] Use per-head scales in the attention selector#34281

[Attn,KV-cache] Use per-head scales in the attention selector#34281
LucasWilkinson merged 5 commits intovllm-project:mainfrom
eldarkurtic:use-perhead-scales-in-attn-selector

eldarkurtic commented Feb 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

MatthewBonanni left a comment

Uh oh!

Uh oh!

eldarkurtic commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

MatthewBonanni Feb 11, 2026

Uh oh!

MatthewBonanni Feb 11, 2026

Uh oh!

eldarkurtic commented Feb 13, 2026

Uh oh!

LucasWilkinson left a comment

Uh oh!

mergify bot commented Feb 19, 2026

Uh oh!

mergify bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	backend_name: str, supports_per_head: bool, should_succeed: bool
	backend_name: str, should_succeed: bool

Uh oh!

Conversation

eldarkurtic commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eldarkurtic commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

eldarkurtic commented Feb 13, 2026

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 19, 2026

Uh oh!

mergify bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eldarkurtic commented Feb 10, 2026 •

edited by github-actions bot

Loading