use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… #39412

sywangyi · 2025-07-15T03:10:52Z

…tention
the GQA could be accelerated in torch.nn.functional.scaled_dot_product_attention. this pytorch api offer a param to enable gqa. see https://docs.pytorch.org/docs/2.7/generated/torch.nn.functional.scaled_dot_product_attention.html#torch-nn-functional-scaled-dot-product-attention

…tention Signed-off-by: Wang, Yi A <[email protected]>

src/transformers/integrations/sdpa_attention.py

Signed-off-by: Wang, Yi A <[email protected]>

liangan1 · 2025-07-15T07:42:11Z

@LuFinch pls help to review this pr.

sywangyi · 2025-07-15T07:46:40Z

FAILED tests/models/nougat/test_image_processing_nougat.py::NougatImageProcessingTest::test_slow_fast_equivalence_batched - AssertionError: 0.005013074725866318 not less than or equal to 0.005 this failure case has nothing to do with the PR. the case does not call sdpa attention

vasqu · 2025-07-15T10:03:56Z

Please see #35235 (comment)

The enable_gqa kwarg is pretty restrictive and would need proper checks around it (version, mask) to ensure we do not fall back to the math kernel / use unsupported features of older torch.

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi · 2025-07-15T12:49:51Z

Please see #35235 (comment)

The enable_gqa kwarg is pretty restrictive and would need proper checks around it (version, mask) to ensure we do not fall back to the math kernel / use unsupported features of older torch.

thanks for the review. add check, only enable the GQA in sdpa for non-cuda device

Signed-off-by: Wang, Yi A <[email protected]>

vasqu

Added some comments on the checks, got to be a bit strict here especially to avoid performance issues.

It would be very helpful to have a test that checks that we don't fall back to the math backend when calling SDPA (with grouped query).

src/transformers/integrations/sdpa_attention.py

vasqu · 2025-07-16T05:27:11Z

And don't worry about the CI, the tests that failed were/are either flaky ones or not related to this PR.

Signed-off-by: Wang, Yi A <[email protected]>

vasqu

Thank you, more or less nits left

cc @ArthurZucker @Cyrilvallez for core maintainer + attention

src/transformers/integrations/sdpa_attention.py

tests/utils/test_cache_utils.py

Signed-off-by: Wang, Yi A <[email protected]>

Cyrilvallez

Nice! Happy to use the native sdpa option whenever possible! Just a final comment

src/transformers/integrations/sdpa_attention.py

Signed-off-by: Wang, Yi A <[email protected]>

Cyrilvallez

Perfect! Thanks a lot for enabling this! 🤗

LuFinch · 2025-07-22T05:57:25Z

It is true that the FlashAttention_Backend of Pytorh SDPA doesn't support attention mask and it will fall back to Math backend when user input an attn_mask. Ref to https://github.com/pytorch/pytorch/blob/d984143a74e5e726e2be35f6531582aab45bcf4c/aten/src/ATen/native/transformers/sdp_utils_cpp.h#L259

However, this check should not be a condition for GQA. The attn_mask along with enable_gqa = True can run into optimized kernels in other SDPA backends, like cudnn_attention, efficient_attention and overrideable backends.

Could you remove the attention_mask is None from use_gqa_in_sdpa or do more code refine?

huggingface#39412) * use the enable_gqa param in torch.nn.functional.scaled_dot_product_attention Signed-off-by: Wang, Yi A <[email protected]> * ci failure fix Signed-off-by: Wang, Yi A <[email protected]> * add check Signed-off-by: Wang, Yi A <[email protected]> * fix ci failure Signed-off-by: Wang, Yi A <[email protected]> * refine code, extend to cuda Signed-off-by: Wang, Yi A <[email protected]> * refine code Signed-off-by: Wang, Yi A <[email protected]> * fix review comments Signed-off-by: Wang, Yi A <[email protected]> * refine the PR Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Cyril Vallez <[email protected]>

vasqu · 2025-07-22T07:44:53Z

According to the docs (https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

"Grouped Query Attention (GQA) is an experimental feature. It currently works only for Flash_attention and math kernel on CUDA tensor, and does not support Nested tensor."

That means either the docs are wrong or we only have the fa and math backend available when using the gqa kwargs. The memory efficient variant is not mentioned so it would need some code sample imo to show that it can work alongside. For cudnn, that's also a rather new and experimental backend so I'm not sure what the conditions are there; iirc it was also not turned on by default as backend option.

liangan1 · 2025-07-22T08:03:03Z

According to the docs (https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

"Grouped Query Attention (GQA) is an experimental feature. It currently works only for Flash_attention and math kernel on CUDA tensor, and does not support Nested tensor."

That means either the docs are wrong or we only have the fa and math backend available when using the gqa kwargs. The memory efficient variant is not mentioned so it would need some code sample imo to show that it can work alongside. For cudnn, that's also a rather new and experimental backend so I'm not sure what the conditions are there; iirc it was also not turned on by default as backend option.

Yes. you are right. The doc for CUDA is also right. The GQA optimization for XPU is enabled since torch-2.8, may be we can add different path in "use_gqa_in_sdpa" function for different device？the recommended condition for xpu should be _is_torch_greater_or_equal_than_2_8 and not isinstance(key, torch.fx.Proxy).

vasqu · 2025-07-22T08:12:13Z

Interestingly enough, I just checked the cudnn backend and it seems to work with the mask + gqa on first glance - would need to look deeper into this

Yes. you are right. The doc for CUDA is also right. The GQA optimization for XPU is enabled since torch-2.8, may be we can add different path in "use_gqa_in_sdpa" function for different device？the recommended condition for xpu should be _is_torch_greater_or_equal_than_2_8 and not isinstance(key, torch.fx.Proxy).

That sounds good to me, can you open a PR? It's hard to keep up what's possible where :D

liangan1 · 2025-07-22T23:43:26Z

Thanks. I suppose @sywangyi who is owner of this part will submit a pr to fix it.

huggingface#39412) * use the enable_gqa param in torch.nn.functional.scaled_dot_product_attention Signed-off-by: Wang, Yi A <[email protected]> * ci failure fix Signed-off-by: Wang, Yi A <[email protected]> * add check Signed-off-by: Wang, Yi A <[email protected]> * fix ci failure Signed-off-by: Wang, Yi A <[email protected]> * refine code, extend to cuda Signed-off-by: Wang, Yi A <[email protected]> * refine code Signed-off-by: Wang, Yi A <[email protected]> * fix review comments Signed-off-by: Wang, Yi A <[email protected]> * refine the PR Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Cyril Vallez <[email protected]>

use the enable_gqa param in torch.nn.functional.scaled_dot_product_at…

0fd2f66

…tention Signed-off-by: Wang, Yi A <[email protected]>

liangan1 reviewed Jul 15, 2025

View reviewed changes

src/transformers/integrations/sdpa_attention.py Show resolved Hide resolved

ci failure fix

82cbeb2

Signed-off-by: Wang, Yi A <[email protected]>

add check

029bc7f

Signed-off-by: Wang, Yi A <[email protected]>

fix ci failure

67a9210

Signed-off-by: Wang, Yi A <[email protected]>

vasqu reviewed Jul 16, 2025

View reviewed changes

refine code, extend to cuda

45b6274

Signed-off-by: Wang, Yi A <[email protected]>

vasqu mentioned this pull request Jul 17, 2025

🚨All attention refactor🚨 #35235

Merged

30 tasks

refine code

883ec3c

Signed-off-by: Wang, Yi A <[email protected]>

vasqu approved these changes Jul 18, 2025

View reviewed changes

src/transformers/integrations/sdpa_attention.py Outdated Show resolved Hide resolved

tests/utils/test_cache_utils.py Outdated Show resolved Hide resolved

fix review comments

eb4d8ad

Signed-off-by: Wang, Yi A <[email protected]>

Cyrilvallez reviewed Jul 21, 2025

View reviewed changes

src/transformers/integrations/sdpa_attention.py Outdated Show resolved Hide resolved

Cyrilvallez and others added 2 commits July 21, 2025 11:47

Merge branch 'main' into yi_sdpa

a9e9454

refine the PR

c72b3a1

Signed-off-by: Wang, Yi A <[email protected]>

Cyrilvallez approved these changes Jul 21, 2025

View reviewed changes

Cyrilvallez merged commit 9323d08 into huggingface:main Jul 21, 2025
23 checks passed

st81 mentioned this pull request Jul 25, 2025

Reduce atol values in test_dynamic_cache_exportability #39667

Closed

5 tasks

sywangyi deleted the yi_sdpa branch November 19, 2025 04:42

use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… #39412

use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… #39412

Conversation

sywangyi commented Jul 15, 2025

Uh oh!

Uh oh!

liangan1 commented Jul 15, 2025

Uh oh!

sywangyi commented Jul 15, 2025

Uh oh!

vasqu commented Jul 15, 2025

Uh oh!

sywangyi commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Jul 16, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LuFinch commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Jul 22, 2025

Uh oh!

liangan1 commented Jul 22, 2025

Uh oh!

vasqu commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangan1 commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sywangyi commented Jul 15, 2025 •

edited

Loading

LuFinch commented Jul 22, 2025 •

edited

Loading

vasqu commented Jul 22, 2025 •

edited

Loading