[CUDA] Support head_sink in flash attention for GQA #25432

tianleiwu · 2025-07-17T07:20:14Z

Description

Update Flash Attention to support softmax sink in GQA.

Changes:

Update flash attention to support head_sink
Add test_gqa.py to test cuda, and remove test_gqa_cuda.py.

Note that the sink is treated as scaled, while the elements in QK GEMMs is not scaled. The sink value does not need scaling or softcap, and it joins softmax with those scaled or soft-capped values. There are two ways to add sink to softmax:

One way is to patch normalize_softmax_lse to use sink to update max and sum. Pros is major change in one function; Cons is the logic is a little complex since row_max is unscaled, while row_sum is scaled.
Another way is to change softmax_rescale_o to handle the sink directly in the first block of online softmax by using an unscaled sink value. It is a robust way to keep core algorithm consistent. Cons is need change in multiple places, and it is little hard to work with softcap.

This PR use the the first approach for easy integration.

Note: Memory efficient attention change will be in separated PR.

Motivation and Context

#25269

onnxruntime/test/python/transformers/test_gqa.py

onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h

### Description Update Flash Attention to support softmax sink in GQA. Changes: - [x] Update flash attention to support head_sink - [x] Add test_gqa.py to test cuda, and remove test_gqa_cuda.py. Note that the sink is treated as scaled, while the elements in QK GEMMs is not scaled. The sink value does not need scaling or softcap, and it joins softmax with those scaled or soft-capped values. There are two ways to add sink to softmax: * One way is to [patch normalize_softmax_lse](https://github.com/microsoft/onnxruntime/blob/1cf1aa786f6e7f7e6abd6fba8b8aea2e7a43092c/onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h#L143-L178) to use sink to update max and sum. Pros is major change in one function; Cons is the logic is a little complex since row_max is unscaled, while row_sum is scaled. * Another way is to change softmax_rescale_o to handle the sink directly in the first block of online softmax by using an unscaled sink value. It is a robust way to keep core algorithm consistent. Cons is need change in multiple places, and it is little hard to work with softcap. This PR use the the first approach for easy integration. Note: Memory efficient attention change will be in separated PR. ### Motivation and Context microsoft#25269

support head sink in flash attention for GQA

ed29822

github-advanced-security bot found potential problems Jul 17, 2025

View reviewed changes

onnxruntime/test/python/transformers/test_gqa.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_gqa.py Fixed Show fixed Hide fixed

tianleiwu added 3 commits July 17, 2025 00:42

update comments

8a8ee9f

remove unused script

cef0664

fix build

1cf1aa7

kunal-vaishnavi reviewed Jul 17, 2025

View reviewed changes

onnxruntime/test/python/transformers/test_gqa.py Show resolved Hide resolved

kunal-vaishnavi reviewed Jul 17, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h Show resolved Hide resolved

kunal-vaishnavi approved these changes Jul 17, 2025

View reviewed changes

tianleiwu merged commit e6c84b8 into main Jul 17, 2025
88 of 90 checks passed

tianleiwu deleted the tlwu/gqa_head_sink_cuda branch July 17, 2025 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Support head_sink in flash attention for GQA #25432

[CUDA] Support head_sink in flash attention for GQA #25432

Uh oh!

tianleiwu commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CUDA] Support head_sink in flash attention for GQA #25432

[CUDA] Support head_sink in flash attention for GQA #25432

Uh oh!

Conversation

tianleiwu commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jul 17, 2025 •

edited

Loading