[CUDA] Update Flash Attention to support head_sink for smooth softmax in GQA #25358

tianleiwu · 2025-07-10T18:26:07Z

Description

Update Flash Attention to support head_sink for smooth softmax in GQA.

Changes:

Update flash attention to support head_sink
Add test_gqa.py to test it
Remove test_gqa_cuda.py

Note: Memory efficient attention change will be in separated PR.

Motivation and Context

#25269

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h

onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h

onnxruntime/test/python/transformers/test_gqa.py

+
+# Triton-based implementation for CUDA
+def rotary_embedding_cuda(*args, **kwargs):
+    from rotary_flash import apply_rotary_emb


onnxruntime/test/python/transformers/test_gqa.py

+            return rotary_embedding_cuda(x, cos, sin, seqlen_offsets=pos, interleaved=interleaved)
+        except ImportError:
+            print("WARNING: Triton-based rotary embedding not found. Falling back to PyTorch version.")
+            use_cuda_triton = False


To fix the issue, the unused assignment to use_cuda_triton on line 194 should be removed. This will eliminate the redundant code while preserving the function's behavior. Since the variable is already used earlier in the function (line 188), its presence is still meaningful for the initial conditional logic. No additional changes are required.

onnxruntime/test/python/transformers/test_gqa.py

+
+    # Fill NaNs with 0
+    if window_size[0] >= 0 or window_size[1] >= 0:
+        attention = attention.masked_fill(torch.all(local_mask, dim=-1, keepdim=True), 0.0)


To fix the issue, we need to ensure that local_mask is always initialized before it is used. The best approach is to initialize local_mask to None at the start of the function. Then, before using local_mask on line 748, we can check if it is None and handle that case appropriately. This ensures that the variable is always defined, regardless of the conditions.

update flash attn

693155b

tianleiwu marked this pull request as draft July 10, 2025 18:26

github-actions bot reviewed Jul 10, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/bert/flash_attention/softmax.h Show resolved Hide resolved

add test

f8cd59e

github-advanced-security bot found potential problems Jul 10, 2025

View reviewed changes

tianleiwu closed this Jul 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Update Flash Attention to support head_sink for smooth softmax in GQA #25358

[CUDA] Update Flash Attention to support head_sink for smooth softmax in GQA #25358

Uh oh!

tianleiwu commented Jul 10, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

Copilot Autofix

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CUDA] Update Flash Attention to support head_sink for smooth softmax in GQA #25358

[CUDA] Update Flash Attention to support head_sink for smooth softmax in GQA #25358

Uh oh!

Conversation

tianleiwu commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

Copilot Autofix

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Jul 10, 2025 •

edited

Loading