webgpu: support head_sink in flash attention by guschmue · Pull Request #27410 · microsoft/onnxruntime

guschmue · 2026-02-20T18:24:11Z

This enables flash attention for gpt-oss

Copilot

Pull request overview

This PR enables flash attention support for head_sink in WebGPU, specifically to support gpt-oss models. The changes remove the restriction that prevented flash attention from being used when head_sink is present, and thread the head_sink parameter through the entire flash attention call chain.

Changes:

Removed the head_sink nullptr check that was blocking flash attention usage with head_sink
Added head_sink parameter throughout the flash attention implementation with backward-compatible default value
Updated WGSL shader templates to handle head_sink logic for proper attention computation
Fixed numerical precision issue with explicit f32 casting in exponential calculations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc	Removed head_sink nullptr check blocking flash attention; added head_sink parameter to ApplyFlashAttention calls
onnxruntime/contrib_ops/webgpu/bert/flash_attention.h	Added head_sink parameter to ApplyFlashAttention signature and FlashAttentionDecodeSplitVxProgram; added num_heads uniform variable
onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc	Threaded head_sink parameter through flash attention implementation; added shader input and uniform handling for head_sink
onnxruntime/contrib_ops/webgpu/bert/flash_attention.wgsl.template	Added head_sink support with conditional initialization of previous_max; fixed numerical precision with explicit f32 casting
onnxruntime/contrib_ops/webgpu/bert/flash_attention_decode_split_vx.wgsl.template	Added head_sink support in global max/sum calculation using head_idx derived from batch_head_idx

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

support head_sink in flash attention

54a3011

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 20, 2026

guschmue marked this pull request as ready for review February 20, 2026 18:26

guschmue added 2 commits February 24, 2026 09:15

fix double counting head_sink

900afe7

remove unwanted counting of head_sink

64df2fb

guschmue requested a review from Copilot February 24, 2026 23:46

Copilot started reviewing on behalf of guschmue February 24, 2026 23:46 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

Merge branch 'main' into gs/wgpu-fa-head-sink

4f4f984

fs-eire previously approved these changes Feb 25, 2026

View reviewed changes

guschmue enabled auto-merge (squash) February 25, 2026 00:16

remove unused parameter

e539f4e

guschmue dismissed fs-eire’s stale review via e539f4e February 25, 2026 01:13

fs-eire approved these changes Feb 25, 2026

View reviewed changes

guschmue merged commit bb3866c into main Feb 25, 2026
94 of 95 checks passed

guschmue deleted the gs/wgpu-fa-head-sink branch February 25, 2026 08:05

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: support head_sink in flash attention#27410

webgpu: support head_sink in flash attention#27410
guschmue merged 5 commits intomainfrom
gs/wgpu-fa-head-sink

guschmue commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guschmue commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants