[webgpu] support smooth softmax for non-FA GQA implementation #25285

fs-eire · 2025-07-04T00:57:24Z

Description

support smooth softmax for non-FA GQA implementation

This change depends on:

[CPU] GQA supports head_sink input for smooth softmax #25269

Work items:

support smooth softmax
support bias
support head sink (per-head smooth softmax)

The following will not be included in this PR:

support for FlashAttention
support sliding window

Copilot

Pull Request Overview

Adds support for smooth softmax, attention bias, and a per-head “sink” value in the non-FlashAttention path of GroupQueryAttention.

Introduce three new optional inputs (position_ids, attention_bias, head_sink) in both WebGPU and CPU GQA kernels
Wire attention_bias and head_sink through ComputeInternal, ApplyAttention, and the softmax shader
Update program constructors and shader generation to respect use_smooth_softmax, has_seqlen_k, and has_head_sink flags

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc	Added new inputs and custom‐input checks; passed `attention_bias` & `head_sink` into `ComputeInternal` and `ApplyAttention`
onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc	Noted TODO for smooth softmax/head_sink support in FlashAttention
onnxruntime/contrib_ops/webgpu/bert/attention_common.h	Extended `ApplyAttention` signature to include `head_sink`
onnxruntime/contrib_ops/webgpu/bert/attention.h	Modified program constructors to take `use_smooth_softmax`, `has_seqlen_k`, `has_head_sink` flags
onnxruntime/contrib_ops/webgpu/bert/attention.cc	Enhanced `InPlaceSoftmaxProgram` generation to handle the new flags in the shader
onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h	Added validation logic for the new `head_sink` tensor
onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc	Loaded `head_sink` input and invoked validation in the CPU kernel

Comments suppressed due to low confidence (3)

onnxruntime/contrib_ops/webgpu/bert/attention.cc:226

The new smooth_softmax and head_sink branches in the shader code introduce distinct behavior; please add unit or integration tests to cover both code paths in WebGPU and CPU implementations.

Status InPlaceSoftmaxProgram::GenerateShaderCode(ShaderHelper& shader) const {

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc:155

The operator schema and user‐facing documentation need updating to expose the new optional inputs (position_ids, attention_bias, and head_sink) and explain their semantics.

  const Tensor* position_ids = context.Input<Tensor>(9);  // TODO: support sliding window

onnxruntime/contrib_ops/webgpu/bert/attention.h:72

[nitpick] The term head_sink may be unfamiliar to new readers; consider renaming it to something more descriptive or adding a comment in the header to clarify its intended effect on the softmax.

  InPlaceSoftmaxProgram(int work_group_size, int components, bool use_smooth_softmax, bool has_seqlen_k, bool has_head_sink)

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc

onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc

…-softmax

…oft#25285) ### Description support smooth softmax for non-FA GQA implementation This change depends on: - microsoft#25269 Work items: - [x] support smooth softmax - [x] support bias - [x] support head sink (per-head smooth softmax) The following will not be included in this PR: - support for FlashAttention - support sliding window

fs-eire added 2 commits July 3, 2025 17:41

[webgpu] support smooth softmax for non-FA implementation

8d3f73e

Add stub for smooth softmax in FlashAttention

55f364e

fs-eire marked this pull request as draft July 4, 2025 02:01

fs-eire changed the title ~~[webgpu] support smooth softmax for non-FA GQA implementation~~ [WIP][webgpu] support smooth softmax for non-FA GQA implementation Jul 4, 2025

fs-eire force-pushed the fs-eire/webgpu-smooth-softmax branch 5 times, most recently from c201d49 to 5845d0e Compare July 5, 2025 19:11

fs-eire changed the title ~~[WIP][webgpu] support smooth softmax for non-FA GQA implementation~~ [webgpu] support smooth softmax for non-FA GQA implementation Jul 5, 2025

fs-eire marked this pull request as ready for review July 5, 2025 19:11

Add implementation of head sink and smooth softmax

3a7b54f

fs-eire force-pushed the fs-eire/webgpu-smooth-softmax branch from 5845d0e to 3a7b54f Compare July 5, 2025 19:13

fs-eire requested a review from Copilot July 5, 2025 19:13

Copilot AI reviewed Jul 5, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

guschmue requested changes Jul 7, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

guschmue added the ep:WebGPU ort-web webgpu provider label Jul 7, 2025

fs-eire added 2 commits July 7, 2025 09:03

resolve comments

de25b3e

Merge remote-tracking branch 'origin/main' into fs-eire/webgpu-smooth…

43a24e8

…-softmax

guschmue approved these changes Jul 7, 2025

View reviewed changes

fs-eire merged commit 6d28e2d into main Jul 7, 2025
91 checks passed

fs-eire deleted the fs-eire/webgpu-smooth-softmax branch July 7, 2025 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] support smooth softmax for non-FA GQA implementation #25285

[webgpu] support smooth softmax for non-FA GQA implementation #25285

Uh oh!

fs-eire commented Jul 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[webgpu] support smooth softmax for non-FA GQA implementation #25285

[webgpu] support smooth softmax for non-FA GQA implementation #25285

Uh oh!

Conversation

fs-eire commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fs-eire commented Jul 4, 2025 •

edited

Loading