webgpu support for LinearAttention and CausalConvWithState by guschmue · Pull Request #27896 · microsoft/onnxruntime

guschmue · 2026-03-29T17:47:47Z

moved to here: #27996

…gram

…ustom ops

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-03-29T17:51:33Z

+  kv_num_heads_ = static_cast<int>(info.GetAttr<int64_t>("kv_num_heads"));
+}
+
+


Suggested change

github-actions · 2026-03-29T17:51:33Z

+  const Tensor* past_state = context.Input(3);   // optional
+  const Tensor* decay = context.Input(4);         // optional
+  const Tensor* beta = context.Input(5);          // optional


Suggested change

const Tensor* past_state = context.Input(3); // optional

const Tensor* decay = context.Input(4); // optional

const Tensor* beta = context.Input(5); // optional

const Tensor* past_state = context.Input(3); // optional

const Tensor* decay = context.Input(4); // optional

const Tensor* beta = context.Input(5); // optional

github-actions · 2026-03-29T17:51:33Z

+            auto& query_shape = getInputShape(ctx, 0);
+            auto& value_shape = getInputShape(ctx, 2);
+            TensorShapeProto state_shape;
+            *state_shape.add_dim() = query_shape.dim(0);  // B


Suggested change

*state_shape.add_dim() = query_shape.dim(0); // B

*state_shape.add_dim() = query_shape.dim(0); // B

github-actions · 2026-03-29T17:51:33Z

+          }
+        }));
+
+


Suggested change

github-actions · 2026-03-29T17:51:33Z

+#define WEBGPU_CONCAT_VERSIONED_KERNEL(start, end)                    \
+  ONNX_OPERATOR_VERSIONED_KERNEL_EX(                                  \
+      Concat,                                                         \
+      kOnnxDomain,                                                    \
+      start,                                                          \
+      end,                                                            \
+      kWebGpuExecutionProvider,                                       \
+      (*KernelDefBuilder::Create())                                   \


Suggested change

#define WEBGPU_CONCAT_VERSIONED_KERNEL(start, end) \

ONNX_OPERATOR_VERSIONED_KERNEL_EX( \

Concat, \

kOnnxDomain, \

start, \

end, \

kWebGpuExecutionProvider, \

(*KernelDefBuilder::Create()) \

#define WEBGPU_CONCAT_VERSIONED_KERNEL(start, end) \

ONNX_OPERATOR_VERSIONED_KERNEL_EX( \

Concat, \

kOnnxDomain, \

start, \

end, \

kWebGpuExecutionProvider, \

(*KernelDefBuilder::Create()) \

github-actions · 2026-03-29T17:51:33Z

+
+using namespace onnxruntime::test;
+
+


Suggested change

github-actions · 2026-03-29T17:51:33Z

+    const std::vector<float>* decay,
+    const std::vector<float>* beta,
+    std::vector<float>& output,
+    std::vector<float>& final_state) {


Suggested change

std::vector<float>& final_state) {

std::vector<float>& final_state) {

int bht = batch_size * num_heads * seq_length;

bool decay_broadcast_dk = (decay != nullptr && static_cast<int>(decay->size()) == bht);

github-actions · 2026-03-29T17:51:34Z

+    int bht = batch_size * num_heads * seq_length;
+    bool decay_broadcast_dk = (decay != nullptr && static_cast<int>(decay->size()) == bht);
+
+      // State: (B, H, dk, dv)


Suggested change

int bht = batch_size * num_heads * seq_length;

bool decay_broadcast_dk = (decay != nullptr && static_cast<int>(decay->size()) == bht);

// State: (B, H, dk, dv)

// State: (B, H, dk, dv)

github-actions · 2026-03-29T17:51:34Z

+
+// Convert data from 4D (B,H,T,D) layout to 3D packed (B,T,H*D) layout
+std::vector<float> PackBHTD_to_BTHD(const std::vector<float>& data_4d,
+                                     int B, int H, int T, int D) {


Suggested change

int B, int H, int T, int D) {

int B, int H, int T, int D) {

github-actions · 2026-03-29T17:51:34Z

+
+// Convert decay/beta from (B,H,T) layout to (B,T,H) layout
+std::vector<float> TransposeBHT_to_BTH(const std::vector<float>& data,
+                                        int B, int H, int T) {


Suggested change

int B, int H, int T) {

int B, int H, int T) {

Copilot

Pull request overview

Adds WebGPU execution provider coverage for new/updated LLM building blocks (notably LinearAttention and CausalConvWithState) and wires up several ONNX-domain LLM ops needed for Qwen3.5-style graphs, alongside new reference-based tests.

Changes:

Add WebGPU contrib kernels for LinearAttention and CausalConvWithState, plus schema registration in MS opset.
Add WebGPU kernels for ONNX-domain Attention, RotaryEmbedding, and RMSNormalization, and update WebGPU kernel registrations for newer Reshape/Transpose opsets.
Add extensive correctness tests (reference implementations) for LinearAttention and CausalConvWithState, plus an int64 Concat test.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/tensor/concat_op_test.cc	Adds an `int64` Concat test case.
onnxruntime/test/contrib_ops/linear_attention_op_test.cc	New reference-based test suite for `LinearAttention` across update rules and shapes.
onnxruntime/test/contrib_ops/causal_conv_with_state_op_test.cc	New reference-based test suite for `CausalConvWithState` (fp32/fp16, state continuity, etc.).
onnxruntime/core/providers/webgpu/webgpu_supported_types.h	Adds an additional supported-type list including int64/uint64 (currently unused).
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc	Registers new kernels and updates versioned kernel declarations/registrations for opset changes.
onnxruntime/core/providers/webgpu/tensor/transpose.cc	Updates WebGPU Transpose kernel registration for opset 23/24 split.
onnxruntime/core/providers/webgpu/tensor/reshape.cc	Updates WebGPU Reshape kernel registration for opset 21–25.
onnxruntime/core/providers/webgpu/tensor/concat.cc	Formatting-only macro alignment / namespace close fix.
onnxruntime/core/providers/webgpu/nn/rms_norm.h	Declares WebGPU `RMSNorm` kernel wrapper.
onnxruntime/core/providers/webgpu/nn/rms_norm.cc	Implements WebGPU `RMSNormalization` via `LayerNormProgram` in simplified mode.
onnxruntime/core/providers/webgpu/llm/rotary_embedding.h	Declares ONNX-domain WebGPU `RotaryEmbedding` kernel wrapper.
onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc	Implements ONNX-domain `RotaryEmbedding` using existing contrib shader programs.
onnxruntime/core/providers/webgpu/llm/attention.h	Declares ONNX-domain WebGPU `Attention` kernel wrapper.
onnxruntime/core/providers/webgpu/llm/attention.cc	Implements ONNX-domain `Attention` on top of existing contrib WebGPU attention kernels.
onnxruntime/core/graph/contrib_ops/ms_opset.h	Registers new MS-domain schemas in opset v1 list.
onnxruntime/core/graph/contrib_ops/bert_defs.cc	Adds MS-domain schemas + shape inference for `LinearAttention` and `CausalConvWithState`.
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.cc	Registers WebGPU contrib kernels for `LinearAttention` and `CausalConvWithState`.
onnxruntime/contrib_ops/webgpu/bert/linear_attention.h	Declares `LinearAttentionProgram` and WebGPU kernel wrapper.
onnxruntime/contrib_ops/webgpu/bert/linear_attention.cc	Implements LinearAttention WGSL generation + kernel host-side validation/dispatch.
onnxruntime/contrib_ops/webgpu/bert/causal_conv_with_state.h	Declares `CausalConvWithStateProgram` and WebGPU kernel wrapper.
onnxruntime/contrib_ops/webgpu/bert/causal_conv_with_state.cc	Implements CausalConvWithState WGSL generation + kernel host-side validation/dispatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T00:31:39Z

+  TensorShapeVector inv_std_dev_dim;
+  for (size_t i = 0; i < x_shape.NumDimensions(); ++i) {
+    if (i < axis) {
+      inv_std_dev_dim.push_back(x_shape[i]);
+    } else {
+      inv_std_dev_dim.push_back(1);
+    }
+  }
+  TensorShape inv_std_dev_shape(inv_std_dev_dim);
+  auto* inv_std_dev = context.Output(1, inv_std_dev_shape);


inv_std_dev output is optional for RMSNormalization, but this code unconditionally calls context.Output(1, ...). If the node only has 1 output, this will be out-of-range and can crash/fail. Please guard with if (context.OutputCount() > 1) before requesting/producing output 1.

Suggested change

TensorShapeVector inv_std_dev_dim;

for (size_t i = 0; i < x_shape.NumDimensions(); ++i) {

if (i < axis) {

inv_std_dev_dim.push_back(x_shape[i]);

} else {

inv_std_dev_dim.push_back(1);

}

}

TensorShape inv_std_dev_shape(inv_std_dev_dim);

auto* inv_std_dev = context.Output(1, inv_std_dev_shape);

Tensor* inv_std_dev = nullptr;

if (context.OutputCount() > 1) {

TensorShapeVector inv_std_dev_dim;

for (size_t i = 0; i < x_shape.NumDimensions(); ++i) {

if (i < axis) {

inv_std_dev_dim.push_back(x_shape[i]);

} else {

inv_std_dev_dim.push_back(1);

}

}

TensorShape inv_std_dev_shape(inv_std_dev_dim);

inv_std_dev = context.Output(1, inv_std_dev_shape);

}

Copilot · 2026-03-30T00:31:39Z

+                     update_rule_ == LinearAttentionUpdateRule::GatedDelta);
+  ORT_RETURN_IF(needs_decay && decay == nullptr, "decay input required for gated/gated_delta update rules");
+  ORT_RETURN_IF(needs_beta && beta == nullptr, "beta input required for delta/gated_delta update rules");
+


The implementation doesn't validate decay/beta shapes beyond nullptr checks, but the shader assumes specific packed layouts. In particular, the schema allows beta to be (B, T, 1) but the shader reads it as (B, T, H), which would cause out-of-bounds reads. Please either implement broadcasting for the (B,T,1) case (and validate dimensions), or explicitly reject unsupported shapes with a clear error.

Suggested change

// Validate decay/beta shapes. The shader expects (B, T, H) where H == num_heads.

if (needs_decay && decay != nullptr) {

const auto& decay_shape = decay->Shape();

ORT_RETURN_IF(decay_shape.NumDimensions() != 3,

"decay must have shape (batch_size, seq_length, num_heads); ",

"broadcasted form (B, T, 1) is not currently supported");

ORT_RETURN_IF(static_cast<int64_t>(batch_size) != decay_shape[0] ||

static_cast<int64_t>(seq_length) != decay_shape[1],

"decay shape mismatch: expected batch_size=", batch_size,

" and seq_length=", seq_length,

" but got (", decay_shape[0], ", ", decay_shape[1], ", ", decay_shape[2], ")");

ORT_RETURN_IF(static_cast<int64_t>(num_heads) != decay_shape[2],

"decay last dimension must equal num_heads (", num_heads,

"); broadcasted form (B, T, 1) is not currently supported, got ",

decay_shape[2]);

}

if (needs_beta && beta != nullptr) {

const auto& beta_shape = beta->Shape();

ORT_RETURN_IF(beta_shape.NumDimensions() != 3,

"beta must have shape (batch_size, seq_length, num_heads); ",

"broadcasted form (B, T, 1) is not currently supported");

ORT_RETURN_IF(static_cast<int64_t>(batch_size) != beta_shape[0] ||

static_cast<int64_t>(seq_length) != beta_shape[1],

"beta shape mismatch: expected batch_size=", batch_size,

" and seq_length=", seq_length,

" but got (", beta_shape[0], ", ", beta_shape[1], ", ", beta_shape[2], ")");

ORT_RETURN_IF(static_cast<int64_t>(num_heads) != beta_shape[2],

"beta last dimension must equal num_heads (", num_heads,

"); broadcasted form (B, T, 1) is not currently supported, got ",

beta_shape[2]);

}

Copilot · 2026-03-30T00:31:40Z

+  // Allocate outputs
+  // Output 0: (B, D, L)
+  Tensor* output = context.Output(0, input_shape);
+
+  // Output 1: present_state (B, D, K-1)
+  std::vector<int64_t> state_dims{batch_size, channels, state_length};
+  Tensor* present_state = context.Output(1, TensorShape(state_dims));
+
+  if (input_length == 0) {
+    return Status::OK();
+  }


When input_length == 0, the code returns early after allocating present_state but never writes/initializes it. present_state should still be well-defined (typically equal to past_state if provided, otherwise zeros) even for an empty input. Please handle the zero-length case by populating present_state appropriately before returning.

Copilot · 2026-03-30T00:31:40Z

+1. Input parsing: Handles both 3D (B, S, hidden) and 4D (B, N, S, H) input formats per the ONNX spec
+2. MHA vs GQA: Detects whether q_num_heads == kv_num_heads (MHA) or q_num_heads > kv_num_heads (GQA) and configures WebgpuAttentionParameters accordingly
+3. Flash attention: Used when available (no output_qk needed, subgroups feature present, no bias)
+4. 3D→BNSH conversion: For 3D inputs, uses TransferBSDToBNSH to convert to the BNSH format expected by the attention kernels
+5. 4D output: Computes in BSD layout (as the shader outputs), then transposes back to BNSH for 4D output format
+6. Attention mask: Reshapes 2D/3D masks to 4D for the shader's broadcasting logic; boolean masks return NOT_SUPPORTED
+
+Remaining failures fall into known limitation categories:
+    Boolean masks (2) — not yet supported on WebGPU
+    SoftCap (2) — not yet wired through to the shader
+    GQA output (3) — output stride mismatch for GQA with different kv_num_heads
+    QK matmul output (5) — the output_qk output needs additional work
+    Present without past (2) — present key/value output without past input needs handling
+    is_causal (1) — causal masking interaction
+
+[  PASSED  ] 24 tests.
+[  FAILED  ] 15 tests, listed below:
+[  FAILED  ] AttentionTest.Attention4DAttnMaskBoolAllFalse
+[  FAILED  ] AttentionTest.Attention4DAttnMaskBoolAllFalseDecodeWithPast
+[  FAILED  ] AttentionTest.Attention4DSoftCap
+[  FAILED  ] AttentionTest.Attention4DSoftCapFloat16
+[  FAILED  ] AttentionTest.Attention4DAttnMaskBool
+[  FAILED  ] AttentionTest.Attention4DAttnIsCausal
+[  FAILED  ] AttentionTest.Attention3DGqaAttn
+[  FAILED  ] AttentionTest.Attention3DGqaSelfAttnCausal
+[  FAILED  ] AttentionTest.Attention4DGqaAttnMask
+[  FAILED  ] AttentionTest.Attention4DWithPastAndPresentQkMatmul
+[  FAILED  ] AttentionTest.Attention3DWithPastAndPresentQkMatmul
+[  FAILED  ] AttentionTest.Attention4DWithMask3DPastAndPresentQkMatmul
+[  FAILED  ] AttentionTest.Attention4DWithMask3DPastAndPresentQkMatmulCausal
+[  FAILED  ] AttentionTest.TestAttention4DWithPastAndPresentQkMatmulBias4DMaskCausal
+[  FAILED  ] AttentionTest.AttentionNoPastWithPresentOutput


This large block comment includes a snapshot of passing/failing test names and counts. It will go stale quickly and makes the production kernel harder to maintain. Consider moving this information to the PR description or a tracking issue (and keep only a brief comment describing current limitations in code).

Suggested change

1. Input parsing: Handles both 3D (B, S, hidden) and 4D (B, N, S, H) input formats per the ONNX spec

2. MHA vs GQA: Detects whether q_num_heads == kv_num_heads (MHA) or q_num_heads > kv_num_heads (GQA) and configures WebgpuAttentionParameters accordingly

3. Flash attention: Used when available (no output_qk needed, subgroups feature present, no bias)

4. 3D→BNSH conversion: For 3D inputs, uses TransferBSDToBNSH to convert to the BNSH format expected by the attention kernels

5. 4D output: Computes in BSD layout (as the shader outputs), then transposes back to BNSH for 4D output format

6. Attention mask: Reshapes 2D/3D masks to 4D for the shader's broadcasting logic; boolean masks return NOT_SUPPORTED

Remaining failures fall into known limitation categories:

Boolean masks (2) — not yet supported on WebGPU

SoftCap (2) — not yet wired through to the shader

GQA output (3) — output stride mismatch for GQA with different kv_num_heads

QK matmul output (5) — the output_qk output needs additional work

Present without past (2) — present key/value output without past input needs handling

is_causal (1) — causal masking interaction

[ PASSED ] 24 tests.

[ FAILED ] 15 tests, listed below:

[ FAILED ] AttentionTest.Attention4DAttnMaskBoolAllFalse

[ FAILED ] AttentionTest.Attention4DAttnMaskBoolAllFalseDecodeWithPast

[ FAILED ] AttentionTest.Attention4DSoftCap

[ FAILED ] AttentionTest.Attention4DSoftCapFloat16

[ FAILED ] AttentionTest.Attention4DAttnMaskBool

[ FAILED ] AttentionTest.Attention4DAttnIsCausal

[ FAILED ] AttentionTest.Attention3DGqaAttn

[ FAILED ] AttentionTest.Attention3DGqaSelfAttnCausal

[ FAILED ] AttentionTest.Attention4DGqaAttnMask

[ FAILED ] AttentionTest.Attention4DWithPastAndPresentQkMatmul

[ FAILED ] AttentionTest.Attention3DWithPastAndPresentQkMatmul

[ FAILED ] AttentionTest.Attention4DWithMask3DPastAndPresentQkMatmul

[ FAILED ] AttentionTest.Attention4DWithMask3DPastAndPresentQkMatmulCausal

[ FAILED ] AttentionTest.TestAttention4DWithPastAndPresentQkMatmulBias4DMaskCausal

[ FAILED ] AttentionTest.AttentionNoPastWithPresentOutput

1. Input parsing: Handles both 3D (B, S, hidden) and 4D (B, N, S, H) input formats per the ONNX spec.

2. MHA vs GQA: Detects whether q_num_heads == kv_num_heads (MHA) or q_num_heads > kv_num_heads (GQA) and configures

WebgpuAttentionParameters accordingly.

3. Flash attention: Used when available (no output_qk needed, subgroups feature present, no bias).

4. 3D→BNSH conversion: For 3D inputs, uses TransferBSDToBNSH to convert to the BNSH format expected by the attention kernels.

5. 4D output: Computes in BSD layout (as the shader outputs), then transposes back to BNSH for 4D output format.

6. Attention mask: Reshapes 2D/3D masks to 4D for the shader's broadcasting logic; boolean masks currently return

NOT_SUPPORTED.

Known current limitations (see associated PR or tracking issue for detailed test coverage status):

- Boolean attention masks on WebGPU are not yet supported.

- SoftCap is not yet wired through to the shader.

- GQA output with differing q_num_heads and kv_num_heads has an output stride/layout mismatch.

- The optional QK matmul (output_qk) output path requires additional work.

- Present-only key/value outputs (present without past) are not fully handled.

- Some is_causal configurations require additional handling of causal masking interactions.

Copilot · 2026-03-30T00:31:40Z

+TEST(ConcatOpTest, Concat1D_int64) {
+  // webgpu ep will fail for 0x1122334455667788
+  const int64_t val = 0x11223344;
+  OpTester test("Concat");


This int64 Concat test uses a 32-bit-sized constant (0x11223344), so it doesn't actually exercise 64-bit value handling. If WebGPU currently fails for larger int64 values, it's better to keep a true 64-bit test value (e.g. > 2^32) and explicitly exclude the WebGPU EP for this test until fixed, rather than weakening the test coverage.

Copilot · 2026-03-30T00:31:41Z

+            *state_shape.add_dim() = query_shape.dim(0);  // B
+            state_shape.add_dim()->set_dim_value(kv_num_heads);  // H_kv
+            // d_k = query.dim(2) / q_num_heads
+            if (query_shape.dim(2).has_dim_value()) {
+              state_shape.add_dim()->set_dim_value(query_shape.dim(2).dim_value() / q_num_heads);
+            } else {
+              state_shape.add_dim();
+            }
+            // d_v = value.dim(2) / kv_num_heads
+            if (value_shape.dim(2).has_dim_value()) {
+              state_shape.add_dim()->set_dim_value(value_shape.dim(2).dim_value() / kv_num_heads);
+            } else {


Similarly, present_state shape inference computes d_k = query.dim(2) / q_num_heads and d_v = value.dim(2) / kv_num_heads without verifying divisibility. Please guard these divisions (mod == 0) to avoid emitting incorrect concrete dimension values in the inferred shape.

Copilot · 2026-03-30T00:31:41Z

+
+static size_t NormalizeAxis(int64_t axis, size_t tensor_rank) {
+  int64_t rank = static_cast<int64_t>(tensor_rank);
+  if (axis < -rank && axis >= rank) {


NormalizeAxis range check is incorrect: axis < -rank && axis >= rank can never be true, so invalid axis values won't be rejected and may lead to overflow in the normalization/casts. This should be axis < -rank || axis >= rank (same logic as elsewhere in the codebase).

Suggested change

if (axis < -rank && axis >= rank) {

if (axis < -rank || axis >= rank) {

Copilot · 2026-03-30T00:31:41Z

+  const Tensor* query = context.Input(0);
+  const Tensor* key = context.Input(1);
+  const Tensor* value = context.Input(2);
+  const Tensor* past_state = context.Input(3);   // optional
+  const Tensor* decay = context.Input(4);         // optional
+  const Tensor* beta = context.Input(5);          // optional
+
+  // Validate 3D packed inputs
+  const auto& q_shape = query->Shape();
+  ORT_RETURN_IF(q_shape.NumDimensions() != 3, "query must be 3D (B, T, H_q*d_k)");
+
+  const int batch_size = static_cast<int>(q_shape[0]);
+  const int seq_length = static_cast<int>(q_shape[1]);
+  const int q_packed_dim = static_cast<int>(q_shape[2]);
+  const int num_heads = kv_num_heads_;
+
+  ORT_RETURN_IF(q_num_heads_ != kv_num_heads_,
+                "GQA (q_num_heads != kv_num_heads) is not yet supported");
+
+  const int head_dim_k = q_packed_dim / q_num_heads_;
+  ORT_RETURN_IF(q_packed_dim != head_dim_k * q_num_heads_,
+                "query packed dim must be divisible by q_num_heads");
+
+  const int v_packed_dim = static_cast<int>(value->Shape()[2]);
+  const int head_dim_v = v_packed_dim / kv_num_heads_;
+  ORT_RETURN_IF(v_packed_dim != head_dim_v * kv_num_heads_,
+                "value packed dim must be divisible by kv_num_heads");


ComputeInternal derives head_dim_k from query but never validates that key has the expected 3D packed shape (B, T, H*dk) or that its last dimension matches q_num_heads_*head_dim_k. The shader indexes key using packed_dk based on the query shape, so a mismatched key shape can lead to out-of-bounds reads and incorrect results. Add explicit checks for key rank and dimensions (and batch/seq match) before launching the program.

Copilot · 2026-03-30T00:31:41Z

+  // Workgroup size = head_dim_k (one thread per dk row)
+  // Ensure it's a power of 2 for tree reduction (round up)
+  uint32_t workgroup_size = 1;
+  while (workgroup_size < static_cast<uint32_t>(head_dim_k)) {
+    workgroup_size *= 2;
+  }
+  // Cap at GPU limits
+  workgroup_size = std::min(workgroup_size, static_cast<uint32_t>(256));


Workgroup size is capped to 256 even when head_dim_k is larger. The shader maps local_idx to a dk row and reductions assume full dk coverage, so capping below head_dim_k silently drops rows and produces incorrect results. Either enforce/validate head_dim_k <= max_workgroup_size_x (and return NOT_IMPLEMENTED/INVALID_ARGUMENT when exceeded) or redesign the algorithm to handle dk > workgroup_size via tiling/multiple workgroups.

Suggested change

// Workgroup size = head_dim_k (one thread per dk row)

// Ensure it's a power of 2 for tree reduction (round up)

uint32_t workgroup_size = 1;

while (workgroup_size < static_cast<uint32_t>(head_dim_k)) {

workgroup_size *= 2;

}

// Cap at GPU limits

workgroup_size = std::min(workgroup_size, static_cast<uint32_t>(256));

// Validate that head_dim_k does not exceed the maximum supported workgroup size.

// The shader maps one thread to each dk row and relies on full dk coverage.

const uint32_t kMaxWorkgroupSizeX = 256;

if (static_cast<uint32_t>(head_dim_k) > kMaxWorkgroupSizeX) {

return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,

"WebGPU LinearAttention currently requires head_dim_k <= ",

kMaxWorkgroupSizeX,

"; got head_dim_k = ",

head_dim_k,

". Consider reducing head_dim_k or updating the kernel implementation.");

}

// Workgroup size = head_dim_k (one thread per dk row)

// Ensure it's a power of 2 for tree reduction (round up)

uint32_t workgroup_size = 1;

while (workgroup_size < static_cast<uint32_t>(head_dim_k)) {

workgroup_size *= 2;

}

// Cap at GPU limits (head_dim_k is already validated to be <= kMaxWorkgroupSizeX)

workgroup_size = std::min(workgroup_size, kMaxWorkgroupSizeX);

Copilot · 2026-03-30T00:31:41Z

+    CausalConvWithState);
+
+CausalConvWithState::CausalConvWithState(const OpKernelInfo& info)
+    : WebGpuKernel(info) {


The op schema defines an ndim attribute (default 1) but the WebGPU kernel ignores it entirely (constructor only reads activation). Please read/validate ndim and return a clear NOT_IMPLEMENTED/INVALID_ARGUMENT error for ndim != 1 (or implement higher dimensions), otherwise models exporting ndim=2/3 will silently run with incorrect semantics.

Suggested change

: WebGpuKernel(info) {

: WebGpuKernel(info) {

// Validate supported dimensionality.

const int64_t ndim = info.GetAttrOrDefault<int64_t>("ndim", 1);

if (ndim != 1) {

ORT_THROW("CausalConvWithState WebGPU kernel only supports ndim=1, but got ndim=", ndim);

}

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-04-05T18:41:18Z

+    const std::vector<float>& query,           // (B, q_num_heads, T, dk)
+    const std::vector<float>& key,             // (B, n_k_heads, T, dk)
+    const std::vector<float>& value,           // (B, kv_num_heads, T, dv)
+    const std::vector<float>* initial_state,   // (B, kv_num_heads, dk, dv)
+    const std::vector<float>* decay,           // (B, kv_num_heads, T[, dk])
+    const std::vector<float>* beta,            // (B, kv_num_heads, T)
+    std::vector<float>& output,                // (B, kv_num_heads, T, dv)
+    std::vector<float>& final_state) {         // (B, kv_num_heads, dk, dv)


Suggested change

const std::vector<float>& query, // (B, q_num_heads, T, dk)

const std::vector<float>& key, // (B, n_k_heads, T, dk)

const std::vector<float>& value, // (B, kv_num_heads, T, dv)

const std::vector<float>* initial_state, // (B, kv_num_heads, dk, dv)

const std::vector<float>* decay, // (B, kv_num_heads, T[, dk])

const std::vector<float>* beta, // (B, kv_num_heads, T)

std::vector<float>& output, // (B, kv_num_heads, T, dv)

std::vector<float>& final_state) { // (B, kv_num_heads, dk, dv)

const std::vector<float>& query, // (B, q_num_heads, T, dk)

const std::vector<float>& key, // (B, n_k_heads, T, dk)

const std::vector<float>& value, // (B, kv_num_heads, T, dv)

const std::vector<float>* initial_state, // (B, kv_num_heads, dk, dv)

const std::vector<float>* decay, // (B, kv_num_heads, T[, dk])

const std::vector<float>* beta, // (B, kv_num_heads, T)

std::vector<float>& output, // (B, kv_num_heads, T, dv)

std::vector<float>& final_state) { // (B, kv_num_heads, dk, dv)

guschmue added 21 commits March 10, 2026 20:38

1st cut at webgpu LinearAttention

96997f7

ut is now passing

23e744d

merge LinearAttentionRecurrentProgram and LinearAttentionRecurrentPro…

b72a150

…gram

webgpu support for rmsnorm

d8f43fd

add empty attention

1baf195

draft implementation of onnx Attention operator that maps to webgpu c…

c53ceaf

…ustom ops

CausalConv1DWithState

c8de2f4

add int64_t to concat/webgpu

0948d56

webgpu support for onnx rotarry embeddings

231c92c

webgpu reshape to opset 25

e69b579

keep int64_t for concat on cpu

56fe4ac

webgpu LinearAttention

8e09ff3

allow for decay [B,H,T]

95400be

Merge branch 'main' into gs/wgpu-lattn

c598455

guard for head-dim_k

6c0c736

qwen3.5 shows now correct results

28a3ee3

remove chunk and group from signature

3f80587

update to latest signature proposal

4111b95

opt: make use of vec4

f7711c4

rename to causal_conv_with_state_op_test

52cee10

ut looks for the registered ops

a1e9827

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 29, 2026

guschmue changed the title ~~Gs/wgpu lattn~~ webgpu support for LinearAttention and CausalConvWithState Mar 29, 2026

github-advanced-security AI found potential problems Mar 29, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/bert/linear_attention.cc Fixed

Comment thread onnxruntime/test/contrib_ops/causal_conv_with_state_op_test.cc Fixed

Comment thread onnxruntime/test/contrib_ops/linear_attention_op_test.cc Fixed

github-actions Bot reviewed Mar 29, 2026

View reviewed changes

guschmue mentioned this pull request Mar 29, 2026

Add CPU kernels for linear attention contrib ops #27835

Open

tianleiwu requested a review from Copilot March 30, 2026 00:24

Copilot started reviewing on behalf of tianleiwu March 30, 2026 00:25 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

lintrunner -a

5c80e30

guschmue added 7 commits March 30, 2026 22:02

move shader generation to .wgsl.template

81251ff

optimize number of barriers

9ebcd61

sync with #27842

079a33b

update_rule need to stay std::string

085b274

add support for inverse GQA, needed for Qwen3.5-4/9B

6474e3e

fix issue in Expand that shows with Qwen3.5 embeddings

15e4010

Merge branch 'main' into gs/wgpu-lattn

f9ff1a5

github-actions Bot reviewed Apr 5, 2026

View reviewed changes

Merge branch 'main' into gs/wgpu-lattn

72bfe8f

guschmue closed this Apr 7, 2026

guschmue deleted the gs/wgpu-lattn branch April 27, 2026 22:35

		kv_num_heads_ = static_cast<int>(info.GetAttr<int64_t>("kv_num_heads"));
		}

	*state_shape.add_dim() = query_shape.dim(0); // B
	*state_shape.add_dim() = query_shape.dim(0); // B

+  // Validate decay/beta shapes. The shader expects (B, T, H) where H == num_heads.
+  if (needs_decay && decay != nullptr) {
+    const auto& decay_shape = decay->Shape();
+    ORT_RETURN_IF(decay_shape.NumDimensions() != 3,
+                  "decay must have shape (batch_size, seq_length, num_heads); ",
+                  "broadcasted form (B, T, 1) is not currently supported");
+    ORT_RETURN_IF(static_cast<int64_t>(batch_size) != decay_shape[0] ||
+                      static_cast<int64_t>(seq_length) != decay_shape[1],
+                  "decay shape mismatch: expected batch_size=", batch_size,
+                  " and seq_length=", seq_length,
+                  " but got (", decay_shape[0], ", ", decay_shape[1], ", ", decay_shape[2], ")");
+    ORT_RETURN_IF(static_cast<int64_t>(num_heads) != decay_shape[2],
+                  "decay last dimension must equal num_heads (", num_heads,
+                  "); broadcasted form (B, T, 1) is not currently supported, got ",
+                  decay_shape[2]);
+  }
+  if (needs_beta && beta != nullptr) {
+    const auto& beta_shape = beta->Shape();
+    ORT_RETURN_IF(beta_shape.NumDimensions() != 3,
+                  "beta must have shape (batch_size, seq_length, num_heads); ",
+                  "broadcasted form (B, T, 1) is not currently supported");
+    ORT_RETURN_IF(static_cast<int64_t>(batch_size) != beta_shape[0] ||
+                      static_cast<int64_t>(seq_length) != beta_shape[1],
+                  "beta shape mismatch: expected batch_size=", batch_size,
+                  " and seq_length=", seq_length,
+                  " but got (", beta_shape[0], ", ", beta_shape[1], ", ", beta_shape[2], ")");
+    ORT_RETURN_IF(static_cast<int64_t>(num_heads) != beta_shape[2],
+                  "beta last dimension must equal num_heads (", num_heads,
+                  "); broadcasted form (B, T, 1) is not currently supported, got ",
+                  beta_shape[2]);
+  }

	if (axis < -rank && axis >= rank) {
	if (axis < -rank \|\| axis >= rank) {

-    : WebGpuKernel(info) {
+    : WebGpuKernel(info) {
+  // Validate supported dimensionality.
+  const int64_t ndim = info.GetAttrOrDefault<int64_t>("ndim", 1);
+  if (ndim != 1) {
+    ORT_THROW("CausalConvWithState WebGPU kernel only supports ndim=1, but got ndim=", ndim);
+  }

Conversation

guschmue commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 5, 2026

Choose a reason for hiding this comment

guschmue commented Mar 29, 2026 •

edited

Loading