linear attention signature by guschmue · Pull Request #27842 · microsoft/onnxruntime

guschmue · 2026-03-25T18:56:59Z

Proposal for CausalConvWithState and LinearAttention onnxruntime custom operator.
This follows the proposal in onnx/onnx#7767.

guschmue · 2026-03-25T19:00:16Z

a working end to end implementation for webgpu with this signature can be found here:
https://github.com/microsoft/onnxruntime/tree/gs/wgpu-lattn

Possible changes at this point:

maybe inputs should be transposed, else the model will have transpose operators in front of LinearAttention
maybe CausalConv1DWithState should be CausalConvWithState

github-actions

You can commit the suggested changes from lintrunner.

justinchuby · 2026-03-31T16:15:40Z

Cross-referencing with the ONNX proposal (onnx/onnx#7767) — review by AI agent team

I compared the LinearAttention and CausalConvWithState schemas between this ORT contrib PR and the ONNX proposal in onnx/onnx#7767. Summary below.

`LinearAttention` — comparison

✅ Matches

Item	Both PRs
Input names (0–5)	`query`, `key`, `value`, `past_state`, `decay`, `beta` — identical order and names
Output names (0–1)	`output`, `present_state` — identical
Input shapes	All match: 3D packed `[B,T,H*D]` for Q/K/V/decay/beta; 4D `[B,H_kv,d_k,d_v]` for state
Attribute names	`q_num_heads`, `kv_num_heads`, `update_rule`, `scale`, `chunk_size` — all five present
Attribute types & defaults	`update_rule="gated_delta"`, `scale=0.0`, `chunk_size=64`, both heads required
TypeConstraint T	`{tensor(float), tensor(float16), tensor(bfloat16)}` — identical
Mathematical semantics	`linear`, `gated`, `delta`, `gated_delta` update rules described identically
Optional input semantics	`past_state` defaults to zeros; `decay`/`beta` required by respective modes
Namespace	ORT uses `com.microsoft`, ONNX proposal uses `ai.onnx` — expected and correct per the 3-phase adoption path

❌ Key mismatch — state type precision (`S` vs `T`)

The ONNX proposal introduces a second type parameter S for the recurrent state tensors, following the stash_type convention used by LayerNormalization and GroupNormalization:

past_state → type S (not T)
present_state → type S (not T)
S is constrained to float32, or the same as T

This ORT PR uses a single type T for all inputs/outputs including state. This means:

In this ORT schema, running with T=float16 forces the recurrent state to also be float16
The ONNX proposal explicitly supports T=float16, S=float32 — float16 activations with float32 state accumulation — which is important for numerical stability during long-sequence decoding (the state matrix is accumulated over hundreds or thousands of tokens)
The ONNX proposal notes: "Using S = float32 with T = float16/bfloat16 is the recommended configuration for long sequences; runtimes handle any necessary casting internally"

Recommendation: Consider adding a stash_type attribute (integer, default 1 = float32) analogous to LayerNormalization, so callers can opt into float32 state accumulation independently of the activation dtype. This would align with the ONNX proposal and avoid a breaking schema change later.

⚠️ Minor gap — input validation rules not in doc string

The ONNX proposal has an explicit table of required/forbidden optional inputs per update_rule:

`update_rule`	`decay`	`beta`
`"linear"`	must be omitted	must be omitted
`"gated"`	required	must be omitted
`"delta"`	must be omitted	required
`"gated_delta"`	required	required

The ORT doc string describes the semantics but doesn't explicitly state that providing a forbidden input is a model validation error. Worth adding to the doc for clarity — helps implementors know they should validate at model-load time, not silently ignore extra inputs.

`CausalConvWithState` — comparison

✅ Full match

All inputs (input, weight, bias, past_state), outputs (output, present_state), attributes (ndim, activation), type constraints, and shapes are identical between the two PRs. No differences found.

Summary

Op	Status
`LinearAttention`	Mostly aligned — 1 structural gap (state type `S` vs `T`), 1 minor doc gap
`CausalConvWithState`	Fully aligned ✅

The state precision gap is the only item that would cause a future breaking schema change if not addressed here. Everything else looks well-aligned with the ONNX proposal.

justinchuby · 2026-03-31T16:15:49Z

Cross-referencing with the ONNX proposal (onnx/onnx#7767) — review by AI agent team

I systematically compared the LinearAttention and CausalConvWithState op schemas in this PR against the formal schema defined in onnx/onnx#7767. The namespace difference (com.microsoft here vs ai.onnx in the proposal) is expected per the contrib-first adoption path and is not flagged below.

LinearAttention

✅ Matches

Element	ORT	ONNX proposal
Input names	`query`, `key`, `value`, `past_state`, `decay`, `beta`	same
Input order	0..5	same
Input optionality	`past_state`, `decay`, `beta` optional	same
Input shapes (3D packed)	`(B, T, H*D)` for Q/K/V; `(B, H_kv, d_k, d_v)` for state	same
Decay shape variants	`(B, T, H_kv*d_k)` or `(B, T, H_kv)`	same
Output names	`output`, `present_state`	same
Output shapes	`(B, T, H_q*d_v)` and `(B, H_kv, d_k, d_v)`	same
Attribute names	`update_rule`, `scale`, `q_num_heads`, `kv_num_heads`, `chunk_size`	same
Attribute types	all match	same
Attribute defaults	`update_rule="gated_delta"`, `scale=0.0`, `chunk_size=64`	same
`q_num_heads`/`kv_num_heads` required	no default value	same
Activation dtype constraint	float, float16, bfloat16	same
Zero-initialized state semantics	implied (optional input)	explicitly stated

❌ Mismatch: State type parameter (`S` vs single `T`)

This is the most significant schema difference.

The ONNX proposal defines two separate type parameters:

T — activation dtype for query, key, value, decay, beta, output
S — state dtype for past_state and present_state; must be float32 or same as T

This allows the recommended configuration of fp32 state + fp16/bf16 activations — important for numerical accuracy in long sequences where state accumulation in fp16 diverges. The ONNX proposal explicitly calls this out: "Using S = float32 with T = float16/bfloat16 is the recommended configuration for long sequences."

This PR uses a single T type for all inputs including state, which prevents expressing this mixed-precision configuration in the op signature.

Suggestion: Consider adding a second type constraint S for past_state/present_state, or at minimum add an attribute state_dtype that allows the runtime to accumulate state in float32 even when activations are fp16.

⚠️ Minor: Input combination validation not documented in schema

The ONNX proposal formally specifies that the presence/absence of decay and beta depends on update_rule and must be validated at model-load time (e.g., "linear" requires neither, "gated_delta" requires both; providing a forbidden input is a schema error).

This PR marks both as Optional in the schema without documenting this constraint. The kernel may still validate at runtime, but adding a note to the docstring would make the contract explicit to model builders.

CausalConvWithState

✅ Matches (essentially identical)

Element	ORT	ONNX proposal
Input names	`input`, `weight`, `bias`, `past_state`	same
Input shapes	`(B, C, ...)` input, `(C, 1, k, ...)` weight, `(C,)` bias, `(B, C, k-1)` state	same
Input optionality	`bias` and `past_state` optional	same
Output names	`output`, `present_state`	same
Output shapes	same as input; same as past_state	same
Attributes	`activation` (default `"none"`), `ndim` (default `1`)	same
Activation values	`"silu"`, `"swish"`, `"none"` (`"silu"` and `"swish"` are aliases)	same
Type constraint	single `T`: float, float16, bfloat16	same

No divergence found for CausalConvWithState. The single-type constraint is appropriate here since the conv state dtype naturally matches the input dtype.

Summary

The two schemas are structurally aligned. The one actionable difference is the missing S state-dtype type parameter in LinearAttention. Everything else matches: all 5 attributes, all 6 inputs, both outputs, all defaults, and the full CausalConvWithState schema.

The ONNX proposal is at onnx/onnx#7767 if you want to review the reference-level pseudocode and the formal input-combination validation table.

justinchuby

LGTM w/ agreement w/ the AI comments

Proposal for CausalConvWithState and LinearAttention onnxruntime custom operator. This follows the proposal in onnx/onnx#7767.

Version bump to 1.25.1. This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | e532c21 | #27842 | linear attention signature | | 410f5a8 | #27752 | +rotemb, +rmsnorm, reshape->opset-25, transpose->opset-24 | | 0fedb26 | #27907 | Add LinearAttention and CausalConvState ops for Qwen3.5 | | 3ac6040 | #27996 | webgpu support for qwen3.5 | | c36c422 | #27998 | [WebGPU EP] Fuse QMoE 1-token decode path to reduce GPU dispatches | | 94f32ec | #27289 | [CORE]: Improve filesystem error messages during Linux device discovery | | dce77a3 | #28118 | Fix lack of auth on python packaging | --------- Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: eserscor <erscor@microsoft.com> Co-authored-by: Sanaa Hamel <sanaahamel@microsoft.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Stephan Seitz <sseitz@nvidia.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>

guschmue added 2 commits March 25, 2026 11:24

proposal for linear attention custom op

83a981d

remove group attr

14d21a7

guschmue changed the title ~~Gs/linear attention signature~~ linear attention signature Mar 25, 2026

guschmue mentioned this pull request Mar 25, 2026

Add CPU kernels for linear attention contrib ops #27835

Open

guschmue added 2 commits March 29, 2026 09:05

Merge branch 'main' into gs/linear-attention-signature

12c5a6e

reflect latest signature proposal

723005a

github-advanced-security AI found potential problems Mar 29, 2026

View reviewed changes

Comment thread onnxruntime/test/contrib_ops/causal_conv_with_state_op_test.cc Fixed

Comment thread onnxruntime/test/contrib_ops/linear_attention_op_test.cc Fixed

github-actions Bot reviewed Mar 29, 2026

View reviewed changes

lintrunner

af06ab1

kunal-vaishnavi reviewed Mar 31, 2026

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated

kunal-vaishnavi reviewed Mar 31, 2026

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc

review feedback

cd3295b

guschmue added a commit that referenced this pull request Mar 31, 2026

sync with #27842

079a33b

update_rule need to stay std::string

40837f2

justinchuby previously approved these changes Mar 31, 2026

View reviewed changes

apsonawane reviewed Apr 1, 2026

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated

gramalingam reviewed Apr 1, 2026

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated

address review feedback

a058135

guschmue dismissed justinchuby’s stale review via a058135 April 2, 2026 16:18

justinchuby reviewed Apr 2, 2026

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc

justinchuby previously approved these changes Apr 2, 2026

View reviewed changes

guschmue added 2 commits April 3, 2026 14:57

Merge branch 'main' into gs/linear-attention-signature

1a07a12

update docs

015f281

guschmue dismissed justinchuby’s stale review via 015f281 April 3, 2026 22:30

justinchuby approved these changes Apr 6, 2026

View reviewed changes

justinchuby merged commit e532c21 into main Apr 6, 2026
98 of 99 checks passed

justinchuby deleted the gs/linear-attention-signature branch April 6, 2026 18:23

guschmue added the release:1.25.1 label Apr 21, 2026

sanaa-hamel-microsoft mentioned this pull request Apr 21, 2026

ORT 1.25.1 release: version bump and cherry-pick #27907 #28149

Merged

sanaa-hamel-microsoft pushed a commit that referenced this pull request Apr 21, 2026

linear attention signature (#27842)

391dae4

Proposal for CausalConvWithState and LinearAttention onnxruntime custom operator. This follows the proposal in onnx/onnx#7767.

BrewTestBot mentioned this pull request Apr 27, 2026

onnxruntime 1.25.1 Homebrew/homebrew-core#279761

Merged

Conversation

guschmue commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guschmue commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Mar 31, 2026

LinearAttention — comparison

✅ Matches

❌ Key mismatch — state type precision (S vs T)

⚠️ Minor gap — input validation rules not in doc string

CausalConvWithState — comparison

✅ Full match

Summary

Uh oh!

justinchuby commented Mar 31, 2026

LinearAttention

✅ Matches

❌ Mismatch: State type parameter (S vs single T)

⚠️ Minor: Input combination validation not documented in schema

CausalConvWithState

✅ Matches (essentially identical)

Summary

Uh oh!

justinchuby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

guschmue commented Mar 25, 2026 •

edited

Loading

`LinearAttention` — comparison

❌ Key mismatch — state type precision (`S` vs `T`)

`CausalConvWithState` — comparison

❌ Mismatch: State type parameter (`S` vs single `T`)