Add CPU kernels for linear attention contrib ops by OmarAzizi · Pull Request #27835 · microsoft/onnxruntime

OmarAzizi · 2026-03-25T10:39:36Z

Description

Implemented CPU execution provider kernels for the three linear attention contrib ops in the for linear attention / recurrent state-update mechanisms used by modern hybrid LLMs (Qwen3.5, Jamba, RWKV-6, FalconMamba, etc.):

LinearAttentionRecurrent: Single-token recurrent decode step supporting linear, gated, delta, and gated_delta update rules. Computes the full state update (decay, retrieve, delta, write, readout) in float32 for numerical stability across all sequence lengths.
LinearAttentionChunkParallel: Prefill kernel that processes a full input sequence by running the recurrent step sequentially for all T tokens. The CUDA chunk-parallel WY decomposition is not used on the CPU.
CausalConv1DWithState: Depthwise causal 1D convolution with carry state and optional SiLU activation.

All ops support float32, float16, and bfloat16. fp16/bf16 inputs are converted to float32 internally for accumulation, matching the precision behavior of the CUDA kernels.

Note: The kernels compile, and the kernel symbols are correctly present in the built binary. However, end-to-end Python testing with ONNXRuntime was blocked. The ops are not registering at runtime despite the kernels being linked. The root cause appears to be in the schema registration in bert_defs.cc, which affects both the CPU and CUDA kernels. Any input on how this could be fixed would be appreciated.

Motivation and Context

The CUDA kernels for these ops were added in commit 3966afb without the CPU kernels, which would result in inference failing for models like Qwen3.5 and Jamba on CPU-only machines.

Ref: onnx/onnx#7689

Signed-off-by: OmarAzizi <oalazizi75@gmail.com>

OmarAzizi · 2026-03-25T10:43:43Z

@microsoft-github-policy-service agree

guschmue · 2026-03-25T17:45:21Z

While implementing this we are discussing some changes to the ops signatures to make them more practical.

instead of using 2 ops for LinearAttention it will be just 1 called LinearAttention().
still under discussion is if we need to transpose the inputs to LinearAttention() to avoid the transpose of every input that is there if we don't.

The signatures we expect for webgpu are here - trying to see if we can finalize the signature today.

https://github.com/microsoft/onnxruntime/blob/gs/wgpu-lattn/onnxruntime/core/graph/contrib_ops/bert_defs.cc#L2236
https://github.com/microsoft/onnxruntime/blob/gs/wgpu-lattn/onnxruntime/core/graph/contrib_ops/bert_defs.cc#L2323

guschmue · 2026-03-25T19:01:12Z

we are working on the signature here:
#27842

OmarAzizi · 2026-03-25T19:36:30Z

@guschmue Thanks for the heads up! I'll hold off on any further changes until the signatures are finalized. I'm happy to update the CPU kernels to match the new LinearAttention op once the interface is settled. Let me know if there's anything I can do in the meantime.

guschmue · 2026-03-29T17:59:50Z

ok, I updated the contrib ops PR to the latest signature
from onnx/onnx#7767
Also added the unit tests for the ops in the same PR.
#27842

A working fp16/q4 models is here:
https://huggingface.co/schmuell/Qwen3.5-0.8B

A working implementation for webgpu is in this PR:
#27896

OmarAzizi · 2026-03-30T12:46:31Z

Hi @guschmue, thank you a lot for the update and for sharing the latest changes and tests.

I’ve been a bit busy over the past few days, but I’ll review the updated signatures and update my PR this week.

Since I’m still relatively new to the codebase and contrib ops workflow, I’d really appreciate it if you have any notes or tips I should keep in mind.

Add CPU kernels for linear attention contrib ops

633fac1

Signed-off-by: OmarAzizi <oalazizi75@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU kernels for linear attention contrib ops#27835

Add CPU kernels for linear attention contrib ops#27835
OmarAzizi wants to merge 1 commit intomicrosoft:rama/linear-attnfrom
OmarAzizi:cpu/linear-attention-kernels

OmarAzizi commented Mar 25, 2026

Uh oh!

OmarAzizi commented Mar 25, 2026

Uh oh!

guschmue commented Mar 25, 2026

Uh oh!

guschmue commented Mar 25, 2026

Uh oh!

OmarAzizi commented Mar 25, 2026 •

edited

Loading

Uh oh!

guschmue commented Mar 29, 2026

Uh oh!

OmarAzizi commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OmarAzizi commented Mar 25, 2026

Description

Motivation and Context

Uh oh!

OmarAzizi commented Mar 25, 2026

Uh oh!

guschmue commented Mar 25, 2026

Uh oh!

guschmue commented Mar 25, 2026

Uh oh!

OmarAzizi commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guschmue commented Mar 29, 2026

Uh oh!

OmarAzizi commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OmarAzizi commented Mar 25, 2026 •

edited

Loading