Implement FP32 kleidiai Gemv #26302

JonathanC-ARM · 2025-10-14T15:45:44Z

Description

Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1

Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels

Indicative Performance

In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model

More Benchmarks to come shortly

hariharans29 · 2025-10-14T19:50:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-14T19:50:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-16T17:10:08Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T17:10:27Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-22T17:19:31Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-22T17:19:52Z

Azure Pipelines successfully started running 4 pipeline(s).

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

onnxruntime/core/mlas/lib/qgemm.cpp

hariharans29 · 2025-10-28T20:24:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-28T20:25:16Z

Azure Pipelines successfully started running 4 pipeline(s).

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

onnxruntime/core/mlas/lib/qgemm.cpp

hariharans29 · 2025-11-11T17:33:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-11T17:33:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-11-14T04:33:17Z

Could you please rebase this @JonathanC-ARM ?

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-11-14T12:14:14Z

Hi @hariharans29 I've updated the branch now, thanks!

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2025-11-14T18:12:20Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-14T18:12:41Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2025-12-01T16:09:43Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-01T16:10:03Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-09T05:41:18Z

Can you please resolve conflicts ?

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-12-10T15:45:29Z

Hi @hariharans29 I've removed all extraneous changes from the latest commits. So its just GEMV now as such so should be easier to review. Also addressed the fill_n stuff you were right that wasn't required. Thanks for the comments!

hariharans29 · 2025-12-10T16:06:32Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-10T16:06:56Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-10T17:29:38Z

Apart from the comment from Copilot, the rest of the PR looks good to me. Thanks.

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-12-10T20:44:25Z

Hi @hariharans29 I did make a change around the copilot suggestion. But saying that I don't think it was actually that big of a deal so to speak. Technically copilot was correct in isolation, but within ORT, the testing fixtures, and mlas this wasn't really possible. I've made the change on the off chance this could be called in isolation somehow and added tests that go down this path.

But ORT/MLAS always use minimal leading dimensions: lda = (TransA ? M : K) and ldb = (TransB ? K : N).

For the degenerate GEMV shapes:
- If M == 1 and TransA == CblasTrans, then lda == 1 - A’s LHS row is contiguous.
- If N == 1 and TransB == CblasNoTrans, then ldb == 1 - B’s LHS row is contiguous.
Since the LHS vector is always contiguous under these rules, the gather is never required, so the original issue could never taken.

This could only occur if someone called MLAS with padded/non-minimal lda/ldb which ORT doesn’t do today. But as it stands now if that was to ever happen we should be ok also.

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

hariharans29 · 2025-12-11T02:27:13Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-11T02:27:31Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Jonathan Clohessy <[email protected]>

edgchen1 · 2025-12-12T18:51:30Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-12T18:51:49Z

Azure Pipelines successfully started running 4 pipeline(s).

### Description Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1 Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels ### Indicative Performance In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout <img width="1572" height="148" alt="image (6)" src="https://github.com/user-attachments/assets/451441e4-df5b-42d1-8c6e-ec8dd14161e6" /> Using onnxruntime perf test I was able to half the total inference time vs mlas with this model <img width="1200" height="900" alt="ort_ops_compare_gemv_no_2025-10-07_19-40-30_vs_gemv_2025-10-07_19-40-58" src="https://github.com/user-attachments/assets/ddef3bf3-796c-4f58-8712-361510e2a901" /> **_More Benchmarks to come shortly_** --------- Signed-off-by: Jonathan Clohessy <[email protected]> Signed-off-by: Jonathan Clohessy <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Copilot <[email protected]>

edgchen1 reviewed Oct 23, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/qgemm.cpp Outdated Show resolved Hide resolved

patryk-kaiser-ARM mentioned this pull request Oct 24, 2025

Fix: Disable KleidiAI on systems with SME1 but not SME2 #26399

Closed

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from a3f4f5b to e8ab1b1 Compare October 24, 2025 15:46

hariharans29 reviewed Oct 30, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp Outdated Show resolved Hide resolved

hariharans29 reviewed Oct 30, 2025

View reviewed changes

onnxruntime/core/mlas/lib/qgemm.cpp Outdated Show resolved Hide resolved

hariharans29 mentioned this pull request Oct 31, 2025

Implement multithreading in qgemm_kleidi #26301

Open

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 3 times, most recently from 4afc95c to 1d9b7c8 Compare November 11, 2025 15:53

JonathanC-ARM and others added 3 commits November 14, 2025 10:48

Implement FP32 kleidiai Gemv

82480ad

Signed-off-by: Jonathan Clohessy <[email protected]>

Update const for kernel interface and sme checks

4cf6ccd

Signed-off-by: Jonathan Clohessy <[email protected]>

Modify SME Detection struct location and logic

733cd76

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 1d9b7c8 to c9a507f Compare November 14, 2025 12:08

Align convolve checks with consolidated smeinfo mechanism

1ead7ca

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from c9a507f to 1ead7ca Compare November 14, 2025 12:19

Refactored gemv based on feedback

060fdc6

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from f45ca9a to 060fdc6 Compare November 25, 2025 15:20

hariharans29 reviewed Dec 10, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

hariharans29 reviewed Dec 10, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Show resolved Hide resolved

Merge remote-tracking branch 'jc/main' into HEAD

355429a

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from d7661a4 to 3ab115c Compare December 10, 2025 15:42

Remove SMEInfo and address review comments

25a0a27

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 3ab115c to 25a0a27 Compare December 10, 2025 15:43

addressed comments in review and fix lda ldb inconsistency and add tests

bf1e8ea

Signed-off-by: Jonathan Clohessy <[email protected]>

edgchen1 reviewed Dec 11, 2025

View reviewed changes

Update Gemv implementation and added comments clarifying behavior

3bfbdc1

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM requested review from edgchen1 and hariharans29 December 12, 2025 18:26

edgchen1 approved these changes Dec 12, 2025

View reviewed changes

hariharans29 approved these changes Dec 14, 2025

View reviewed changes

hariharans29 merged commit 92c1ed2 into microsoft:main Dec 14, 2025
89 checks passed

Implement FP32 kleidiai Gemv #26302

Implement FP32 kleidiai Gemv #26302

Uh oh!

Conversation

JonathanC-ARM commented Oct 14, 2025

Description

Indicative Performance

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

azure-pipelines bot commented Oct 14, 2025

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

azure-pipelines bot commented Oct 16, 2025

Uh oh!

hariharans29 commented Oct 22, 2025

Uh oh!

azure-pipelines bot commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Oct 28, 2025

Uh oh!

azure-pipelines bot commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Nov 11, 2025

Uh oh!

azure-pipelines bot commented Nov 11, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

JonathanC-ARM commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

azure-pipelines bot commented Nov 14, 2025

Uh oh!

hariharans29 commented Dec 1, 2025

Uh oh!

azure-pipelines bot commented Dec 1, 2025

Uh oh!

hariharans29 commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

JonathanC-ARM commented Dec 10, 2025

Uh oh!

hariharans29 commented Dec 10, 2025

Uh oh!

azure-pipelines bot commented Dec 10, 2025

Uh oh!

hariharans29 commented Dec 10, 2025

Uh oh!

JonathanC-ARM commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Dec 11, 2025

Uh oh!

azure-pipelines bot commented Dec 11, 2025

Uh oh!

edgchen1 commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants