-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Implement FP32 kleidiai Gemv #26302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FP32 kleidiai Gemv #26302
Conversation
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
a3f4f5b to
e8ab1b1
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
4afc95c to
1d9b7c8
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Could you please rebase this @JonathanC-ARM ? |
Signed-off-by: Jonathan Clohessy <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
1d9b7c8 to
c9a507f
Compare
|
Hi @hariharans29 I've updated the branch now, thanks! |
Signed-off-by: Jonathan Clohessy <[email protected]>
c9a507f to
1ead7ca
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Jonathan Clohessy <[email protected]>
f45ca9a to
060fdc6
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Can you please resolve conflicts ? |
d7661a4 to
3ab115c
Compare
Signed-off-by: Jonathan Clohessy <[email protected]>
3ab115c to
25a0a27
Compare
|
Hi @hariharans29 I've removed all extraneous changes from the latest commits. So its just GEMV now as such so should be easier to review. Also addressed the fill_n stuff you were right that wasn't required. Thanks for the comments! |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Apart from the comment from Copilot, the rest of the PR looks good to me. Thanks. |
Signed-off-by: Jonathan Clohessy <[email protected]>
|
Hi @hariharans29 I did make a change around the copilot suggestion. But saying that I don't think it was actually that big of a deal so to speak. Technically copilot was correct in isolation, but within ORT, the testing fixtures, and mlas this wasn't really possible. I've made the change on the off chance this could be called in isolation somehow and added tests that go down this path. But ORT/MLAS always use minimal leading dimensions:
This could only occur if someone called MLAS with padded/non-minimal |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Jonathan Clohessy <[email protected]>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
### Description Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1 Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels ### Indicative Performance In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout <img width="1572" height="148" alt="image (6)" src="https://github.com/user-attachments/assets/451441e4-df5b-42d1-8c6e-ec8dd14161e6" /> Using onnxruntime perf test I was able to half the total inference time vs mlas with this model <img width="1200" height="900" alt="ort_ops_compare_gemv_no_2025-10-07_19-40-30_vs_gemv_2025-10-07_19-40-58" src="https://github.com/user-attachments/assets/ddef3bf3-796c-4f58-8712-361510e2a901" /> **_More Benchmarks to come shortly_** --------- Signed-off-by: Jonathan Clohessy <[email protected]> Signed-off-by: Jonathan Clohessy <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Copilot <[email protected]>
Description
Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1
Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels
Indicative Performance
In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model

More Benchmarks to come shortly