Skip to content

Conversation

@JonathanC-ARM
Copy link
Contributor

Description

Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1

Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels

Indicative Performance

In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout
image (6)

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model
ort_ops_compare_gemv_no_2025-10-07_19-40-30_vs_gemv_2025-10-07_19-40-58

More Benchmarks to come shortly

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from a3f4f5b to e8ab1b1 Compare October 24, 2025 15:46
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 3 times, most recently from 4afc95c to 1d9b7c8 Compare November 11, 2025 15:53
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

Could you please rebase this @JonathanC-ARM ?

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 1d9b7c8 to c9a507f Compare November 14, 2025 12:08
@JonathanC-ARM
Copy link
Contributor Author

Hi @hariharans29 I've updated the branch now, thanks!

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from c9a507f to 1ead7ca Compare November 14, 2025 12:19
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Jonathan Clohessy <[email protected]>
@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from f45ca9a to 060fdc6 Compare November 25, 2025 15:20
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

Can you please resolve conflicts ?

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from d7661a4 to 3ab115c Compare December 10, 2025 15:42
@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 3ab115c to 25a0a27 Compare December 10, 2025 15:43
@JonathanC-ARM
Copy link
Contributor Author

Hi @hariharans29 I've removed all extraneous changes from the latest commits. So its just GEMV now as such so should be easier to review. Also addressed the fill_n stuff you were right that wasn't required. Thanks for the comments!

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

Apart from the comment from Copilot, the rest of the PR looks good to me. Thanks.

@JonathanC-ARM
Copy link
Contributor Author

Hi @hariharans29 I did make a change around the copilot suggestion. But saying that I don't think it was actually that big of a deal so to speak. Technically copilot was correct in isolation, but within ORT, the testing fixtures, and mlas this wasn't really possible. I've made the change on the off chance this could be called in isolation somehow and added tests that go down this path.

But ORT/MLAS always use minimal leading dimensions: lda = (TransA ? M : K) and ldb = (TransB ? K : N).

  • For the degenerate GEMV shapes:
    • If M == 1 and TransA == CblasTrans, then lda == 1 - A’s LHS row is contiguous.
    • If N == 1 and TransB == CblasNoTrans, then ldb == 1 - B’s LHS row is contiguous.
  • Since the LHS vector is always contiguous under these rules, the gather is never required, so the original issue could never taken.

This could only occur if someone called MLAS with padded/non-minimal lda/ldb which ORT doesn’t do today. But as it stands now if that was to ever happen we should be ok also.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@edgchen1
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 merged commit 92c1ed2 into microsoft:main Dec 14, 2025
89 checks passed
Sumit2318 pushed a commit that referenced this pull request Jan 6, 2026
### Description
Implementation of special sgemm path which uses GEMV kernels in cases
where M or N are 1

Additionally this pr introduces the usage of a microkernel interface
which utilizes typedef's provided by KleidiAI such that we can simplify
the code and remove things such as ternary operations for SME1 vs SME2
kernels

### Indicative Performance
In Lieu of any production models where gemv was a large contributor of
the network. I opted to create a mini model to test which contains
thousands of randomized matmul variants. With a distribution of GEMV
cases throughout
<img width="1572" height="148" alt="image (6)"
src="https://github.com/user-attachments/assets/451441e4-df5b-42d1-8c6e-ec8dd14161e6"
/>

Using onnxruntime perf test I was able to half the total inference time
vs mlas with this model
<img width="1200" height="900"
alt="ort_ops_compare_gemv_no_2025-10-07_19-40-30_vs_gemv_2025-10-07_19-40-58"
src="https://github.com/user-attachments/assets/ddef3bf3-796c-4f58-8712-361510e2a901"
/>


**_More Benchmarks to come shortly_**

---------

Signed-off-by: Jonathan Clohessy <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
Co-authored-by: Hariharan Seshadri <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants