Enable SME for sgemm and sbgemm through KleidiAI #24346

MichaelTylerArm · 2025-04-08T14:18:01Z

Description

Enables Arm® KleidiAI™ SME kernels for MLAS sgemm and sbgemm functions.

Motivation and Context

These kernels provide performance improvements on SME-enabled devices.
We see a performance improvement of 1.2x-1.8x on onnxruntime_perf_test for the following Geekbench models on M4:

Model                           Speedup
deeplabv3_mobilenetv2_f16.onnx    1.79x
bert_tiny_f16.onnx                1.47x
deeplabv3_mobilenetv2_f32.onnx    1.43x
mobilenetv1_ssd_f16.onnx          1.29x
mobilenet_v1_f32.onnx             1.28x
mobilenetv1_ssd_f32.onnx          1.26x
de_efficientnetlitev3_f16.onnx    1.25x
mobilenet_v1_f16.onnx             1.23x

Signed-off-by: Michael Tyler <[email protected]>

MichaelTylerArm · 2025-04-08T14:31:46Z

Can workflows be approved please?

Signed-off-by: Michael Tyler <[email protected]>

MichaelTylerArm · 2025-04-10T20:18:55Z

Can workflows be approved please?

edgchen1 · 2025-04-10T21:09:15Z

Would you mind sharing some measurements to give an idea of how much these changes improve the performance?

edgchen1 · 2025-04-10T21:56:49Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,
Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-10T21:57:10Z

Azure Pipelines successfully started running 5 pipeline(s).

Signed-off-by: Michael Tyler <[email protected]>

yihonglyu

Could you include the benchmark improvement (e.g., onnxruntime_perf_test) on existing hardware (e.g., M4) in the git log message description?

Signed-off-by: Michael Tyler <[email protected]>

MichaelTylerArm · 2025-05-12T10:30:31Z

I've added performance figures to the PR description.

hariharans29 · 2025-05-29T00:40:52Z

onnxruntime/core/mlas/inc/mlas.h

 size_t
 MLASCALL
 MlasGemmPackBSize(
+    CBLAS_TRANSPOSE TransA,


Just a thought - Do we need data to see if A is going to be transposed to determine the size of packed B ? It seems like it bloats the API unnecessarily. The same thoughts for the MlasGemmPackB(...) routine. It seems like its only real usage is to see if the Kleidi library can be used. Can this be determined separately and sent to the packing routines ? The issue with introducing this to the API is that other non-Kleidi paths must handle them appropriately if exposed in the API ?

hariharans29 · 2025-05-29T00:41:40Z

onnxruntime/core/mlas/lib/sgemm.cpp

+        const size_t BlockedN = (N + n_step - 1) / n_step;
+        const size_t AlignedN = BlockedN * n_step;
+
+        return AlignedN > 64 && K > 1;


Maybe add some documentation on why this is the criteria to be met before using the Kleidi micro-kernels ?

hariharans29 · 2025-05-29T00:42:57Z

onnxruntime/core/mlas/lib/sgemm.cpp

-
-    size_t CountN;
+#if !(defined(USE_KLEIDIAI) && !defined(_MSVC_LANG))
+    MLAS_UNREFERENCED_PARAMETER(TransB);


Ignoring it for non-Kleidi enabled builds seems like a bad idea in general as the caller might be oblivious to its non-usage.

hariharans29 · 2025-05-29T00:44:12Z

onnxruntime/core/mlas/lib/sgemm.cpp

-        CountN = std::min(RangeCountN - n, size_t(MLAS_SGEMM_PACKED_STRIDEN));
+        kai_run_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa(M, RangeCountN, K,
+            A, rhs_ptr, C, dst_stride, sizeof(float),
+            -std::numeric_limits<float>::max(), std::numeric_limits<float>::max());


Very minor ignorable nit: Maybe use std::numeric_limits::lowest() instead of -std::numeric_limits::max() ?

jywu-msft · 2025-06-04T15:30:07Z

@MichaelTylerArm your branch has conflicts. can it be updated? Thanks!

damdoo01-arm · 2025-06-05T11:16:52Z

Hi George, Hariharan, apologies for the delay in responding to the above. So a few developments since this PR was reviewed. We have a new merge candidate under a proposed MLAS architectural change that was communicated to Microsoft. Following that initial communication there has additional feedback with a proposal to create a struct with function pointers that may help alleviate the MlasGemmPackB bloat concern described above. I understand Ronan on our side is looking to create a discussion to firm up on this proposal for both ARM and MSFT. We will take on board the additional comments provided the above and work them into our new branch. In the meantime, I propose we close this PR pending the new PR reflecting the agreed changes.
Thank you both.

Signed-off-by: Michael Tyler <[email protected]>

hariharans29 · 2025-08-11T23:26:42Z

Assuming this PR is not relevant any more after #25187 ? So, closing this for now. If relevant, please re-open. Thanks.

Use KleidiAI for sgemm and sbgemm

824c725

Signed-off-by: Michael Tyler <[email protected]>

MichaelTylerArm requested a review from a team as a code owner April 8, 2025 14:18

jywu-msft requested review from edgchen1, fajin-corp and liqunfu April 10, 2025 03:22

MichaelTylerArm added 2 commits April 10, 2025 10:22

Fix N step selection

7e345e0

Signed-off-by: Michael Tyler <[email protected]>

Add missing parameters

7a3fa64

Signed-off-by: Michael Tyler <[email protected]>

Add guards for SME on MSVC

2d51b7d

Signed-off-by: Michael Tyler <[email protected]>

jywu-msft requested a review from yihonglyu April 16, 2025 23:20

MichaelTylerArm added 2 commits April 25, 2025 14:09

Disable KleidiAI for small sgemm shapes

69dec43

Signed-off-by: Michael Tyler <[email protected]>

Don't call SME function if not present

9969b36

Signed-off-by: Michael Tyler <[email protected]>

yihonglyu reviewed May 6, 2025

View reviewed changes

Revert qnbitgemm changes

7c6514e

Signed-off-by: Michael Tyler <[email protected]>

hariharans29 reviewed May 29, 2025

View reviewed changes

MichaelTylerArm closed this Jun 17, 2025

MichaelTylerArm reopened this Jun 17, 2025

Use BF16 for sgemm

62f0b67

Signed-off-by: Michael Tyler <[email protected]>

hariharans29 closed this Aug 11, 2025

patryk-kaiser-ARM mentioned this pull request Dec 11, 2025

[MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path #26773

Open

Enable SME for sgemm and sbgemm through KleidiAI #24346

Enable SME for sgemm and sbgemm through KleidiAI #24346

Uh oh!

Conversation

MichaelTylerArm commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

MichaelTylerArm commented Apr 8, 2025

Uh oh!

MichaelTylerArm commented Apr 10, 2025

Uh oh!

edgchen1 commented Apr 10, 2025

Uh oh!

edgchen1 commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

yihonglyu left a comment

Choose a reason for hiding this comment

Uh oh!

MichaelTylerArm commented May 12, 2025

Uh oh!

hariharans29 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

jywu-msft commented Jun 4, 2025

Uh oh!

damdoo01-arm commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MichaelTylerArm commented Apr 8, 2025 •

edited

Loading

damdoo01-arm commented Jun 5, 2025 •

edited

Loading