Skip to content

Conversation

@MichaelTylerArm
Copy link
Contributor

@MichaelTylerArm MichaelTylerArm commented Apr 8, 2025

Description

Enables Arm® KleidiAI™ SME kernels for MLAS sgemm and sbgemm functions.

Motivation and Context

These kernels provide performance improvements on SME-enabled devices.
We see a performance improvement of 1.2x-1.8x on onnxruntime_perf_test for the following Geekbench models on M4:

Model                           Speedup
deeplabv3_mobilenetv2_f16.onnx    1.79x
bert_tiny_f16.onnx                1.47x
deeplabv3_mobilenetv2_f32.onnx    1.43x
mobilenetv1_ssd_f16.onnx          1.29x
mobilenet_v1_f32.onnx             1.28x
mobilenetv1_ssd_f32.onnx          1.26x
de_efficientnetlitev3_f16.onnx    1.25x
mobilenet_v1_f16.onnx             1.23x

@MichaelTylerArm MichaelTylerArm requested a review from a team as a code owner April 8, 2025 14:18
@MichaelTylerArm
Copy link
Contributor Author

Can workflows be approved please?

Signed-off-by: Michael Tyler <[email protected]>
Signed-off-by: Michael Tyler <[email protected]>
@MichaelTylerArm
Copy link
Contributor Author

Can workflows be approved please?

@edgchen1
Copy link
Contributor

Would you mind sharing some measurements to give an idea of how much these changes improve the performance?

@edgchen1
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,
Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Signed-off-by: Michael Tyler <[email protected]>
@jywu-msft jywu-msft requested a review from yihonglyu April 16, 2025 23:20
Copy link
Contributor

@yihonglyu yihonglyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include the benchmark improvement (e.g., onnxruntime_perf_test) on existing hardware (e.g., M4) in the git log message description?

Signed-off-by: Michael Tyler <[email protected]>
@MichaelTylerArm
Copy link
Contributor Author

I've added performance figures to the PR description.

size_t
MLASCALL
MlasGemmPackBSize(
CBLAS_TRANSPOSE TransA,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought - Do we need data to see if A is going to be transposed to determine the size of packed B ? It seems like it bloats the API unnecessarily. The same thoughts for the MlasGemmPackB(...) routine. It seems like its only real usage is to see if the Kleidi library can be used. Can this be determined separately and sent to the packing routines ? The issue with introducing this to the API is that other non-Kleidi paths must handle them appropriately if exposed in the API ?

const size_t BlockedN = (N + n_step - 1) / n_step;
const size_t AlignedN = BlockedN * n_step;

return AlignedN > 64 && K > 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some documentation on why this is the criteria to be met before using the Kleidi micro-kernels ?


size_t CountN;
#if !(defined(USE_KLEIDIAI) && !defined(_MSVC_LANG))
MLAS_UNREFERENCED_PARAMETER(TransB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring it for non-Kleidi enabled builds seems like a bad idea in general as the caller might be oblivious to its non-usage.

CountN = std::min(RangeCountN - n, size_t(MLAS_SGEMM_PACKED_STRIDEN));
kai_run_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa(M, RangeCountN, K,
A, rhs_ptr, C, dst_stride, sizeof(float),
-std::numeric_limits<float>::max(), std::numeric_limits<float>::max());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor ignorable nit: Maybe use std::numeric_limits::lowest() instead of -std::numeric_limits::max() ?

@jywu-msft
Copy link
Member

@MichaelTylerArm your branch has conflicts. can it be updated? Thanks!

@damdoo01-arm
Copy link
Contributor

damdoo01-arm commented Jun 5, 2025

Hi George, Hariharan, apologies for the delay in responding to the above. So a few developments since this PR was reviewed. We have a new merge candidate under a proposed MLAS architectural change that was communicated to Microsoft. Following that initial communication there has additional feedback with a proposal to create a struct with function pointers that may help alleviate the MlasGemmPackB bloat concern described above. I understand Ronan on our side is looking to create a discussion to firm up on this proposal for both ARM and MSFT. We will take on board the additional comments provided the above and work them into our new branch. In the meantime, I propose we close this PR pending the new PR reflecting the agreed changes.
Thank you both.

Signed-off-by: Michael Tyler <[email protected]>
@hariharans29
Copy link
Member

Assuming this PR is not relevant any more after #25187 ? So, closing this for now. If relevant, please re-open. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants