-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Enable SME for sgemm and sbgemm through KleidiAI #24346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Michael Tyler <[email protected]>
|
Can workflows be approved please? |
Signed-off-by: Michael Tyler <[email protected]>
Signed-off-by: Michael Tyler <[email protected]>
|
Can workflows be approved please? |
|
Would you mind sharing some measurements to give an idea of how much these changes improve the performance? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline, |
|
Azure Pipelines successfully started running 5 pipeline(s). |
Signed-off-by: Michael Tyler <[email protected]>
Signed-off-by: Michael Tyler <[email protected]>
Signed-off-by: Michael Tyler <[email protected]>
yihonglyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include the benchmark improvement (e.g., onnxruntime_perf_test) on existing hardware (e.g., M4) in the git log message description?
Signed-off-by: Michael Tyler <[email protected]>
|
I've added performance figures to the PR description. |
| size_t | ||
| MLASCALL | ||
| MlasGemmPackBSize( | ||
| CBLAS_TRANSPOSE TransA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought - Do we need data to see if A is going to be transposed to determine the size of packed B ? It seems like it bloats the API unnecessarily. The same thoughts for the MlasGemmPackB(...) routine. It seems like its only real usage is to see if the Kleidi library can be used. Can this be determined separately and sent to the packing routines ? The issue with introducing this to the API is that other non-Kleidi paths must handle them appropriately if exposed in the API ?
| const size_t BlockedN = (N + n_step - 1) / n_step; | ||
| const size_t AlignedN = BlockedN * n_step; | ||
|
|
||
| return AlignedN > 64 && K > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some documentation on why this is the criteria to be met before using the Kleidi micro-kernels ?
|
|
||
| size_t CountN; | ||
| #if !(defined(USE_KLEIDIAI) && !defined(_MSVC_LANG)) | ||
| MLAS_UNREFERENCED_PARAMETER(TransB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring it for non-Kleidi enabled builds seems like a bad idea in general as the caller might be oblivious to its non-usage.
| CountN = std::min(RangeCountN - n, size_t(MLAS_SGEMM_PACKED_STRIDEN)); | ||
| kai_run_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa(M, RangeCountN, K, | ||
| A, rhs_ptr, C, dst_stride, sizeof(float), | ||
| -std::numeric_limits<float>::max(), std::numeric_limits<float>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor ignorable nit: Maybe use std::numeric_limits::lowest() instead of -std::numeric_limits::max() ?
|
@MichaelTylerArm your branch has conflicts. can it be updated? Thanks! |
|
Hi George, Hariharan, apologies for the delay in responding to the above. So a few developments since this PR was reviewed. We have a new merge candidate under a proposed MLAS architectural change that was communicated to Microsoft. Following that initial communication there has additional feedback with a proposal to create a struct with function pointers that may help alleviate the MlasGemmPackB bloat concern described above. I understand Ronan on our side is looking to create a discussion to firm up on this proposal for both ARM and MSFT. We will take on board the additional comments provided the above and work them into our new branch. In the meantime, I propose we close this PR pending the new PR reflecting the agreed changes. |
Signed-off-by: Michael Tyler <[email protected]>
|
Assuming this PR is not relevant any more after #25187 ? So, closing this for now. If relevant, please re-open. Thanks. |
Description
Enables Arm® KleidiAI™ SME kernels for MLAS sgemm and sbgemm functions.
Motivation and Context
These kernels provide performance improvements on SME-enabled devices.
We see a performance improvement of 1.2x-1.8x on onnxruntime_perf_test for the following Geekbench models on M4: