-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) #25669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@skottmckay @snnn @yufenglee, appreciate it if this tiny PR could be reviewed & CI triggered. Thanks! |
|
@hariharans29 could you review this PR & run CI? Thanks! |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
Looks like the Windows x64 build timed out after 2 hours. |
|
/azp run Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
The Windows x64 QNN CI Pipeline timed out again. The logs suggest that the build was in progress, and not hung. What could be the issue? |
I think you may have to wait for this: #25864 |
|
Thanks @hariharans29. It looks like #25864 has been closed as the updated VM SKU seems to be running better. Do you mind re-triggering the CI for this PR? Additionally, could you also review/request someone to review #25580 ? Thanks! |
|
/azp run Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Thanks for the contribution and patience. I am trying to find someone to review it (and a few other MLAS PRs). It may take some time. I will get back on that. Thanks. |
|
Thanks @hariharans29 ! The CI for this PR timed out again unfortunately 🤔 |
|
/azp run Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Thanks @hariharans29! |
This reverts commit af4bf43.
Description
The vfmaq_f32 intrinsic compiles to the FMLA instruction which is more performant than separate
fmul+faddinstructions that vmlaq_f32 compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5WhNote that this is not a breaking change, as vmlaq_f32 compiles to FMLA instructions already on the latest clang compilers (which are the default for MacOS ORT builds already)
Motivation and Context
With this change, the NEON version of
MlasMultiplyAddFloat32x4achieves parity with the x86 version that uses_mm_fmadd_ps.It also achieves up to ~15% speedups compared to the current
vmlaq_f32implementation when tested on top of #25580