-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in version 0.3.28 on Graviton4 #4939
Comments
Hi @dnoan, That is unfortunate, what's interesting is that many of these are Lines 501 to 545 in 8483a71
Whilst the patch you're indicating is triggered later, after checking the small GEMM permit: Lines 551 to 571 in 8483a71
The This would indicate the last line ( That leaves the Could you also let me know what compiler you're using? |
This could indeed be an unintentional downgrade caused by switching from SVE GEMM to NEON GEMV (the SVE implementation of the latter - from #4803 - only being available on A64FX right now) |
might be worthwile to copy kernel/arm64/KERNEL.A64FX to kernel/arm64/KERNEL.NEOVERSEV2 and rebuild/retest |
Copying KERNEL.A64FX to KERNEL.NEOVERSEV2 didn't improve performance. Actually the workload ran marginally slower. @Mousius, can provide a diff so that I don't mess things up? |
guess that would be something like
|
I tried with |
Thank you. But you're certain that it was a4e56e0 that brought the slowdown, or is this just an educated guess ? (Does speed return to normal when you simply |
I bisected the commits between 0.3.27 and 0.3.28. The following had no effect:
The following brought performance back to normal:
|
Thanks - this is surprising, I would have expected the two to be equivalent (unless I'm missing something - the check for "small size" itself can't be that expensive ??) |
I removed the check completely and it brought the worst result ever:
|
By the way, I don't know if the performance issue is in DGEMM. Could it be that |
hmmm. for that to be the worst, your code must be doing lots of SGEMM calls as well (?) and that simple call to gemm_small_kernel_permit must be causing a cache flush or something. |
...definitely no other code paths affected besides GEMM (and the directly related GEMM_BATCH). Also "small matrix" is checked only after the possible branching into GEMV, so most of the representative DGEMM calls you picked out should never even get that far down into interface/gemm.c (as mentioned by Mousius in his very first comment) |
Maybe I missed it, but how much worse is the performance with the |
I rebuilt and reran everything. The example workload makes mostly LAPACK calls but even some BLAS calls, including DGEMM. Testing was done on c8g.metal-24xl. I deleted the entire project after each build and test run and then started from scratch. Version 0.3.27 - 250 seconds
Version 0.3.28 - returning
Version 0.3.28 - returning
|
Thanks, @dnoan. This is very helpful. It's interesting to see the ~5% drop when just enabling small matrix optimisation. When calling dgemm, what are your transpose settings? I'd like to quickly write an alternative kernel for you to try 😸 |
In this case transpose is TN and NN. |
Thanks! I've put some experimental ASIMD kernels in #4963. Can you give them a try and see whether they perform better? I hypothesise that using ASIMD may be slightly more efficient than SVE for the 128-bit vector length for such small sizes. |
There is no difference in performance compared to 0.3.28. I don't know how OpenBLAS works internally, but I can't see anywhere in the diff that we are switching from the SVE to the SIMD kernels. |
I've commented out the SVE kernels and added ASIMD ones for And added them explicitly to There is potential I got this wrong, it seems odd you're seeing no performance change 🤔 |
I'm beginning to suspect a problem with cpu autodetection - @dnoan, I assume you are building with DYNAMIC_ARCH=1 or not ? |
Yes, my settings are:
|
@dnoan, what compiler are you using? If it's an older compiler or mis-detected, it might not enable the SVE platforms. |
I am compiling with GCC 13.2. |
Recently I reported a performance regression at a4e56e0 There is some glitch in GitHub which prevents me from posting a follow up so I am opening this ticket.
In my case the app makes calls to DGEMM with 86100 different inputs. I picked some of what appears to be the most common calls:
The text was updated successfully, but these errors were encountered: