-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor DGEMM performance for armsve build on Neoverse N2 #641
Comments
I think, of the currently-available configs, that ThunderX2 should perform best on N2. The SVE kernels are tuned for 256+ bit so I think you really want a neon kernel. A "real" Neoverse N1 kernel/configuration should be in |
Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units. |
FWIW, the existing Neon kernel we have was designed originally for dual
issue Neon, but manages to issue to four 128-bit Neon units on the Apple
M1 series cores and achieves over 99% of peak flops. It seems that
cores designed for issuing to 4 pipes scale up older code pretty well,
although I haven't tested a G3 yet. :)
…On 7/8/22 2:01 PM, John C. Linford wrote:
Good to hear about the N1 kernel coming to master. I also suggest
building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For
GEMM, I don't see SVE128 having a significant advantage over NEON128.
If you build a kernel that can feed four NEON SIMD units it should run
very well on all known Arm server-class CPUs, even if they don't have
wide SVE units.
—
Reply to this email directly, view it on GitHub
<#641 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHGHJKYSQ3CYVI5BEBHKWNTVTB3JTANCNFSM53BO64IA>.
You are receiving this because you are subscribed to this
thread.Message ID: ***@***.***>
|
Apologies for this late response. For Graviton 3, 2xSVE256 does better than 4xNEON by about 2% or so.
|
Hi.
Whilst doing some comparative benchmarking on the Alibaba Cloud g8m instances I've run into some BLIS performance issues. g8m is based on Arm's Neoverse N2 technology and has 2x128-bit SVE vectors.
When I've done a build for the target "armsve" I am getting a peak performance of between 5 and 6 GFLOPs on a single core rather than the 20 GFLOPs I get from the Neon implementation.
There seems to be an awful lot of time spent in the function "bli_dpackm_mrxk_armsve_ref" which makes me think it is packing incorrectly for the 128-bit vector length. Running on AWS Graviton3 instances (with a 256-bit vector length) does not show these issues.
Thanks.
Chris
The text was updated successfully, but these errors were encountered: