-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890
Comments
I have not tried to reproduce this yet, but a possible explanation could be that by grouping you keep the individual problem size small enough to run in a single thread, while in the other case the multithreading overhead may outweigh the performance advantage of having (just) two threads. (And I assume you are actually talking about versions 0.3.27/0.3.28, not 0.3.7/0.3.8 from about five years ago) |
Thank you for your response :D Yes I was referring to openblas versions 0.3.27/0.3.28, 0.3.7/0.3.8 was a typo. To eliminate the impact of multi-thread, I tested on a single thread ARM Debian using a non-threaded openblas (0.3.28). I included # ARM debian with single thread
root@am62xx-evm:/path$ ./a.out # input shape: 128x512x512, groups: 1
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 1
Groups: 1
Elapsed time with group: 99.8057 ms
Elapsed time without group: 100.5576 ms
root@am62xx-evm:/path$ ./a.out 2048 # groups: 2048
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 1
Groups: 2048
Elapsed time with group: 46.5642 ms
Elapsed time without group: 100.7267 ms The results still showed a speed improvement, which means that the grouped computation is also faster in a single-threaded context. Furthermore, since the improvement with two threads was minimal, I also ran the test code on a twenty threads Intel WSL, and got a faster speed too. # Intel PC with twenty threads
root@am62xx-evm:/path$ ./a.out # input shape: 128x512x512, groups: 1
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 20
Groups: 1
Elapsed time with group: 67.3847 ms
Elapsed time without group: 49.0649 ms
root@am62xx-evm:/path$ ./a.out 2048 # groups: 2048
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 20
Groups: 2048
Elapsed time with group: 26.4648 ms
Elapsed time without group: 53.7776 ms In summary, the speed improvements by grouping are evident on platforms with both single and multi threads. Doesn't seem like it's just about the threads. If you’re up for trying it out, just modify CPU Info: Single Thread ARM Debian$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 1 On-line CPU(s) list: 0 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: ARM Model: 1 Model name: Neoverse-N1 Stepping: r3p1 BogoMIPS: 50.00 NUMA node0 CPU(s): 0 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc pc dcpop asimddp ssbs Twenty threads WSL$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Virtualization features: Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all): L1d: 480 KiB (10 instances) L1i: 320 KiB (10 instances) L2: 20 MiB (10 instances) L3: 24 MiB (1 instance) Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i5-13600K CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 Stepping: 1 BogoMIPS: 6988.79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities |
Hmm. Sadly I think the explanation is simply that your example is flawed - if you compare C1 and C2 in the end they will be different for any value of "groups" that is sufficiently large to give your code an advantage. E.g. in the extreme case of your 2048-group run, dividing A_col by count will yield a K and LDA of zero. |
Yeah I later tried implementing and optimizing gemm myself, then I realized the issue with the "groups" approach. Performing gemm in group is inherently flawed and can get incorrect results. The fastest and safest solution is just using |
CPU Info
How I build
I'm performing matrix multiplication A@B=C using
cblas_sgemm
where matrix A with size of 1x72 and matrix B with 72x(height*width), height = width = 96 in my work. The computation takes 5ms when performingcblas_sgemm
directly, but 1.1ms when in grouped way - almost 5 times faster.Is there a reason why grouped computation get a considerable performance benefit over a direct way?
Also, since matrix B is row-major order in my work, I can not slice B's columns by
B[g * (B_col / groups)]
and can not speed up my computation by this grouped method. How should I improve the computation speed in this case?I found that openblas 0.3.8 was released recently while writing this issue, and the improvement includes GEMM with 1xN or Mx1 matrix in this version.
I built this version and confirmed that the speed indeed improved a lot compared to openblas 0.3.7, but grouped method still faster. For an input shape of 8x512x512, openblas 0.3.7's grouped computation improves performance from 235ms to 53ms, whereas openblas 0.3.8 from 117ms to 51ms. All files were compiled with -O3.
The detail outputs I got are shown as follows.
And here is the test code
The text was updated successfully, but these errors were encountered: