1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890

Avafly · 2024-09-05T08:07:10Z

CPU Info

    $ lscpu
    Architecture:                    aarch64
    CPU op-mode(s):                  32-bit, 64-bit
    Byte Order:                      Little Endian
    CPU(s):                          2
    On-line CPU(s) list:             0,1
    Thread(s) per core:              1
    Core(s) per socket:              2
    Socket(s):                       1
    Vendor ID:                       ARM
    Model:                           4
    Model name:                      Cortex-A53
    Stepping:                        r0p4
    CPU max MHz:                     1400.0000
    CPU min MHz:                     200.0000
    BogoMIPS:                        400.00
    L1d cache:                       64 KiB
    L1i cache:                       64 KiB
    L2 cache:                        256 KiB
    Vulnerability Itlb multihit:     Not affected
    Vulnerability L1tf:              Not affected
    Vulnerability Mds:               Not affected
    Vulnerability Meltdown:          Not affected
    Vulnerability Mmio stale data:   Not affected
    Vulnerability Retbleed:          Not affected
    Vulnerability Spec store bypass: Not affected
    Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
    Vulnerability Spectre v2:        Not affected
    Vulnerability Srbds:             Not affected
    Vulnerability Tsx async abort:   Not affected
    Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

How I build

    export USE_THREAD=1
    export NUM_THREADS=2
    export DYNAMIC_ARCH=0
    export NO_WARMUP=1		
    export BUILD_RELAPACK=0
    make -j DYNAMIC_ARCH=0 CC=gcc HOSTCC=gcc BINARY=64 INTERFACE=64 USE_OPENMP=1 TARGET=CORTEXA53

I'm performing matrix multiplication A@B=C using cblas_sgemm where matrix A with size of 1x72 and matrix B with 72x(height*width), height = width = 96 in my work. The computation takes 5ms when performing cblas_sgemm directly, but 1.1ms when in grouped way - almost 5 times faster.

// direct computation
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
    A_row, B_col, A_col, 1.0,
    A, A_col,
    B, B_col, 0.0,
    C1, B_col);

// grouped computation, like B[:, :g] in Python
for (int g = 0; g < groups; ++g) {
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
        A_row, B_col, A_col / groups, 1.0,
        A, A_col / groups,
        &B[g * (B_col / groups)], B_col, 0.0,
        &C2[g * (B_col / groups)], B_col);
}

Is there a reason why grouped computation get a considerable performance benefit over a direct way?

Also, since matrix B is row-major order in my work, I can not slice B's columns by B[g * (B_col / groups)] and can not speed up my computation by this grouped method. How should I improve the computation speed in this case?

I found that openblas 0.3.8 was released recently while writing this issue, and the improvement includes GEMM with 1xN or Mx1 matrix in this version.

arm64:

Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 matrix to the corresponding GEMV kernel.

I built this version and confirmed that the speed indeed improved a lot compared to openblas 0.3.7, but grouped method still faster. For an input shape of 8x512x512, openblas 0.3.7's grouped computation improves performance from 235ms to 53ms, whereas openblas 0.3.8 from 117ms to 51ms. All files were compiled with -O3.

The detail outputs I got are shown as follows.

root@am62xx-evm:/path$ ll
total 868
drwxr-xr-x 2 root root   4096 Jan  1 00:46 .
drwxr-xr-x 3 root root   4096 Jan  1 00:21 ..
-rwxr-xr-x 1 root root 214064 Jan  1 00:30 0.3.7
-rwxr-xr-x 1 root root 218800 Jan  1 00:29 0.3.8
-rwxr-xr-x 1 root root 218184 Jan  1 00:42 512x512_0.3.7
-rwxr-xr-x 1 root root 218800 Jan  1 00:41 512x512_0.3.8
root@am62xx-evm:/path$ ./0.3.7		# input shape: 8x96x96, version: 0.3.7, groups: 1
Input shape : (8, 96, 96)
Output shape: (1, 96, 96)
Groups: 1
Elapsed time with group:        4.4508 ms
Elapsed time without group:     5.4457 ms

root@am62xx-evm:/path$ ./0.3.7 288	# groups: 288
Input shape : (8, 96, 96)
Output shape: (1, 96, 96)
Groups: 288
Elapsed time with group:        1.1920 ms
Elapsed time without group:     5.1266 ms

root@am62xx-evm:/path$ ./0.3.8		# version: 0.3.8, groups: 1
Input shape : (8, 96, 96)
Output shape: (1, 96, 96)
Groups: 1
Elapsed time with group:        1.8736 ms
Elapsed time without group:     1.6119 ms

root@am62xx-evm:/path$ ./0.3.8 288
Input shape : (8, 96, 96)
Output shape: (1, 96, 96)
Groups: 288
Elapsed time with group:        1.2933 ms
Elapsed time without group:     1.8287 ms

root@am62xx-evm:/path$ ./512x512_0.3.7	# input shape: 8x512x512
Input shape : (8, 512, 512)
Output shape: (1, 512, 512)
Groups: 1
Elapsed time with group:        236.0992 ms
Elapsed time without group:     236.1608 ms

root@am62xx-evm:/path$ ./512x512_0.3.7 128
Input shape : (8, 512, 512)
Output shape: (1, 512, 512)
Groups: 128
Elapsed time with group:        53.1764 ms
Elapsed time without group:     235.3431 ms

root@am62xx-evm:/path$ ./512x512_0.3.8
Input shape : (8, 512, 512)
Output shape: (1, 512, 512)
Groups: 1
Elapsed time with group:        108.0218 ms
Elapsed time without group:     116.4349 ms

root@am62xx-evm:/path$ ./512x512_0.3.8 128
Input shape : (8, 512, 512)
Output shape: (1, 512, 512)
Groups: 128
Elapsed time with group:        51.2436 ms
Elapsed time without group:     117.9340 ms

And here is the test code

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>

#define IMG_CHANNEL     8
#define IMG_HEIGHT      512
#define IMG_WIDTH       512
#define IMG_SIZE        (IMG_HEIGHT * IMG_WIDTH)
#define KN_SIZE         3

float A[KN_SIZE * KN_SIZE * IMG_CHANNEL] = { 0.0f, };
float B[KN_SIZE * KN_SIZE * IMG_CHANNEL * IMG_SIZE] = { 0.0f, };
float C1[IMG_SIZE] = { 0.0f, };
float C2[IMG_SIZE] = { 0.0f, };

int main(int argc, char *argv[])
{
    printf("Input shape : (%d, %d, %d)\n", IMG_CHANNEL, IMG_HEIGHT, IMG_WIDTH);
    printf("Output shape: (%d, %d, %d)\n", 1, IMG_HEIGHT, IMG_WIDTH);

    int groups = 1;
    // get groups
    if (argc > 1 && atoi(argv[1]) > 0 && IMG_SIZE % atoi(argv[1]) == 0) {
        groups = atoi(argv[1]);
    }
    printf("Groups: %d\n", groups);

    // init
    srand(time(NULL));
    for (int i = 0; i < KN_SIZE * KN_SIZE * IMG_CHANNEL; ++i) {
        A[i] = ((float)rand() / RAND_MAX) * 2.0f - 1.0f;
    }
    for (int i = 0; i < KN_SIZE * KN_SIZE * IMG_CHANNEL * IMG_SIZE; ++i) {
        B[i] = ((float)rand() / RAND_MAX) * 2.0f - 1.0f;
    }

    int A_row = 1, A_col = KN_SIZE * KN_SIZE * IMG_CHANNEL; // A: 1 x 72
    int B_row = A_col, B_col = IMG_SIZE;                    // B: 72 x (96*96) or 72 x (512*512)

    // timer 1 start
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    // group
    for (int g = 0; g < groups; ++g) {
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
            A_row, B_col, A_col / groups, 1.0,
            A, A_col / groups,
            &B[g * (B_col / groups)], B_col, 0.0,
            &C1[g * (B_col / groups)], B_col);
    }

    // timer 1 end
    clock_gettime(CLOCK_MONOTONIC, &end);
    float elapsed_ms = ((end.tv_sec - start.tv_sec) * 1000.0) + ((end.tv_nsec - start.tv_nsec) / 1000000.0);
    printf("Elapsed time with group:   \t%.4f ms\n", elapsed_ms);

    // timer 2 start
    clock_gettime(CLOCK_MONOTONIC, &start);

    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
        A_row, B_col, A_col, 1.0,
        A, A_col,
        B, B_col, 0.0,
        C2, B_col);

    // timer 2 end
    clock_gettime(CLOCK_MONOTONIC, &end);
    elapsed_ms = ((end.tv_sec - start.tv_sec) * 1000.0) + ((end.tv_nsec - start.tv_nsec) / 1000000.0);
    printf("Elapsed time without group:\t%.4f ms\n", elapsed_ms);

    return 0;
}

The text was updated successfully, but these errors were encountered:

martin-frbg · 2024-09-10T13:39:27Z

I have not tried to reproduce this yet, but a possible explanation could be that by grouping you keep the individual problem size small enough to run in a single thread, while in the other case the multithreading overhead may outweigh the performance advantage of having (just) two threads. (And I assume you are actually talking about versions 0.3.27/0.3.28, not 0.3.7/0.3.8 from about five years ago)

Avafly · 2024-09-12T01:44:40Z

Thank you for your response :D

Yes I was referring to openblas versions 0.3.27/0.3.28, 0.3.7/0.3.8 was a typo.

To eliminate the impact of multi-thread, I tested on a single thread ARM Debian using a non-threaded openblas (0.3.28). I included printf("Threads: %d\n", openblas_get_num_threads()); in the test code to check the number of threads.

# ARM debian with single thread
root@am62xx-evm:/path$ ./a.out      # input shape: 128x512x512, groups: 1
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 1
Groups: 1
Elapsed time with group:        99.8057 ms
Elapsed time without group:     100.5576 ms

root@am62xx-evm:/path$ ./a.out 2048 # groups: 2048
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 1
Groups: 2048
Elapsed time with group:        46.5642 ms
Elapsed time without group:     100.7267 ms

The results still showed a speed improvement, which means that the grouped computation is also faster in a single-threaded context.

Furthermore, since the improvement with two threads was minimal, I also ran the test code on a twenty threads Intel WSL, and got a faster speed too.

# Intel PC with twenty threads
root@am62xx-evm:/path$ ./a.out      # input shape: 128x512x512, groups: 1
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 20
Groups: 1
Elapsed time with group:        67.3847 ms
Elapsed time without group:     49.0649 ms

root@am62xx-evm:/path$ ./a.out 2048 # groups: 2048
Input shape : (128, 512, 512)
Output shape: (1, 512, 512)
Threads: 20
Groups: 2048
Elapsed time with group:        26.4648 ms
Elapsed time without group:     53.7776 ms

In summary, the speed improvements by grouping are evident on platforms with both single and multi threads. Doesn't seem like it's just about the threads.

If you’re up for trying it out, just modify IMG_CHANNEL, IMG_HEIGHT, and IMG_WIDTH to define the input tensor shape.

CPU Info:

Single Thread ARM Debian

$ lscpu
Architecture:                         aarch64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               1
On-line CPU(s) list:                  0
Thread(s) per core:                   1
Core(s) per socket:                   1
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            ARM
Model:                                1
Model name:                           Neoverse-N1
Stepping:                             r3p1
BogoMIPS:                             50.00
NUMA node0 CPU(s):                    0
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; CSV2, BHB
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc pc dcpop asimddp ssbs

Twenty threads WSL

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Virtualization features:
  Virtualization:        VT-x
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):
  L1d:                   480 KiB (10 instances)
  L1i:                   320 KiB (10 instances)
  L2:                    20 MiB (10 instances)
  L3:                    24 MiB (1 instance)
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; Enhanced IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
Vendor ID:               GenuineIntel
  Model name:            13th Gen Intel(R) Core(TM) i5-13600K
    CPU family:          6
    Model:               183
    Thread(s) per core:  2
    Core(s) per socket:  10
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            6988.79
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities

martin-frbg · 2024-12-11T17:47:13Z

Hmm. Sadly I think the explanation is simply that your example is flawed - if you compare C1 and C2 in the end they will be different for any value of "groups" that is sufficiently large to give your code an advantage. E.g. in the extreme case of your 2048-group run, dividing A_col by count will yield a K and LDA of zero.

Avafly · 2024-12-12T02:43:36Z

Yeah I later tried implementing and optimizing gemm myself, then I realized the issue with the "groups" approach. Performing gemm in group is inherently flawed and can get incorrect results. The fastest and safest solution is just using cblas_sgemm directly. Anyway, thanks for your response!

martin-frbg mentioned this issue Nov 17, 2024

Remove any optimization flags from DEBUG builds on POWER architecture #4979

Merged

Avafly closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890

1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890

Avafly commented Sep 5, 2024

martin-frbg commented Sep 10, 2024

Avafly commented Sep 12, 2024

martin-frbg commented Dec 11, 2024

Avafly commented Dec 12, 2024

1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890

1xK @ KxN matrix multiplication using GEMM significant slower than group version #4890

Comments

Avafly commented Sep 5, 2024

martin-frbg commented Sep 10, 2024

Avafly commented Sep 12, 2024

martin-frbg commented Dec 11, 2024

Avafly commented Dec 12, 2024