Optimize AVX2 SGEMM & STRMM #2361

wjc404 · 2020-01-06T10:04:51Z

Replace KERNEL_16x6 with KERNEL_8x12 to slow down reading on packed matrix A(in L3 cache), as mentioned in issue #2210 . The performance catchs up with MKL2019 after the change.

wjc404 · 2020-01-06T10:10:48Z

1).1-thread SGEMM test with m=n=k=7023, transa=N and transb=T, on i9 9900K at 4.4 GHz, theoretical 140.8 GFLOPS:

old kernel:

new kernel:

2).4-thread SGEMM test with m=n=k=10000, transa=T and transb=N, on the same CPU at 4.2 GHz, theoretical 538 GFLOPS:

old kernel:

new kernel:

3).8-thread test with m=n=k=20000, transa=transb=N on the same CPU at 4.1 GHz, theoretical 1050 GFLOPS:

old kernel:

new kernel:

4).1-thread STRMM test with m=n=6971, side=L, uplo=U, transa=N and diag=N, on the same CPU at 4.4 GHz, theoretical 140.8 GFLOPS:

old kernel:

new kernel:

wjc404 · 2020-01-06T10:54:48Z

1-thread SGEMM test with m=n=k=5999, transa=N and transb=N, on r7 3700x at 3.6 GHz, theoretical 115.2 GFLOPS:

old kernel:

new kernel:

4-thread test with m=n=k=10000, transa=transb=T, on the same CPU at 3.6 GHz, theoretical 461 GFLOPS:

old kernel:

new kernel:

wjc404 · 2020-01-06T11:50:05Z

The STRMM kernel passed 1-thread reliability test.

Test code:
strmm_compare_test.zip

wjc404 · 2020-01-08T06:44:19Z

The SGEMM kernel passed 1-thread reliability test. Ready for merging.

martin-frbg · 2020-01-08T15:20:20Z

Great, thanks a lot. Still fascinating to see how much performance can be improved by making things slower.

I partially reverted the changes in OpenMathLib#2361 and I received the following speed up on: ./xsl3blastst -R gemm -N 2048 2048 1 -a 5 1 1 1 1 1 AMD Ryzen 7 2700X (Zen+): 61400 to 63300 MFlops AMD EPYC 7742 (Zen v2): 91400 to 94500 MFlops These numbers are single-threaded performance.

wjc404 added 5 commits January 6, 2020 12:07

optimize AVX2 SGEMM

eb3c9f1

optimize AVX2 SGEMM

b73bf01

optimize AVX2 SGEMM

92b1021

optimize AVX2 SGEMM

b7b408a

Update CONTRIBUTORS.md

9f5cdc4

wjc404 added 2 commits January 6, 2020 20:11

Update sgemm_kernel_8x4_haswell.c

9dc9b7b

Update sgemm_kernel_8x4_haswell.c

bd4c032

martin-frbg merged commit 38742d5 into OpenMathLib:develop Jan 8, 2020

martin-frbg added this to the 0.3.8 milestone Jan 8, 2020

wjc404 mentioned this pull request Jan 9, 2020

Fix SKYLAKEX STRMM issues #2365

Merged

marxin mentioned this pull request Feb 13, 2020

Test and tune for Zen 2 #2180

Open

wjc404 mentioned this pull request Feb 17, 2020

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

Merged

marxin mentioned this pull request Feb 18, 2020

Restore ZEN SGEMM speed after #2361. #2430

Closed

martin-frbg mentioned this pull request Aug 20, 2020

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AVX2 SGEMM & STRMM #2361

Optimize AVX2 SGEMM & STRMM #2361

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 8, 2020 •

edited

Loading

martin-frbg commented Jan 8, 2020

Optimize AVX2 SGEMM & STRMM #2361

Optimize AVX2 SGEMM & STRMM #2361

Conversation

wjc404 commented Jan 6, 2020 • edited Loading

wjc404 commented Jan 6, 2020 • edited Loading

wjc404 commented Jan 6, 2020 • edited Loading

wjc404 commented Jan 6, 2020 • edited Loading

wjc404 commented Jan 8, 2020 • edited Loading

martin-frbg commented Jan 8, 2020

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 6, 2020 •

edited

Loading

wjc404 commented Jan 8, 2020 •

edited

Loading