Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AVX2 SGEMM & STRMM #2361

Merged
merged 7 commits into from
Jan 8, 2020
Merged

Optimize AVX2 SGEMM & STRMM #2361

merged 7 commits into from
Jan 8, 2020

Conversation

wjc404
Copy link
Contributor

@wjc404 wjc404 commented Jan 6, 2020

Replace KERNEL_16x6 with KERNEL_8x12 to slow down reading on packed matrix A(in L3 cache), as mentioned in issue #2210 . The performance catchs up with MKL2019 after the change.

@wjc404
Copy link
Contributor Author

wjc404 commented Jan 6, 2020

1).1-thread SGEMM test with m=n=k=7023, transa=N and transb=T, on i9 9900K at 4.4 GHz, theoretical 140.8 GFLOPS:

old kernel:
Screenshot from 2020-01-06 18-09-02

new kernel:
Screenshot from 2020-01-06 18-09-09

2).4-thread SGEMM test with m=n=k=10000, transa=T and transb=N, on the same CPU at 4.2 GHz, theoretical 538 GFLOPS:

old kernel:
Screenshot from 2020-01-06 18-14-08

new kernel:
Screenshot from 2020-01-06 18-14-12

3).8-thread test with m=n=k=20000, transa=transb=N on the same CPU at 4.1 GHz, theoretical 1050 GFLOPS:

old kernel:
Screenshot from 2020-01-08 11-01-15

new kernel:
Screenshot from 2020-01-08 11-01-25

4).1-thread STRMM test with m=n=6971, side=L, uplo=U, transa=N and diag=N, on the same CPU at 4.4 GHz, theoretical 140.8 GFLOPS:

old kernel:
Screenshot from 2020-01-06 18-21-11

new kernel:
Screenshot from 2020-01-06 18-21-16

@wjc404
Copy link
Contributor Author

wjc404 commented Jan 6, 2020

1-thread SGEMM test with m=n=k=5999, transa=N and transb=N, on r7 3700x at 3.6 GHz, theoretical 115.2 GFLOPS:

old kernel:
Screenshot from 2020-01-06 18-51-59

new kernel:
Screenshot from 2020-01-06 18-52-06

4-thread test with m=n=k=10000, transa=transb=T, on the same CPU at 3.6 GHz, theoretical 461 GFLOPS:

old kernel:
Screenshot from 2020-01-06 18-53-23

new kernel:
Screenshot from 2020-01-06 18-53-30

@wjc404
Copy link
Contributor Author

wjc404 commented Jan 6, 2020

The STRMM kernel passed 1-thread reliability test.
Screenshot from 2020-01-06 19-47-19
Screenshot from 2020-01-08 13-18-47

Test code:
strmm_compare_test.zip

@wjc404
Copy link
Contributor Author

wjc404 commented Jan 8, 2020

The SGEMM kernel passed 1-thread reliability test. Ready for merging.
Screenshot from 2020-01-08 14-43-05

@martin-frbg
Copy link
Collaborator

Great, thanks a lot. Still fascinating to see how much performance can be improved by making things slower.

@martin-frbg martin-frbg merged commit 38742d5 into OpenMathLib:develop Jan 8, 2020
@martin-frbg martin-frbg added this to the 0.3.8 milestone Jan 8, 2020
@wjc404 wjc404 mentioned this pull request Jan 9, 2020
marxin added a commit to marxin/OpenBLAS that referenced this pull request Feb 18, 2020
I partially reverted the changes in OpenMathLib#2361 and I received the following
speed up on:
./xsl3blastst -R gemm -N 2048 2048 1 -a 5 1 1 1 1 1

AMD Ryzen 7 2700X (Zen+): 61400 to 63300 MFlops
AMD EPYC 7742 (Zen v2): 91400 to 94500 MFlops

These numbers are single-threaded performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants