Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When could you support AMD Zen4 arch? #770

Open
ltjsjyyy opened this issue Sep 4, 2023 · 7 comments
Open

When could you support AMD Zen4 arch? #770

ltjsjyyy opened this issue Sep 4, 2023 · 7 comments

Comments

@ltjsjyyy
Copy link

ltjsjyyy commented Sep 4, 2023

No description provided.

@devinamatthews
Copy link
Member

Zen4 is already support in AMD's fork of BLIS. We're in contact with AMD on coordinating how best to back-port these changes to BLIS master.

@AngryLoki
Copy link
Contributor

AngryLoki commented Nov 25, 2023

Hi. I've conducted some experiments using scripts from https://github.com/flame/blis/blob/master/docs/Performance.md and AMD's fork of BLIS. I tested only GEMM and only in multithread mode, as https://github.com/amd/blis/tree/master/test/3 output is not compatible with https://github.com/flame/blis/tree/master/test/3 , but this test was enough for initial needs.

My setup:

  • Processor model: AMD Ryzen 9 7950X3D (Zen4)
  • Core topology: one socket, 16 cores per socket, 32 cores total
  • SMT status: enabled, used
  • OS: Gentoo
  • Compiler: Clang 17.0.3 (CC="clang" CXX="clang++" AR="llvm-ar" RANLIB="llvm-ranlib" ./configure -t openmp zen4)
  • Stock blis compiled with zen3 kernels. All libraries in general were compiled with native to zen4 flags.
  • Versions:
    ** AMD/blis master a5a3c8b Mon Aug 7 13:48:54 2023
    ** flame/blis master f7ce54a Fri Nov 3 15:52:57 2023
    ** sci-libs/mkl-2023.0.0.25398
    ** sci-libs/openblas-0.3.23

Commands executed:

BLIS_NUM_THREADS=32     ./test_sgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d s -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d s -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d s -c nn   -i native -p "256 5120 128" -r 3 -v

BLIS_NUM_THREADS=32     ./test_dgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d d -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d d -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d d -c nn   -i native -p "256 5120 128" -r 3 -v

Results:
image

Comments: AMD fork of BLIS significantly outperforms all other libraries on AMD Ryzen 9 7950X3D with Zen4 kernels (up to 2x). Vanilla BLIS is on par with OpenBLAS, but slower than MKL. There is a performance drop in MKL library for some sizes, but it looks like a fluke (it disappears for larger sizes). When checking gemm for larger matrices (like 6000*6000) performance was the same for all 4 libraries (supposedly due to memory bottleneck on my system).

@fgvanzee
Copy link
Member

@AngryLoki Thank you for taking the time to gather, visualize, and share these performance results! Don't worry; a proper zen4 subconfiguration will be added to vanilla BLIS in the future. We are just overwhelmed with work these days relative to our resources. Thanks for your patience in the meantime. ❤️

PS: Please feel free to keep up with us in our Discord server, if you haven't already joined! 😄

@HaukurPall
Copy link

@AngryLoki thank you for this information.

I am curious, did you also test AMD/blis compiled with AOCC? I've been experimenting with it on my system (Gentoo AMD 7840U) and it's performing well on certain tasks.

@AngryLoki
Copy link
Contributor

AngryLoki commented Feb 3, 2024

@HaukurPall , checked sgemm (M=N=K) with gcc 13.2.1 (+full lto), clang 17.0.6, AOCC and rocm-llvm-alt. Results are the same, almost the same.
compilers

I checked the code of AOCC and unfortunately I don't see any specific optimizations... AMD just shipped vanilla precompiled Clang and included some ROCm-related fixed (to make it work, not for optimization). Also they added ROCm/llvm-project@0272bec - if you attempt to use -famd-opt, it tries to use for proprietary version of Clang - rocm-llvm-alt - which actually has some interesting optimizations. However even after installing rocm-llvm-alt I was not able to increase performance for AOCL-BLAS. Anyways, ICX, AOCC and rocm-llvm-alt are basically Clang. With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

Regarding my previous tests, I checked my approach more carefully and found few misses from my side:

  • Specifying 32 threads on 16 core (32 vCPU) was a mistake. While it seemed that performance was the same, standard deviation was too big. After setting to 16, there is almost no variance (see image above).
  • Now tested with trunk OpenBLAS, trunk BLIS, trunk amd/blis, and MKL 2024.0.
  • Checked, why MKL is so slow and guess what, Intel did it again (as they always do), they shipped if cpu = zen: use slow code, we shipped extra megabytes specifically to degrade AMD performance. Followed https://documentation.sigma2.no/jobs/mkl.html#forcing-mkl-to-use-best-performing-routines and it made MKL 2 times faster.
  • Updated results on image below, everything was compiled with Clang and launched with OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY=0-15
    results

@devinamatthews
Copy link
Member

BLIS is usually pretty insensitive to compiler since most of the work happens in the inline assembly kernels.

With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

I consider this a good thing since LLVM (and to fair other compilers too) really make a hash of C or intrinsics kernels due to a combination of poor register allocation and instruction ordering.

Glad to see that AOCL-BLIS is performing well for you though. As we work with AMD to backport their changes BLIS will catch up.

@HaukurPall
Copy link

@AngryLoki thank you so much for this, this answers a lot of questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants