Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

Open
robertknight opened this issue Dec 30, 2023 · 1 comment
Labels
performance Issues that affect model inference or loading performance

Comments

@robertknight
Copy link
Owner

I optimized the initial AVX 512 GEMM kernel based on what works best on my 2020 Intel MacBook Pro i5. This is an Ice Lake client architecture system which has a single 512-bit FMA unit. When testing on a C6i-xlarge system in AWS (Intel Xeon, Ice Lake Server) I found that doubling the Avx512Kernel::MR constant from 6 to 12 gave a substantial boost, and actually going to 14 is better still. Any more is slower (we run out of zmm registers). The server CPU has two 512-bit FMA units.

This means that for optimal performance on a range of systems, multiple sizes of AVX-512 kernels are needed and some mechanism to choose them. Section 18.1 "Severs with a single FMA unit" in the Intel Optimization Manual has some code to detect the FMA unit count, but it relies on a microbenchmark. Google's cpu_features library has some code that relies on detecting specific CPU models.

In addition to increasing the tile size, the output prefetching logic probably also needs adjusting to work with a larger kernel. Currently we prefetch all rows of the output tile in a single loop before the final outer product, but when MR gets large this is inefficient. What you're supposed to do is interleave prefetching with computation.

Here are some concrete numbers for an benchmark, using:

cargo +nightly test -p wasnn --features avx512 --release bench_gemm -- --nocapture --ignored

And taking the number for M=N=K=1024.

  • Baseline performance of AVX-512 kernel (with MR=6) on C6i-xlarge: ~180 GFLOPS
  • Intel MKL performance with 2-4 threads (using gemm-benchmark): ~308 GFLOPS
  • BLIS performance: ~260 GFLOPS
  • This library's AVX-512 kernel with MR=14: ~232 GFLOPS
  • As above, + compile with RUSTFLAGS="-C target-cpu=native": ~239 GFLOPS

Looking at a report generated by perf, the A-block packing code is showing up as expensive (~14% of runtime with default CPU, ~12% with target-cpu=native). This is not surprising since it logically involves reading MRxMR-sized blocks from A, transposing them and writing them to the packing buffer. I looked at this in #16 and didn't find an easy win with a smaller MR, but perhaps worth revisiting for larger MR.

@robertknight robertknight added the performance Issues that affect model inference or loading performance label Dec 30, 2023
@robertknight
Copy link
Owner Author

The C6i instances also support the AVX-VNNI extension (aka. "Deep Learning Boost"). Ultimately being able to exploit that would get the most out of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that affect model inference or loading performance
Projects
None yet
Development

No branches or pull requests

1 participant