Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

robertknight · 2023-12-30T12:59:39Z

I optimized the initial AVX 512 GEMM kernel based on what works best on my 2020 Intel MacBook Pro i5. This is an Ice Lake client architecture system which has a single 512-bit FMA unit. When testing on a C6i-xlarge system in AWS (Intel Xeon, Ice Lake Server) I found that doubling the Avx512Kernel::MR constant from 6 to 12 gave a substantial boost, and actually going to 14 is better still. Any more is slower (we run out of zmm registers). The server CPU has two 512-bit FMA units.

This means that for optimal performance on a range of systems, multiple sizes of AVX-512 kernels are needed and some mechanism to choose them. Section 18.1 "Severs with a single FMA unit" in the Intel Optimization Manual has some code to detect the FMA unit count, but it relies on a microbenchmark. Google's cpu_features library has some code that relies on detecting specific CPU models.

In addition to increasing the tile size, the output prefetching logic probably also needs adjusting to work with a larger kernel. Currently we prefetch all rows of the output tile in a single loop before the final outer product, but when MR gets large this is inefficient. What you're supposed to do is interleave prefetching with computation.

Here are some concrete numbers for an benchmark, using:

cargo +nightly test -p wasnn --features avx512 --release bench_gemm -- --nocapture --ignored

And taking the number for M=N=K=1024.

Baseline performance of AVX-512 kernel (with MR=6) on C6i-xlarge: ~180 GFLOPS
Intel MKL performance with 2-4 threads (using gemm-benchmark): ~308 GFLOPS
BLIS performance: ~260 GFLOPS
This library's AVX-512 kernel with MR=14: ~232 GFLOPS
As above, + compile with RUSTFLAGS="-C target-cpu=native": ~239 GFLOPS

Looking at a report generated by perf, the A-block packing code is showing up as expensive (~14% of runtime with default CPU, ~12% with target-cpu=native). This is not surprising since it logically involves reading MRxMR-sized blocks from A, transposing them and writing them to the packing buffer. I looked at this in #16 and didn't find an easy win with a smaller MR, but perhaps worth revisiting for larger MR.

The text was updated successfully, but these errors were encountered:

robertknight · 2023-12-30T13:02:45Z

The C6i instances also support the AVX-VNNI extension (aka. "Deep Learning Boost"). Ultimately being able to exploit that would get the most out of them.

robertknight added the performance Issues that affect model inference or loading performance label Dec 30, 2023

robertknight mentioned this issue Apr 6, 2024

Refactor GEMM kernel traits #79

Merged

robertknight mentioned this issue Apr 30, 2024

Add avx512 feature robertknight/ocrs#54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

robertknight commented Dec 30, 2023

robertknight commented Dec 30, 2023

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17

Comments

robertknight commented Dec 30, 2023

robertknight commented Dec 30, 2023