๐ Welcome to Awesome GEMM!
A curated and continually evolving list of frameworks, libraries, tutorials, and tools for optimizing General Matrix Multiply (GEMM) operations. Whether you're a beginner eager to learn the fundamentals, a developer optimizing performance-critical code, or a researcher pushing the limits of hardware, this repository is your launchpad to mastery.
General Matrix Multiply is at the core of a wide range of computational tasks: from scientific simulations and signal processing to modern AI workloads like neural network training and inference. Efficiently implementing and optimizing GEMM can lead to dramatic performance improvements across entire systems.
This repository is a comprehensive resource for:
- Students & Beginners: Learn the basics and theory of matrix multiplication.
- Engineers & Developers: Discover frameworks, libraries, and tools to optimize GEMM on CPUs, GPUs, and specialized hardware.
- Researchers & Performance Experts: Explore cutting-edge techniques, research papers, and advanced optimization strategies.
If youโre new and just want to dive in, start here:
-
For Beginners:
- NumPy (CPU, Python) - The go-to library for basic matrix operations.
- How To Optimize GEMM - A step-by-step guide to improving performance from a naive implementation.
-
For GPU Developers:
- NVIDIA cuBLAS - Highly optimized BLAS for NVIDIA GPUs.
- NVIDIA CUTLASS - Templates and building blocks to write your own CUDA GEMM kernels.
-
For Low-Precision & AI Workloads:
- Fundamental Theories and Concepts ๐ง
- General Optimization Techniques ๐
- Frameworks and Development Tools ๐ ๏ธ
- Libraries ๐๏ธ
- Debugging and Profiling Tools ๐
- Learning Resources ๐
- Example Implementations ๐ก
- Contributions ๐ค
- License ๐
-
What is GEMM?
- General Matrix Multiply (Intel) - Intro from Intel.
- Spatial-lang GEMM - High-level overview.
-
Matrix Multiplication Algorithms:
- Strassen's Algorithm - Faster asymptotic complexity for large matrices.
- Winograd's Algorithm - Reduced multiplication count for improved performance.
- How To Optimize GEMM - Hands-on optimization guide.
- GEMM: From Pure C to SSE Optimized Micro Kernels - Detailed tutorial on going from naive to vectorized implementations.
- BLIS - A modular framework for building high-performance BLAS-like libraries.
- BLISlab - Educational framework for experimenting with BLIS-like GEMM algorithms.
- Tensile - AMD ROCm JIT compiler for GPU kernels, specializing in GEMM and tensor contractions.
- BLASFEO: Optimized for small- to medium-sized dense matrices (BSD-2-Clause)
- blis_apple: BLIS optimized for Apple M1 (BSD-3-Clause)
- FBGEMM: Meta's CPU GEMM for optimized server inference (BSD-3-Clause)
- gemmlowp: Google's low-precision GEMM library (Apache-2.0)
- Intel MKL: Highly optimized math routines for Intel CPUs (Intel Proprietary)
- libFLAME: High-performance dense linear algebra library (BSD-3-Clause)
- LIBXSMM: Specializing in small/micro GEMM kernels (BSD-3-Clause)
- OpenBLAS: Optimized BLAS implementation based on GotoBLAS2 (BSD-3-Clause)
- BitBLAS: Mixed-precision BLAS operations on GPUs (MIT)
- clBLAS: BLAS functions on OpenCL for portability (Apache-2.0)
- CLBlast: Tuned OpenCL BLAS library (Apache-2.0)
- hipBLAS: BLAS for AMD GPU platforms (ROCm) (MIT)
- hipBLASLt: Lightweight BLAS library on ROCm (MIT)
- NVIDIA cuBLAS: Highly tuned BLAS for NVIDIA GPUs (NVIDIA License)
- NVIDIA cuDNN: Deep learning primitives, including GEMM (NVIDIA License)
- NVIDIA cuSPARSE: Sparse matrix computations on NVIDIA GPUs (NVIDIA License)
- NVIDIA CUTLASS: Template library for CUDA GEMM kernels (BSD-3-Clause)
- TiledCUDA: Kernel template library designed to elevate CUDA Cโs level of abstraction for processing tiles
- TileFusion: Simplifying Kernel Fusion with Tile Processing (MIT)
- ARM Compute Library: Optimized for ARM platforms (Apache-2.0/MIT)
- CUSP: C++ templates for sparse linear algebra (Apache-2.0)
- CUV: C++/Python for CUDA-based vector/matrix ops
- Ginkgo: High-performance linear algebra on many-core systems (BSD-3-Clause)
- LAPACK: Foundational linear algebra routines (BSD-3-Clause)
- MAGMA: High-performance linear algebra on GPUs and multicore CPUs (BSD-3-Clause)
- oneDNN (MKL-DNN): Cross-platform deep learning primitives with optimized GEMM (Apache-2.0)
- viennacl-dev: OpenCL-based linear algebra library (MIT)
Python:
- JAX (Apache-2.0)
- NumPy (BSD-3-Clause)
- PyTorch (BSD-3-Clause)
- SciPy (BSD-3-Clause)
- TensorFlow (Apache-2.0) & XLA
C++:
- Armadillo (Apache-2.0/MIT)
- Blaze (BSD-3-Clause)
- Boost uBlas (Boost License)
- Eigen (MPL2)
Julia:
- BLIS.jl (BSD-3-Clause)
- GemmKernels.jl (BSD-3-Clause)
Intel Tools:
NVIDIA Tools:
ROCm Tools:
Others:
- Extrae
- FPChecker
- gprof
- gprofng
- HPCToolkit
- LIKWID
- MegPeak
- Perf (Linux)
- TAU
- VAMPIR
- Valgrind (Memcheck)
- CUDATutorial
- GPU MODE YouTube Channel
- HLS Tutorial & Deep Learning Accelerator Lab1
- HPC Garage
- MIT OCW: 6.172 Performance Engineering
- MIT: Optimizing Matrix Multiplication (6.172 Lecture Notes)
- NJIT: Optimize Matrix Multiplication
- Optimizing Matrix Multiplication using SIMD and Parallelization
- ORNL: CUDA C++ Exercise: Basic Linear Algebra Kernels: GEMM Optimization Strategies
- Purdue: Optimizing Matrix Multiplication
- Stanford: BLAS-level CPU Performance in 100 Lines of C
- UC Berkeley: CS267 Parallel Computing
- UCSB CS 240A: Applied Parallel Computing
- UT Austin: LAFF-On Programming for High Performance
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015)
- Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014)
- Model-driven BLAS Performance on Loongson (2012)
- High-performance Implementation of the Level-3 BLAS (2008)
- Anatomy of High-Performance Matrix Multiplication (2008)
- A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
- Building a FAST Matrix Multiplication Algorithm
- CUDA GEMM Optimization
- CUDA Learn Notes
- CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining
- CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIAยฎ Hopperโข GPUs
- CUTLASS Tutorial: Persistent Kernels and Stream-K
- Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Epilogue Fusion in CUTLASS with Epilogue Visitor Trees
- Fast Multidimensional Matrix Multiplication on CPU from Scratch
- Matrix Multiplication Background Guide (NVIDIA)
- Matrix Multiplication on CPU
- Matrix-Matrix Product Experiments with BLAZE
- Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
- Mixed-input matrix multiplication performance optimizations
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: A Worklog
- Outperforming cuBLAS on H100: a Worklog
- Optimizing Matrix Multiplication
- Optimizing Matrix Multiplication: Cache + OpenMP
- perf-book by Denis Bakhvalov
- Tuning Matrix Multiplication (GEMM) for Intel GPUs
- Why GEMM is at the heart of deep learning
- chgemm: Int8 GEMM implementations
- CoralGemm: AMD high-performance GEMM implementations (MIT)
- CUDA-INT8-GEMM
- cuda-sgemm
- cute_gemm
- Cute-Learning (MIT)
- CUTLASS-based Grouped GEMM: Efficient grouped GEMM operations (Apache-2.0)
- CUTLASS GEMM (BSD-3-Clause)
- DeepBench (Apache-2.0)
- how-to-optimize-gemm (row-major matmul) (GPLv3)
- NVIDIA_SGEMM_PRACTICE: Step-by-step optimization of CUDA SGEMM
- Optimizing-SGEMM-on-NVIDIA-Turing-GPUs (GPLv3)
- SGEMM_CUDA: Step-by-Step Optimization (MIT)
- simple-gemm (MIT)
- TK-GEMM: a Triton FP8 GEMM kernel using SplitK parallelization
- Toy HGEMM (Tensor Cores with MMA/WMMA) (GPLv3)
- xGeMM: Accelerated General (FP32) Matrix Multiplication (MIT)
We welcome and encourage contributions! You can help by:
- Adding new libraries, tools, or tutorials.
- Submitting performance benchmarks or example implementations.
- Improving documentation or correcting errors.
Submit a pull request or open an issue to get started!
This repository is licensed under the MIT License.
By maintaining this curated list, we hope to empower the community to learn, implement, and optimize GEMM efficiently. Thanks for visiting, and happy computing!