Awesome GEMM

🚀 Welcome to Awesome GEMM!
A curated and continually evolving list of frameworks, libraries, tutorials, and tools for optimizing General Matrix Multiply (GEMM) operations. Whether you're a beginner eager to learn the fundamentals, a developer optimizing performance-critical code, or a researcher pushing the limits of hardware, this repository is your launchpad to mastery.

Why GEMM Matters 💡

General Matrix Multiply is at the core of a wide range of computational tasks: from scientific simulations and signal processing to modern AI workloads like neural network training and inference. Efficiently implementing and optimizing GEMM can lead to dramatic performance improvements across entire systems.

This repository is a comprehensive resource for:

Students & Beginners: Learn the basics and theory of matrix multiplication.
Engineers & Developers: Discover frameworks, libraries, and tools to optimize GEMM on CPUs, GPUs, and specialized hardware.
Researchers & Performance Experts: Explore cutting-edge techniques, research papers, and advanced optimization strategies.

Quickstart & Highlights 🌱

If you’re new and just want to dive in, start here:

For Beginners:
- NumPy (CPU, Python) - The go-to library for basic matrix operations.
- How To Optimize GEMM - A step-by-step guide to improving performance from a naive implementation.
For GPU Developers:
- NVIDIA cuBLAS - Highly optimized BLAS for NVIDIA GPUs.
- NVIDIA CUTLASS - Templates and building blocks to write your own CUDA GEMM kernels.
For Low-Precision & AI Workloads:
- FBGEMM (Meta) - Specialized low-precision GEMM for server inference.
- gemmlowp (Google) - Low-precision (integer) GEMM for efficient ML inference.

Fundamental Theories and Concepts 🧠

What is GEMM?
- General Matrix Multiply (Intel) - Intro from Intel.
- Spatial-lang GEMM - High-level overview.
Matrix Multiplication Algorithms:
- Strassen's Algorithm - Faster asymptotic complexity for large matrices.
- Winograd's Algorithm - Reduced multiplication count for improved performance.

General Optimization Techniques 🚀

How To Optimize GEMM - Hands-on optimization guide.
GEMM: From Pure C to SSE Optimized Micro Kernels - Detailed tutorial on going from naive to vectorized implementations.

Frameworks and Development Tools 🛠️

BLIS - A modular framework for building high-performance BLAS-like libraries.
BLISlab - Educational framework for experimenting with BLIS-like GEMM algorithms.
Tensile - AMD ROCm JIT compiler for GPU kernels, specializing in GEMM and tensor contractions.

Libraries 🗂️

Language-Specific Libraries 🔤

Python:

JAX (Apache-2.0)
NumPy (BSD-3-Clause)
PyTorch (BSD-3-Clause)
SciPy (BSD-3-Clause)
TensorFlow (Apache-2.0) & XLA

C++:

Armadillo (Apache-2.0/MIT)
Blaze (BSD-3-Clause)
Boost uBlas (Boost License)
Eigen (MPL2)

Julia:

BLIS.jl (BSD-3-Clause)
GemmKernels.jl (BSD-3-Clause)

Debugging and Profiling Tools 🔍

Intel Tools:

NVIDIA Tools:

ROCm Tools:

ROCm Profiler (rocprofiler)

Others:

Learning Resources 📚

University Courses & Tutorials 🎓

Selected Papers 📝

Blogs 🖋️

Example Implementations 💡

chgemm: Int8 GEMM implementations
CoralGemm: AMD high-performance GEMM implementations (MIT)
CUDA-INT8-GEMM
cuda-sgemm
cute_gemm
Cute-Learning (MIT)
CUTLASS-based Grouped GEMM: Efficient grouped GEMM operations (Apache-2.0)
CUTLASS GEMM (BSD-3-Clause)
DeepBench (Apache-2.0)
how-to-optimize-gemm (row-major matmul) (GPLv3)
NVIDIA_SGEMM_PRACTICE: Step-by-step optimization of CUDA SGEMM
Optimizing-SGEMM-on-NVIDIA-Turing-GPUs (GPLv3)
SGEMM_CUDA: Step-by-Step Optimization (MIT)
simple-gemm (MIT)
TK-GEMM: a Triton FP8 GEMM kernel using SplitK parallelization
Toy HGEMM (Tensor Cores with MMA/WMMA) (GPLv3)
xGeMM: Accelerated General (FP32) Matrix Multiplication (MIT)

Contributions 🤝

We welcome and encourage contributions! You can help by:

Adding new libraries, tools, or tutorials.
Submitting performance benchmarks or example implementations.
Improving documentation or correcting errors.

Submit a pull request or open an issue to get started!

License 📜

This repository is licensed under the MIT License.

By maintaining this curated list, we hope to empower the community to learn, implement, and optimize GEMM efficiently. Thanks for visiting, and happy computing!

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome GEMM

Why GEMM Matters 💡

Quickstart & Highlights 🌱

Table of Contents 📑

Fundamental Theories and Concepts 🧠

General Optimization Techniques 🚀

Frameworks and Development Tools 🛠️

Libraries 🗂️

CPU Libraries 💻

GPU Libraries ⚡

Cross-Platform Libraries 🌍

Language-Specific Libraries 🔤

Debugging and Profiling Tools 🔍

Learning Resources 📚

University Courses & Tutorials 🎓

Selected Papers 📝

Blogs 🖋️

Example Implementations 💡

Contributions 🤝

License 📜

About

Releases

Packages

License

jssonx/awesome-gemm

Folders and files

Latest commit

History

Repository files navigation

Awesome GEMM

Why GEMM Matters 💡

Quickstart & Highlights 🌱

Table of Contents 📑

Fundamental Theories and Concepts 🧠

General Optimization Techniques 🚀

Frameworks and Development Tools 🛠️

Libraries 🗂️

CPU Libraries 💻

GPU Libraries ⚡

Cross-Platform Libraries 🌍

Language-Specific Libraries 🔤

Debugging and Profiling Tools 🔍

Learning Resources 📚

University Courses & Tutorials 🎓

Selected Papers 📝

Blogs 🖋️

Example Implementations 💡

Contributions 🤝

License 📜

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages