GEMM Kernel Microbenchmark

This repo provides a microbenchmark for GEMM kernels on NVIDIA GPUs with Ampere Architecture (sm_80). It includes both a CUDA kernel benchmark and a Python extension benchmark.

Requirements

NVIDIA GPU with Ampere Architecture (sm_80)
CUDA 12.2

Getting Started

CUDA Kernel Benchmark

Build the project:

$ make

Run a benchmark with specific parameters:

$ ./csrc/bench/main --groups=16 --m=64 --n=64 --k=768 --iterations=3

Where:

--groups: Number of groups
--m, --n, --k: Problem size dimensions
--iterations: Number of iterations

For more information on available options:

$ ./csrc/bench/main --help

Python Extension Benchmark

Export the CUDA kernel as a Python extension:

$ python ./python/testbed/lib.py
$ cd out && TORCH_CUDA_ARCH_LIST="8.0" python setup.py install --user

Run the benchmark:

$ python ./python/testbed/multi_gemm.py > perf.txt

References

CUTLASS Examples
- "02_pytorch_extension_grouped_gemm" Notebook: A guide to implementing grouped GEMM operations as PyTorch extensions.
- "gemm_grouped" CUDA Example: Example code and documentation for grouped GEMM operations in CUDA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GEMM Kernel Microbenchmark

Requirements

Getting Started

CUDA Kernel Benchmark

Python Extension Benchmark

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

GEMM Kernel Microbenchmark

Requirements

Getting Started

CUDA Kernel Benchmark

Python Extension Benchmark

References