Skip to content

chanzhennan/cuda_gemm_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository showcases various features of GEMM aimed at enhancing its performance.

C = alpha * A * B + beta * C

Matrix Multiplication Algorithm Implementations

Installation

  • Edit build.sh file
    • cmake -DCUDA_ARCH=/your/cuda/arch -DCUDA_TOOLKIT_ROOT_DIR=/local/cuda/path
  • bash build.sh
image

Performance

Run on RTX 4070 Ti | Theoretical Performance: FP32 (float) 40.09 TFLOPS

Benchmark Time CPU Iterations UserCounters
Naive/Gemm_float/5120/4096/4096 1731 ms 1731 ms 1 TFlops=0.099244/s, operation=171.799G
Blocker/Gemm_float/5120/4096/4096 103 ms 103 ms 6 TFlops=1.66191/s, operation=1030.79G
Strider/Gemm_float/5120/4096/4096 19.9 ms 19.9 ms 30 TFlops=8.62941/s, operation=5.15396T
Aligner/Gemm_float/5120/4096/4096 17.3 ms 17.3 ms 33 TFlops=9.93519/s, operation=5.66936T
MultiLoader/Gemm_float/5120/4096/4096 19.8 ms 19.8 ms 31 TFlops=8.67294/s, operation=5.32576T
BcAvoider/Gemm_float/5120/4096/4096 24.2 ms 24.2 ms 26 TFlops=7.10627/s, operation=4.46677T
PpBuffer/Gemm_float/5120/4096/4096 20.9 ms 20.9 ms 28 TFlops=8.2018/s, operation=4.81036T
Dense/Gemm_float/5120/4096/4096 11.0 ms 11.0 ms 61 TFlops=15.5654/s, operation=10.4797T
Cublas/Gemm_float/5120/4096/4096 5.95 ms 5.95 ms 115 TFlops=28.8656/s, operation=19.7568T
Yzaiustc/Gemm_float/5120/4096/4096 7.23 ms 7.23 ms 93 TFlops=23.765/s, operation=15.9773T
Yhs/Gemm_float/5120/4096/4096 6.78 ms 6.78 ms 100 TFlops=25.3418/s, operation=17.1799T

Todo

  • Address the bug causing a segment fault in MatrixMulCUDA7.
  • Fix the issue where CUDA implementations 0 to 6 cannot handle cases where m = 8 n = 4096 k = 4096.