GitHub - Cjkkkk/CUDA_gemm: A simple high performance CUDA GEMM implementation.

introduction

A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.

C = alpha * A * B + beta * C

algorithm

located in src/cuda/

MatrixMulCUDA
- one element of C is assigned one thread
- global memory coalesce of B
MatrixMulCUDA1
- texture load
MatrixMulCUDA2
- one 4 * 4 grid of C is assigned one thread
MatrixMulCUDA3
- vectorized A B load
MatrixMulCUDA4
- vectorized C store
MatrixMulCUDA5
- block sparse version
MatrixMulCUDA6
- vectorized A B load coalesce
MatrixMulCUDA7
- warp shuffle to enable C store coalesce
MatrixMulCUDAQuantize8bit
- 8 bit non-uniform quantized matmul

experiments

located in benchmark/

benchmark_dense
- Compare My Gemm with Cublas
benchmark_sparse
- Compare My block sparse Gemm with Cusparse
benchmark_quantization_8bit
- Compare My Gemm with Cublas
benchmark_quantization
- Compare My Gemm with My quantized non-uniform 8 bit Gemm

TODO

(MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
(MatrixMulCUDA8) double buffering

run

mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh

Note

sparsity约为1%的时候, cusparse的性能可以超越cublas
合理分配寄存器尽可能让参数在编译器确定节省计算资源和寄存器数目

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
benchmark		benchmark
data		data
imgs		imgs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
benchmark.md		benchmark.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

introduction

algorithm

experiments

TODO

run

Note

About

Releases

Packages

Contributors 4

Languages

Cjkkkk/CUDA_gemm

Folders and files

Latest commit

History

Repository files navigation

introduction

algorithm

experiments

TODO

run

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages