CUDA练手小项目——Parallel Prefix Sum (Scan)
My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.
- Block scanning
- Full scan for large vectors (support for multi-layers scan)
- Bank conflict avoidance optimization (BCAO)
i5-11400 @ 2.60GHz + GeForce RTX 2060 Super
-------------------------- N = 1000 --------------------------
scan_cpu - total: 0.00147 ms
sequential_scan_gpu - total: 0.16394 ms kernel: 0.05120 ms
parallel_block_scan_gpu - total: 0.12372 ms kernel: 0.01190 ms
parallel_block_scan_gpu with bcao - total: 0.11636 ms kernel: 0.01027 ms
parallel_large_scan_gpu - total: 0.13093 ms kernel: 0.02202 ms
parallel_large_scan_gpu with bcao - total: 0.12363 ms kernel: 0.01696 ms
-------------------------- N = 2048 --------------------------
scan_cpu - total: 0.00403 ms
sequential_scan_gpu - total: 0.20442 ms kernel: 0.09626 ms
parallel_block_scan_gpu - total: 0.12439 ms kernel: 0.01395 ms
parallel_block_scan_gpu with bcao - total: 0.12183 ms kernel: 0.01360 ms
parallel_large_scan_gpu - total: 0.12436 ms kernel: 0.02048 ms
parallel_large_scan_gpu with bcao - total: 0.12137 ms kernel: 0.01638 ms
-------------------------- N = 100000 --------------------------
scan_cpu - total: 0.25345 ms
sequential_scan_gpu - total: 4.93468 ms kernel: 4.60474 ms
parallel_large_scan_gpu - total: 0.30275 ms kernel: 0.05891 ms
parallel_large_scan_gpu with bcao - total: 0.26996 ms kernel: 0.04157 ms
-------------------------- N = 10000000 --------------------------
scan_cpu - total: 27.09050 ms
sequential_scan_gpu - total: 484.60391 ms kernel: 469.34097 ms
parallel_large_scan_gpu - total: 10.31124 ms kernel: 1.15578 ms
parallel_large_scan_gpu with bcao - total: 9.54029 ms kernel: 0.89962 ms
https://github.com/mattdean1/cuda
https://github.com/TVycas/CUDA-Parallel-Prefix-Sum