Skip to content

li199603/parallel_prefix_sum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

CUDA练手小项目——Parallel Prefix Sum (Scan)
My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.

Achievements

  • Block scanning
  • Full scan for large vectors (support for multi-layers scan)
  • Bank conflict avoidance optimization (BCAO)

Result

i5-11400 @ 2.60GHz + GeForce RTX 2060 Super

-------------------------- N = 1000 --------------------------
                           scan_cpu - total:    0.00147 ms
                sequential_scan_gpu - total:    0.16394 ms    kernel:    0.05120 ms
            parallel_block_scan_gpu - total:    0.12372 ms    kernel:    0.01190 ms
  parallel_block_scan_gpu with bcao - total:    0.11636 ms    kernel:    0.01027 ms
            parallel_large_scan_gpu - total:    0.13093 ms    kernel:    0.02202 ms
  parallel_large_scan_gpu with bcao - total:    0.12363 ms    kernel:    0.01696 ms

-------------------------- N = 2048 --------------------------
                           scan_cpu - total:    0.00403 ms
                sequential_scan_gpu - total:    0.20442 ms    kernel:    0.09626 ms
            parallel_block_scan_gpu - total:    0.12439 ms    kernel:    0.01395 ms
  parallel_block_scan_gpu with bcao - total:    0.12183 ms    kernel:    0.01360 ms
            parallel_large_scan_gpu - total:    0.12436 ms    kernel:    0.02048 ms
  parallel_large_scan_gpu with bcao - total:    0.12137 ms    kernel:    0.01638 ms

-------------------------- N = 100000 --------------------------
                           scan_cpu - total:    0.25345 ms
                sequential_scan_gpu - total:    4.93468 ms    kernel:    4.60474 ms
            parallel_large_scan_gpu - total:    0.30275 ms    kernel:    0.05891 ms
  parallel_large_scan_gpu with bcao - total:    0.26996 ms    kernel:    0.04157 ms

-------------------------- N = 10000000 --------------------------
                           scan_cpu - total:   27.09050 ms
                sequential_scan_gpu - total:  484.60391 ms    kernel:  469.34097 ms
            parallel_large_scan_gpu - total:   10.31124 ms    kernel:    1.15578 ms
  parallel_large_scan_gpu with bcao - total:    9.54029 ms    kernel:    0.89962 ms

Acknowledgements

https://github.com/mattdean1/cuda
https://github.com/TVycas/CUDA-Parallel-Prefix-Sum

About

Parallel Prefix Sum (Scan) with CUDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published