-
Notifications
You must be signed in to change notification settings - Fork 3k
CUDA Programming
Baiju Meswani edited this page Feb 15, 2023
·
3 revisions
-
Understand the hardware
-
Architecture Generations
- P100: Pascal / sm 60
- V100: Volta / sm 70
- A100: Ampere / sm 80
-
CUDA Core vs. Tensor Core
-
-
Programming model
- Thread
- Block
- Grid
- Stream
-
Must-know functions
-
cudaMalloc()
vs.cudaFree()
-
cudaMemcpy()
vs.cudaMemcpyAsync()
-
cudaMemset()
vs.cudaMemsetAsync()
-
cudaStreamSynchronize()
vs.cudaDeviceSynchronize()
-
cudaEventRecord()
vs.cudaStreamWaitEvent()
-
- Avoid memcpy
- Avoid unnecessary Sync
- Preprocess data in CPU
- when to use
#pragma
unroll?
- Easy: Dropout/DropGrad
- Medium: SoftmaxCrossEntropyLoss(Grad)
- Hard: LayerNormalization, ReduceSum, GatherGrad
-
printf()
works inside CUDA code - Memcpy data to CPU for inspection?
Please use the learning roadmap on the home wiki page for building general understanding of ORT.