- C++ Core Guidelines - Performance
- CPU Caches and Why You Care
- Optimizing Software in C++
- CUDA C++ Best Practices Guide
- Prefer larger block sizes when not using barriers
- Prefer doing more work per block element (HIP / CUDA thread)
- Avoid bijective mappings where each element in the execution domain performs only one computation on a single element from the data domain, as these do not amortise setup / tear-down costs;
- See the following for an argument in favour of this being desirable on the GPU as well:
- TODO
- Barriers
- Due to their reliance on
O(block_size)
fiber switches, functions that have barrier semantics, such as__syncthreads()
, induce significant slowdown; - It is preferable to avoid using barriers if possible, especially since,
unlike on GPUs,
__shared__
memory does not provide performance benefits; - If you must use barriers, prefer smaller block sizes such as
8
or16
.
- Due to their reliance on
FP16
i.e.__half
- The HIP CPU Runtime provides correct but low-performance support for
FP16
computation, in order to ensure that code which uses__half
or__half2
is portable; - It is preferable to avoid using
__half
or__half2
arithmetic.
- The HIP CPU Runtime provides correct but low-performance support for
- Passing large arguments by-value to
__global__
functions- This is a commonly encountered anti-pattern, stemming from the presence of
dedicated memory for arguments on some (generally older) GPUs;
- It is generally disadvantageous when targeting an AMD GPUs, thus the guidance below applies to them as well, and leads to performance portable code;
- If
sizeof(T) > 32
for a typeT
that is the type of an argument passed to a function, strongly prefer pass-by-pointer / pass-by-reference.
- This is a commonly encountered anti-pattern, stemming from the presence of
dedicated memory for arguments on some (generally older) GPUs;
- Excessive unrolling via
#pragma unroll
- This is a commonly encountered anti-pattern, stemming from historical weaknesses in GPU compiler optimisation, which are no longer present in modern toolchains;
- Can be extremely harmful due to trashing the I$
- See mixbench for an example;
- Strongly prefer deferring to the compiler on matters of unrolling.
- Composition with MPI
- The interaction between the underlying implementation of the C++ Standard Library and, more specifically, its Parallel Algorithms component, can and will interact in opaque ways with any MPI driven scheduling;
- Experiment with pinning / affinity of MPI tasks if performance is low in such cases.