Skip to content

Latest commit

 

History

History
59 lines (53 loc) · 3.29 KB

performance.md

File metadata and controls

59 lines (53 loc) · 3.29 KB

HIP CPU Runtime Performance Considerations

Recommended Reading

General Considerations

  1. Prefer larger block sizes when not using barriers
  2. Prefer doing more work per block element (HIP / CUDA thread)
  3. TODO

Pitfalls

  1. Barriers
    • Due to their reliance on O(block_size) fiber switches, functions that have barrier semantics, such as __syncthreads(), induce significant slowdown;
    • It is preferable to avoid using barriers if possible, especially since, unlike on GPUs, __shared__ memory does not provide performance benefits;
    • If you must use barriers, prefer smaller block sizes such as 8 or 16.
  2. FP16 i.e. __half
    • The HIP CPU Runtime provides correct but low-performance support for FP16 computation, in order to ensure that code which uses __half or __half2 is portable;
    • It is preferable to avoid using __half or __half2 arithmetic.
  3. Passing large arguments by-value to __global__ functions
    • This is a commonly encountered anti-pattern, stemming from the presence of dedicated memory for arguments on some (generally older) GPUs;
      • It is generally disadvantageous when targeting an AMD GPUs, thus the guidance below applies to them as well, and leads to performance portable code;
    • If sizeof(T) > 32 for a type T that is the type of an argument passed to a function, strongly prefer pass-by-pointer / pass-by-reference.
  4. Excessive unrolling via #pragma unroll
    • This is a commonly encountered anti-pattern, stemming from historical weaknesses in GPU compiler optimisation, which are no longer present in modern toolchains;
    • Can be extremely harmful due to trashing the I$
    • Strongly prefer deferring to the compiler on matters of unrolling.
  5. Composition with MPI
    • The interaction between the underlying implementation of the C++ Standard Library and, more specifically, its Parallel Algorithms component, can and will interact in opaque ways with any MPI driven scheduling;
    • Experiment with pinning / affinity of MPI tasks if performance is low in such cases.