The CUDA Coarsening Compiler (CCC) is an automatic implementation of thread coarsening using CUDA and LLVM.
Thread coarsening is loop unrolling applied to GPU kernels - kernel unrolling. GPU kernels exist within a implied execution loops over thread blocks and threads. Thread coarsening unrolls one (or several) of these by a coarsening factor, applying code duplication to the kernel body and reducing the iteration space accordingly.
The coarsening compilation happens in two stages. First, the device code is coarsened by applying code duplication to GPU kernel bodies, with respect to a set of coarsening parameters. Second, the host code is checked for kernel invocations, and the iteration space is reduced accordingly. Both are implemented as LLVM passes.
The CUDA Coarsening Compiler can operate in static
and dynamic
mode. Static mode coarsens a given GPU kernel with respect to one specific configuration. Dynamic mode, on the other hand, creates a whole bunch of differently coarsened kernels and links them all into the binary. You almost always want to be using dynamic mode (which is the default), as it allows you to select the most suitable kernel config at runtime, once the problem size is known.
Building can be done by simply invoking cmake
followed by make
, which should be sufficient when working on CCC itself.
If you want to coarsen programs of your own CMake projects, you can include CCC as a CMake dependency. This section will walk you through this step by step. First, let's build CCC:
cmake /path/to/cuda-coarsening-compiler -DCMAKE_INSTALL_PREFIX=/path/to/install-dir"
make install
The CMAKE_INSTALL_PREFIX
can be omitted, but is useful for installing into user space.
Next, set up your own CMakeLists.txt
as follows:
project(MyProject LANGUAGES C CXX CUDA)
cmake_minimum_required(VERSION 3.17)
find_package(cuda-coarsening-compiler REQUIRED)
coarsening_compile(TARGET my_program SOURCES my_program.cu)
You can add any include directories in the usual way before invoking coarsening_compile()
, which will then perform the compilation for you.
The following arguments can be provided to the build system:
CCC_DEVICE_ARCH
expects a CUDA device architecture, for examplesm_52
CCC_COMPUTE_ARCH
expects a CUDA compute architecture, for instancecompute_52
AOPT
expects any optimization flags. Defaults to-O3
and should not be changed.CUDA_PATH
is used to locate CUDA_PATH/lib64. Defaults to/usr/local/cuda
.GENERATE_LL_FILES
to produce human-readable versions of the bitcode files.
Now it's time to build your project. If you have specified a CMAKE_INSTALL_PREFIX
path before, you will need to specify a CMAKE_PREFIX_PATH
(note the appended /lib
):
cmake /path/to/my-project -DCMAKE_PREFIX_PATH=/path/to/install-dir/lib -DCCC_DEVICE_ARCH=sm_52 -DCCC_COMPUTE_ARCH=compute_52
make
Although dynamic mode is recommended, the coarsening_compile()
function also supports static mode compilation. Here is the full definition:
coarsening_compile(TARGET target
SOURCES src1 [src2...]
[STATIC
KERNEL <specific-kernel-name|all>
MODE <thread|block>
DIMENSION <x|y|z>
STRIDE <number>
])
In order to apply coarsening to kernels within this project, uncomment the following line to CMakeLists.txt
:
include("${CMAKE_SOURCE_DIR}/cmake/cuda-coarsening-compiler-config.cmake.in")
Adjust your build commands as described above.
The CUDA Coarsening Compiler was tested with LLVM 9.0 and CUDA 10.1.
The coarsening_compile()
CMake function requires tools from both LLVM and CUDA, which need to be visible in your $PATH
:
- From LLVM:
opt
,llc
,clang
andclang++
. - From CUDA:
ptxas
andfatbinary
.
By default, the CUDA Coarsening Compiler operates in dynamic
mode. This means that a bunch of differently coarsened kernels are compiled and linked into the target binary. By setting the RPC_CONFIG
environment variable, you can then select at runtime which version of the kernel you would like to execute, although the choice can be made automatically in the future.
The RPC_CONFIG
environment variable expects the following format: <kernelname>,<dim>,<block/thread>,<factor>,<stride>
. For instance, a dynamically coarsened program can be invoked as follows:
RPC_CONFIG=my_kernel,x,block,4,1 ./my-program