|
| 1 | +# 2021-09-30, Release 1 |
| 2 | + |
| 3 | +## API Changes |
| 4 | + |
| 5 | +* Added option to specify arguments in NFI kernel signatures as `const` |
| 6 | + * The effect is the same as marking them as `in` in the NIDL syntax |
| 7 | + * It is not strictly required to have the corresponding arguments in the CUDA kernel marked as `const`, although that's recommended |
| 8 | + * Marking arguments as `const` or `in` enables the async scheduler to overlap kernels that use the same read-only arguments |
| 9 | + |
| 10 | +## New asynchronous scheduler |
| 11 | + |
| 12 | +* Added a new asynchronous scheduler for GrCUDA, enable it with `--experimental-options --grcuda.ExecutionPolicy=async` |
| 13 | + * With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes immediately |
| 14 | + * The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU data are accessed by the host thread |
| 15 | + * Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using different streams |
| 16 | + * Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams |
| 17 | + * The scheduler supports different options, see `README.md` for the full list |
| 18 | + * It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021) |
| 19 | + |
| 20 | +* Enabled partial support for cuBLAS and cuML in the aync scheduler |
| 21 | + * **Known limitation:** functions in these libraries work with the async scheduler, although they still run on the default stream (i.e. they are not asynchronous) |
| 22 | + * They do benefit from prefetching |
| 23 | +* Set TensorRT support to experimental |
| 24 | + * TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML |
| 25 | + * **Known limitation:** due to this incompatibility, TensorRT is currently not available on the async scheduler |
| 26 | + |
| 27 | +## New features |
| 28 | + |
| 29 | +* Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray, MultiDimDeviceArrayView, and provides high-level array interfaces |
| 30 | +* Added API for prefetching |
| 31 | + * If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance |
| 32 | +* Added API for stream attachment |
| 33 | + * Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync scheduler, it can be manually enabled |
| 34 | + * It restricts the visibility of GPU data to the specified stream |
| 35 | + * In architectures newer or equal than Pascal it can provide a small performance benefit |
| 36 | +* Added `copyTo/copyFrom` functions on generic arrays (Truffle interoperable objects that expose the array API) |
| 37 | + * Internally, the copy is implemented as a for loop, instead of using CUDA's `memcpy` |
| 38 | + * It is still faster than copying using loops in the host languages, in many cases, and especially if host code is not JIT-ted |
| 39 | + * It is also used for copying data to/from DeviceArrays with column-major layout, as `memcpy` cannot copy non-contiguous data |
| 40 | + |
| 41 | +## Demos, benchmarks and code samples |
| 42 | + |
| 43 | +* Added demo used at SeptembeRSE 2021 (`demos/image_pipeline_local` and `demos/image_pipeline_web`) |
| 44 | + * It shows an image processing pipeline that applies a retro look to images. We have a local version and a web version that displays results a in web page |
| 45 | +* Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021) |
| 46 | + * It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling. |
| 47 | + |
| 48 | +## Miscellaneosus |
| 49 | + |
| 50 | +* Added dependency to `grcuda-data` submodule, used to store data, results and plots used in publications and demos. |
| 51 | +* Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it? |
| 52 | +* Added support for Java 11 along with Java 8 |
| 53 | +* Added option to specify the location of cuBLAS and cuML with environment variables (`LIBCUBLAS_DIR` and `LIBCUML_DIR`) |
| 54 | +* Refactored package hierarchy to reflect changes to current GrCUDA (e.g. `gpu -> runtime`) |
| 55 | +* Added basic support for TruffleLogger |
| 56 | +* Removed a number of existing deprecation warnings |
| 57 | +* Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking |
| 58 | +* Updated documentation |
| 59 | + * Bumped GraalVM version to 21.2 |
| 60 | + * Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (see `oci_setup/`) |
| 61 | + * Added documentation to setup IntelliJ Idea for GrCUDA development |
| 62 | + * Added documentation about Python benchmark suite |
| 63 | + * Added documentation on asynchronous scheduler options |
0 commit comments