AWS OFI NCCL v1.8.0
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).
New Features:
- A tuner component for the plugin that picks the optimal NCCL algorithm and protocol at a given scale and message size.
- Improved communicator and memory region identifier management.
- Migrated from CUDA Runtime API to functional equivalents in CUDA Driver API in preparation for dma-buf support for memory registration. With this change, the plugin uses the same mechanism as NCCL to interact with the CUDA subsystem.
- No longer forcing a flush operation for network operations when running with H100 GPUs, even when running with older NCCL versions (< v2.19.1).
- Improvements to internal device-agnostic APIs.
- Support for NCCL v7 ext-net plugin interface introduced in NCCL v2.19.3.
- Support for Ubuntu 22.04 LTS distribution.
Bug Fixes:
- Set the maximum NVLS tree chunk size used to 512KiB to recover from a performance regression introduced in NCCL v2.19.4, using a parameter introduced in NCCL v2.20.3.
- Prevent possible invocation of CUDA calls in libfabric by requiring a libfabric version of v1.18.0 or newer.
- Fix debug prints that reported incorrect device IDs during initialization
- Fixes to MAX_COMM computation.
- Better handling of NVLS enablement when NCCL is statically linked to applications
- Fixes to internal API return codes
- Configuration system fixes for Neuron builds
- Fixes to plugin environment parsing to be case insensitive
- Miscellaneous fixes that address memory leaks, NULL derefences, and compiler warnings.
- Updates and improvements to the project documentation.
Testing:
This release has been tested extensively with NCCL v2.19.4-1 for functionality and performance. This release has also been lightly tested with NCCL v2.20.3-1 that was released earlier this week. It was tested with Libfabric versions up to Libfabric v1.19.0.
Checksum (sha512) for the release tarball:
7bad7995e99649dc3ae4c46b2b0011225134703050ae83ab837cd46a7ff979079809cbd117e50cf5169428dd397ab099fea6249d12f891bff94b2d5579b0c0d9 aws-ofi-nccl-1.8.0-aws.tar.gz