AWS OFI NCCL v1.9.0
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).
New Features:
- Support v8 plugin interface introduced with NCCL 2.20. This enables the use of the user memory registration feature recently introduced in NCCL.
- Update the tuner component to support v2 ext-tuner interface introduced with NCCL 2.21.
- Reduce ordering constraints for control messages, to reduce head of line blocking under congestion.
Bug Fixes:
- Increase the number of communicators to 256K (from 4K), supporting larger all-to-all groups.
- Improve logging in some corner case error conditions.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
7c86650f2f275b97bd08ff66b24ae8fef593269c068ec543259903d0eec80a0fe4153a3f171700e7e3dcb3b809a1d6aba82d5e7dc52ec138eacd7353629d1bc0 aws-ofi-nccl-1.9.0-aws.tar.gz