Releases: aws/aws-ofi-nccl
AWS OFI NCCL v1.7.4
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:
New Features:
- Hard fail if GPUDirect RDMA initialization fails on an EC2 instance that should support GPUDirect RDMA (such as P4d.24xlarge or P5.48xlarge), rather than fall back to host copy buffers at significantly reduced performance. Setting the environment variable
OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1
will disable this behavior. - Change the threshold at which the rdma transport switches from round robin to striping from 8 KiB to 256 KiB, improving the efficiency of large message transfers.
Bug Fixes:
- Fixed debugging output in some initialization failure cases.
- Request
FI_LOCAL_COMM
feature from Libfabric, as flush and eager copies are both implemented via local communication. - Fix initialization when using the Libfabric TCP provider.
- Improve documentation on using the plugin with AWS's Elastic Fabric Adapter (EFA).
- Improve handling of Neuron device detection when the plugin is used with Tranium instances.
- Fix segfault in error case of freelist memory growth.
- The test programs that only support 2 ranks now fail with a useful error message if run with another number of ranks.
This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.
AWS OFI NCCL v1.7.3
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:
- Do not disable LL and LL128 protocols on P5 instances.
- Add support for g5.48xlarge instance types.
- Fix a block in use leak in the freelist implementation.
- For NCCL 2.18.5 or later, don't disable NVLS support.
- Fix bug in handling retry error issues from Libfabric in the RDMA transport (P5 instance types).
This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.
AWS OFI NCCL v1.7.2
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:
- Fix compilation against CUDA versions prior to 11.3.
- Fix allocation of free lists to avoid accidently registering user data, which can cause corruption on fork() with older Linux kernels.
- Fix memory leak with registered bounce buffers.
- Fix improper usage of optlen in call to fi_getopt().
- Numerous memory cleanup fixes.
This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.
AWS OFI NCCL v1.7.1
This release is part of enabling AWS's P5 platform. It is not recommended for other platforms at this time; we will release a general 1.7.x series in the near future.
This release removes the direct dependency on libcudart.so and dynamically loads the shared library at runtime, similar to the behaviors of NCCL and Libfabric.
This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.
AWS OFI NCCL v1.7.0
This release is part of enabling AWS's P5 instance type. It has no useful features for other platforms.
This release requires Libfabric v1.11.0 or later and supports NCCL v2.17.1-1 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.17.1.
The plugin has been tested with following libfabric providers using unit tests bundled in the source code and nccl-tests test suite:
efa
tcp
AWS OFI NCCL v1.7.0rc1-aws
Pre-release of the next 1.7.0 release series, which will (initially) target only the AWS EFA platform.
AWS OFI NCCL v1.6.0
This release requires Libfabric v1.11.0 or later and supports NCCL v2.17.1-1 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.17.1.
The plugin has been tested with following libfabric providers using unit tests bundled in the source code and nccl-tests test suite:
- efa
- tcp; ofi_rxm
[aws] AWS OFI NCCL v1.5.0
This release requires Libfabric v1.11.0 or later and supports NCCL v2.16.2 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.16.1.
The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- efa
[aws] AWS OFI NCCL v1.4.0
This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.12 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.15.1.
The plugin has been tested with following libfabric providers using unit tests
bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
AWS OFI NCCL v1.4.0
This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.12 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.15.1.
The plugin has been tested with following libfabric providers using unit tests
bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa