Skip to content

Releases: aws/aws-ofi-nccl

AWS OFI NCCL v1.7.4

04 Dec 20:44
v1.7.4-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

New Features:

  • Hard fail if GPUDirect RDMA initialization fails on an EC2 instance that should support GPUDirect RDMA (such as P4d.24xlarge or P5.48xlarge), rather than fall back to host copy buffers at significantly reduced performance. Setting the environment variable OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1 will disable this behavior.
  • Change the threshold at which the rdma transport switches from round robin to striping from 8 KiB to 256 KiB, improving the efficiency of large message transfers.

Bug Fixes:

  • Fixed debugging output in some initialization failure cases.
  • Request FI_LOCAL_COMM feature from Libfabric, as flush and eager copies are both implemented via local communication.
  • Fix initialization when using the Libfabric TCP provider.
  • Improve documentation on using the plugin with AWS's Elastic Fabric Adapter (EFA).
  • Improve handling of Neuron device detection when the plugin is used with Tranium instances.
  • Fix segfault in error case of freelist memory growth.
  • The test programs that only support 2 ranks now fail with a useful error message if run with another number of ranks.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.3

05 Oct 19:26
v1.7.3-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

  • Do not disable LL and LL128 protocols on P5 instances.
  • Add support for g5.48xlarge instance types.
  • Fix a block in use leak in the freelist implementation.
  • For NCCL 2.18.5 or later, don't disable NVLS support.
  • Fix bug in handling retry error issues from Libfabric in the RDMA transport (P5 instance types).

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.2

25 Aug 17:58
v1.7.2-aws
a463b88
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

  • Fix compilation against CUDA versions prior to 11.3.
  • Fix allocation of free lists to avoid accidently registering user data, which can cause corruption on fork() with older Linux kernels.
  • Fix memory leak with registered bounce buffers.
  • Fix improper usage of optlen in call to fi_getopt().
  • Numerous memory cleanup fixes.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.1

29 Jul 00:31
v1.7.1-aws
8a79a34
Compare
Choose a tag to compare

This release is part of enabling AWS's P5 platform. It is not recommended for other platforms at this time; we will release a general 1.7.x series in the near future.

This release removes the direct dependency on libcudart.so and dynamically loads the shared library at runtime, similar to the behaviors of NCCL and Libfabric.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.0

25 Jul 22:27
v1.7.0-aws
69f2292
Compare
Choose a tag to compare

This release is part of enabling AWS's P5 instance type. It has no useful features for other platforms.

This release requires Libfabric v1.11.0 or later and supports NCCL v2.17.1-1 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.17.1.

The plugin has been tested with following libfabric providers using unit tests bundled in the source code and nccl-tests test suite:

efa
tcp

AWS OFI NCCL v1.7.0rc1-aws

21 Jul 22:54
v1.7.0rc1-aws
d520367
Compare
Choose a tag to compare
Pre-release

Pre-release of the next 1.7.0 release series, which will (initially) target only the AWS EFA platform.

AWS OFI NCCL v1.6.0

22 Apr 01:29
Compare
Choose a tag to compare

This release requires Libfabric v1.11.0 or later and supports NCCL v2.17.1-1 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.17.1.

The plugin has been tested with following libfabric providers using unit tests bundled in the source code and nccl-tests test suite:

  • efa
  • tcp; ofi_rxm

[aws] AWS OFI NCCL v1.5.0

26 Jan 21:59
Compare
Choose a tag to compare

This release requires Libfabric v1.11.0 or later and supports NCCL v2.16.2 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.16.1.

The plugin has been tested with following libfabric providers using unit tests bundled in the source code:

  • efa

[aws] AWS OFI NCCL v1.4.0

14 Jul 02:28
Compare
Choose a tag to compare

This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.12 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.15.1.

The plugin has been tested with following libfabric providers using unit tests
bundled in the source code:

  • tcp;ofi_rxm
  • sockets
  • efa

AWS OFI NCCL v1.4.0

14 Jul 02:27
Compare
Choose a tag to compare

This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.12 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.15.1.

The plugin has been tested with following libfabric providers using unit tests
bundled in the source code:

  • tcp;ofi_rxm
  • sockets
  • efa