Skip to content

Releases: aws/aws-ofi-nccl

AWS OFI NCCL v1.11.0

19 Aug 20:28
v1.11.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.22.3-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Autogenerate topology file on P5 by default, with detected topology, instead of using a static file
  • Support for AWS P5e instance type

Bug fixes:

  • Fixed segfault for platform-aws builds for instance types not explicitly configured
  • Fixed failure in mr cache in SENDRECV protocol for providers that don't require memory registration
  • Re-enabled WRITE_IN_ORDER_ALIGNED_128_BYTES setting and check on P5.
  • Added check to cause an error when using old blocking connect_v4/accept_v4 interfaces with RDMA protocol. The previous release changed connection establishment such that these interfaces cause deadlock.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

17063f1e10a885fe6cd48e275c9a0d5748b73d04d6514103a5e9a0f28dff604c1766f8a85a55e89ad5691830c54199936d88442d28c65180c2f79be939f0b208  aws-ofi-nccl-1.11.0-aws.tar.gz

AWS OFI NCCL v1.10.0

06 Aug 21:38
v1.10.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Replaced the model-based tuner with one based on regions derived from experimental evaluations.
  • Changed properties reported to NCCL to signal that registered MRs are global, in order to support user buffer registrations.
  • Added the option to use different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.
  • Updated plugin to use the zero-copy path in the EFA provider for fi_send/fi_recv operations.
  • Shrank the control message to 32 bytes to fit in inline data for EFA.

Bug Fixes:

  • Disabled Libfabric shared memory when possible.
  • Disabled RDMA eager messages on Neuron by default for better performance.
  • Ensured plugin's multi-rail protocol consistently sorts rails in order of VF index for better performance.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

fa296339a7e40fa420e2934c3a44f9a18ad3a9d798b7f129b35f46892f76532b70996fe36f309e3dedd2823ed9a819a4578f7c8241d8549805c49811b38ae14f  aws-ofi-nccl-1.10.0-aws.tar.gz

AWS OFI NCCL v1.9.2

17 Jun 23:33
v1.9.2-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Improved tuner model to make better decisions on P5 instances.
  • Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
  • Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
  • Fixed logging statement regarding if a domain is created per thread or per process.
  • Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01  aws-ofi-nccl-1.9.2-aws.tar.gz

AWS OFI NCCL v1.9.1

15 Apr 21:45
v1.9.1-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Fix release distribution generation to include missing headers introduced in v1.9.0. This fixes issue #382.
  • Restrict libcuda link-time dependency to builds with testing enabled
  • Build fixes to explicitly link against libm and libpthread used by the plugin

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

77e44dcdb77e6b25cae882d2124b6d9a2a66f2b85321ae827ec7e3fd88bacd214a537a2490a578af44b7457cc655b2e382fc148b6ed8594a68a30d145f3ce70e  aws-ofi-nccl-1.9.1-aws.tar.gz

AWS OFI NCCL v1.9.0

05 Apr 22:07
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • Support v8 plugin interface introduced with NCCL 2.20. This enables the use of the user memory registration feature recently introduced in NCCL.
  • Update the tuner component to support v2 ext-tuner interface introduced with NCCL 2.21.
  • Reduce ordering constraints for control messages, to reduce head of line blocking under congestion.

Bug Fixes:

  • Increase the number of communicators to 256K (from 4K), supporting larger all-to-all groups.
  • Improve logging in some corner case error conditions.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

7c86650f2f275b97bd08ff66b24ae8fef593269c068ec543259903d0eec80a0fe4153a3f171700e7e3dcb3b809a1d6aba82d5e7dc52ec138eacd7353629d1bc0  aws-ofi-nccl-1.9.0-aws.tar.gz

AWS OFI NCCL v1.8.1

25 Feb 21:40
v1.8.1-aws
Compare
Choose a tag to compare

This is a bugfix release that requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Fix an issue with the ID pool's reference counting and allocation
  • Improved error propagation for failed NCCL requests, allowing applications to fail early instead of blocking on requests that can never be completed.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

4ee21380176d5a76e4af0233ac44d1d46f92fd34941ecfaa104b7567a16cc84503c0abe59e540d36d79675bb3cc443979ed319f39582e301814d0653ea184508  aws-ofi-nccl-1.8.1-aws.tar.gz

AWS OFI NCCL v1.8.0

19 Feb 17:48
v1.8.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • A tuner component for the plugin that picks the optimal NCCL algorithm and protocol at a given scale and message size.
  • Improved communicator and memory region identifier management.
  • Migrated from CUDA Runtime API to functional equivalents in CUDA Driver API in preparation for dma-buf support for memory registration. With this change, the plugin uses the same mechanism as NCCL to interact with the CUDA subsystem.
  • No longer forcing a flush operation for network operations when running with H100 GPUs, even when running with older NCCL versions (< v2.19.1).
  • Improvements to internal device-agnostic APIs.
  • Support for NCCL v7 ext-net plugin interface introduced in NCCL v2.19.3.
  • Support for Ubuntu 22.04 LTS distribution.

Bug Fixes:

  • Set the maximum NVLS tree chunk size used to 512KiB to recover from a performance regression introduced in NCCL v2.19.4, using a parameter introduced in NCCL v2.20.3.
  • Prevent possible invocation of CUDA calls in libfabric by requiring a libfabric version of v1.18.0 or newer.
  • Fix debug prints that reported incorrect device IDs during initialization
  • Fixes to MAX_COMM computation.
  • Better handling of NVLS enablement when NCCL is statically linked to applications
  • Fixes to internal API return codes
  • Configuration system fixes for Neuron builds
  • Fixes to plugin environment parsing to be case insensitive
  • Miscellaneous fixes that address memory leaks, NULL derefences, and compiler warnings.
  • Updates and improvements to the project documentation.

Testing:

This release has been tested extensively with NCCL v2.19.4-1 for functionality and performance. This release has also been lightly tested with NCCL v2.20.3-1 that was released earlier this week. It was tested with Libfabric versions up to Libfabric v1.19.0.

Checksum (sha512) for the release tarball:

7bad7995e99649dc3ae4c46b2b0011225134703050ae83ab837cd46a7ff979079809cbd117e50cf5169428dd397ab099fea6249d12f891bff94b2d5579b0c0d9  aws-ofi-nccl-1.8.0-aws.tar.gz

AWS OFI NCCL v1.7.4

04 Dec 20:44
v1.7.4-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

New Features:

  • Hard fail if GPUDirect RDMA initialization fails on an EC2 instance that should support GPUDirect RDMA (such as P4d.24xlarge or P5.48xlarge), rather than fall back to host copy buffers at significantly reduced performance. Setting the environment variable OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1 will disable this behavior.
  • Change the threshold at which the rdma transport switches from round robin to striping from 8 KiB to 256 KiB, improving the efficiency of large message transfers.

Bug Fixes:

  • Fixed debugging output in some initialization failure cases.
  • Request FI_LOCAL_COMM feature from Libfabric, as flush and eager copies are both implemented via local communication.
  • Fix initialization when using the Libfabric TCP provider.
  • Improve documentation on using the plugin with AWS's Elastic Fabric Adapter (EFA).
  • Improve handling of Neuron device detection when the plugin is used with Tranium instances.
  • Fix segfault in error case of freelist memory growth.
  • The test programs that only support 2 ranks now fail with a useful error message if run with another number of ranks.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.3

05 Oct 19:26
v1.7.3-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

  • Do not disable LL and LL128 protocols on P5 instances.
  • Add support for g5.48xlarge instance types.
  • Fix a block in use leak in the freelist implementation.
  • For NCCL 2.18.5 or later, don't disable NVLS support.
  • Fix bug in handling retry error issues from Libfabric in the RDMA transport (P5 instance types).

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.

AWS OFI NCCL v1.7.2

25 Aug 17:58
v1.7.2-aws
a463b88
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

  • Fix compilation against CUDA versions prior to 11.3.
  • Fix allocation of free lists to avoid accidently registering user data, which can cause corruption on fork() with older Linux kernels.
  • Fix memory leak with registered bounce buffers.
  • Fix improper usage of optlen in call to fi_getopt().
  • Numerous memory cleanup fixes.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.