Skip to content

AWS OFI NCCL v1.8.0

Compare
Choose a tag to compare
@rajachan rajachan released this 19 Feb 17:48
· 376 commits to master since this release
v1.8.0-aws

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • A tuner component for the plugin that picks the optimal NCCL algorithm and protocol at a given scale and message size.
  • Improved communicator and memory region identifier management.
  • Migrated from CUDA Runtime API to functional equivalents in CUDA Driver API in preparation for dma-buf support for memory registration. With this change, the plugin uses the same mechanism as NCCL to interact with the CUDA subsystem.
  • No longer forcing a flush operation for network operations when running with H100 GPUs, even when running with older NCCL versions (< v2.19.1).
  • Improvements to internal device-agnostic APIs.
  • Support for NCCL v7 ext-net plugin interface introduced in NCCL v2.19.3.
  • Support for Ubuntu 22.04 LTS distribution.

Bug Fixes:

  • Set the maximum NVLS tree chunk size used to 512KiB to recover from a performance regression introduced in NCCL v2.19.4, using a parameter introduced in NCCL v2.20.3.
  • Prevent possible invocation of CUDA calls in libfabric by requiring a libfabric version of v1.18.0 or newer.
  • Fix debug prints that reported incorrect device IDs during initialization
  • Fixes to MAX_COMM computation.
  • Better handling of NVLS enablement when NCCL is statically linked to applications
  • Fixes to internal API return codes
  • Configuration system fixes for Neuron builds
  • Fixes to plugin environment parsing to be case insensitive
  • Miscellaneous fixes that address memory leaks, NULL derefences, and compiler warnings.
  • Updates and improvements to the project documentation.

Testing:

This release has been tested extensively with NCCL v2.19.4-1 for functionality and performance. This release has also been lightly tested with NCCL v2.20.3-1 that was released earlier this week. It was tested with Libfabric versions up to Libfabric v1.19.0.

Checksum (sha512) for the release tarball:

7bad7995e99649dc3ae4c46b2b0011225134703050ae83ab837cd46a7ff979079809cbd117e50cf5169428dd397ab099fea6249d12f891bff94b2d5579b0c0d9  aws-ofi-nccl-1.8.0-aws.tar.gz