v0.6.0
New features:
- Support for
Send
,Recv
, andSendRecv
in the NCCL backend. - Add initial support for
Gather
,Scatter
, andAlltoall
to the NCCL backend. - Initial support for vector collectives in the NCCL and MPI backends:
Allgatherv
,Alltoallv
,Gatherv
,Scatterv
, andReduce_scatterv
. - Added new benchmarks for all supported operations.
- Improved performance and correctness of the spin-wait kernel used in the host-transfer backend.
- Improved progress engine binding logic. Related environment variables have been removed. Failing to bind no longer throws an exception.
Other changes:
- Various code cleanups and enhancements.
- The pairwise-exchange/ring allreduce algorithm has been removed from the MPI backend.
- Internal CUB memory pool is used for temporary GPU memory allocations.