v1.4.0
This release addresses various issues and adds a new MultiSendRecv
operation.
- The default internal stream pool size has changed to 1. This is to mitigate issues on ROCm platforms, but no performance impact was observed on other platforms.
- Fix a compilation error when building on CUDA 12 platforms.
- On ROCm platforms only: zero-size RCCL
Send
,Recv
, andSendrecv
messages are skipped. This is to work around apparent hangs in RCCL with such messages and will be removed once the issue is fixed upstream. - Fix a memory copy issue in the host-transfer
Alltoallv
. - Updated to cxxopts 3.
- Added a compile-time traits API for describing what operations, types, etc. are supported by each backend.
- Added the
MultiSendRecv
operation, which supports an arbitrary sequence of sends and receives among ranks as a single operation. - Various internal reorganizations for the test and benchmark code.