Releases: LLNL/Aluminum
Releases · LLNL/Aluminum
v1.0.0
Aluminum is now officially stable.
Changes since v0.7.0:
- Aluminum communicators have been refactored and now always operate like objects (as opposed to handles). Communicators all have a stream interface.
- Added
Barrier
operation to all backends. - Added support for vector collectives in the host-transfer backend.
- Fix bug in the NCCL
Reduce_scatterv
operation (#110). - Various other code cleanups and bug fixes.
v0.7.0
The testing and benchmarking infrastructure has been entirely rewritten to be significantly more comprehensive and cleaner. There are also now scripts for nicely plotting benchmark results.
Numerous bugfixes and similar improvements:
- Aluminum no longer attempts to use bitwise reductions for
long double
. - Fixed bug in the host-transfer
Allreduce
on one processor. - Fix in-place bugs in the NCCL
Gather
,Gatherv
,Scatter
, andScatterv
, operations. - Fix MPI type for
long int
. - The
throw_al_exception
macro works outside of theAl
namespace. - Added a check for version mismatches in the version of HWLOC Aluminum was compiled with versus the one that is used at runtime.
- All internal Aluminum headers are now included with the
aluminum/
prefix to avoid conflicts with other projects.
v0.6.0
New features:
- Support for
Send
,Recv
, andSendRecv
in the NCCL backend. - Add initial support for
Gather
,Scatter
, andAlltoall
to the NCCL backend. - Initial support for vector collectives in the NCCL and MPI backends:
Allgatherv
,Alltoallv
,Gatherv
,Scatterv
, andReduce_scatterv
. - Added new benchmarks for all supported operations.
- Improved performance and correctness of the spin-wait kernel used in the host-transfer backend.
- Improved progress engine binding logic. Related environment variables have been removed. Failing to bind no longer throws an exception.
Other changes:
- Various code cleanups and enhancements.
- The pairwise-exchange/ring allreduce algorithm has been removed from the MPI backend.
- Internal CUB memory pool is used for temporary GPU memory allocations.
v0.5.0
v0.4.0
v0.3.3
v0.3.2
v0.2.1-1
v0.2.1
v0.2
New features/changes:
- Host-transfer implementations of standard collectives in the
MPI-CUDA
backend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter. - Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations.
- Experimental RMA Put/Get operations.
- Improved Aluminum algorithm specification.
- Non-blocking point-to-point operations.
- Improved testing and benchmarks.
- Bugfixes and performance improvements.