NCCL inplace reduce-scatterv trashes rank 0 buffer #110

ndryden · 2021-02-16T15:23:35Z

We currently implement the reduce-scatter as a reduce to rank 0 followed by a scatterv. When doing an in-place op, the reduce is in-place on the input sendbuf. This therefore writes to portions of sendbuf on rank 0 that are outside of the region where the final scattered value would be placed.

I don't find something explicitly prohibiting this in the MPI standard, but:

It's a bit aesthetically displeasing.
In other cases, like a MPI_Recv with a buffer/count larger than the actual message length, MPI does guarantee that no more memory will be touched than is actually needed by the message.
Avoiding it shouldn't take too much overhead if we use a memory pool.
A better, direct implementation can probably avoid it.

The text was updated successfully, but these errors were encountered:

ndryden added the bug Something isn't working label Feb 16, 2021

ndryden added a commit that referenced this issue Feb 26, 2021

Fix NCCL reduce-scatterv. Closes #110.

41590b4

ndryden mentioned this issue Feb 26, 2021

Fix NCCL reduce-scatterv #121

Merged

ndryden added a commit that referenced this issue Mar 4, 2021

Fix NCCL reduce-scatterv. Closes #110.

b6b6726

ndryden added a commit that referenced this issue Mar 5, 2021

Fix NCCL reduce-scatterv. Closes #110.

f865c07

ndryden closed this as completed in #121 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL inplace reduce-scatterv trashes rank 0 buffer #110

NCCL inplace reduce-scatterv trashes rank 0 buffer #110

ndryden commented Feb 16, 2021

NCCL inplace reduce-scatterv trashes rank 0 buffer #110

NCCL inplace reduce-scatterv trashes rank 0 buffer #110

Comments

ndryden commented Feb 16, 2021