v5.0.x - OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi#13631
v5.0.x - OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi#13631janjust merged 2 commits intoopen-mpi:v5.0.xfrom
Conversation
|
CI Fails due to #13623 |
| uint64_t dummy_prev, dummy_fetch; | ||
|
|
||
| if (sig_op == SHMEM_SIGNAL_SET) { | ||
| return MCA_ATOMIC_CALL(swap_nb(ctx, &dummy_fetch, (void*)sig_addr, (void*)&dummy_prev, |
There was a problem hiding this comment.
This is unsafe. swap_nb is nonblocking (I assume) and dummy_fetch/dummy_prev will go out of scope before it completes. There is a good chance the network will randomly overwrite stack values. A dedicated heap memory location for dummy values like this is needed.
There was a problem hiding this comment.
Good point, thanks!
There was a problem hiding this comment.
Fixed in the original PR:
https://github.com/open-mpi/ompi/pull/13568/files
Changing this one to draft until its merge
51dde73 to
9af0ce9
Compare
|
@devreal please rereview - the main original was fixed, approved, merged |
devreal
left a comment
There was a problem hiding this comment.
Sorry, I think there are more issues here. Or maybe my reading of the UCX is off? Please tag me on the PR to main if more changes are needed.
| &mca_spml_ucp_request_params[size >> 3]); | ||
| res = opal_common_ucx_wait_request(status_ptr, ucx_ctx->ucp_worker[0], | ||
| "ucp_atomic_op_nbx post"); | ||
| res = UCS_PTR_IS_ERR(status_ptr) ? OSHMEM_ERROR : OSHMEM_SUCCESS; |
There was a problem hiding this comment.
This change breaks the call to ucp_atomic_op_nbx: the &value parameter is a pointer to a local variable. Before this change, we waited for the op to complete (opal_common_ucx_wait_request). Now we don't wait anymore, so value may go out of scope before the op is complete and we may read garbage.
And what happens to the request that is returned by ucp_atomic_op_nbx? We used to wait for its completion. Shouldn't that be at least released to avoid a memory leak?
There was a problem hiding this comment.
Hi, we figured those non-fetching atomics should be always non-blocking (as it also being said in the SHMEM spec) as they are not retrieving anything, so there’s no point at waiting for the completion, this change is intentional
There was a problem hiding this comment.
The problem now is not that the we're potentially overwriting memory but that the atomic op reads a random value if the value is not read immediately during ucp_atomic_op_nbx and instead the operation is deferred until after the function returned. Then &value points to some memory that we don't control anymore. Consequently, the signal may not be what the user said it should be.
Regarding completion: the request does not need to be completed/released?
There was a problem hiding this comment.
Love it. Is that true for all UCX versions we support or do we need a fallback / test for UCP_REQUEST_FLAG_PROTO_AMO_PACKED?
There was a problem hiding this comment.
It seems this is the first PR in UCX in which the atomic value is packed:
https://github.com/openucx/ucx/pull/8547/files
which is v1.20.0-rc1
We can check for the UCP_REQUEST_FLAG_PROTO_AMO_PACKED flag presence, which was introduced in that PR, and in case it's not there, fallback to blocking behaviour
I'll open a PR fixing it in master and then I'll cherry-pick it here
There was a problem hiding this comment.
Is that documented? Coding against code instead of docs is a bit concerning imo.
There was a problem hiding this comment.
We are planning on updating the documentation as well
9af0ce9 to
6b014f2
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 51dde73: OSHMEM/MCA/SPML/UCX: implement put_signal and put_...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 51dde73: OSHMEM/MCA/SPML/UCX: implement put_signal and put_...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
(cherry picked from commit 1afefcf) Signed-off-by: Roie Danino <rdanino@nvidia.com>
…ue packing Signed-off-by: Roie Danino <rdanino@nvidia.com> OSHMEM/MCA/ATOMIC/UCX: added doc' explaining why there's no need to wait for completion / free status_ptr Signed-off-by: Roie Danino <rdanino@nvidia.com> (cherry picked from commit 3d25da6)
6b014f2 to
5bbf441
Compare
All comments were addressed in PR #13665

(cherry picked from commit 1afefcf)
(cherry picked from commit 3d25da6)