Skip to content

v5.0.x - OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi#13631

Merged
janjust merged 2 commits intoopen-mpi:v5.0.xfrom
roiedanino:port-put-signal-v5.0.x
Jan 20, 2026
Merged

v5.0.x - OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi#13631
janjust merged 2 commits intoopen-mpi:v5.0.xfrom
roiedanino:port-put-signal-v5.0.x

Conversation

@roiedanino
Copy link
Contributor

@roiedanino roiedanino commented Jan 12, 2026

(cherry picked from commit 1afefcf)
(cherry picked from commit 3d25da6)

@github-actions github-actions bot added this to the v5.0.10 milestone Jan 12, 2026
@roiedanino roiedanino changed the title OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi v5.0.x - OSHMEM/MCA/SPML/UCX: implement put_signal and put_signal_nbi Jan 12, 2026
@roiedanino
Copy link
Contributor Author

CI Fails due to #13623

uint64_t dummy_prev, dummy_fetch;

if (sig_op == SHMEM_SIGNAL_SET) {
return MCA_ATOMIC_CALL(swap_nb(ctx, &dummy_fetch, (void*)sig_addr, (void*)&dummy_prev,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe. swap_nb is nonblocking (I assume) and dummy_fetch/dummy_prev will go out of scope before it completes. There is a good chance the network will randomly overwrite stack values. A dedicated heap memory location for dummy values like this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the original PR:
https://github.com/open-mpi/ompi/pull/13568/files

Changing this one to draft until its merge

@roiedanino roiedanino marked this pull request as draft January 13, 2026 08:45
@janjust janjust force-pushed the port-put-signal-v5.0.x branch from 51dde73 to 9af0ce9 Compare January 16, 2026 18:49
@roiedanino roiedanino marked this pull request as ready for review January 16, 2026 18:58
@janjust janjust requested a review from devreal January 16, 2026 22:23
janjust
janjust previously approved these changes Jan 16, 2026
@janjust
Copy link
Contributor

janjust commented Jan 16, 2026

@devreal please rereview - the main original was fixed, approved, merged

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think there are more issues here. Or maybe my reading of the UCX is off? Please tag me on the PR to main if more changes are needed.

&mca_spml_ucp_request_params[size >> 3]);
res = opal_common_ucx_wait_request(status_ptr, ucx_ctx->ucp_worker[0],
"ucp_atomic_op_nbx post");
res = UCS_PTR_IS_ERR(status_ptr) ? OSHMEM_ERROR : OSHMEM_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks the call to ucp_atomic_op_nbx: the &value parameter is a pointer to a local variable. Before this change, we waited for the op to complete (opal_common_ucx_wait_request). Now we don't wait anymore, so value may go out of scope before the op is complete and we may read garbage.

And what happens to the request that is returned by ucp_atomic_op_nbx? We used to wait for its completion. Shouldn't that be at least released to avoid a memory leak?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roiedanino ^ do we need to re-address main?

Copy link
Contributor Author

@roiedanino roiedanino Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, we figured those non-fetching atomics should be always non-blocking (as it also being said in the SHMEM spec) as they are not retrieving anything, so there’s no point at waiting for the completion, this change is intentional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original PR btw:
#13568

Copy link
Contributor

@devreal devreal Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem now is not that the we're potentially overwriting memory but that the atomic op reads a random value if the value is not read immediately during ucp_atomic_op_nbx and instead the operation is deferred until after the function returned. Then &value points to some memory that we don't control anymore. Consequently, the signal may not be what the user said it should be.

Regarding completion: the request does not need to be completed/released?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Is that true for all UCX versions we support or do we need a fallback / test for UCP_REQUEST_FLAG_PROTO_AMO_PACKED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is the first PR in UCX in which the atomic value is packed:
https://github.com/openucx/ucx/pull/8547/files

which is v1.20.0-rc1

We can check for the UCP_REQUEST_FLAG_PROTO_AMO_PACKED flag presence, which was introduced in that PR, and in case it's not there, fallback to blocking behaviour

I'll open a PR fixing it in master and then I'll cherry-pick it here

Copy link
Contributor Author

@roiedanino roiedanino Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok apparently we don't even need to free the status_ptr, those are last lines of ucp_atomic_op_nbx:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that documented? Coding against code instead of docs is a bit concerning imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning on updating the documentation as well

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

51dde73: OSHMEM/MCA/SPML/UCX: implement put_signal and put_...

  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: 26fa31a

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@roiedanino roiedanino requested a review from devreal January 20, 2026 14:33
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

51dde73: OSHMEM/MCA/SPML/UCX: implement put_signal and put_...

  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: 26fa31a

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

(cherry picked from commit 1afefcf)
Signed-off-by: Roie Danino <rdanino@nvidia.com>
…ue packing

Signed-off-by: Roie Danino <rdanino@nvidia.com>

OSHMEM/MCA/ATOMIC/UCX: added doc' explaining why there's no need to wait for completion / free status_ptr

Signed-off-by: Roie Danino <rdanino@nvidia.com>
(cherry picked from commit 3d25da6)
@roiedanino roiedanino force-pushed the port-put-signal-v5.0.x branch from 6b014f2 to 5bbf441 Compare January 20, 2026 14:39
@janjust janjust dismissed devreal’s stale review January 20, 2026 20:08

All comments were addressed in PR #13665

@janjust janjust merged commit 1e40352 into open-mpi:v5.0.x Jan 20, 2026
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants