Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 9, 2026

The PMIx_Fence_nb function can return PMIX_OPERATION_SUCCEEDED to indicate
that the function was executed atomically and the callback function will
therefore not be called. The PMIx Standard lists a few reasons why this
can happen, but the point here was to fix usage to properly handle
that possibility.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2026

@janjust This is timing out in some test NVIDIA runs, but I have no idea why and no way to test it myself (my Mac doesn't support UCX). I can try to do it in some kind of container, but would need advice on how. Otherwise, I'd just consider this as a suggestion PR and close it.

@janjust
Copy link
Contributor

janjust commented Jan 9, 2026

@rhc54 let me give it a try

@janjust
Copy link
Contributor

janjust commented Jan 9, 2026

I'm guessing update pmix sha is intended with this PR

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2026

I removed the changes to common_ucx.c and now it passes. So there is something odd going on down there, but I don't have the ability to test it. Perhaps what is needed is to set the active flag to "false" in the case of a singleton, or when the fence isn't going to issue a callback? Might need some experimenting.

@janjust
Copy link
Contributor

janjust commented Jan 9, 2026

yeah it now passes, because the initial PR, you didn't set the fenced in case of atomic completion so when it came to tear down it was stuck in the if(!fenced)

the _nb code worked fine.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 9, 2026

See the other comment thread on #13626 (comment) - there is a necessary change to fix what @hppritcha observed. Your tests don't seem to hit that use-case. I'm just not including it here.

I updated PMIx here because it was changed to strictly conform to the Standard - which it wasn't before. Hence, the spml/ucx component will now fail with these changes when running as a singleton.

@hppritcha
Copy link
Member

Please remove the PMIx sha update and include in a separate PR.

Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the pmix sha removal from this pr and open a separate one to do that.

The PMIx_Fence_nb function can return PMIX_OPERATION_SUCCEEDED to indicate
that the function was executed atomically and the callback function will
therefore not be called. The PMIx Standard lists a few reasons why this
can happen, but the point here was to fix usage to properly handle
that possibility.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants