-
Notifications
You must be signed in to change notification settings - Fork 937
Removed deadlock in find dynamic region #10473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@awlauria could you check this new PR. |
|
ok to test |
|
Thanks @jotabf ! Looks good, you just need to add the signoff message.
|
|
@jotabf It looks like there's 2 extra commits on this PR: there's 2 commits with the title "Removed deadlock in find dynamic region" and then another merge commit. Can you squash it all down to 1 signed-off commit? That will make the Git commit checker CI happy. Thanks! |
|
Perfect - thanks @jotabf! |
|
@jotabf It also fix part of #10244, specifically the hang with I think you need adjust the commit message:
|
|
Just browsing the code, it seems there are multiple paths to get to this function One path is:
This path - to my eye - does not acquire the shared lock before trying to acquire the exclusive lock. In fact, in the cases of accumulate and swap it seems to acquire the exclusive lock after the call to in osc_rdma_accumulate.c. Should that lock acquisition be moved up with this change? Another path, is:
which looks to me like it doesn't acquire any lock. I'm not familiar with this code enough to know if these are ok or not - but @devreal is it ok to access this block with no lock, as the paths above could in theory do? There's probably one or two other paths as well, but if a lock isn't required on this code block it's moot. |
| if (!ompi_osc_rdma_peer_local_state (peer)) { | ||
| ret = ompi_osc_rdma_refresh_dynamic_region (module, dy_peer); | ||
| if (OMPI_SUCCESS != ret) { | ||
| ompi_osc_rdma_lock_release_exclusive (module, peer, offsetof (ompi_osc_rdma_state_t, regions_lock)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an OPAL_THREAD_LOCK() acquired on line 470 above..should that get released here?
Technically outside the scope of this PR, but something @gpaulsen and I noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this should be fixed in this PR as well (while we're at it)
osc/rdma: removed deadlock in find dynamic region Signed-off-by: João Batista Fernandes <[email protected]>
|
Sorry for the delay, this went under my radar. The problem is the The lock you pointed out @awlauria is the |
|
Makes sense.. @jotabf can you fixup your commit message? Looks like it got swapped for a merge commit message. |
osc/rdma: removed deadlock in find dynamic region Signed-off-by: João Batista Fernandes <[email protected]>
|
@jotabf thanks for the patch. Would you mind cherry-picking the commit to the 5.0.x and and 4.1.x branches? |
|
@devreal I saw that this pull request was already merged and closed. To confirm, are you asking me to repeat this same commit and pull request to the 5.0.x and 4.1.x branches? |
|
Correct. The workflow is typically to create a new branch from the release branch and |
osc/rdma: removed deadlock in find dynamic region Signed-off-by: João Batista Fernandes <[email protected]> (cherry picked from commit 9779b99)
osc/rdma: removed deadlock in find dynamic region Signed-off-by: João Batista Fernandes <[email protected]> (cherry picked from commit 9779b99)
osc/rdma: removed deadlock in find dynamic region Signed-off-by: João Batista Fernandes <[email protected]> (cherry picked from commit 9779b99)
Fixing the PR #10413
This is my solution to issue #10328. The problem with this code is that inside of
ompi_osc_rdma_refresh_dynamic_regionthere is aompi_osc_rdma_lock_acquire_sharedto the same lock used inompi_osc_rdma_lock_acquire_exclusivegeneration a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.