Skip to content

Conversation

@joaobfernandes0
Copy link
Contributor

Fixing the PR #10413

This is my solution to issue #10328. The problem with this code is that inside of ompi_osc_rdma_refresh_dynamic_region there is a ompi_osc_rdma_lock_acquire_shared to the same lock used in ompi_osc_rdma_lock_acquire_exclusive generation a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@joaobfernandes0
Copy link
Contributor Author

joaobfernandes0 commented Jun 15, 2022

@awlauria could you check this new PR.

@awlauria
Copy link
Contributor

ok to test

@awlauria
Copy link
Contributor

Thanks @jotabf ! Looks good, you just need to add the signoff message.

git commit --amend --signoff + a force-push should do the trick.

@jsquyres
Copy link
Member

@jotabf It looks like there's 2 extra commits on this PR: there's 2 commits with the title "Removed deadlock in find dynamic region" and then another merge commit.

Can you squash it all down to 1 signed-off commit? That will make the Git commit checker CI happy.

Thanks!

@awlauria awlauria marked this pull request as ready for review June 15, 2022 21:19
@awlauria
Copy link
Contributor

Perfect - thanks @jotabf!

@wzamazon
Copy link
Contributor

@jotabf

It also fix part of #10244, specifically the hang with c_get_dynamic_self and c_put_dynamic_self.

I think you need adjust the commit message:

  1. add osc/rdma to the title of your commit. like: "osc/rdma: removed deadlock in find dynamic region"
  2. break the body of the message into multiple lines.

@awlauria
Copy link
Contributor

Just browsing the code, it seems there are multiple paths to get to this function ompi_osc_rdma_find_dynamic_region.

One path is:

  1. ompi_osc_rdma_compare_and_swap/ompi_osc_rdma_accumulate
  2. osc_rdma_get_remote_segment
  3. ompi_osc_rdma_find_dynamic_region

This path - to my eye - does not acquire the shared lock before trying to acquire the exclusive lock. In fact, in the cases of accumulate and swap it seems to acquire the exclusive lock after the call to get_remote_segment, here:

   ret = osc_rdma_get_remote_segment (module, peer, target_disp, true_lb+true_extent, &target_address, &target_handle);
    if (OPAL_UNLIKELY(OPAL_SUCCESS != ret)) {
        return ret;
    }

    /* to ensure order wait until the previous accumulate completes */
    while (!ompi_osc_rdma_peer_test_set_flag (peer, OMPI_OSC_RDMA_PEER_ACCUMULATING)) {
        ompi_osc_rdma_progress (module);
    }

    /* get an exclusive lock on the peer */
    if (!ompi_osc_rdma_peer_is_exclusive (peer) && !(module->acc_single_intrinsic || win->w_acc_ops <= OMPI_WIN_ACCUMULATE_OPS_SAME_OP)) {
        (void) ompi_osc_rdma_lock_acquire_exclusive (module, peer, offsetof (ompi_osc_rdma_state_t, accumulate_lock));
        lock_acquired = true;
    }

in osc_rdma_accumulate.c. Should that lock acquisition be moved up with this change?

Another path, is:

  1. ompi_osc_rdma_put
  2. ompi_osc_rdma_put_w_req
  3. osc_rdma_get_remote_segment
  4. ompi_osc_rdma_find_dynamic_region

which looks to me like it doesn't acquire any lock. I'm not familiar with this code enough to know if these are ok or not - but @devreal is it ok to access this block with no lock, as the paths above could in theory do? There's probably one or two other paths as well, but if a lock isn't required on this code block it's moot.

if (!ompi_osc_rdma_peer_local_state (peer)) {
ret = ompi_osc_rdma_refresh_dynamic_region (module, dy_peer);
if (OMPI_SUCCESS != ret) {
ompi_osc_rdma_lock_release_exclusive (module, peer, offsetof (ompi_osc_rdma_state_t, regions_lock));
Copy link
Contributor

@awlauria awlauria Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an OPAL_THREAD_LOCK() acquired on line 470 above..should that get released here?

Technically outside the scope of this PR, but something @gpaulsen and I noticed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should be fixed in this PR as well (while we're at it)

joaobfernandes0 added a commit to joaobfernandes0/ompi that referenced this pull request Jun 17, 2022
osc/rdma: removed deadlock in find dynamic region

Signed-off-by: João Batista Fernandes <[email protected]>
@devreal
Copy link
Contributor

devreal commented Jun 22, 2022

Sorry for the delay, this went under my radar. The problem is the regions_lock, which is used to synchronize access to the list of attached regions. The problem is that in ompi_osc_rdma_find_dynamic_region we take the regions lock at the target exclusively but then in ompi_osc_rdma_refresh_dynamic_region (while holding the exclusive lock) there is a chance that we have to take the regions_lock if the regions have changed. I don't see why we need the exclusive lock at the target since we don't modify anything there. The shared lock is required to ensure consistency with the target process modifying the list of attached regions while holding the regions_lock exclusively (in ompi_osc_rdma_attach and ompi_osc_rdma_detach).

The lock you pointed out @awlauria is the accumulate_lock, which is used to guarantee atomicity if we cannot rely on hardware atomic operations. That is independent though.

@awlauria
Copy link
Contributor

Makes sense.. @jotabf can you fixup your commit message? Looks like it got swapped for a merge commit message.

osc/rdma: removed deadlock in find dynamic region

Signed-off-by: João Batista Fernandes <[email protected]>
@devreal devreal merged commit 21864d1 into open-mpi:main Jun 23, 2022
@devreal
Copy link
Contributor

devreal commented Jun 23, 2022

@jotabf thanks for the patch. Would you mind cherry-picking the commit to the 5.0.x and and 4.1.x branches?

@joaobfernandes0
Copy link
Contributor Author

@devreal I saw that this pull request was already merged and closed. To confirm, are you asking me to repeat this same commit and pull request to the 5.0.x and 4.1.x branches?

@devreal
Copy link
Contributor

devreal commented Jun 24, 2022

Correct. The workflow is typically to create a new branch from the release branch and git cherry-pick -x <SHA> the commit. The -x flag is important to include the cherry-pick tag. Then just create PRs against the release branches and tag the right target in the labels :) Feel free to request reviews from me. Thanks!

@joaobfernandes0 joaobfernandes0 deleted the patch-2 branch June 27, 2022 13:53
joaobfernandes0 added a commit to joaobfernandes0/ompi that referenced this pull request Jun 27, 2022
osc/rdma: removed deadlock in find dynamic region

Signed-off-by: João Batista Fernandes <[email protected]>
(cherry picked from commit 9779b99)
joaobfernandes0 added a commit to joaobfernandes0/ompi that referenced this pull request Jun 27, 2022
osc/rdma: removed deadlock in find dynamic region

Signed-off-by: João Batista Fernandes <[email protected]>
(cherry picked from commit 9779b99)
joaobfernandes0 added a commit to joaobfernandes0/ompi that referenced this pull request Jun 27, 2022
osc/rdma: removed deadlock in find dynamic region

Signed-off-by: João Batista Fernandes <[email protected]>
(cherry picked from commit 9779b99)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants