Skip to content

Conversation

@joaobfernandes0
Copy link
Contributor

This is my solution to issue #10328. The problem with this code is that inside of ompi_osc_rdma_refresh_dynamic_region there is a ompi_osc_rdma_lock_acquire_shared to the same lock used in ompi_osc_rdma_lock_acquire_exclusive generation a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.

Zhiming-Wang and others added 30 commits August 18, 2021 14:19
Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit d483e3e)
There are a number of other places to do this, but starting with this.

Signed-off-by: Austen Lauria <[email protected]>

Co-authored-by: Joseph Schuchart <[email protected]>
Co-authored-by: Jeff Squyres <[email protected]>
(cherry picked from commit 79d7fd2)
Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 904459a)
Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit a069749)
Avoid atomic cmpxchng operations for MPI requests that are already
complete. This improves the performance in message rate benchmarks.

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 3cf5004)

Conflicts:
	ompi/request/req_wait.c
Signed-off-by: Valentin Petrov <[email protected]>
(cherry picked from commit 13c8d22)
This patch attempts to open up libfabric resources in order to notify
libfabric when our memhooks patcher intercepts free calls.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 25811e2)
…on after wait completes.

We found an issue where with using multiple threads, it is possible for the data
to not be in the buffer before MPI_Wait() returns. Testing the buffer later after
MPI_Wait() returned would show the data arrives eventually without the rmb().

We have seen this issue on Power9 intermittently using PAMI, but in theory could
happen with any transport.

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 12192f1)
…alue.

Example:
int main(void) {
    int rc = 0;
    shmem_init();
    if (0 == shmem_my_pe())
        rc = 1;
    shmem_finalize();
    return rc;
}

If the user wants to return not-zero status, this leads to
`ompi_mpi_finalize()` won't be called and the program terminates
abnormally.

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit ceb9259)
Signed-off-by: Geoffrey Paulsen <[email protected]>
(cherry picked from commit a93d92b)
There are known issues with the API in libfabric 1.13.0 which will guarantee
segfaults when used. These issues are fixed in libfabric 1.13.1, but we
do not have a way to detect which patch version of libfabric is used.
Thus, delay the usage of the API until the subsequent minor release.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 190feba)
…hmem_pc

Ignoring some generated oshmem .pc files
v4.1.x: common/ofi: Utilize new libfabric API to import memhooks monitor
v4.1.x: Improve MPI_Waitall performance for MPI_THREAD_MULTIPLE
v4.1.x: opal/ppc atomics: Optimize the RMB().
v4.1.x: fix a memory hook recursion hang
v4.1.x: fix --display-diffable-map free()
…_v4.1.x

v4.1.x: orte: Fix orte_report_silent_errors error reporting path
…nup_v4.1.x

v4.1.x: Unlink and rebind socket when session directory already exists
…4.1.x

v4.1.x: Silence some ppc atomics warnings
v4.1.x: Correctly process 0 slots with -host option.
v4.1.x: Fix launch when set -u is in users login .rc file.
v4.1.x: Remove hard coded -lmpi from oshc++ wrapper.
v4.1.x: reduce/ireduce: Return MPI_SUCCESS when count == 0 and send == recv.
v4.1.x: Fix wrong affinity under LSF with membind option.
…alize

v4.1.x: shmem: fix `oshmem_shmem_finalize()` when `main()` returns not zero value.
v4.1.x: orted_comm: Add debugging message to kill procs.
bwbarrett and others added 7 commits May 16, 2022 12:50
…gdouble-fix-v4.1

common/ompio: increase internal default fview size
topo/treematch: Update and sync with fixes from main
…nux-sysconf-sc-open-max-i-cant-even-dot-dot-dot

v4.1.x: odls/default: cap the max number of child FDs to close
Update the NEWS and RELEASE files for 4.1.4rc2.

Signed-off-by: Brian Barrett <[email protected]>
This is my solution to issue open-mpi#10328. The problem with this code is that inside of `ompi_osc_rdma_refresh_dynamic_region` there is a `ompi_osc_rdma_lock_acquire_shared` to the same lock used in `ompi_osc_rdma_lock_acquire_exclusive` generation a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@awlauria
Copy link
Contributor

ok to test

@awlauria awlauria added this to the v4.1.4 milestone May 23, 2022
@janjust janjust requested a review from devreal May 24, 2022 15:07
@devreal devreal requested a review from hjelmn May 24, 2022 15:29
@devreal
Copy link
Contributor

devreal commented May 24, 2022

I don't think removing the outer lock is the right way to go as that was probably protecting more than just ompi_osc_rdma_refresh_dynamic_region. Will have to take a closer look though.

@devreal
Copy link
Contributor

devreal commented May 24, 2022

@jotabf After looking through the code, I think your patch is correct. The exclusive is not needed and the shared lock in ompi_osc_rdma_find_dynamic_region is enough to ensure consistency. I'd like @hjelmn to confirm that though.

@bwbarrett bwbarrett modified the milestones: v4.1.4, v4.1.5 May 25, 2022
@awlauria
Copy link
Contributor

awlauria commented Jun 9, 2022

@devreal is this relevant to main/v5?

@devreal devreal marked this pull request as ready for review June 9, 2022 17:47
@devreal
Copy link
Contributor

devreal commented Jun 9, 2022

I think this should go back to v5.x, too.

@awlauria
Copy link
Contributor

@jotabf can you retarget this for main branch, and then cherry-pick back to the v4.1/v4.0/v5.0 branches?

@joaobfernandes0 joaobfernandes0 changed the base branch from v4.1.x to main June 15, 2022 15:28
@joaobfernandes0
Copy link
Contributor Author

@awlauria If I understood correctly, I change the pull request base to the branch main, but is the next point to change again?

@awlauria
Copy link
Contributor

@jotabf since git is going to try and merge v4.1 to main (not what we want), you probably want to locally checkout main, port (or even cherry-pick might work, if there's no conflict) the fix over, and create a new branch and then open a new PR targeting main. Sorry for this pain :(

After that's over and merged we can cherry-pick it back to the release branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.