-
Notifications
You must be signed in to change notification settings - Fork 937
Removed deadlock in find dynamic region #10413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Austen Lauria <[email protected]> (cherry picked from commit d483e3e)
There are a number of other places to do this, but starting with this. Signed-off-by: Austen Lauria <[email protected]> Co-authored-by: Joseph Schuchart <[email protected]> Co-authored-by: Jeff Squyres <[email protected]> (cherry picked from commit 79d7fd2)
Signed-off-by: Austen Lauria <[email protected]> (cherry picked from commit 904459a)
Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit a069749)
Avoid atomic cmpxchng operations for MPI requests that are already complete. This improves the performance in message rate benchmarks. Signed-off-by: Austen Lauria <[email protected]> (cherry picked from commit 3cf5004) Conflicts: ompi/request/req_wait.c
Signed-off-by: Valentin Petrov <[email protected]> (cherry picked from commit 13c8d22)
This patch attempts to open up libfabric resources in order to notify libfabric when our memhooks patcher intercepts free calls. Signed-off-by: William Zhang <[email protected]> (cherry picked from commit 25811e2)
…on after wait completes. We found an issue where with using multiple threads, it is possible for the data to not be in the buffer before MPI_Wait() returns. Testing the buffer later after MPI_Wait() returned would show the data arrives eventually without the rmb(). We have seen this issue on Power9 intermittently using PAMI, but in theory could happen with any transport. Signed-off-by: Austen Lauria <[email protected]> (cherry picked from commit 12192f1)
…alue.
Example:
int main(void) {
int rc = 0;
shmem_init();
if (0 == shmem_my_pe())
rc = 1;
shmem_finalize();
return rc;
}
If the user wants to return not-zero status, this leads to
`ompi_mpi_finalize()` won't be called and the program terminates
abnormally.
Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit ceb9259)
Signed-off-by: Geoffrey Paulsen <[email protected]> (cherry picked from commit a93d92b)
There are known issues with the API in libfabric 1.13.0 which will guarantee segfaults when used. These issues are fixed in libfabric 1.13.1, but we do not have a way to detect which patch version of libfabric is used. Thus, delay the usage of the API until the subsequent minor release. Signed-off-by: William Zhang <[email protected]> (cherry picked from commit 190feba)
…hmem_pc Ignoring some generated oshmem .pc files
v4.1.x: common/ofi: Utilize new libfabric API to import memhooks monitor
v4.1.x: coll/hcoll: fixes dtypes mapping
v4.1.x: Improve MPI_Waitall performance for MPI_THREAD_MULTIPLE
v4.1.x: opal/ppc atomics: Optimize the RMB().
v4.1.x: Some string cleanup
v4.1.x: fix a memory hook recursion hang
v4.1.x: fix --display-diffable-map free()
…_v4.1.x v4.1.x: orte: Fix orte_report_silent_errors error reporting path
…nup_v4.1.x v4.1.x: Unlink and rebind socket when session directory already exists
…4.1.x v4.1.x: Silence some ppc atomics warnings
v4.1.x: Correctly process 0 slots with -host option.
v4.1.x: Fix launch when set -u is in users login .rc file.
v4.1.x: Remove hard coded -lmpi from oshc++ wrapper.
v4.1.x: reduce/ireduce: Return MPI_SUCCESS when count == 0 and send == recv.
v4.1.x: Fix wrong affinity under LSF with membind option.
…alize v4.1.x: shmem: fix `oshmem_shmem_finalize()` when `main()` returns not zero value.
v4.1.x: orted_comm: Add debugging message to kill procs.
v4.1.x: osc/pt2pt: Some fixes
…gdouble-fix-v4.1 common/ompio: increase internal default fview size
topo/treematch: Update and sync with fixes from main
common/ucx: fix variable registration
…nux-sysconf-sc-open-max-i-cant-even-dot-dot-dot v4.1.x: odls/default: cap the max number of child FDs to close
Update the NEWS and RELEASE files for 4.1.4rc2. Signed-off-by: Brian Barrett <[email protected]>
dist: Prep for v4.1.4rc2
This is my solution to issue open-mpi#10328. The problem with this code is that inside of `ompi_osc_rdma_refresh_dynamic_region` there is a `ompi_osc_rdma_lock_acquire_shared` to the same lock used in `ompi_osc_rdma_lock_acquire_exclusive` generation a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.
|
Can one of the admins verify this patch? |
|
ok to test |
|
I don't think removing the outer lock is the right way to go as that was probably protecting more than just |
|
@jotabf After looking through the code, I think your patch is correct. The exclusive is not needed and the shared lock in |
|
@devreal is this relevant to main/v5? |
|
I think this should go back to v5.x, too. |
|
@jotabf can you retarget this for |
|
@awlauria If I understood correctly, I change the pull request base to the branch |
|
@jotabf since git is going to try and merge v4.1 to main (not what we want), you probably want to locally checkout After that's over and merged we can cherry-pick it back to the release branches. |
This is my solution to issue #10328. The problem with this code is that inside of
ompi_osc_rdma_refresh_dynamic_regionthere is aompi_osc_rdma_lock_acquire_sharedto the same lock used inompi_osc_rdma_lock_acquire_exclusivegeneration a deadlock. My solution was simply to remove the outermost lock. However, I don't know if this is the better solution, but, with these changes, my test codes ran as I expected.