-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shm failure with PSM2 #48
Comments
Adrian, You're problem doesn't ring any bells, but I've opened an internal bug report for it. Could you give me a little more info? What version of IFS are you using and on what distro you're using it? |
Also - which MPI you're using; if you could provide the mpirun command line it would help us understand what might be going on. |
Thanks. CentOS Linux release 7.5.1804 export FI_PROVIDER=psm2 Can you "remind me" how to get the IFS version? |
|
opaconfig -V reports: 10.8.0.0.204 |
Thanks. Are you using the version of PSM2 that comes packaged with Intel MPI or the upstream version? |
For this it was the PSM2 with Intel MPI but we do have other versions installed on the system. |
okay. |
Adrian, thinking about it, We've never tested PSM2 in conjunction with OMP and we don't provide strong protections for using PSM2 in a multi-threaded environment. Does the problem still exist if you set |
I can check. I should say that we're not doing any MPI from within OpenMP regions, but I'll check nevertheless. |
Using a single OpenMP thread doesn't help I'm afraid. How would you suggest I debug the issue, can build my own PSM2 source and modify the part failing to see what's going wrong? Here's a stack trace of the current failure (or at least the relevant part): #0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24 |
Debugging PSM2 a bit, the error is happening in this function: psm2_error_t psmi_shm_create(ptl_t *ptl_gen) It's this bit of code that's failing:
Where it is looking for a specific psm2 file in /dev/shm that isn't on the current host but is on a remote host (I can find it by searching all the /dev/shm on the hosts that have been used for the run). For instance, this failed on node 22 but the file it failed on (psm2_shm.2955100000000001624020) was on node 17. |
Okay - try a workaround. in your mpirun line add -X PSM2_DEVICES=self,hfi That will disable the shm device. I have no explanation for why a machine would be trying to open a shared memory handle on a different machine. |
It definitely works if we disable the shm device, we're just trying to get shm to work for better performance. |
Adrian, I know it's been 10 days, I just wanted to let you know we are looking at this. |
We have some ideas, but we were wondering if you could try adding the following to a test run: -x PSM2_TRACEMASK=0x40 -x HFI_DEBUG_FILENAME="/tmp/%h.%p.out" This will generate a ton of output in the .out files, but the contents should tell us if different machines are really trying to communicate over SHM. |
Thanks for looking into this. I'll try that out and let you now what it produces. |
I appreciate some time has passed, but I have had some time to get back and play with PSM2 to find out where the problem is occurring. I've isolated it (with an OpenMPI application using PSM2) to the function psmi_shm_map_remote in the file ptl_am/am_reqrep_shmem.c. (note this was playing with PSM2 11.2.78). The shm file opening completes correctly, i.e. this works without any error:
The mmap also works, i.e. this works without any error:
However, any attempt to dereference eleents of dest_nodeinfo throws the error, i.e. this is the first place in the function this happens and the program crashes: volatile uint16_t *is_init = &dest_nodeinfo->is_init; Does this provide any pointers (apologies for the pun) on what's going wrong? |
Thanks for the update. Two ideas come to mind:
We have some follow-up questions/requests:
|
Thanks for the response.
|
I'm getting this same error, but ONLY when using MPI_Comm_spawn(). Open MPI 4.1.2 |
Running using Intel MPI and PSM2 on a dual rail Omnipath network we're getting these errors with some applications:
Error opening remote shared memory object in shm_open: No such file or directory (err=9)
PSM could not set up shared memory segment (err=9)
When we look in /dev/shm we see psm2_shm.295510000000000020e02 type files, but it is still failing. We've tried cleaning up /dev/shm but it does not seem to help.
We've seen this for PSM2 10.3.46, 11.2.23, 11.2.77, and 11.2.78.
Any idea what's going wrong?
The text was updated successfully, but these errors were encountered: