prov/shm: Fix progress error cases and init id paths #10313

zachdworkin · 2024-08-13T23:46:05Z

Fix progress error cases to always be positive ints when calling write_err_comp
Make id always be valid when looking up if pid_fd has been set for level-zero

prov/shm/src/smr_progress.c

aingerson

Commit messages are a bit confusing, but changes look good!

zachdworkin · 2024-08-15T13:03:17Z

@shijin-aws can you share what the AWS CI failure is?

shijin-aws · 2024-08-15T15:47:02Z

All of the efa provider tests either failed or timed out, though I haven't had log yet, I can try to reproduce it manually and get back to you today

[2024-08-14 20:48:29] test_suites/libfabric/test_omb.py::test_omb_pt2pt[mpirun-single_node-openmpi4-disable_xpmem-osu_mbw_mr-host_to_host] FAILED [  3%]

[2024-08-14 21:18:32] test_suites/libfabric/test_omb.py::test_omb_pt2pt[mpirun-single_node-openmpi4-disable_xpmem-osu_multi_lat-host_to_host] FAILED [  3%]

...

shijin-aws · 2024-08-15T17:32:04Z

Ran the same test locally, I hit a hang with the following warning

libfabric:4029839:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029760:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029823:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029830:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029809:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029699:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029843:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029853:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029741:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029806:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029743:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029850:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029615:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in6://[de:b819:4495:ae21:40::]:0
libfabric:4029621:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029807:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029613:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029846:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029828:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029689:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029775:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3

# [ pairs: 48 ] [ window size: 64 ]
# Size                  MB/s        Messages/s

zachdworkin · 2024-08-15T17:56:40Z

@shijin-aws how do you run this test? I would like to try to reproduce it on my cluster.
When I run the pt2pt osu 7.0 osu_mbw_mr using impi with -n 96 -ppn 48 I get the 48 pairs like you have in your test but mine is successful.

shijin-aws · 2024-08-15T18:00:00Z

The test is run via efa provider, as open mpi cannot pick shm directly

mpirun -n 96 -N 96 -x FI_LOG_LEVEL=warn -x LD_LIBRARY_PATH ~/PortaFiducia/build/workloads/omb/openmpi-v4.1.6-installer/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr

shijin-aws · 2024-08-15T18:01:13Z

@zachdworkin OK, if I run -n 48 -N 48, the test passed. Seems the problem is only triggered by larger process number?

shijin-aws · 2024-08-15T18:01:26Z

(env) ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/pr10313/source/libfabric$ mpirun -n 48 -N 48 -x FI_LOG_LEVEL=warn -x LD_LIBRARY_PATH ~/PortaFiducia/build/workloads/omb/openmpi-v4.1.6-installer/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr 
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
libfabric:4031043:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031019:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030995:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031029:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031002:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030991:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031081:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031009:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030993:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in://0.0.0.0:0
libfabric:4031055:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030999:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031023:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031078:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030996:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031066:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031087:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031014:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031006:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031000:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in6://[ac:6f0c:c4a2:2466:40::]:0
libfabric:4031025:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031046:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031001:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030990:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030992:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031037:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031034:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030997:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031017:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031091:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031057:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031098:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030998:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031005:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031047:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_ib://[29:4689:d87b:b860:40::]:0x0:0x756:0x0
libfabric:4031028:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030994:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031096:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031064:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031013:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031051:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031088:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031021:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031102:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031040:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031072:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031084:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031073:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3

# [ pairs: 24 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      99.58       99576540.39
2                     208.69      104345132.39
4                     434.99      108748213.19
8                     879.09      109886100.48
16                   1757.16      109822225.27
32                   3527.86      110245590.71
64                   6499.83      101559896.54
128                 12756.47       99659949.35
256                  9881.83       38600908.08
512                 17957.45       35073153.56
1024                35695.78       34859163.08
2048                64065.95       31282203.77
4096               100319.63       24492095.96
8192               144971.12       17696670.09
16384              225320.32       13752461.07
32768              325620.60        9937152.04
65536              415865.36        6345601.80
131072             469409.82        3581312.74
262144             376902.37        1437768.44
524288             458730.20         874958.41
1048576            506234.31         482782.66
2097152            512654.60         244452.76
4194304            438840.32         104627.68

zachdworkin · 2024-08-15T18:04:36Z

@shijin-aws what version of ompi are you using?

zachdworkin · 2024-08-15T18:05:19Z

@shijin-aws nevermind I see it is 4.1.6

zachdworkin · 2024-08-15T18:05:56Z

@shijin-aws so the failing case is -n 96 -ppn 48?

shijin-aws · 2024-08-15T18:07:33Z

so the failing case is -n 96 -ppn 48?

No. I ran on single node, it's 96 total process and 96 PPN

shijin-aws · 2024-08-15T18:11:43Z

The problem seems to be from fcb61c899, If I revert this commit, the test passed

shijin-aws · 2024-08-15T18:13:52Z

And also the warning libfabric:4030996:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null) disappears if I revert the commit ^^

Remember EFA call shm's fi_av_insert with flags FI_AV_USER_ID. Can this path be impacted by the commit?

zachdworkin · 2024-08-15T18:15:10Z

@shijin-aws that is very helpful. I made a comment in that commit that the case where a peer disconnects and then reconnects with the same name and ID, it will have a valid id with no shm region. There is a TODO to create an smr_unmap_peer function to let it re initialize everything. Do you think this test is hitting this edge case?

shijin-aws · 2024-08-15T18:16:28Z

I don't think so. this is at the beginning of the test and I don't see open mpi even call fi_av_remove

shijin-aws · 2024-08-15T18:18:05Z

prov/shm/src/smr_util.c

@@ -554,8 +555,7 @@ int smr_map_add(const struct fi_provider *prov, struct smr_map *map,
 	if (ret) {
 		assert(ret == -FI_EALREADY);
 		*id = (intptr_t) node->data;
-		ofi_spin_unlock(&map->lock);


This seems to change the behavior, you have smr_map_to_region in the out while before your change it seems to return without map, is it expected? @aingerson

I think I understand what is going on. Its trying to map to region when it shouldnt because the region already exists... so either I need to revert and return 0 or check if map->peers[id].region is valid and if it isnt then go to the out to map to region.

There is a check in smr_map_to_region that skips initialization if it's already been set so I don't think it should do a double mapping. The addition of the mapping was added because if the mapping failed on av_insert, we need to trigger it again on the next map_add (in progress_connreq). Not saying it isn't related but it should be skipping the duplicate mapping in the case you're talking about

zachdworkin · 2024-08-16T22:20:03Z

@shijin-aws I just pushed a patch with a bunch of prints. I cannot figure out what the error is and when I run ofi shm through ompi with osu 7.4 it doesnt have the hang. Can you please send over the log with all the prints when it finishes?

prov/shm/src/smr_av.c

prov/shm/src/smr_util.c

shijin-aws · 2024-08-19T18:02:28Z

@zachdworkin

--------------------------------- Captured Err ---------------------------------
2024-08-16 22:52:38,471 - INFO - conftest - info: {'EFA Installer Version': '1.34.0', 'Instance Type': 'c5n.18xlarge', 'AMI': 'ami-0ec6d79cb282e8eb3', 'OS': 'Amazon Linux2', 'Account ID': '891377098239', 'AWS Region': 'us-east-1', 'CMA status': True, 'Libfabric Provider': 'efa', 'Cluster Name': 'LibfabricPRCI-PR-10313-3-alinux2-c5n18xlar-yU125c1L', 'Test metrics tag': '', 'Libfabric Commit': 'e1ee305d64e9bb7d75810b530d08568fa5b1de1c'}
2024-08-16 22:52:38,704 - INFO - utils - Running on 1 nodes, 36 processes per node
2024-08-16 22:52:38,705 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.21.230 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
2024-08-16 22:52:38,939 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.21.230 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
2024-08-16 22:52:39,176 - INFO - utils - Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:2:node --timeout 1800 -x OMPI_MCA_accelerator=null -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency  --buffer-num multiple  2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/run/pt2pt/osu_latency/node1-ppn2.txt
2024-08-16 22:52:39,987 - INFO - utils - mpirun output:
pid: 13344: adding a new id to the map! id is 0
pid: 13343: adding a new id to the map! id is 0
pid: 13343: found name in list! hello self! id is 0
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
pid: 13344: mapping occurred! id is 0, ret is
pid: 13344: smr_av_insert shm_id is 1077867520, name is (null)
pid: 13344: adding a new id to the map! id is 1
pid: 13344: found name in list! hello self! id is 1
[ip-172-31-21-230:13344] *** Process received signal ***
[ip-172-31-21-230:13344] Signal: Segmentation fault (11)
[ip-172-31-21-230:13344] Signal code: Address not mapped (1)
[ip-172-31-21-230:13344] Failing at address: 0x1
pid: 13343: adding a new id to the map! id is 0
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
[ip-172-31-21-230:13344] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f7e564ad8e0]
[ip-172-31-21-230:13344] [ 1] /lib64/libc.so.6(+0x14d651)[0x7f7e5623c651]
[ip-172-31-21-230:13344] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x1ed7)[0x7f7e5613be47]
[ip-172-31-21-230:13344] [ 3] /lib64/libc.so.6(_IO_printf+0xa6)[0x7f7e56142566]
[ip-172-31-21-230:13344] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0xba0f4)[0x7f7e485620f4]
[ip-172-31-21-230:13344] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0x6b9bd)[0x7f7e485139bd]
[ip-172-31-21-230:13344] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0x6bd89)[0x7f7e48513d89]
[ip-172-31-21-230:13344] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/openmpi/mca_mtl_ofi.so(ompi_mtl_ofi_add_procs+0x1e3)[0x7f7e46bf77e3]
[ip-172-31-21-230:13344] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(+0x64f8c)[0x7f7e5671ef8c]
[ip-172-31-21-230:13344] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x7f7e5671f2b1]
[ip-172-31-21-230:13344] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x7f7e567170e6]
[ip-172-31-21-230:13344] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f7e5674598e]
[ip-172-31-21-230:13344] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x401a45]
[ip-172-31-21-230:13344] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f7e5611013a]
[ip-172-31-21-230:13344] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x40243a]
[ip-172-31-21-230:13344] *** End of error message ***
# OSU MPI Latency Test v7.0-lrbison3
# Size          Latency (us)
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

zachdworkin · 2024-08-19T20:53:25Z

@shijin-aws thank you! I had a print argument backwards... can you send this new output instead? I had addr and shm_id switched in smr_av.c's smr_av_insert call

prov/shm/src/smr_av.c

shijin-aws · 2024-08-20T00:37:02Z

The test is still running, but this error message says something?

pid: 13344: adding a new id to the map! id is 0
pid: 13343: adding a new id to the map! id is 0
pid: 13343: found name in list! hello self! id is 0
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
pid: 13344: mapping occurred! id is 0, ret is
pid: 13344: smr_av_insert shm_id is 1077867520, name is (null)
pid: 13344: adding a new id to the map! id is 1
pid: 13344: found name in list! hello self! id is 1
pid: 13344: found name in list! hello self! id is 1
[ip-172-31-21-230:13344] *** Process received signal ***
[ip-172-31-21-230:13344] Signal: Segmentation fault (11)
[ip-172-31-21-230:13344] Signal code: Address not mapped (1)
[ip-172-31-21-230:13344] Failing at address: 0x1

Is it possible that your code is trying to access an address of 0x1 (your id)?

zachdworkin · 2024-08-20T16:54:31Z

@shijin-aws does the segfault only occur with the prints I added?

shijin-aws · 2024-08-20T21:03:57Z

@zachdworkin Now I am trying to run your PR on my local VM with Open MPI 4.1.6 manually, I get such printing all the time

pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2

Is that expected?

zachdworkin · 2024-08-20T21:08:54Z

@shijin-aws yes that is expected. One of the peers gets in a verify peer loop and will keep printing that while the other side is stuck. @aingerson helped me find the issue with the patch you identified and I am working on cleaning it up. Thanks for providing all the logs!

Move smr_map_create to an initialization type function. The map doesn't need to be calloc'd because there is a 1:1 relationship between av:map so there is no reason to allocate them separately. With this change these functions can move into smr_av.c because that is the only place they are called. peer_data.addr.name is not used anywhere and is already initialized to zeros on region creation so it doesn't need to be initialized. Signed-off-by: Zach Dworkin <[email protected]>

The id field was incorrectly being used as an indicator that a region was mapped. Instead a valid id shows that a peer is in the rbmap and not that region memory is valid. Map_to_region has moved out of map_add and a region can only be initialized in verify_peer or in progress_connreq. This was done to reduce initialization time and memory footprint because there are many cases where communication doesn't happen across every peer. We only need to initialize a region when we need to communicate with that peer. Previously map_to_endpoint would be able to map to an endpoint if the peer was in the map but the region wasn't mapped. This now protects against segmentation faults because the region must be valid. Since map_to_region was moved out of map_add there is no point in doing map_to_endpoint on av_insert because the region is not mapped. Map_to_endpoint now happens at the end of map_to_region because we guarentee that the peer is in the map and its region is valid. This endpoint mapping does not need to happen for every possible peer so we can use a for each call to do it for initialized peers only. With all of these changes we can silence coverity issues about accessing the fields in the map without a lock. This is because these fields are properly protected and region and id are guaranteed to be valid at the point of accessing them. Signed-off-by: Zach Dworkin <[email protected]>

prov/shm/src/smr_av.c

prov/shm/src/smr_progress.c

Err was passed into progress_iov remove err path iov to avoid doing a copy if there was a previous error. This err was discovered to always be 0 so these changes remove it. Signed-off-by: Zach Dworkin <[email protected]>

Make progress errors ints to match the fi_cq_err_entry struct's err type. This error needs to be positive so the paths have been checked and they all return negative errors. The err sign is flipped when calling write_err_comp to keep it positive like the API expects. Signed-off-by: Zach Dworkin <[email protected]>

zachdworkin · 2024-08-30T19:08:19Z

@shijin-aws what is the latest failure? I have it passing all my tests now.

shijin-aws · 2024-08-30T19:42:23Z

@zachdworkin It's a socket provider failure

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.37.69 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.37.69'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.43.177 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.43.177 172.31.37.69'"'"''
client_stdout:
libfabric:12823:1724783302::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem
timeout: the monitored command dumped core

client returncode: 255
server_stdout:
libfabric:151227:1724783301::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_rx_pe_entry():2039<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removi

I am not sure even AWS CI should test this provider any longer as I heard it's not maintained.

shijin-aws · 2024-08-30T19:42:28Z

bot:aws:retest

j-xiong reviewed Aug 14, 2024

View reviewed changes

prov/shm/src/smr_progress.c Show resolved Hide resolved

zachdworkin force-pushed the ze-shm branch from efff09b to 8b5a5d1 Compare August 14, 2024 20:12

aingerson approved these changes Aug 14, 2024

View reviewed changes

shijin-aws reviewed Aug 15, 2024

View reviewed changes

zachdworkin force-pushed the ze-shm branch from 8b5a5d1 to e1ee305 Compare August 16, 2024 22:18

github-advanced-security bot found potential problems Aug 16, 2024

View reviewed changes

zachdworkin force-pushed the ze-shm branch from e1ee305 to 48e39f2 Compare August 19, 2024 20:52

github-advanced-security bot found potential problems Aug 19, 2024

View reviewed changes

prov/shm/src/smr_av.c Fixed Show fixed Hide fixed

zachdworkin added 2 commits August 23, 2024 15:43

zachdworkin force-pushed the ze-shm branch 2 times, most recently from df919ba to c0822f9 Compare August 27, 2024 17:38

j-xiong reviewed Aug 27, 2024

View reviewed changes

prov/shm/src/smr_av.c Outdated Show resolved Hide resolved

prov/shm/src/smr_progress.c Outdated Show resolved Hide resolved

zachdworkin added 2 commits August 27, 2024 11:00

prov/shm: Remove unused err path from progress_iov

9634031

Err was passed into progress_iov remove err path iov to avoid doing a copy if there was a previous error. This err was discovered to always be 0 so these changes remove it. Signed-off-by: Zach Dworkin <[email protected]>

zachdworkin force-pushed the ze-shm branch from c0822f9 to 2c7314b Compare August 27, 2024 18:00

j-xiong approved these changes Aug 27, 2024

View reviewed changes

j-xiong added the ⚠️ Do not merge label Aug 29, 2024

j-xiong removed the ⚠️ Do not merge label Aug 31, 2024

j-xiong merged commit 3c1fc65 into ofiwg:main Aug 31, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: Fix progress error cases and init id paths #10313

prov/shm: Fix progress error cases and init id paths #10313

zachdworkin commented Aug 13, 2024

aingerson left a comment

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024 •

edited

Loading

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws Aug 15, 2024

zachdworkin Aug 15, 2024

aingerson Aug 15, 2024

zachdworkin commented Aug 16, 2024

shijin-aws commented Aug 19, 2024

zachdworkin commented Aug 19, 2024

shijin-aws commented Aug 20, 2024

zachdworkin commented Aug 20, 2024

shijin-aws commented Aug 20, 2024

zachdworkin commented Aug 20, 2024

zachdworkin commented Aug 30, 2024

shijin-aws commented Aug 30, 2024

shijin-aws commented Aug 30, 2024

prov/shm: Fix progress error cases and init id paths #10313

prov/shm: Fix progress error cases and init id paths #10313

Conversation

zachdworkin commented Aug 13, 2024

aingerson left a comment

Choose a reason for hiding this comment

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024 • edited Loading

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

zachdworkin commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

shijin-aws Aug 15, 2024

Choose a reason for hiding this comment

zachdworkin Aug 15, 2024

Choose a reason for hiding this comment

aingerson Aug 15, 2024

Choose a reason for hiding this comment

zachdworkin commented Aug 16, 2024

shijin-aws commented Aug 19, 2024

zachdworkin commented Aug 19, 2024

shijin-aws commented Aug 20, 2024

zachdworkin commented Aug 20, 2024

shijin-aws commented Aug 20, 2024

zachdworkin commented Aug 20, 2024

zachdworkin commented Aug 30, 2024

shijin-aws commented Aug 30, 2024

shijin-aws commented Aug 30, 2024

shijin-aws commented Aug 15, 2024 •

edited

Loading