Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: Fix progress error cases and init id paths #10313

Merged
merged 4 commits into from
Aug 31, 2024

Conversation

zachdworkin
Copy link
Contributor

Fix progress error cases to always be positive ints when calling write_err_comp
Make id always be valid when looking up if pid_fd has been set for level-zero

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit messages are a bit confusing, but changes look good!

@zachdworkin
Copy link
Contributor Author

@shijin-aws can you share what the AWS CI failure is?

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 15, 2024

All of the efa provider tests either failed or timed out, though I haven't had log yet, I can try to reproduce it manually and get back to you today

[2024-08-14 20:48:29] test_suites/libfabric/test_omb.py::test_omb_pt2pt[mpirun-single_node-openmpi4-disable_xpmem-osu_mbw_mr-host_to_host] FAILED [  3%]

[2024-08-14 21:18:32] test_suites/libfabric/test_omb.py::test_omb_pt2pt[mpirun-single_node-openmpi4-disable_xpmem-osu_multi_lat-host_to_host] FAILED [  3%]

...

@shijin-aws
Copy link
Contributor

Ran the same test locally, I hit a hang with the following warning

libfabric:4029839:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029760:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029823:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029830:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029809:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029699:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029843:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029853:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029741:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029806:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029743:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029850:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029615:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in6://[de:b819:4495:ae21:40::]:0
libfabric:4029621:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029807:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029613:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029846:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029828:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029689:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4029775:1723743090::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3

# [ pairs: 48 ] [ window size: 64 ]
# Size                  MB/s        Messages/s

@zachdworkin
Copy link
Contributor Author

@shijin-aws how do you run this test? I would like to try to reproduce it on my cluster.
When I run the pt2pt osu 7.0 osu_mbw_mr using impi with -n 96 -ppn 48 I get the 48 pairs like you have in your test but mine is successful.

@shijin-aws
Copy link
Contributor

The test is run via efa provider, as open mpi cannot pick shm directly

mpirun -n 96 -N 96 -x FI_LOG_LEVEL=warn -x LD_LIBRARY_PATH ~/PortaFiducia/build/workloads/omb/openmpi-v4.1.6-installer/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr

@shijin-aws
Copy link
Contributor

@zachdworkin OK, if I run -n 48 -N 48, the test passed. Seems the problem is only triggered by larger process number?

@shijin-aws
Copy link
Contributor

(env) ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/pr10313/source/libfabric$ mpirun -n 48 -N 48 -x FI_LOG_LEVEL=warn -x LD_LIBRARY_PATH ~/PortaFiducia/build/workloads/omb/openmpi-v4.1.6-installer/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr 
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
No protocol specified
libfabric:4031043:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031019:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030995:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031029:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031002:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030991:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031081:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031009:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030993:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in://0.0.0.0:0
libfabric:4031055:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030999:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031023:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031078:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030996:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031066:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031087:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031014:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031006:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031000:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_in6://[ac:6f0c:c4a2:2466:40::]:0
libfabric:4031025:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031046:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031001:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030990:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030992:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031037:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031034:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030997:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031017:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031091:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031057:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031098:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030998:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031005:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031047:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: fi_sockaddr_ib://[29:4689:d87b:b860:40::]:0x0:0x756:0x0
libfabric:4031028:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4030994:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031096:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031064:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031013:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031051:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031088:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031021:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031102:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031040:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031072:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031084:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
libfabric:4031073:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3
# OSU MPI Multiple Bandwidth / Message Rate Test v7.0-lrbison3

# [ pairs: 24 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      99.58       99576540.39
2                     208.69      104345132.39
4                     434.99      108748213.19
8                     879.09      109886100.48
16                   1757.16      109822225.27
32                   3527.86      110245590.71
64                   6499.83      101559896.54
128                 12756.47       99659949.35
256                  9881.83       38600908.08
512                 17957.45       35073153.56
1024                35695.78       34859163.08
2048                64065.95       31282203.77
4096               100319.63       24492095.96
8192               144971.12       17696670.09
16384              225320.32       13752461.07
32768              325620.60        9937152.04
65536              415865.36        6345601.80
131072             469409.82        3581312.74
262144             376902.37        1437768.44
524288             458730.20         874958.41
1048576            506234.31         482782.66
2097152            512654.60         244452.76
4194304            438840.32         104627.68

@zachdworkin
Copy link
Contributor Author

@shijin-aws what version of ompi are you using?

@zachdworkin
Copy link
Contributor Author

@shijin-aws nevermind I see it is 4.1.6

@zachdworkin
Copy link
Contributor Author

@shijin-aws so the failing case is -n 96 -ppn 48?

@shijin-aws
Copy link
Contributor

so the failing case is -n 96 -ppn 48?

No. I ran on single node, it's 96 total process and 96 PPN

@shijin-aws
Copy link
Contributor

The problem seems to be from fcb61c899, If I revert this commit, the test passed

@shijin-aws
Copy link
Contributor

And also the warning libfabric:4030996:1723744836::shm:av:ofi_av_insert_addr():313<warn> addr already in AV: (null) disappears if I revert the commit ^^

Remember EFA call shm's fi_av_insert with flags FI_AV_USER_ID. Can this path be impacted by the commit?

@zachdworkin
Copy link
Contributor Author

@shijin-aws that is very helpful. I made a comment in that commit that the case where a peer disconnects and then reconnects with the same name and ID, it will have a valid id with no shm region. There is a TODO to create an smr_unmap_peer function to let it re initialize everything. Do you think this test is hitting this edge case?

@shijin-aws
Copy link
Contributor

I don't think so. this is at the beginning of the test and I don't see open mpi even call fi_av_remove

@@ -554,8 +555,7 @@ int smr_map_add(const struct fi_provider *prov, struct smr_map *map,
if (ret) {
assert(ret == -FI_EALREADY);
*id = (intptr_t) node->data;
ofi_spin_unlock(&map->lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change the behavior, you have smr_map_to_region in the out while before your change it seems to return without map, is it expected? @aingerson

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what is going on. Its trying to map to region when it shouldnt because the region already exists... so either I need to revert and return 0 or check if map->peers[id].region is valid and if it isnt then go to the out to map to region.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check in smr_map_to_region that skips initialization if it's already been set so I don't think it should do a double mapping. The addition of the mapping was added because if the mapping failed on av_insert, we need to trigger it again on the next map_add (in progress_connreq). Not saying it isn't related but it should be skipping the duplicate mapping in the case you're talking about

@zachdworkin
Copy link
Contributor Author

@shijin-aws I just pushed a patch with a bunch of prints. I cannot figure out what the error is and when I run ofi shm through ompi with osu 7.4 it doesnt have the hang. Can you please send over the log with all the prints when it finishes?

prov/shm/src/smr_av.c Fixed Show fixed Hide fixed
prov/shm/src/smr_av.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
prov/shm/src/smr_util.c Fixed Show fixed Hide fixed
@shijin-aws
Copy link
Contributor

@zachdworkin

--------------------------------- Captured Err ---------------------------------
2024-08-16 22:52:38,471 - INFO - conftest - info: {'EFA Installer Version': '1.34.0', 'Instance Type': 'c5n.18xlarge', 'AMI': 'ami-0ec6d79cb282e8eb3', 'OS': 'Amazon Linux2', 'Account ID': '891377098239', 'AWS Region': 'us-east-1', 'CMA status': True, 'Libfabric Provider': 'efa', 'Cluster Name': 'LibfabricPRCI-PR-10313-3-alinux2-c5n18xlar-yU125c1L', 'Test metrics tag': '', 'Libfabric Commit': 'e1ee305d64e9bb7d75810b530d08568fa5b1de1c'}
2024-08-16 22:52:38,704 - INFO - utils - Running on 1 nodes, 36 processes per node
2024-08-16 22:52:38,705 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.21.230 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
2024-08-16 22:52:38,939 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.21.230 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
2024-08-16 22:52:39,176 - INFO - utils - Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:2:node --timeout 1800 -x OMPI_MCA_accelerator=null -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency  --buffer-num multiple  2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/run/pt2pt/osu_latency/node1-ppn2.txt
2024-08-16 22:52:39,987 - INFO - utils - mpirun output:
pid: 13344: adding a new id to the map! id is 0
pid: 13343: adding a new id to the map! id is 0
pid: 13343: found name in list! hello self! id is 0
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
pid: 13344: mapping occurred! id is 0, ret is
pid: 13344: smr_av_insert shm_id is 1077867520, name is (null)
pid: 13344: adding a new id to the map! id is 1
pid: 13344: found name in list! hello self! id is 1
[ip-172-31-21-230:13344] *** Process received signal ***
[ip-172-31-21-230:13344] Signal: Segmentation fault (11)
[ip-172-31-21-230:13344] Signal code: Address not mapped (1)
[ip-172-31-21-230:13344] Failing at address: 0x1
pid: 13343: adding a new id to the map! id is 0
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
[ip-172-31-21-230:13344] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f7e564ad8e0]
[ip-172-31-21-230:13344] [ 1] /lib64/libc.so.6(+0x14d651)[0x7f7e5623c651]
[ip-172-31-21-230:13344] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x1ed7)[0x7f7e5613be47]
[ip-172-31-21-230:13344] [ 3] /lib64/libc.so.6(_IO_printf+0xa6)[0x7f7e56142566]
[ip-172-31-21-230:13344] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0xba0f4)[0x7f7e485620f4]
[ip-172-31-21-230:13344] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0x6b9bd)[0x7f7e485139bd]
[ip-172-31-21-230:13344] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/libfabric/lib/libfabric.so.1(+0x6bd89)[0x7f7e48513d89]
[ip-172-31-21-230:13344] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/openmpi/mca_mtl_ofi.so(ompi_mtl_ofi_add_procs+0x1e3)[0x7f7e46bf77e3]
[ip-172-31-21-230:13344] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(+0x64f8c)[0x7f7e5671ef8c]
[ip-172-31-21-230:13344] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x7f7e5671f2b1]
[ip-172-31-21-230:13344] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x7f7e567170e6]
[ip-172-31-21-230:13344] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v5.0.x/install/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f7e5674598e]
[ip-172-31-21-230:13344] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x401a45]
[ip-172-31-21-230:13344] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f7e5611013a]
[ip-172-31-21-230:13344] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v5.0.6a1-v5.0.x/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x40243a]
[ip-172-31-21-230:13344] *** End of error message ***
# OSU MPI Latency Test v7.0-lrbison3
# Size          Latency (us)
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
pid: 13343: shm_open error: name fe80::8ff:e3ff:fe72:479f_0000_61f5ae29_03e8 errno 2
pid: 13343: mapping failed! id is 0, ret is -2
pid: 13343: verifying peer map to region starting fe80::8ff:e3ff:fe72:479f_0001_570a1541_03e8
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------


@zachdworkin
Copy link
Contributor Author

@shijin-aws thank you! I had a print argument backwards... can you send this new output instead? I had addr and shm_id switched in smr_av.c's smr_av_insert call

prov/shm/src/smr_av.c Fixed Show fixed Hide fixed
@shijin-aws
Copy link
Contributor

The test is still running, but this error message says something?

pid: 13344: adding a new id to the map! id is 0
pid: 13343: adding a new id to the map! id is 0
pid: 13343: found name in list! hello self! id is 0
pid: 13343: smr_av_insert shm_id is 1163882224, name is (null)
pid: 13344: mapping occurred! id is 0, ret is
pid: 13344: smr_av_insert shm_id is 1077867520, name is (null)
pid: 13344: adding a new id to the map! id is 1
pid: 13344: found name in list! hello self! id is 1
pid: 13344: found name in list! hello self! id is 1
[ip-172-31-21-230:13344] *** Process received signal ***
[ip-172-31-21-230:13344] Signal: Segmentation fault (11)
[ip-172-31-21-230:13344] Signal code: Address not mapped (1)
[ip-172-31-21-230:13344] Failing at address: 0x1

Is it possible that your code is trying to access an address of 0x1 (your id)?

@zachdworkin
Copy link
Contributor Author

@shijin-aws does the segfault only occur with the prints I added?

@shijin-aws
Copy link
Contributor

@zachdworkin Now I am trying to run your PR on my local VM with Open MPI 4.1.6 manually, I get such printing all the time

pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2
pid: 222767: mapping failed! id is 0, ret is -2
pid: 222767: verifying peer map to region starting fe80::8cf:4dff:fe70:935f_0070_69c1c9ee_03e8
pid: 222767: shm_open error: name fe80::8cf:4dff:fe70:935f_0091_63d33ceb_03e8 errno 2

Is that expected?

@zachdworkin
Copy link
Contributor Author

@shijin-aws yes that is expected. One of the peers gets in a verify peer loop and will keep printing that while the other side is stuck. @aingerson helped me find the issue with the patch you identified and I am working on cleaning it up. Thanks for providing all the logs!

Move smr_map_create to an initialization type function.
The map doesn't need to be calloc'd because there is a
1:1 relationship between av:map so there is no reason to
allocate them separately. With this change these functions
can move into smr_av.c because that is the only place they
are called.

peer_data.addr.name is not used anywhere and is already
initialized to zeros on region creation so it doesn't
need to be initialized.

Signed-off-by: Zach Dworkin <[email protected]>
The id field was incorrectly being used as an indicator that
a region was mapped. Instead a valid id shows that a peer is
in the rbmap and not that region memory is valid.

Map_to_region has moved out of map_add and a region can only be
initialized in verify_peer or in progress_connreq. This was done
to reduce initialization time and memory footprint because there
are many cases where communication doesn't happen across every
peer. We only need to initialize a region when we need to
communicate with that peer.

Previously map_to_endpoint would be able to map to an endpoint
if the peer was in the map but the region wasn't mapped. This
now protects against segmentation faults because the region must
be valid. Since map_to_region was moved out of map_add there is
no point in doing map_to_endpoint on av_insert because the region
is not mapped. Map_to_endpoint now happens at the end of
map_to_region because we guarentee that the peer is in the map
and its region is valid. This endpoint mapping does not need to
happen for every possible peer so we can use a for each call to
do it for initialized peers only.

With all of these changes we can silence coverity issues about
accessing the fields in the map without a lock. This is because
these fields are properly protected and region and id are
guaranteed to be valid at the point of accessing them.

Signed-off-by: Zach Dworkin <[email protected]>
prov/shm/src/smr_av.c Outdated Show resolved Hide resolved
prov/shm/src/smr_progress.c Outdated Show resolved Hide resolved
Err was passed into progress_iov remove err path iov to
avoid doing a copy if there was a previous error.
This err was discovered to always be 0 so these changes
remove it.

Signed-off-by: Zach Dworkin <[email protected]>
Make progress errors ints to match the fi_cq_err_entry
struct's err type. This error needs to be positive so
the paths have been checked and they all return negative
errors. The err sign is flipped when calling write_err_comp
to keep it positive like the API expects.

Signed-off-by: Zach Dworkin <[email protected]>
@zachdworkin
Copy link
Contributor Author

@shijin-aws what is the latest failure? I have it passing all my tests now.

@shijin-aws
Copy link
Contributor

@zachdworkin It's a socket provider failure

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.37.69 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.37.69'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.43.177 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10313/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.43.177 172.31.37.69'"'"''
client_stdout:
libfabric:12823:1724783302::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem
timeout: the monitored command dumped core

client returncode: 255
server_stdout:
libfabric:151227:1724783301::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_rx_pe_entry():2039<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.43.177:38577
libfabric:151227:1724783302::sockets:ep_data:sock_pe_progress_tx_entry():1961<warn> Peer disconnected: removi

I am not sure even AWS CI should test this provider any longer as I heard it's not maintained.

@shijin-aws
Copy link
Contributor

bot:aws:retest

@j-xiong j-xiong merged commit 3c1fc65 into ofiwg:main Aug 31, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants