prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

etiennemlb · 2024-06-11T16:15:02Z

I'm far from decently understanding libfabric's cxi provider so, I would like to ask for detail on an observed error. I run the mumps solver https://mumps-solver.org/index.php on Adastra (a similar machine to lumi and frontier but running SS 2.2 and libfabric/prov/cxi 1.20), we observe, in a predictable manner the following crash:

MPICH ERROR [Rank 519] [job id 947981.0] [Tue Jun 11 15:11:42 2024] [c1207] - Abort(539602063) (rank 519 in comm 0): Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126)...............: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000005, flag=0x7ffdeb64a47c, status=0x7ffdeb64a960) failed
MPID_Iprobe(257)...............:
MPIDI_iprobe_safe(118).........:
MPIDI_iprobe_unsafe(42)........:
MPIDI_OFI_do_iprobe(77)........:
MPIDI_OFI_handle_cq_error(1075): OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

I'm wondering what is triggering this issue and what could be done to fix it.

On LUMI, the issue has also been seen and more BTs are given in:
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/Sk2JJKnS6
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/SyjVLT3ra

Thanks.

The text was updated successfully, but these errors were encountered:

GuoqiMa · 2024-06-24T16:01:25Z

Hi etiennemlb, @etiennemlb I am suffering form the same proble now, with MUMPS on a large modeling. I didn't find any solutions here. Could you please tell me how you fix it. Thank you very much.

GuoqiMa · 2024-06-24T16:02:21Z

Abort(1687183) on node 84 (rank 84 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe8b171570, status=0x7ffe8b171990) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(405913231) on node 47 (rank 47 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe001734c0, status=0x7ffe001738e0) failed
MPID_Iprobe(385)...............:
MPIDI_iprobe_safe(246).........:
MPIDI_iprobe_unsafe(72)........:
MPIDIG_mpi_iprobe(48)..........:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(672775823) on node 79 (rank 79 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffce3ea9430, status=0x7ffce3ea9850) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

etiennemlb · 2024-06-24T19:11:56Z

Actually this is not entirely fixed. I some people are saying that the SHS 2.2.0 with the introduction of libfabric 1.20.1 fixed the issue. That is not my experience TBH.

But for all the machines that are not going to upgrade to such recent SHS, there is not fix reported by HPE nor on this repo.

To reproduce the issue, I need 20 bi-socket Genoa nodes and the issue seems to be produced only on the system without certainty of reproductibility). The lack of reproducer is a big problem (without taking into account the closed source mumps code). If you can shrink a reproducer, you may be able to forward it to your site's support which in turn could forward it to HPE.

etiennemlb added the bug label Jun 11, 2024

j-xiong added the prov/cxi label Jun 12, 2024

etiennemlb closed this as completed Jun 21, 2024

etiennemlb reopened this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

etiennemlb commented Jun 11, 2024 •

edited

Loading

GuoqiMa commented Jun 24, 2024

GuoqiMa commented Jun 24, 2024

etiennemlb commented Jun 24, 2024

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

Comments

etiennemlb commented Jun 11, 2024 • edited Loading

GuoqiMa commented Jun 24, 2024

GuoqiMa commented Jun 24, 2024

etiennemlb commented Jun 24, 2024

etiennemlb commented Jun 11, 2024 •

edited

Loading