Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

Open
etiennemlb opened this issue Jun 11, 2024 · 3 comments
Open

prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra #10072

etiennemlb opened this issue Jun 11, 2024 · 3 comments

Comments

@etiennemlb
Copy link

etiennemlb commented Jun 11, 2024

I'm far from decently understanding libfabric's cxi provider so, I would like to ask for detail on an observed error. I run the mumps solver https://mumps-solver.org/index.php on Adastra (a similar machine to lumi and frontier but running SS 2.2 and libfabric/prov/cxi 1.20), we observe, in a predictable manner the following crash:

MPICH ERROR [Rank 519] [job id 947981.0] [Tue Jun 11 15:11:42 2024] [c1207] - Abort(539602063) (rank 519 in comm 0): Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126)...............: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000005, flag=0x7ffdeb64a47c, status=0x7ffdeb64a960) failed
MPID_Iprobe(257)...............:
MPIDI_iprobe_safe(118).........:
MPIDI_iprobe_unsafe(42)........:
MPIDI_OFI_do_iprobe(77)........:
MPIDI_OFI_handle_cq_error(1075): OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

I'm wondering what is triggering this issue and what could be done to fix it.

On LUMI, the issue has also been seen and more BTs are given in:
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/Sk2JJKnS6
https://hackmd.io/@mxKVWCKbQd6NvRm0h72YpQ/SyjVLT3ra

Thanks.

@GuoqiMa
Copy link

GuoqiMa commented Jun 24, 2024

Hi etiennemlb, @etiennemlb I am suffering form the same proble now, with MUMPS on a large modeling. I didn't find any solutions here. Could you please tell me how you fix it. Thank you very much.

@GuoqiMa
Copy link

GuoqiMa commented Jun 24, 2024

Abort(1687183) on node 84 (rank 84 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe8b171570, status=0x7ffe8b171990) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(405913231) on node 47 (rank 47 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe001734c0, status=0x7ffe001738e0) failed
MPID_Iprobe(385)...............:
MPIDI_iprobe_safe(246).........:
MPIDI_iprobe_unsafe(72)........:
MPIDIG_mpi_iprobe(48)..........:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(672775823) on node 79 (rank 79 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffce3ea9430, status=0x7ffce3ea9850) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

@etiennemlb
Copy link
Author

Actually this is not entirely fixed. I some people are saying that the SHS 2.2.0 with the introduction of libfabric 1.20.1 fixed the issue. That is not my experience TBH.

But for all the machines that are not going to upgrade to such recent SHS, there is not fix reported by HPE nor on this repo.

To reproduce the issue, I need 20 bi-socket Genoa nodes and the issue seems to be produced only on the system without certainty of reproductibility). The lack of reproducer is a big problem (without taking into account the closed source mumps code). If you can shrink a reproducer, you may be able to forward it to your site's support which in turn could forward it to HPE.

@etiennemlb etiennemlb reopened this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants