Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

kah3f · 2025-02-26T19:02:42Z

Describe the bug

Running an with OpenMPI 5.0.7 with a simple example code doing a collective (mpi_allgather) I get, from each process,
rcache.c:247 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
rcache.c:674 UCX ERROR Failed to allocate invalidation entry for 0x149cb0ca4000..0x149cb4000000, data corruption may occur
eventually reaching
array.c:44 UCX ERROR failed to grow &worker->ep_config from 0 to 32 elems of 7744 bytes
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Steps to Reproduce

In a slurm script:
mpirun -mca btl ^uct ./mpi_ag

UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
UCX 1.18.0
Any UCX environment variables used
export UCX_IB_RCACHE_MAX_REGIONS="262144"
export SLURM_CPU_BIND_TYPE='cores'
export OMPI_MCA_btl='^openib'
export OMP_

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- cat /etc/issue or cat /etc/redhat-release + uname -a
  Rocky Linux release 8.9 (Green Obsidian)
  Linux udc-ba38-32c0 4.18.0-513.18.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Wed Feb 21 21:34:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
For RDMA/IB/RoCE related issues:
- Driver version:
  rpm -q rdma-core or rpm -q libibverbs
  
  rdma-core-2307mlnx47-1.2310213.x86_64
  MLNX_OFED_LINUX-23.10-2.1.3.1:

or: MLNX_OFED version ofed_info -s

HW information from ibstat or ibv_devinfo -vv command
CA 'mlx5_2'
CA type: MT4129
Number of ports: 1
Firmware version: 28.38.1002
Hardware version: 0
Node GUID: 0xa088c20300c835dc
System image GUID: 0xa088c20300c835dc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 592
LMC: 0
SM lid: 723
Capability mask: 0xa751e848
Port GUID: 0xa088c20300c835dc
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4127
Number of ports: 1
Firmware version: 26.38.1002
Hardware version: 0
Node GUID: 0x387c760300948587
System image GUID: 0x387c760300948587
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x3a7c76fffe948587
Link layer: Ethernet
For GPU related issues:
- GPU type
- Cuda:
  - Drivers version
  - Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
    Not building with GPU in this case

Additional information (depending on the issue)

OpenMPI version
5.0.7

This is only happening for collective communications at large scales (e.g. 800 cores). Smaller jobs with collectives work (e.g. 160 processes). The nodes have NDR Infiniband hardware.
Output of ucx_info -d to show transports and devices recognized by UCX
attached below
Configure result - config.log
This is the configure part from the Easybuild log. It is from a build without the logging enabled; for that disable-logging --disable-debug --disable-assertions would be absent and --enable-logging present.

./configure --prefix=/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18.0
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations -
-enable-cma --enable-mt --with-verbs --without-java --without-go --disable-doxyg
en-doc --disable-logging --disable-debug --disable-assertions --disable-params-c
heck
== 2025-02-20 15:42:00,883 run.py:703 INFO parse_log_for_error (some may be harm
less) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) fou
nd:
configure: XPMEM - failed to open the requested location (guess), guessing ...
== 2025-02-20 15:42:00,883 run.py:660 WARNING Found 1 errors in command output (
output: configure: XPMEM - failed to open the requested location (guess), guessi
ng ...)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO ... (took 20 secs)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO building...
== 2025-02-20 15:42:00,884 easyblock.py:3901 INFO Starting build step
== 2025-02-20 15:42:00,884 easyconfig.py:1690 INFO Generating template values...
== 2025-02-20 15:42:00,884 easyconfig.py:1709 INFO Template values: arch='x86_64
', bitbucket_account='ucx', builddir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0', github
_account='ucx', installdir='/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18
.0', module_name='ucx/1.18.0', name='UCX', nameletter='U', nameletterlower='u',
namelower='ucx', parallel='40', start_dir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0/ucx
-1.18.0/', toolchain_name='GCC', toolchain_version='14.2.0', version='1.18.0', v
ersion_major='1', version_major_minor='1.18', version_minor='18', versionprefix=
'', versionsuffix=''
== 2025-02-20 15:42:00,884 easyblock.py:3909 INFO Running method build_step part
of step build
== 2025-02-20 15:42:00,884 configuremake.py:350 INFO Building target ''
== 2025-02-20 15:42:00,884 run.py:236 INFO running cmd: make -j 40 V=1
== 2025-02-20 15:43:20,735 run.py:648 INFO cmd " make -j 40 V=1" exited with ex
it code 0 and output:
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
abbreviated version attached

ucx.txt

ucx_short.log

The text was updated successfully, but these errors were encountered:

brminich · 2025-02-27T08:50:13Z

The attached log is truncated. Is it a regression? (i.e. did you try to run the same app with older OMPI/UCX)
Is it possible to share application source code?
Seems to be that at least one of the nodes reached out its memory limit

kah3f · 2025-02-27T15:05:53Z

The full log is 300MB. All processes printed the same info. I can try to extract it from one.

I can run the code at this scale with UCX 1.15 and OpenMPI 4.1.4 but we had to build OpenMPI --withoutverbs or it would also hang and/or segfault. The same code also runs without problems with Intel/IntelMPI 2025 (the newer LLVM-based OneAPI). I don't know what transport layer they (intelmpi) are using, maybe pmi2. But there are no application-based memory issues for either of these other MPI distributions.

I have attached the source. It's in Fortran. I have a C++ version since these are example programs I have used to teach MPI, but I'm using the Fortran one for testing because I'm more comfortable with it. I had to rename it (sorry) since github doesn't seem to allow files ending in .f90

allgather.f90.txt

kah3f · 2025-02-28T17:33:17Z

I have winnowed down the debug log to about 1/10 the size by confining it to one node (the job requested 80 cores on each of 10 nodes). I do not know whether this output is more helpful but at least it seems manageable.

ucx_short.log

kah3f added the Bug label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

kah3f commented Feb 26, 2025

brminich commented Feb 27, 2025

kah3f commented Feb 27, 2025

kah3f commented Feb 28, 2025

Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

Comments

kah3f commented Feb 26, 2025

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

brminich commented Feb 27, 2025

kah3f commented Feb 27, 2025

kah3f commented Feb 28, 2025