You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running an with OpenMPI 5.0.7 with a simple example code doing a collective (mpi_allgather) I get, from each process,
rcache.c:247 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
rcache.c:674 UCX ERROR Failed to allocate invalidation entry for 0x149cb0ca4000..0x149cb4000000, data corruption may occur
eventually reaching
array.c:44 UCX ERROR failed to grow &worker->ep_config from 0 to 32 elems of 7744 bytes
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Steps to Reproduce
In a slurm script:
mpirun -mca btl ^uct ./mpi_ag
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
UCX 1.18.0
Any UCX environment variables used
export UCX_IB_RCACHE_MAX_REGIONS="262144"
export SLURM_CPU_BIND_TYPE='cores'
export OMPI_MCA_btl='^openib'
export OMP_
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue or cat /etc/redhat-release + uname -a
Rocky Linux release 8.9 (Green Obsidian)
Linux udc-ba38-32c0 4.18.0-513.18.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Wed Feb 21 21:34:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
For RDMA/IB/RoCE related issues:
Driver version: rpm -q rdma-core or rpm -q libibverbs
HW information from ibstat or ibv_devinfo -vv command
CA 'mlx5_2'
CA type: MT4129
Number of ports: 1
Firmware version: 28.38.1002
Hardware version: 0
Node GUID: 0xa088c20300c835dc
System image GUID: 0xa088c20300c835dc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 592
LMC: 0
SM lid: 723
Capability mask: 0xa751e848
Port GUID: 0xa088c20300c835dc
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4127
Number of ports: 1
Firmware version: 26.38.1002
Hardware version: 0
Node GUID: 0x387c760300948587
System image GUID: 0x387c760300948587
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x3a7c76fffe948587
Link layer: Ethernet
For GPU related issues:
GPU type
Cuda:
Drivers version
Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
Not building with GPU in this case
Additional information (depending on the issue)
OpenMPI version
5.0.7
This is only happening for collective communications at large scales (e.g. 800 cores). Smaller jobs with collectives work (e.g. 160 processes). The nodes have NDR Infiniband hardware.
Output of ucx_info -d to show transports and devices recognized by UCX
attached below
Configure result - config.log
This is the configure part from the Easybuild log. It is from a build without the logging enabled; for that disable-logging --disable-debug --disable-assertions would be absent and --enable-logging present.
./configure --prefix=/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18.0
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations -
-enable-cma --enable-mt --with-verbs --without-java --without-go --disable-doxyg
en-doc --disable-logging --disable-debug --disable-assertions --disable-params-c
heck
== 2025-02-20 15:42:00,883 run.py:703 INFO parse_log_for_error (some may be harm
less) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) fou
nd:
configure: XPMEM - failed to open the requested location (guess), guessing ...
== 2025-02-20 15:42:00,883 run.py:660 WARNING Found 1 errors in command output (
output: configure: XPMEM - failed to open the requested location (guess), guessi
ng ...)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO ... (took 20 secs)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO building...
== 2025-02-20 15:42:00,884 easyblock.py:3901 INFO Starting build step
== 2025-02-20 15:42:00,884 easyconfig.py:1690 INFO Generating template values...
== 2025-02-20 15:42:00,884 easyconfig.py:1709 INFO Template values: arch='x86_64
', bitbucket_account='ucx', builddir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0', github
_account='ucx', installdir='/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18
.0', module_name='ucx/1.18.0', name='UCX', nameletter='U', nameletterlower='u',
namelower='ucx', parallel='40', start_dir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0/ucx
-1.18.0/', toolchain_name='GCC', toolchain_version='14.2.0', version='1.18.0', v
ersion_major='1', version_major_minor='1.18', version_minor='18', versionprefix=
'', versionsuffix=''
== 2025-02-20 15:42:00,884 easyblock.py:3909 INFO Running method build_step part
of step build
== 2025-02-20 15:42:00,884 configuremake.py:350 INFO Building target ''
== 2025-02-20 15:42:00,884 run.py:236 INFO running cmd: make -j 40 V=1
== 2025-02-20 15:43:20,735 run.py:648 INFO cmd " make -j 40 V=1" exited with ex
it code 0 and output:
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
abbreviated version attached
The attached log is truncated. Is it a regression? (i.e. did you try to run the same app with older OMPI/UCX)
Is it possible to share application source code?
Seems to be that at least one of the nodes reached out its memory limit
The full log is 300MB. All processes printed the same info. I can try to extract it from one.
I can run the code at this scale with UCX 1.15 and OpenMPI 4.1.4 but we had to build OpenMPI --withoutverbs or it would also hang and/or segfault. The same code also runs without problems with Intel/IntelMPI 2025 (the newer LLVM-based OneAPI). I don't know what transport layer they (intelmpi) are using, maybe pmi2. But there are no application-based memory issues for either of these other MPI distributions.
I have attached the source. It's in Fortran. I have a C++ version since these are example programs I have used to teach MPI, but I'm using the Fortran one for testing because I'm more comfortable with it. I had to rename it (sorry) since github doesn't seem to allow files ending in .f90
I have winnowed down the debug log to about 1/10 the size by confining it to one node (the job requested 80 cores on each of 10 nodes). I do not know whether this output is more helpful but at least it seems manageable.
Describe the bug
Running an with OpenMPI 5.0.7 with a simple example code doing a collective (mpi_allgather) I get, from each process,
rcache.c:247 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
rcache.c:674 UCX ERROR Failed to allocate invalidation entry for 0x149cb0ca4000..0x149cb4000000, data corruption may occur
eventually reaching
array.c:44 UCX ERROR failed to grow &worker->ep_config from 0 to 32 elems of 7744 bytes
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Steps to Reproduce
In a slurm script:
mpirun -mca btl ^uct ./mpi_ag
ucx_info -v
)UCX 1.18.0
export UCX_IB_RCACHE_MAX_REGIONS="262144"
export SLURM_CPU_BIND_TYPE='cores'
export OMPI_MCA_btl='^openib'
export OMP_
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Rocky Linux release 8.9 (Green Obsidian)
Linux udc-ba38-32c0 4.18.0-513.18.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Wed Feb 21 21:34:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)Driver version:
rpm -q rdma-core
orrpm -q libibverbs
rdma-core-2307mlnx47-1.2310213.x86_64
MLNX_OFED_LINUX-23.10-2.1.3.1:
or: MLNX_OFED version
ofed_info -s
HW information from
ibstat
oribv_devinfo -vv
commandCA 'mlx5_2'
CA type: MT4129
Number of ports: 1
Firmware version: 28.38.1002
Hardware version: 0
Node GUID: 0xa088c20300c835dc
System image GUID: 0xa088c20300c835dc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 592
LMC: 0
SM lid: 723
Capability mask: 0xa751e848
Port GUID: 0xa088c20300c835dc
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4127
Number of ports: 1
Firmware version: 26.38.1002
Hardware version: 0
Node GUID: 0x387c760300948587
System image GUID: 0x387c760300948587
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x3a7c76fffe948587
Link layer: Ethernet
For GPU related issues:
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
Not building with GPU in this case
Additional information (depending on the issue)
OpenMPI version
5.0.7
This is only happening for collective communications at large scales (e.g. 800 cores). Smaller jobs with collectives work (e.g. 160 processes). The nodes have NDR Infiniband hardware.
Output of
ucx_info -d
to show transports and devices recognized by UCXattached below
Configure result - config.log
This is the configure part from the Easybuild log. It is from a build without the logging enabled; for that disable-logging --disable-debug --disable-assertions would be absent and --enable-logging present.
./configure --prefix=/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18.0
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations -
-enable-cma --enable-mt --with-verbs --without-java --without-go --disable-doxyg
en-doc --disable-logging --disable-debug --disable-assertions --disable-params-c
heck
== 2025-02-20 15:42:00,883 run.py:703 INFO parse_log_for_error (some may be harm
less) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) fou
nd:
configure: XPMEM - failed to open the requested location (guess), guessing ...
== 2025-02-20 15:42:00,883 run.py:660 WARNING Found 1 errors in command output (
output: configure: XPMEM - failed to open the requested location (guess), guessi
ng ...)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO ... (took 20 secs)
== 2025-02-20 15:42:00,884 build_log.py:267 INFO building...
== 2025-02-20 15:42:00,884 easyblock.py:3901 INFO Starting build step
== 2025-02-20 15:42:00,884 easyconfig.py:1690 INFO Generating template values...
== 2025-02-20 15:42:00,884 easyconfig.py:1709 INFO Template values: arch='x86_64
', bitbucket_account='ucx', builddir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0', github
_account='ucx', installdir='/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18
.0', module_name='ucx/1.18.0', name='UCX', nameletter='U', nameletterlower='u',
namelower='ucx', parallel='40', start_dir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0/ucx
-1.18.0/', toolchain_name='GCC', toolchain_version='14.2.0', version='1.18.0', v
ersion_major='1', version_major_minor='1.18', version_minor='18', versionprefix=
'', versionsuffix=''
== 2025-02-20 15:42:00,884 easyblock.py:3909 INFO Running method build_step part
of step build
== 2025-02-20 15:42:00,884 configuremake.py:350 INFO Building target ''
== 2025-02-20 15:42:00,884 run.py:236 INFO running cmd: make -j 40 V=1
== 2025-02-20 15:43:20,735 run.py:648 INFO cmd " make -j 40 V=1" exited with ex
it code 0 and output:
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
abbreviated version attached
ucx.txt
ucx_short.log
The text was updated successfully, but these errors were encountered: