Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run large-scale MPI jobs with collectives with UCX 1.18 + OpenMPI 5 #10522

Open
kah3f opened this issue Feb 26, 2025 · 3 comments
Open
Labels

Comments

@kah3f
Copy link

kah3f commented Feb 26, 2025

Describe the bug

Running an with OpenMPI 5.0.7 with a simple example code doing a collective (mpi_allgather) I get, from each process,
rcache.c:247 UCX ERROR mmap(size=151552) failed: Cannot allocate memory
rcache.c:674 UCX ERROR Failed to allocate invalidation entry for 0x149cb0ca4000..0x149cb4000000, data corruption may occur
eventually reaching
array.c:44 UCX ERROR failed to grow &worker->ep_config from 0 to 32 elems of 7744 bytes
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Steps to Reproduce

In a slurm script:
mpirun -mca btl ^uct ./mpi_ag

  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
    UCX 1.18.0
  • Any UCX environment variables used
    export UCX_IB_RCACHE_MAX_REGIONS="262144"
    export SLURM_CPU_BIND_TYPE='cores'
    export OMPI_MCA_btl='^openib'
    export OMP_

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
      Rocky Linux release 8.9 (Green Obsidian)
      Linux udc-ba38-32c0 4.18.0-513.18.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Wed Feb 21 21:34:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      rpm -q rdma-core or rpm -q libibverbs

      rdma-core-2307mlnx47-1.2310213.x86_64
      MLNX_OFED_LINUX-23.10-2.1.3.1:

or: MLNX_OFED version ofed_info -s

  • HW information from ibstat or ibv_devinfo -vv command
    CA 'mlx5_2'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.38.1002
    Hardware version: 0
    Node GUID: 0xa088c20300c835dc
    System image GUID: 0xa088c20300c835dc
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 400
    Base lid: 592
    LMC: 0
    SM lid: 723
    Capability mask: 0xa751e848
    Port GUID: 0xa088c20300c835dc
    Link layer: InfiniBand
    CA 'mlx5_bond_0'
    CA type: MT4127
    Number of ports: 1
    Firmware version: 26.38.1002
    Hardware version: 0
    Node GUID: 0x387c760300948587
    System image GUID: 0x387c760300948587
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 25
    Base lid: 0
    LMC: 0
    SM lid: 0
    Capability mask: 0x00010000
    Port GUID: 0x3a7c76fffe948587
    Link layer: Ethernet

  • For GPU related issues:

    • GPU type
    • Cuda:
      • Drivers version
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
        Not building with GPU in this case

Additional information (depending on the issue)

  • OpenMPI version
    5.0.7

    This is only happening for collective communications at large scales (e.g. 800 cores). Smaller jobs with collectives work (e.g. 160 processes). The nodes have NDR Infiniband hardware.

  • Output of ucx_info -d to show transports and devices recognized by UCX
    attached below

  • Configure result - config.log

  • This is the configure part from the Easybuild log. It is from a build without the logging enabled; for that disable-logging --disable-debug --disable-assertions would be absent and --enable-logging present.

    ./configure --prefix=/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18.0
    --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations -
    -enable-cma --enable-mt --with-verbs --without-java --without-go --disable-doxyg
    en-doc --disable-logging --disable-debug --disable-assertions --disable-params-c
    heck
    == 2025-02-20 15:42:00,883 run.py:703 INFO parse_log_for_error (some may be harm
    less) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) fou
    nd:
    configure: XPMEM - failed to open the requested location (guess), guessing ...
    == 2025-02-20 15:42:00,883 run.py:660 WARNING Found 1 errors in command output (
    output: configure: XPMEM - failed to open the requested location (guess), guessi
    ng ...)
    == 2025-02-20 15:42:00,884 build_log.py:267 INFO ... (took 20 secs)
    == 2025-02-20 15:42:00,884 build_log.py:267 INFO building...
    == 2025-02-20 15:42:00,884 easyblock.py:3901 INFO Starting build step
    == 2025-02-20 15:42:00,884 easyconfig.py:1690 INFO Generating template values...
    == 2025-02-20 15:42:00,884 easyconfig.py:1709 INFO Template values: arch='x86_64
    ', bitbucket_account='ucx', builddir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0', github
    _account='ucx', installdir='/apps/software/standard/compiler/gcc/14.2.0/ucx/1.18
    .0', module_name='ucx/1.18.0', name='UCX', nameletter='U', nameletterlower='u',
    namelower='ucx', parallel='40', start_dir='/tmp/uvacse/UCX/1.18.0/GCC-14.2.0/ucx
    -1.18.0/', toolchain_name='GCC', toolchain_version='14.2.0', version='1.18.0', v
    ersion_major='1', version_major_minor='1.18', version_minor='18', versionprefix=
    '', versionsuffix=''
    == 2025-02-20 15:42:00,884 easyblock.py:3909 INFO Running method build_step part
    of step build
    == 2025-02-20 15:42:00,884 configuremake.py:350 INFO Building target ''
    == 2025-02-20 15:42:00,884 run.py:236 INFO running cmd: make -j 40 V=1
    == 2025-02-20 15:43:20,735 run.py:648 INFO cmd " make -j 40 V=1" exited with ex
    it code 0 and output:

  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
    abbreviated version attached

ucx.txt

ucx_short.log

@kah3f kah3f added the Bug label Feb 26, 2025
@brminich
Copy link
Contributor

The attached log is truncated. Is it a regression? (i.e. did you try to run the same app with older OMPI/UCX)
Is it possible to share application source code?
Seems to be that at least one of the nodes reached out its memory limit

@kah3f
Copy link
Author

kah3f commented Feb 27, 2025

The full log is 300MB. All processes printed the same info. I can try to extract it from one.

I can run the code at this scale with UCX 1.15 and OpenMPI 4.1.4 but we had to build OpenMPI --withoutverbs or it would also hang and/or segfault. The same code also runs without problems with Intel/IntelMPI 2025 (the newer LLVM-based OneAPI). I don't know what transport layer they (intelmpi) are using, maybe pmi2. But there are no application-based memory issues for either of these other MPI distributions.

I have attached the source. It's in Fortran. I have a C++ version since these are example programs I have used to teach MPI, but I'm using the Fortran one for testing because I'm more comfortable with it. I had to rename it (sorry) since github doesn't seem to allow files ending in .f90

allgather.f90.txt

@kah3f
Copy link
Author

kah3f commented Feb 28, 2025

I have winnowed down the debug log to about 1/10 the size by confining it to one node (the job requested 80 cores on each of 10 nodes). I do not know whether this output is more helpful but at least it seems manageable.

ucx_short.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants