Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

Open
uranix opened this issue Mar 1, 2025 · 0 comments · May be fixed by #10527
Open

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

uranix opened this issue Mar 1, 2025 · 0 comments · May be fixed by #10527
Assignees
Labels

Comments

@uranix
Copy link

uranix commented Mar 1, 2025

Describe the bug

When transferring memory allocated by cudaMalloc3D via MPI a segfault is possible. Probably, UCX fails to identify the address as device memory and tries to access it as host memory, causing segfault.

Steps to Reproduce

  • Compile and run the program test.txt (rename test.txt to test.cu)

    $ module purge
    $ module load nvhpc
    $ mpic++ test.cu -o test
    $ mpirun -n 2 ./test

    The program segfaults with a message:

    rank = 1, ptr = 0x7f98a3200000, pitch = 512b, xsize = 200b, ysize = 100
    rank = 1, exch_ptr = 0x7f98a36a3800, type = 2
    rank = 0, ptr = 0x7fdbdf200000, pitch = 512b, xsize = 200b, ysize = 100
    rank = 0, exch_ptr = 0x7fdbdf6a3800, type = 2
    Sending 3.141593 to rank 1
    [bigmoose:1797603:0:1797603] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fdbdf6a3800)
    ==== backtrace (tid:1797603) ====
     0 0x000000000003fda0 __sigaction()  ???:0
     1 0x0000000000164d95 _dl_mcount_wrapper()  ???:0
     2 0x000000000001b72c uct_am_short_fill_data()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/base/uct_iface.h:995
     3 0x000000000001b72c uct_mm_ep_am_short()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/mm/base/mm_ep.c:405
     4 0x000000000009e5ab uct_ep_am_short()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2989
     5 0x00000000000a53ad ucp_request_try_send()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:307
     6 0x00000000000a53ad ucp_request_send()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:330
     7 0x00000000000a53ad ucp_proto_request_send_op_common()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_common.inl:260
     8 0x00000000000a53ad ucp_proto_request_send_op()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_common.inl:310
     9 0x00000000000a53ad ucp_tag_send_nbx()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/tag/tag_send.c:295
    10 0x0000000000005b99 mca_pml_ucx_send_nbr()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:909
    11 0x0000000000005b99 mca_pml_ucx_send()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:949
    12 0x000000000007b75f PMPI_Send()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/psend.c:81
    13 0x000000000040173a main()  /mnt/nvme/home/uranix/ucx-bug/test.cu:39
    14 0x0000000000029d68 __libc_init_first()  ???:0
    15 0x0000000000029e25 __libc_start_main()  ???:0
    16 0x00000000004011a1 _start()  ???:0
    =================================
    

    CUDA Runtime correctly identifies 0x7fdbdf6a3800 as device memory (type = 2), but uct_am_short_fill_data in backtrace indicates that host memory transfer was attempted.

    The problem does not occur if any of the following modifications are done:

    • the program is run as UCX_MEMTYPE_CACHE=n mpirun -n 2 ./test
    • code is using cudaMemcpy with manual padding instead of cudaMemcpy3D
    • Nx is divisible by 128 or close (resulting in pitch < xsize / 0.95)
      Seems that device memory region allocated by cudaMemcpy3D is incorrectly marked as having size xsize * ysize * depth instead of pitch * ysize * depth
  • UCX version used + UCX configure flags: bundled with NVHPC 21.9

$ /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ucx/bin/ucx_info -d
# Library version: 1.17.0
# Library path: /lib/x86_64-linux-gnu/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 39c8f9b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.5.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx --with-bfd=/hpc/local/oss/binutils/2.37/redhat8

Setup and versions

  • OS version + CPU architecture: Debian GNU/Linux trixie/sid, x86_64
  • For GPU related issues:
    • GPU type: NVIDIA GeForce GTX 1660 Ti
    • Cuda:
      • Drivers version: 535.216.03-1
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv: No, empty output

Additional information (depending on the issue)

  • OpenMPI version: mpirun (Open MPI) 4.1.7a1
  • Output of ucx_info -d to show transports and devices recognized by UCX:
Output of `ucx_info -d`
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: eno2
#           Type: network
#  System device: eno2 (0)
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 15946372K
#           remote key: 24 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants