Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

uranix · 2025-03-01T17:41:25Z

Describe the bug

When transferring memory allocated by cudaMalloc3D via MPI a segfault is possible. Probably, UCX fails to identify the address as device memory and tries to access it as host memory, causing segfault.

Steps to Reproduce

Compile and run the program test.txt (rename test.txt to test.cu)

$ module purge
$ module load nvhpc
$ mpic++ test.cu -o test
$ mpirun -n 2 ./test

The program segfaults with a message:

rank = 1, ptr = 0x7f98a3200000, pitch = 512b, xsize = 200b, ysize = 100
rank = 1, exch_ptr = 0x7f98a36a3800, type = 2
rank = 0, ptr = 0x7fdbdf200000, pitch = 512b, xsize = 200b, ysize = 100
rank = 0, exch_ptr = 0x7fdbdf6a3800, type = 2
Sending 3.141593 to rank 1
[bigmoose:1797603:0:1797603] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fdbdf6a3800)
==== backtrace (tid:1797603) ====
 0 0x000000000003fda0 __sigaction()  ???:0
 1 0x0000000000164d95 _dl_mcount_wrapper()  ???:0
 2 0x000000000001b72c uct_am_short_fill_data()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/base/uct_iface.h:995
 3 0x000000000001b72c uct_mm_ep_am_short()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/mm/base/mm_ep.c:405
 4 0x000000000009e5ab uct_ep_am_short()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2989
 5 0x00000000000a53ad ucp_request_try_send()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:307
 6 0x00000000000a53ad ucp_request_send()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:330
 7 0x00000000000a53ad ucp_proto_request_send_op_common()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_common.inl:260
 8 0x00000000000a53ad ucp_proto_request_send_op()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_common.inl:310
 9 0x00000000000a53ad ucp_tag_send_nbx()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/ucp/tag/tag_send.c:295
10 0x0000000000005b99 mca_pml_ucx_send_nbr()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:909
11 0x0000000000005b99 mca_pml_ucx_send()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:949
12 0x000000000007b75f PMPI_Send()  /proj/scratch/qa/library_builds/Linux_x86_64/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/psend.c:81
13 0x000000000040173a main()  /mnt/nvme/home/uranix/ucx-bug/test.cu:39
14 0x0000000000029d68 __libc_init_first()  ???:0
15 0x0000000000029e25 __libc_start_main()  ???:0
16 0x00000000004011a1 _start()  ???:0
=================================

CUDA Runtime correctly identifies 0x7fdbdf6a3800 as device memory (type = 2), but uct_am_short_fill_data in backtrace indicates that host memory transfer was attempted.

The problem does not occur if any of the following modifications are done:

the program is run as UCX_MEMTYPE_CACHE=n mpirun -n 2 ./test
code is using cudaMemcpy with manual padding instead of cudaMemcpy3D
Nx is divisible by 128 or close (resulting in pitch < xsize / 0.95)
Seems that device memory region allocated by cudaMemcpy3D is incorrectly marked as having size xsize * ysize * depth instead of pitch * ysize * depth

UCX version used + UCX configure flags: bundled with NVHPC 21.9

$ /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/comm_libs/12.6/hpcx/hpcx-2.20/ucx/bin/ucx_info -d
# Library version: 1.17.0
# Library path: /lib/x86_64-linux-gnu/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 39c8f9b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.5.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx --with-bfd=/hpc/local/oss/binutils/2.37/redhat8

Setup and versions

OS version + CPU architecture: Debian GNU/Linux trixie/sid, x86_64
For GPU related issues:
- GPU type: NVIDIA GeForce GTX 1660 Ti
- Cuda:
  - Drivers version: 535.216.03-1
  - Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv: No, empty output

Additional information (depending on the issue)

OpenMPI version: mpirun (Open MPI) 4.1.7a1
Output of ucx_info -d to show transports and devices recognized by UCX:

Output of `ucx_info -d`

#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: eno2
#           Type: network
#  System device: eno2 (0)
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 15946372K
#           remote key: 24 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#

The text was updated successfully, but these errors were encountered:

uranix added the Bug label Mar 1, 2025

uranix linked a pull request Mar 1, 2025 that will close this issue

UCM/CUDA: Correct region size for pitched CUDA allocation hooks #10527

Open

yosefe assigned rakhmets Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

uranix commented Mar 1, 2025

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

Segfault in uct_am_short_fill_data when transferring cudaMalloc3D allocated regions #10526

Comments

uranix commented Mar 1, 2025

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)