[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

vchuravy · 2021-07-17T14:51:03Z

Describe the bug

CUDA 10.2 introduced a new set of memory allocation routines (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA) which allow for pooled allocation and stream based allocation.

These allocation do not support cuIpcGetMemHandle as noted in https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

The new CUDA virtual memory management functions do not support the legacy cuIpc* functions with their memory. Instead, they expose a new mechanism for interprocess communication that works better with each supported platform. This new mechanism is based on manipulating system–specific handles. On Windows, these are of type HANDLE or D3DKMT_HANDLE, while on Linux-based platforms, these are file descriptors.

To get one of these operating system–specific handles, the new function cuMemExportToShareableHandle is introduced. The appropriate request handle types must be passed to cuMemCreate. By default, memory is not exportable, so shareable handles are not available with the default properties.

It seems to be that CUDA 11.2 introduced cudaMallocAsync is using this new interface under the hood as https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g8158cc4b2c0d2c2c771f9d1af3cf386e takes a HandleType https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gabde707dfb8a602b917e0b177f77f365

Steps to Reproduce

See JuliaGPU/CUDA.jl#1053 for an application failure caused by this.

The error encountered is:

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1

The text was updated successfully, but these errors were encountered:

Akshay-Venkatesh · 2021-07-20T22:17:30Z

@vchuravy Is staging cuMemCreate/MallocAsync through cuMemAlloc/cudaMalloc memory not an option? Is it true that JuliaGPU/CUDA.jl#1053 strictly needs to use cuMemCreate/MallocAsync ?

vchuravy · 2021-07-21T07:33:42Z

Two notes:

I am only 90% sure that cudaMallocAsync uses cuMemCreate and had to infer that from the surrounding documentation.
That kinda highlights the point, the user doesn't necessarily know or needs to know what allocation method was used.

From the perspective of CUDA.jl we currently do not expose the different allocators to the user, the only option is whether the user configures the use of a memory pool managed by CUDA.jl or via cudaMallocAsync thus managed by the driver.

Now we currently have the work around for users who want to use UCX or MPI to disable the use of cudaMallocAsync. On an application level staging through cudaMalloc might be a possibility as well, but introduces additional complexities. (Dealing with provenance e.g. who allocated the buffer, which method, allocation of unnecessary temporary memory...)

From my perspective as a user of MPI or UCX I would like to see support for cudaMallocAsync since they can be IPC capable.

There seem to be two relevant pointer attributes:

CU_POINTER_ATTRIBUTE_IS_LEGACY_CUDA_IPC_CAPABLE
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES

vchuravy · 2022-01-24T22:17:43Z

This remains an issue https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060/20?u=vchuravy and we have to tell users to explicitly disable CUDA mempool's support.

jrhemstad · 2022-09-28T15:27:34Z

cudaMallocAsync supports CUDA IPC, but requires configuring an explicit pool handle.

See the "Interprocess communication support" section here: https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/

pentschev · 2022-11-01T23:34:32Z

From the discussions I had with @Akshay-Venkatesh , it seems using an explicit pool handle for CUDA IPC may not be possible in UCX at the moment, but that will probably be possible in protov2. Meanwhile, support for cudaMallocAsync has been added in #8623, and given the lack of direct support for CUDA IPC, one intermediate solution is to use staging buffers by setting UCX_RNDV_FRAG_MEM_TYPE=cuda, from our preliminary performance tests in UCX-Py we were able to reach about 90% of CUDA IPC performance when compared to default CUDA pinned memory, with the advantage of being able to prevent fragmentation. We still have some open issues though: #8639 #8669 , those still prevent us from using async memory allocations for specific use cases.

vchuravy added the Bug label Jul 17, 2021

vchuravy changed the title ~~[CUDA] Support IPC for allocations created by cuMemCreate~~ [CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync Jul 17, 2021

lfmeadow mentioned this issue Aug 6, 2021

process_vm_readv Bad Address causes abort with Cuda and OpenMPI on large application #7194

Open

yosefe mentioned this issue Aug 7, 2021

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

Closed

simonbyrne mentioned this issue Jan 24, 2022

UCX incompatible with CUDA.jl memory pool JuliaParallel/MPI.jl#532

Open

masterleinad mentioned this issue Sep 17, 2023

Trilinos nightly failures in cuda/11.2 builds, no-uvm (various packages) kokkos/kokkos#6450

Closed

ndellingwood mentioned this issue Jun 6, 2024

Ifpack2 AdditiveSchwarz & RILUK performance improvements trilinos/Trilinos#13057

Merged

shijin-aws mentioned this issue Jul 9, 2024

Support IPC for allocations created by cudaMallocAsync ofiwg/libfabric#10162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

vchuravy commented Jul 17, 2021 •

edited

Loading

Akshay-Venkatesh commented Jul 20, 2021

vchuravy commented Jul 21, 2021

vchuravy commented Jan 24, 2022

jrhemstad commented Sep 28, 2022

pentschev commented Nov 1, 2022

[CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync #7110

[CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync #7110

Comments

vchuravy commented Jul 17, 2021 • edited Loading

Describe the bug

Steps to Reproduce

Akshay-Venkatesh commented Jul 20, 2021

vchuravy commented Jul 21, 2021

vchuravy commented Jan 24, 2022

jrhemstad commented Sep 28, 2022

pentschev commented Nov 1, 2022

[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

vchuravy commented Jul 17, 2021 •

edited

Loading