NIXL EP: Use VMM API for device memory allocation.#1415
NIXL EP: Use VMM API for device memory allocation.#1415ofirfarjun7 wants to merge 29 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi ofirfarjun7! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CUDA Driver VMM-backed allocation support: introduces Changes
Sequence Diagram(s)mermaid App->>Driver: vmm_init(size, device) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 66-77: The two calls to cuDeviceGetAttribute (checking
CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED and
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED) do not check their CUresult
return values; update the code around the variables rdma_vmm_supported and
fabric_supported to capture the CUresult, test it against CUDA_SUCCESS, and on
failure throw or log a runtime_error that includes the cuGetErrorString result
and context (which attribute failed and for which device); ensure you only rely
on rdma_vmm_supported/fabric_supported after the call succeeds so you don't act
on zero-initialized values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8bc85385-7d61-405a-90d0-e86c5ca8956c
📒 Files selected for processing (2)
examples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hpp
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 102-110: The destructor ~cuda_allocator() currently unmaps and
releases VMM state without waiting for GPU work; fix by calling
cudaDeviceSynchronize() at the start of ~cuda_allocator() before any
cuMemUnmap/cuMemAddressFree/cuMemRelease calls so all in-flight
kernels/transfers are fenced; additionally, ensure allocator creation paths
cannot bypass that fence on exception by either making explicitly_destroy
default to false or wrapping allocator construction in the init paths
(_nixl_agent_init(), _nixl_ep_init(), or any init() that creates the allocator)
with a try/catch that calls cudaDeviceSynchronize() before rethrowing so
stack-unwound destructor runs safe; keep references to the methods destroy() and
~cuda_allocator() when making changes.
- Around line 56-64: The allocator currently queries the ambient CUDA context
via cuCtxGetDevice() which is unsafe; change cuda_allocator to accept an
explicit CUdevice (or device_id) parameter, set the context explicitly inside
the constructor using cuCtxSetCurrent() and handle errors (throw on failure),
then update Buffer::init call sites to pass the tracked device_id into
cuda_allocator so allocations are bound to the correct device regardless of
external context switches; ensure you remove the cuCtxGetDevice() usage in
cuda_allocator and propagate failures with clear runtime_error messages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d14696d0-b771-4297-bed3-a7596c6c4643
📒 Files selected for processing (1)
examples/device/ep/csrc/nixl_ep.hpp
examples/device/ep/csrc/nixl_ep.hpp
Outdated
| throw std::runtime_error("Failed to get CUDA allocation granularity"); | ||
| } | ||
|
|
||
| init_vmm(size, device, prop, granularity); |
There was a problem hiding this comment.
This seems to be the only thing we need to do per allocation, which makes me believe we don't need this cuda_allocator class, we can just verify once that we can allocate using vmm and get granularity etc and then just call init_vmm directly
There was a problem hiding this comment.
Can change it.
You want to get rid of the class? or make it static?
|
/build |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 70-127: The static cuda_alloc_ctx ctx binds device-specific fields
(prop.location.id, granularity, fabric support) to whichever device was active
at first cuCtxGetDevice() call, causing wrong-device VMM allocations; change
initialization so driver/version checks remain global but device-specific
queries (cuCtxGetDevice(), cuDeviceGetAttribute(),
cuMemGetAllocationGranularity()) are performed per-call or cached per-device
(keyed by device ID) instead of in the static cuda_alloc_ctx constructor—either
remove static cuda_alloc_ctx ctx and build a ctx per vmm_init()/allocation (with
a device-ID cache), or split cuda_alloc_ctx into a static global verifier and a
per-device struct populated on each allocation using the current context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 120c00f8-eacc-4630-9658-e92390364a9c
📒 Files selected for processing (2)
examples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hpp
|
/build |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
examples/device/ep/csrc/nixl_ep.cpp (1)
70-121:⚠️ Potential issue | 🟠 MajorDo not cache device-specific VMM state in a function-local static.
static cuda_alloc_ctx ctx(device);is initialized only on the firstvmm_init()call, so every later allocation reuses that first device'sprop.location.id, granularity, and fallback decision. In a multi-GPU process, buffers allocated on GPU 1 can end up using GPU 0's VMM properties, which defeats the multi-device support this change is introducing.Suggested direction
- static cuda_alloc_ctx ctx(device); + const cuda_alloc_ctx ctx(device);If the repeated driver/version probe is a concern, keep that part in a separate one-time helper and build/cache the device-specific state per
CUdevice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/device/ep/csrc/nixl_ep.cpp` around lines 70 - 121, The device-specific VMM context is incorrectly cached in a function-local static (static cuda_alloc_ctx ctx(device);) causing all subsequent calls to reuse the first device's prop.location.id, granularity and fallback decision; change this by removing the function-local static and either (a) create a per-call cuda_alloc_ctx instance (e.g., cuda_alloc_ctx ctx(device);) so each device is probed correctly, or (b) implement a per-device cache keyed by CUdevice (e.g., std::unordered_map<CUdevice,cuda_alloc_ctx>) and look up/create the cuda_alloc_ctx for the specific device, while extracting any global-only driver/version probe into a separate one-time helper function to avoid repeated work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 124-129: The code incorrectly passes a CUdeviceptr* to cudaMalloc;
change the allocation to use a temporary void* (e.g., void* tmp = nullptr), call
cudaMalloc(&tmp, size), check the return, and then assign region.ptr =
reinterpret_cast<CUdeviceptr>(tmp) (or static_cast if appropriate) so that
vmm_region.region.ptr receives the allocated device pointer without violating
the CUDA Runtime API contract.
- Around line 101-104: Replace the throw when rdma_vmm_supported is false with
an early return so the function can continue to the existing fallback path (the
later check of fabric_supported that falls back to cudaMalloc); specifically,
remove the std::runtime_error throw and return (keeping the function's normal
flow) when rdma_vmm_supported == false to match the behavior used for the CUDA
version and fabric support checks and allow the cudaMalloc fallback to execute.
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 53-58: The vmm_region fields are left uninitialized causing
vmm_free() to operate on garbage values; update the vmm_region definition so its
members are zero-initialized by default (e.g., initialize CUdeviceptr ptr,
size_t size, and CUmemGenericAllocationHandle handle to zero or provide a
default ctor that sets them to 0) so that Buffer's members (m_rdma_alloc,
m_mask_alloc, m_sync_alloc, m_sync_count_alloc, m_workspace_alloc) are safe if
Buffer::~Buffer()/destroy() runs before init(); ensure the guard in vmm_free()
will reliably detect an unused region.
---
Duplicate comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 70-121: The device-specific VMM context is incorrectly cached in a
function-local static (static cuda_alloc_ctx ctx(device);) causing all
subsequent calls to reuse the first device's prop.location.id, granularity and
fallback decision; change this by removing the function-local static and either
(a) create a per-call cuda_alloc_ctx instance (e.g., cuda_alloc_ctx
ctx(device);) so each device is probed correctly, or (b) implement a per-device
cache keyed by CUdevice (e.g., std::unordered_map<CUdevice,cuda_alloc_ctx>) and
look up/create the cuda_alloc_ctx for the specific device, while extracting any
global-only driver/version probe into a separate one-time helper function to
avoid repeated work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb93105b-51b9-4789-b86a-242031b99df3
📒 Files selected for processing (2)
examples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 53-58: The functions vmm_init and vmm_free currently have external
linkage but no header declarations; decide intended scope and fix accordingly:
if they are internal implementation details, add the static keyword to their
definitions in nixl_ep.cpp (i.e., mark vmm_init and vmm_free static) to restrict
linkage; if they are part of the API, add forward declarations for vmm_init and
vmm_free to the header alongside vmm_region so callers can see the prototypes
and linkage is explicit. Ensure the chosen change is applied consistently for
both functions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7af83afb-0d9f-4ed5-bc2c-d16193eb3605
📒 Files selected for processing (1)
examples/device/ep/csrc/nixl_ep.hpp
| size_t size_ = 0; | ||
| CUmemGenericAllocationHandle handle_ = 0; | ||
| bool is_cuda_malloc_ = false; | ||
| bool vmm_addr_reserved_ = false; |
There was a problem hiding this comment.
vmm_addr_reserved_ can be removed.
examples/device/ep/csrc/vmm.cpp
Outdated
| if (!ctx.fabric_supported) { | ||
| size_ = size; | ||
| is_cuda_malloc_ = true; | ||
| if (cudaMalloc(reinterpret_cast<void **>(&ptr_), size) != cudaSuccess) { |
There was a problem hiding this comment.
cudaMalloc -> cuMemAlloc
cudaFree -> cuMemFree
to avoid cast, and #include <cuda_runtime.h> can be removed from vmm.hpp
examples/device/ep/csrc/vmm.hpp
Outdated
| [[nodiscard]] size_t | ||
| size() const noexcept { | ||
| return size_; | ||
| } | ||
|
|
||
| [[nodiscard]] CUmemGenericAllocationHandle | ||
| handle() const noexcept { | ||
| return handle_; | ||
| } |
There was a problem hiding this comment.
Can be removed as they are unused.
examples/device/ep/csrc/vmm.hpp
Outdated
|
|
||
| [[nodiscard]] CUdeviceptr | ||
| ptr() const noexcept { | ||
| return ptr_; |
There was a problem hiding this comment.
maybe do reinterpret_cast here and return void *, as it only used to get the pointer.
examples/device/ep/csrc/vmm.cpp
Outdated
| access_desc.location.type = CU_MEM_LOCATION_TYPE_DEVICE; | ||
| access_desc.location.id = device; | ||
| access_desc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE; |
There was a problem hiding this comment.
it seems that this is also should be done only once in cuda_alloc_ctx.
examples/device/ep/csrc/vmm.cpp
Outdated
|
|
||
| prop.type = CU_MEM_ALLOCATION_TYPE_PINNED; | ||
| prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE; | ||
| prop.location.id = dev; |
There was a problem hiding this comment.
In this implementation cuda_alloc_ctx is initialized only once.
So you can remove CUdevice device from the parameters list, call cuCtxGetDevice, and throw if it returns an error (which means that device should be set before constructing a vmm_region). And remove const CUdevice cu_dev = static_cast<CUdevice>(device_id); from nixl_ep.cpp.
| if (size == 0) { | ||
| throw std::invalid_argument("vmm_region: size must be non-zero"); | ||
| } |
There was a problem hiding this comment.
Is it really needed? I guess it can be removed.
There was a problem hiding this comment.
I think cudaMalloc return success even for size == 0 (and nullptr for the ptr), so we will need to check it if we remove it.
There was a problem hiding this comment.
But it's safe to call cudaFree for 0 \ NULL \ nullptr. So, I think it's not really an exceptional case for this class.
There was a problem hiding this comment.
I see this class as abstraction, don't you think we should hint the user if he call the ctr with zero?
There was a problem hiding this comment.
In my opinion, no, we shouldn't. Since this is an abstraction over various methods of memory allocation. And in general, it is not forbidden to pass the zero size to allocators.
There was a problem hiding this comment.
AFAIK cuMemCreate will fail if we pass zero.
If it's true don't you think it is strange that vmm will fail in some systems with zero and not fail in others?
I don't think user should care which API vmm used and it should get same behavior
examples/device/ep/csrc/vmm.cpp
Outdated
| @@ -0,0 +1,151 @@ | |||
| /* | |||
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
examples/device/ep/csrc/vmm.hpp
Outdated
| @@ -0,0 +1,62 @@ | |||
| /* | |||
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| @@ -0,0 +1,48 @@ | |||
| /* | |||
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
| * SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| #include <cuda_runtime.h> | ||
| #include <cstddef> | ||
|
|
||
| class vmm_region { |
There was a problem hiding this comment.
Please use an existing namespace nixl_ep.
What?
Use VMM API for device memory allocation in nixl_ep
Why?
To support multi node nvlink.
How?
Summary by CodeRabbit
Refactor
New Features