Release stale CUDA primary contexts inherited by forked workers#42874
Release stale CUDA primary contexts inherited by forked workers#42874lokashrinav wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to release stale CUDA primary contexts inherited by worker processes after a fork, preventing memory waste and issues with external tools. This is achieved through a new _release_stale_cuda_primary_contexts function in gpu_worker.py that utilizes the CUDA driver API, along with a new test suite to verify the cleanup. Feedback suggests adding error handling for the CUDA driver API return values to ensure robustness if an API call fails.
| libcuda.cuInit(0) | ||
| device_count = torch.cuda.device_count() | ||
|
|
||
| for dev_id in range(device_count): | ||
| if dev_id == local_rank: | ||
| continue | ||
| dev = ctypes.c_int() | ||
| libcuda.cuDeviceGet(ctypes.byref(dev), dev_id) | ||
| flags = ctypes.c_uint() | ||
| state = ctypes.c_int() | ||
| libcuda.cuDevicePrimaryCtxGetState( | ||
| dev, ctypes.byref(flags), ctypes.byref(state) | ||
| ) | ||
| if state.value != 0: | ||
| libcuda.cuDevicePrimaryCtxRelease(dev) | ||
| logger.debug( | ||
| "Released stale CUDA primary context for device %d " | ||
| "(worker assigned to device %d)", | ||
| dev_id, | ||
| local_rank, | ||
| ) |
There was a problem hiding this comment.
The CUDA driver API functions (cuInit, cuDeviceGet, cuDevicePrimaryCtxGetState) return a CUresult integer where 0 (CUDA_SUCCESS) indicates success. The current implementation does not check these return values. If a call fails (e.g., due to driver issues or invalid device indices), variables like dev or state will remain uninitialized, potentially leading to undefined behavior or crashes in subsequent calls. It is highly recommended to verify the success of each driver API call.
if libcuda.cuInit(0) != 0:
return
device_count = torch.cuda.device_count()
for dev_id in range(device_count):
if dev_id == local_rank:
continue
dev = ctypes.c_int()
if libcuda.cuDeviceGet(ctypes.byref(dev), dev_id) != 0:
continue
flags = ctypes.c_uint()
state = ctypes.c_int()
if libcuda.cuDevicePrimaryCtxGetState(
dev, ctypes.byref(flags), ctypes.byref(state)) == 0:
if state.value != 0:
libcuda.cuDevicePrimaryCtxRelease(dev)
logger.debug(
"Released stale CUDA primary context for device %d "
"(worker assigned to device %d)",
dev_id,
local_rank,
)When vLLM uses the fork multiprocessing method, child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 retains a stale GPU 0 context from the parent, wasting memory and causing failures in tools like NVIDIA cuda-checkpoint that enumerate per-process CUDA contexts. Add _release_stale_cuda_primary_contexts() to gpu_worker.py that calls cuDevicePrimaryCtxRelease() for non-assigned devices after device setup. Fixes vllm-project#42873 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Shrinav <lokashrinav@gmail.com>
Add return value checks for cuInit, cuDeviceGet, and cuDevicePrimaryCtxGetState to avoid undefined behavior if a driver call fails. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Shrinav <lokashrinav@gmail.com>
991d395 to
14c482e
Compare
Summary
cuda-checkpointrestore failures ("invalid argument")_release_stale_cuda_primary_contexts()that callscuDevicePrimaryCtxRelease()via ctypes after device setup inWorker.init_device()Fixes #42873
Context
Discovered while integrating NVIDIA cuda-checkpoint with vLLM for multi-GPU cold start optimization (related to RFC #34303). The stale inherited contexts cause
cuda-checkpoint --action restoreto fail because it tries to restore both the inherited GPU 0 context and the worker's actual GPU 1 context in the same process.Why this is not duplicating an existing PR
Searched open PRs and issues - no existing fix for this. The existing
_maybe_force_spawn()insystem_utils.pyforces spawn when CUDA is already initialized, but doesn't handle the case where fork is used intentionally or where contexts are inherited through other mechanisms.AI assistance disclosure
AI assistance was used for code review and drafting. All changes reviewed and validated by the submitter.
Test plan
tests/cuda/test_stale_context_release.py- verifies stale contexts are released while preserving the assigned device's context (requires 2+ GPUs)Generated with Claude Code