Skip to content

Release stale CUDA primary contexts inherited by forked workers#42874

Open
lokashrinav wants to merge 2 commits into
vllm-project:mainfrom
lokashrinav:fix/release-stale-cuda-contexts-in-workers
Open

Release stale CUDA primary contexts inherited by forked workers#42874
lokashrinav wants to merge 2 commits into
vllm-project:mainfrom
lokashrinav:fix/release-stale-cuda-contexts-in-workers

Conversation

@lokashrinav

Copy link
Copy Markdown
Contributor

Summary

  • Release inherited CUDA primary contexts for non-assigned devices in forked worker processes
  • When using fork multiprocessing (Linux default), child workers inherit the parent's GPU 0 context even when assigned to GPU 1+
  • These stale contexts waste GPU memory and cause NVIDIA cuda-checkpoint restore failures ("invalid argument")
  • Add _release_stale_cuda_primary_contexts() that calls cuDevicePrimaryCtxRelease() via ctypes after device setup in Worker.init_device()

Fixes #42873

Context

Discovered while integrating NVIDIA cuda-checkpoint with vLLM for multi-GPU cold start optimization (related to RFC #34303). The stale inherited contexts cause cuda-checkpoint --action restore to fail because it tries to restore both the inherited GPU 0 context and the worker's actual GPU 1 context in the same process.

Why this is not duplicating an existing PR

Searched open PRs and issues - no existing fix for this. The existing _maybe_force_spawn() in system_utils.py forces spawn when CUDA is already initialized, but doesn't handle the case where fork is used intentionally or where contexts are inherited through other mechanisms.

AI assistance disclosure

AI assistance was used for code review and drafting. All changes reviewed and validated by the submitter.

Test plan

  • tests/cuda/test_stale_context_release.py - verifies stale contexts are released while preserving the assigned device's context (requires 2+ GPUs)
  • Manually verified with cuda-checkpoint on H100 tp=2: checkpoint/restore cycle passes after fix, fails without it
  • Single-GPU systems: function is a no-op (no contexts to release)

Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to release stale CUDA primary contexts inherited by worker processes after a fork, preventing memory waste and issues with external tools. This is achieved through a new _release_stale_cuda_primary_contexts function in gpu_worker.py that utilizes the CUDA driver API, along with a new test suite to verify the cleanup. Feedback suggests adding error handling for the CUDA driver API return values to ensure robustness if an API call fails.

Comment thread vllm/v1/worker/gpu_worker.py Outdated
Comment on lines +125 to +145
libcuda.cuInit(0)
device_count = torch.cuda.device_count()

for dev_id in range(device_count):
if dev_id == local_rank:
continue
dev = ctypes.c_int()
libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
flags = ctypes.c_uint()
state = ctypes.c_int()
libcuda.cuDevicePrimaryCtxGetState(
dev, ctypes.byref(flags), ctypes.byref(state)
)
if state.value != 0:
libcuda.cuDevicePrimaryCtxRelease(dev)
logger.debug(
"Released stale CUDA primary context for device %d "
"(worker assigned to device %d)",
dev_id,
local_rank,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CUDA driver API functions (cuInit, cuDeviceGet, cuDevicePrimaryCtxGetState) return a CUresult integer where 0 (CUDA_SUCCESS) indicates success. The current implementation does not check these return values. If a call fails (e.g., due to driver issues or invalid device indices), variables like dev or state will remain uninitialized, potentially leading to undefined behavior or crashes in subsequent calls. It is highly recommended to verify the success of each driver API call.

    if libcuda.cuInit(0) != 0:
        return
    device_count = torch.cuda.device_count()

    for dev_id in range(device_count):
        if dev_id == local_rank:
            continue
        dev = ctypes.c_int()
        if libcuda.cuDeviceGet(ctypes.byref(dev), dev_id) != 0:
            continue
        flags = ctypes.c_uint()
        state = ctypes.c_int()
        if libcuda.cuDevicePrimaryCtxGetState(
                dev, ctypes.byref(flags), ctypes.byref(state)) == 0:
            if state.value != 0:
                libcuda.cuDevicePrimaryCtxRelease(dev)
                logger.debug(
                    "Released stale CUDA primary context for device %d "
                    "(worker assigned to device %d)",
                    dev_id,
                    local_rank,
                )

lokashrinav and others added 2 commits May 17, 2026 04:21
When vLLM uses the fork multiprocessing method, child worker processes
inherit the parent's active CUDA primary contexts for all devices.
A worker assigned to GPU 1 retains a stale GPU 0 context from the
parent, wasting memory and causing failures in tools like NVIDIA
cuda-checkpoint that enumerate per-process CUDA contexts.

Add _release_stale_cuda_primary_contexts() to gpu_worker.py that calls
cuDevicePrimaryCtxRelease() for non-assigned devices after device setup.

Fixes vllm-project#42873

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Shrinav <lokashrinav@gmail.com>
Add return value checks for cuInit, cuDeviceGet, and
cuDevicePrimaryCtxGetState to avoid undefined behavior if
a driver call fails.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Shrinav <lokashrinav@gmail.com>
@lokashrinav lokashrinav force-pushed the fix/release-stale-cuda-contexts-in-workers branch from 991d395 to 14c482e Compare May 17, 2026 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug]: Forked workers retain stale CUDA primary contexts from parent process

1 participant