Release stale CUDA primary contexts inherited by forked workers by lokashrinav · Pull Request #42874 · vllm-project/vllm

lokashrinav · 2026-05-17T08:17:27Z

Summary

Release inherited CUDA primary contexts for non-assigned devices in forked worker processes
When using fork multiprocessing (Linux default), child workers inherit the parent's GPU 0 context even when assigned to GPU 1+
These stale contexts waste GPU memory and cause NVIDIA cuda-checkpoint restore failures ("invalid argument")
Add _release_stale_cuda_primary_contexts() that calls cuDevicePrimaryCtxRelease() via ctypes after device setup in Worker.init_device()

Context

Discovered while integrating NVIDIA cuda-checkpoint with vLLM for multi-GPU cold start optimization (related to RFC #34303). The stale inherited contexts cause cuda-checkpoint --action restore to fail because it tries to restore both the inherited GPU 0 context and the worker's actual GPU 1 context in the same process.

Why this is not duplicating an existing PR

Searched open PRs and issues - no existing fix for this. The existing _maybe_force_spawn() in system_utils.py forces spawn when CUDA is already initialized, but doesn't handle the case where fork is used intentionally or where contexts are inherited through other mechanisms.

AI assistance disclosure

AI assistance was used for code review and drafting. All changes reviewed and validated by the submitter.

Test plan

tests/cuda/test_stale_context_release.py - verifies stale contexts are released while preserving the assigned device's context (requires 2+ GPUs)
Manually verified with cuda-checkpoint on H100 tp=2: checkpoint/restore cycle passes after fix, fails without it
Single-GPU systems: function is a no-op (no contexts to release)

Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a mechanism to release stale CUDA primary contexts inherited by worker processes after a fork, preventing memory waste and issues with external tools. This is achieved through a new _release_stale_cuda_primary_contexts function in gpu_worker.py that utilizes the CUDA driver API, along with a new test suite to verify the cleanup. Feedback suggests adding error handling for the CUDA driver API return values to ensure robustness if an API call fails.

gemini-code-assist · 2026-05-17T08:19:16Z

+    libcuda.cuInit(0)
+    device_count = torch.cuda.device_count()
+
+    for dev_id in range(device_count):
+        if dev_id == local_rank:
+            continue
+        dev = ctypes.c_int()
+        libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
+        flags = ctypes.c_uint()
+        state = ctypes.c_int()
+        libcuda.cuDevicePrimaryCtxGetState(
+            dev, ctypes.byref(flags), ctypes.byref(state)
+        )
+        if state.value != 0:
+            libcuda.cuDevicePrimaryCtxRelease(dev)
+            logger.debug(
+                "Released stale CUDA primary context for device %d "
+                "(worker assigned to device %d)",
+                dev_id,
+                local_rank,
+            )


The CUDA driver API functions (cuInit, cuDeviceGet, cuDevicePrimaryCtxGetState) return a CUresult integer where 0 (CUDA_SUCCESS) indicates success. The current implementation does not check these return values. If a call fails (e.g., due to driver issues or invalid device indices), variables like dev or state will remain uninitialized, potentially leading to undefined behavior or crashes in subsequent calls. It is highly recommended to verify the success of each driver API call.

if libcuda.cuInit(0) != 0: return device_count = torch.cuda.device_count() for dev_id in range(device_count): if dev_id == local_rank: continue dev = ctypes.c_int() if libcuda.cuDeviceGet(ctypes.byref(dev), dev_id) != 0: continue flags = ctypes.c_uint() state = ctypes.c_int() if libcuda.cuDevicePrimaryCtxGetState( dev, ctypes.byref(flags), ctypes.byref(state)) == 0: if state.value != 0: libcuda.cuDevicePrimaryCtxRelease(dev) logger.debug( "Released stale CUDA primary context for device %d " "(worker assigned to device %d)", dev_id, local_rank, )

When vLLM uses the fork multiprocessing method, child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 retains a stale GPU 0 context from the parent, wasting memory and causing failures in tools like NVIDIA cuda-checkpoint that enumerate per-process CUDA contexts. Add _release_stale_cuda_primary_contexts() to gpu_worker.py that calls cuDevicePrimaryCtxRelease() for non-assigned devices after device setup. Fixes vllm-project#42873 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Shrinav <lokashrinav@gmail.com>

Add return value checks for cuInit, cuDeviceGet, and cuDevicePrimaryCtxGetState to avoid undefined behavior if a driver call fails. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Shrinav <lokashrinav@gmail.com>

lokashrinav requested a review from njhill as a code owner May 17, 2026 08:17

mergify Bot added nvidia v1 labels May 17, 2026

github-project-automation Bot added this to NVIDIA May 17, 2026

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

lokashrinav and others added 2 commits May 17, 2026 04:21

lokashrinav force-pushed the fix/release-stale-cuda-contexts-in-workers branch from 991d395 to 14c482e Compare May 17, 2026 08:21

Sunt-ing mentioned this pull request Jun 1, 2026

[Bugfix] Detect driver-level CUDA init before fork #44252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release stale CUDA primary contexts inherited by forked workers#42874

Release stale CUDA primary contexts inherited by forked workers#42874
lokashrinav wants to merge 2 commits into
vllm-project:mainfrom
lokashrinav:fix/release-stale-cuda-contexts-in-workers

lokashrinav commented May 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lokashrinav commented May 17, 2026

Summary

Context

Why this is not duplicating an existing PR

AI assistance disclosure

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant