[Bugfix] fix CUDA illegal memory access when sleep mode is triggered during request processing by cynton503 · Pull Request #28721 · vllm-project/vllm

cynton503 · 2025-11-14T10:14:29Z

This PR fixes a bug where the scheduler continues to process requests after sleep mode has freed GPU memory, causing CUDA error: an illegal memory access was encountered and engine crashes.

This fixes #28714

Below is the test content. More test cases will be added later (such as triggering sleep when the model is processing multiple requests, triggering sleep when the scheduler is using batch_queue, etc.) along with related code fixes.

Test Plan

# Started vLLM server with sleep mode enabled
VLLM_LOGGING_LEVEL=DEBUG VLLM_SERVER_DEV_MODE=1 python3 -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 7001 \
    --model /model/Qwen3-4B \
    --swap-space 20 \
    --no-enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 2048 \
    --served-model-name qwen3-4b \
    --enable_sleep_mode \
    --dtype bfloat16

# Sent a long output request
python test_req.py

# Triggered sleep mode during the decode phase
curl -X POST http://localhost:7001/sleep -H "Content-Type: application/json" -d '{"level": "1"}'

Test Result

After the fix, the model can successfully sleep and wake up:

github-actions · 2025-11-14T10:14:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a fix for a CUDA illegal memory access error that occurs when sleep mode is triggered during request processing. The fix uses shared memory to signal the sleep state to the scheduler. My review focuses on improving the robustness of the shared memory implementation by addressing a race condition, preventing resource leaks, and handling exceptions more safely.

gemini-code-assist · 2025-11-14T10:16:05Z

vllm/entrypoints/openai/api_server.py

+def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None:
+    try:
+        shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4)
+    except FileNotFoundError:
+        shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4)
+
+    if shm is not None:
+        shm.buf[0:4] = value.to_bytes(4, 'little')
+        shm.close()


This implementation has a race condition when creating the shared memory, which can lead to an unhandled FileExistsError. It also doesn't guarantee that shm.close() is called if an error occurs. Furthermore, the shared memory block is never unlinked, which can cause resource leaks and lead to stale state if the server restarts.

A more robust implementation should handle the race condition atomically, ensure resource closure with a try...finally block, and the application should manage the shared memory lifecycle (e.g., in the lifespan context manager) to create it on startup and unlink it on shutdown.

Suggested change

def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None:

try:

shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4)

except FileNotFoundError:

shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4)

if shm is not None:

shm.buf[0:4] = value.to_bytes(4, 'little')

shm.close()

def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None:

shm = None

try:

try:

shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4)

except FileExistsError:

shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4)

shm.buf[0:4] = value.to_bytes(4, 'little')

finally:

if shm is not None:

shm.close()

gemini-code-assist · 2025-11-14T10:16:06Z

vllm/v1/core/sched/utils.py

+    try:
+        shm = shared_memory.SharedMemory(name=shared_memory_name)
+        sleep_signal = int.from_bytes(shm.buf[0:4], 'little')
+        shm.close()
+    except:
+        pass


Using a bare except: is dangerous as it can hide unexpected errors, such as issues with buffer access or data conversion. It's better to catch only the specific exception you expect, in this case FileNotFoundError. Additionally, using a try...finally block ensures that shm.close() is called even if an error occurs while reading from the shared memory, preventing resource leaks.

Suggested change

try:

shm = shared_memory.SharedMemory(name=shared_memory_name)

sleep_signal = int.from_bytes(shm.buf[0:4], 'little')

shm.close()

except:

pass

shm = None

try:

shm = shared_memory.SharedMemory(name=shared_memory_name)

sleep_signal = int.from_bytes(shm.buf[0:4], 'little')

except FileNotFoundError:

pass

finally:

if shm is not None:

shm.close()

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-14T10:16:39Z

vllm/entrypoints/openai/api_server.py

+from multiprocessing import shared_memory
+def set_sleep_signal(value: int = 1, shared_memory_name: str = 'sleep_signal') -> None:
+    try:
+        shm = shared_memory.SharedMemory(name=shared_memory_name, create=False, size=4)
+    except FileNotFoundError:
+        shm = shared_memory.SharedMemory(name=shared_memory_name, create=True, size=4)
+
+    if shm is not None:
+        shm.buf[0:4] = value.to_bytes(4, 'little')
+        shm.close()


Shared sleep flag persists across process restarts

The new set_sleep_signal creates a named shared memory block but never unlinks or resets it. If the server crashes or terminates after /sleep sets the flag to 1, the OS keeps /sleep_signal around with that value. Any new vLLM instance started on the same host will open the existing block and read 1, causing check_stop to immediately mark all requests as FINISHED_STOPPED even though the engine has not actually gone to sleep. This makes a fresh server unusable until someone issues /wake_up or manually deletes the shared memory segment. The shared flag should be cleaned up or defaulted to 0 during startup.

Useful? React with 👍 / 👎.

…during request processing Signed-off-by: root <root@nscc-ds-4a100-node10.cluster.local>

heheda12345 · 2025-11-16T07:38:27Z

CC @22quinn

22quinn

The shared memory implementation feels hacky to me. Would it be possible to rely on this new pause_generation API in #28037 to avoid CUDA IMA?

cynton503 · 2025-11-19T02:24:38Z

The shared memory implementation feels hacky to me. Would it be possible to rely on this new pause_generation API in #28037 to avoid CUDA IMA?

Thanks for the suggestion. I tested it and the pause_generation API can avoid CUDA IMA. However, there are issues using it with pipeline parallel (some states are incorrect after sleep, e.g., running requests are not 0). I'll try to fix this bug and update the commit.

mergify · 2026-01-14T07:38:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cynton503.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cynton503 requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, chaunceyjiang, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 14, 2025 10:14

mergify bot added frontend nvidia v1 labels Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

cynton503 force-pushed the fix-sleep-mode-bug branch 4 times, most recently from 6a5ab49 to e06a88e Compare November 14, 2025 10:54

[Bugfix] fix CUDA illegal memory access when sleep mode is triggered …

31b7db6

…during request processing Signed-off-by: root <root@nscc-ds-4a100-node10.cluster.local>

cynton503 force-pushed the fix-sleep-mode-bug branch from e06a88e to 31b7db6 Compare November 16, 2025 07:07

cynton503 closed this Nov 16, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 16, 2025

cynton503 reopened this Nov 16, 2025

cynton503 closed this Nov 16, 2025

cynton503 reopened this Nov 16, 2025

22quinn reviewed Nov 18, 2025

View reviewed changes

mergify bot added the bug Something isn't working label Jan 14, 2026

mergify bot added the needs-rebase label Jan 14, 2026

njhill mentioned this pull request Feb 13, 2026

[Core] Cleanup engine pause/sleep logic #34528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] fix CUDA illegal memory access when sleep mode is triggered during request processing#28721

[Bugfix] fix CUDA illegal memory access when sleep mode is triggered during request processing#28721
cynton503 wants to merge 1 commit intovllm-project:mainfrom
cynton503:fix-sleep-mode-bug

cynton503 commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Uh oh!

heheda12345 commented Nov 16, 2025

Uh oh!

22quinn left a comment

Uh oh!

cynton503 commented Nov 19, 2025

Uh oh!

mergify bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cynton503 commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Test Result

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 16, 2025

Uh oh!

22quinn left a comment

Choose a reason for hiding this comment

Uh oh!

cynton503 commented Nov 19, 2025

Uh oh!

mergify bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cynton503 commented Nov 14, 2025 •

edited by github-actions bot

Loading