[Core] Cleanup engine pause/sleep logic by njhill · Pull Request #34528 · vllm-project/vllm

njhill · 2026-02-13T17:49:35Z

This is a follow-on from #33195 which I didn't get a chance to review, and the subsequent merging with #34125

Always pause scheduler prior to sleeping (regardless of level)
Centralize cache resetting
Also support sleep/pause/resume with inline-engine cases (VLLM_ENABLE_V1_MULTIPROCESSING=0 / external launcher mode) - i.e. move from EngineCoreProc to EngineCore
Deduplicate the added LLM.enqueue logic'
Add optional pause-mode arg to sleep method
Clean up OutputProcessor._requests_drained logic missed from [Core] Move pause and resume functions into engine #34125
Incorporate changes for DPEP support (including @hao-aaron's test updates from [RL] Pause and Resume for DPEP #34544)

cc @jaewonlee-fb @houseroad @zhuohan123 @hao-aaron

Resolves #31619
Resolves #15483
Replaces #30186, #28721

Signed-off-by: Nick Hill <nickhill123@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the engine's pause and sleep logic, centralizing it in the EngineCore base class and improving the separation of concerns between request submission and execution. While the overall refactoring is a positive change for code clarity and maintainability, I've identified a critical logic bug in the new pause_scheduler implementation that could lead to deadlocks.

vllm/v1/engine/core.py

Signed-off-by: Nick Hill <nickhill123@gmail.com>

hao-aaron

LGTM, just some clarifying questions

hao-aaron · 2026-02-13T20:35:27Z

vllm/v1/engine/core.py

+        # Pause scheduler before sleeping.
+        clear_prefix_cache = level >= 1
+        pause_future = self.pause_scheduler(mode=mode, clear_cache=clear_prefix_cache)
+        if level < 1:
+            return pause_future


What are your thoughts on removing the whole pause/resume and merging with sleep/wakeup

I think it would be a good idea, but we should put some thought into the API, and we'll need to keep the existing ones anyhow to begin with for backwards compatibility. So we can consider that as a follow-on change.

hao-aaron · 2026-02-13T20:42:58Z

vllm/v1/engine/core.py

-            self.model_executor.sleep(level)
+
+        # Pause scheduler before sleeping.
+        clear_prefix_cache = level >= 1


Are we okay with sleep level=0 always retaining prefix cache?

I think it's fine

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill · 2026-02-14T00:22:27Z

Test failures are all known existing unrelated errors

vllm/v1/engine/core.py

mergify · 2026-02-18T00:31:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # vllm/entrypoints/llm.py

# Conflicts: # vllm/entrypoints/llm.py Signed-off-by: Nick Hill <nickhill123@gmail.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Co-authored-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill · 2026-02-24T16:01:13Z

CI failures are known issues on main.

DarkLight1337 · 2026-02-24T16:02:47Z

I think https://buildkite.com/vllm/ci/builds/52826/steps/canvas?sid=019c8d4d-b78f-43f0-810a-461f7d2befce&tab=output is related, it's not the spec decoding test that is failing.

markmc

Nice work, lgtm now 👍

(Obvs need to sort out test failures before merging)

njhill · 2026-02-24T16:48:10Z

I think https://buildkite.com/vllm/ci/builds/52826/steps/canvas?sid=019c8d4d-b78f-43f0-810a-461f7d2befce&tab=output is related, it's not the spec decoding test that is failing.

Ah thanks @DarkLight1337, will check that, it was the flaky spec decode test that failed before I retried it!

Signed-off-by: Nick Hill <nickhill123@gmail.com>

hao-aaron

LGTM! thanks for adding my tests

njhill · 2026-02-24T22:19:15Z

Only failure is known unrelated audio model failure on main.

Signed-off-by: Nick Hill <nickhill123@gmail.com>

) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill added 2 commits February 13, 2026 09:48

[Core] Cleanup engine pause/sleep logic

299277f

Signed-off-by: Nick Hill <nickhill123@gmail.com>

deduplicate LLM.enqueue()

b3d70c2

Signed-off-by: Nick Hill <nickhill123@gmail.com>

mergify bot added frontend v1 labels Feb 13, 2026

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

also support pause/resume in inline engine mode

f440baf

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill force-pushed the cleanup-pausesleep branch from baab562 to f440baf Compare February 13, 2026 18:06

njhill marked this pull request as ready for review February 13, 2026 18:16

njhill requested a review from DarkLight1337 as a code owner February 13, 2026 18:16

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 13, 2026

njhill force-pushed the cleanup-pausesleep branch from b1f19ef to f088cf2 Compare February 13, 2026 19:12

njhill requested a review from houseroad February 13, 2026 19:12

njhill force-pushed the cleanup-pausesleep branch from f088cf2 to 9df4dd0 Compare February 13, 2026 19:28

add pause mode support to sleep()

9d94abb

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill force-pushed the cleanup-pausesleep branch from 9df4dd0 to 9d94abb Compare February 13, 2026 19:41

clean up vestigial OutputProcessor._requests_drained

be8ab3c

Signed-off-by: Nick Hill <nickhill123@gmail.com>

hao-aaron approved these changes Feb 13, 2026

View reviewed changes

njhill added 2 commits February 13, 2026 14:15

fix wake_up None case

ae6396d

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

dcd4d8f

erdaltoprak mentioned this pull request Feb 16, 2026

Fix #15483 : Add error handling for model-dependent endpoints during … #30186

Closed

markmc reviewed Feb 16, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

markmc mentioned this pull request Feb 17, 2026

[Bug][Metrics] - record stats when requests are aborted by pause/sleep #34691

Open

mergify bot added the needs-rebase label Feb 18, 2026

erdaltoprak mentioned this pull request Feb 19, 2026

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Closed

1 task

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

a9ee7c0

# Conflicts: # vllm/entrypoints/llm.py

mergify bot removed the needs-rebase label Feb 19, 2026

This comment was marked as resolved.

Sign in to view

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

ef75dbe

njhill added 2 commits February 21, 2026 21:31

separate inproc impl of pause_scheduler

73820ff

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

fe668e2

# Conflicts: # vllm/entrypoints/llm.py Signed-off-by: Nick Hill <nickhill123@gmail.com>

mergify bot removed the needs-rebase label Feb 22, 2026

njhill added 5 commits February 23, 2026 11:21

fix

a4c32ef

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

6539824

refactor and incorporate DP changes

c0df4a1

Co-authored-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>

test_async_llm_dp.py updates from @hao-aaron

ba7d17a

Co-authored-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

a957f0a

This comment was marked as resolved.

Sign in to view

fix precommit

ec29845

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill mentioned this pull request Feb 24, 2026

[RL] Pause and Resume for DPEP #34544

Closed

5 tasks

markmc approved these changes Feb 24, 2026

View reviewed changes

njhill added 2 commits February 24, 2026 09:44

test fixes

0b95a8f

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into cleanup-pausesleep

bdc3975

hao-aaron approved these changes Feb 24, 2026

View reviewed changes

vllm-bot merged commit dbf0da8 into vllm-project:main Feb 25, 2026
50 of 53 checks passed

njhill deleted the cleanup-pausesleep branch February 25, 2026 16:12

tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026

[Core] Cleanup engine pause/sleep logic (vllm-project#34528)

2223182

Signed-off-by: Nick Hill <nickhill123@gmail.com>

hao-aaron mentioned this pull request Feb 26, 2026

[RFC]: Extending vLLM Pause/Resume API to support in-flight requests and DPEP scenarios #32103

Open

1 task

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Core] Cleanup engine pause/sleep logic (vllm-project#34528)

874358a

Signed-off-by: Nick Hill <nickhill123@gmail.com>

CharlieFRuan mentioned this pull request Mar 2, 2026

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump NovaSky-AI/SkyRL#1250

Merged

4 tasks

will-deines mentioned this pull request Mar 4, 2026

feat: add suspend()/resume() for CRIU-safe snapshots #35934

Open

9 tasks

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Core] Cleanup engine pause/sleep logic (vllm-project#34528)

12eb07e

Signed-off-by: Nick Hill <nickhill123@gmail.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[Core] Cleanup engine pause/sleep logic (vllm-project#34528)

52e8bd0

Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[Core] Cleanup engine pause/sleep logic (vllm-project#34528)

f4b0969

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Uh oh!

Conversation

njhill commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hao-aaron left a comment

Choose a reason for hiding this comment

Uh oh!

hao-aaron Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

hao-aaron Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

njhill commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

njhill commented Feb 24, 2026

Uh oh!

DarkLight1337 commented Feb 24, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Feb 24, 2026

Uh oh!

hao-aaron left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhill commented Feb 13, 2026 •

edited by github-actions bot

Loading