[Feat][RL] Pause and Resume with keep requests for single engine by hao-aaron · Pull Request #32351 · vllm-project/vllm

hao-aaron · 2026-01-14T21:24:00Z

Purpose

Completes part 1 of #32103
We introduce new "mode" parameter to pause and resume APIs, allowing for the following behaviors:

mode="abort": all inflight requests are immediately aborted and partially generated sequences are returned to callers
mode="wait": before pausing, we wait for all inflight requests to finish first
NEW: mode="keep": we stop all inflight requests asap, but do not return. From the perspective of caller, it will appear as if token streaming is taking a long time.

Test Plan

include new example pause_resume.py
include unit tests

Test Result

passing

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

mergify · 2026-01-14T21:24:38Z

Documentation preview: https://vllm--32351.org.readthedocs.build/en/32351/

gemini-code-assist

Code Review

This pull request introduces a "keep" mode for pausing and resuming generation, which freezes requests in the queue. The implementation spans multiple components, from the API endpoint to the engine core and scheduler. The changes are logical and include a new example for testing. However, I've identified a critical race condition in AsyncLLM.pause_generation that could lead to inconsistent states, and a case of code duplication in EngineCore that should be refactored for better maintainability.

vllm/v1/engine/async_llm.py

vllm/v1/engine/core.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron · 2026-01-14T23:38:25Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a pause and resume feature with a "keep" mode for single-engine setups, which is a valuable addition for managing engine state. The changes are comprehensive, touching the API, engine protocol, core scheduling logic, and client implementation. A new example is also included to demonstrate and test the feature. The implementation is mostly solid, but I've identified a potential race condition in the client implementation that could lead to issues if used concurrently, even though the current usage pattern in AsyncLLM prevents it. Addressing this would make the implementation more robust.

vllm/v1/engine/core_client.py

kouroshHakha

some comments / questions

vllm/engine/protocol.py

vllm/v1/engine/core.py

vllm/v1/engine/core_client.py

kouroshHakha · 2026-01-15T00:32:54Z

vllm/v1/engine/core_client.py

+                        continue
+
                    if output_handler is not None:
-                        assert _self_ref is not None


why remove assert?

vllm/v1/engine/core_client.py

vllm/v1/engine/core.py

vllm/v1/engine/async_llm.py

vllm/v1/engine/core.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

mergify · 2026-01-29T05:27:16Z

Hi @ahao-anyscale, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

kouroshHakha

My main question is why should clear_cache be ignored in case of mode=keep, That is not a fundamental limitation other than wanting to separate concerns between pause/resume and kv-cache clearance right?

vllm/engine/protocol.py

vllm/v1/engine/async_llm.py

tests/v1/engine/test_async_llm.py

examples/offline_inference/pause_resume.py

kouroshHakha · 2026-01-29T20:00:24Z

examples/offline_inference/pause_resume.py

+2. Controller task: sends pause/resume commands
+
+If the engine properly pauses, we should see a gap in token timestamps
+matching the pause duration.


What would be the example output of this example?

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

kouroshHakha

LGTM

vllm/v1/engine/async_llm.py

SumanthRH

Nice! Left minor comments primarily on examples/tests

examples/offline_inference/pause_resume.py

tests/v1/engine/test_async_llm.py

vllm/entrypoints/serve/rlhf/api_router.py

njhill

Thanks @hao-aaron

Re the race condition, I had actually missed the fact that new requests are blocked (now I've looked closer at the original PR #28037). So it's not such an issue, but there is technically still a race condition because of how the lock is scoped. That's kind of independent of this PR though so I can open a separate one for it (opened #33926).

Re "keep" mode, as mentioned above I think this is different to the other two modes in that it should work fine with multiple API servers. It also doesn't require the locking _pause_cond for synchronizing with new requests, since we are essentially pausing in the engine.

So I think you could move the handing of that in pause_generation() outside of the _pause_cond context mgr, and not subject it to the _client_count check (but leave the check for the other two modes).

I guess the only caveat is that the is_paused method won't work in the "keep" case, not sure how important that is though.

vllm/v1/engine/async_llm.py

robertgshaw2-redhat · 2026-02-05T18:52:48Z

nice, the utility output really simplified things. glad it could work with the existing infra. I will let Nick give final signoff

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…into pause-resume

hao-aaron · 2026-02-05T18:58:23Z

Thanks for the reviews! Do we anticipate needing to add pause resume support for api-server-count > 1 in the future?

robertgshaw2-redhat · 2026-02-05T19:37:09Z

Thanks for the reviews! Do we anticipate needing to add pause resume support for api-server-count > 1 in the future?

Yes, for the DP case

vllm/v1/engine/async_llm.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

njhill

Thank @hao-aaron just one last comment!

vllm/v1/engine/async_llm.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

njhill

Thanks @hao-aaron !

Freder-chen · 2026-02-08T12:54:25Z

examples/offline_inference/pause_resume.py

+        pause_token_idx = len(token_times)
+        print(f"Paused! Sleeping for {PAUSE_DURATION}s...")
+
+        # Sleep while paused - no tokens should be generated during this time


I found that using sleep here works without any problems, but referencing the parameter update implementation at url causes a deadlock. Are there any related tests for this functionality currently?

Freder-chen · 2026-02-09T08:46:14Z

I think this PR is excellent. However, I’m seeing the following warnings when using the feature:

[block_pool.py:435] Failed to reset prefix cache because some blocks (5150) are not freed yet
[core.py:578] Resetting the multi-modal cache when requests are in progress may lead to desynced internal caches.
[core.py:606] Resetting the encoder cache when requests are in progress may lead to desynced internal caches.

In addition, after pause → update weights → resume, the model sometimes fails to terminate.

Has this PR been tested for this workflow (pause/resume with in-flight requests)? If so, could you share the test setup or any recommended usage constraints (e.g., requiring all requests to drain before resetting caches)?

hao-aaron · 2026-02-11T21:57:06Z

Hi @Freder-chen thanks for trying the code, right now there hasn't been any investigation on the interactions between this new keep mode and clearing caches, but I can address it in a follow up PR.

In addition, after pause → update weights → resume, the model sometimes fails to terminate.

Could you elaborate on what you mean by this? Anything you can share to help me reproduce would be great

Freder-chen · 2026-02-13T02:33:09Z

你好@Freder-chen 感谢您尝试这段代码，目前还没有对这种新的保持模式和清除缓存之间的交互进行任何调查，但我可以在后续的 PR 中解决这个问题。

此外，在暂停→更新权重→恢复后，模型有时无法终止。

您能详细解释一下您的意思吗？如果您能提供任何有助于我重现此问题的信息，那就太好了。

I’m using Qwen3-Thinking. After a pause → update weights → resume workflow, the generation becomes corrupted: the subsequent generated token_ids are either all zeros or decode into garbled output (e.g., 。。。。。？？？？xxxx).

I’ve already isolated the components:

The weight update method is correct when tested independently (without pause/resume).
Pause/resume alone works correctly if I don’t update weights in between.
The abnormal behavior only happens when I do pause → update weights → resume.

So this looks like a state/caching issue triggered by resuming generation after an in-place weight update (possibly KV cache / prefix cache / tokenizer state / internal buffers becoming inconsistent), but I haven’t confirmed the root cause yet.

Additionally, I'd like to confirm if the pausing/resume methods conflict with the sleep and wake_up methods; I'd like to try this implementation in hybrid engine.

Freder-chen · 2026-02-13T02:37:46Z

你好@Freder-chen感谢您尝试编写代码，目前还没有对这种新的保持模式和清除存储之间的交互进行任何调查，但我可以在后续的 PR 中解决这个问题。

另外，在暂停→更新权重→恢复后，模型有时无法终止。

您能详细解释一下您的意思吗？如果您能提供任何帮助我替换此问题的信息，那就太棒了。

我正在使用 Qwen3-Thinking。在暂停→更新权重→恢复工作流程后，生成过程会出错：后续生成的 token_id 要么全为零，要么解码成乱码输出（例如，。。。。。？？？？xxxx）。

我已经将各个组件分离出来：

独立测试（不暂停/恢复）时，权重更新方法是正确的。

如果我不在两次暂停/恢复操作之间更新权重，则单独的暂停/恢复功能可以正常工作。

只有当我执行暂停→更新权重→恢复操作时，才会出现异常行为。

所以这看起来像是状态/缓存问题，是由在原地权重更新后恢复生成引起的（可能是 KV 缓存/前缀缓存/分词器状态/内部缓冲区不一致），但我还没有确认根本原因。

此外，我想确认暂停/恢复方法是否与睡眠和唤醒方法冲突；我想在混合引擎中尝试这种实现方式。

I referenced the implementation at https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/new_weight_syncing/rlhf_async_new_apis.py.

…m-project#32351) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m-project#32351) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

pause and resume with keep

8282a6d

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

mergify bot added documentation Improvements or additions to documentation frontend v1 labels Jan 14, 2026

hao-aaron mentioned this pull request Jan 14, 2026

[RFC]: Extending vLLM Pause/Resume API to support in-flight requests and DPEP scenarios #32103

Open

1 task

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

fixes

b0f2c41

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

vllm/v1/engine/core_client.py Show resolved Hide resolved

kouroshHakha reviewed Jan 15, 2026

View reviewed changes

dangoldbj reviewed Jan 15, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

hao-aaron added 4 commits January 15, 2026 15:50

fix async bugs

9e39f87

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

x

229f04a

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

add tests

5b2ddef

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

x

688f1dd

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron marked this pull request as ready for review January 29, 2026 05:24

hao-aaron requested review from aarnphm and chaunceyjiang as code owners January 29, 2026 05:24

hao-aaron changed the title ~~[WIP] Pause and Resume with keep requests for single engine~~ [Feat][RL] Pause and Resume with keep requests for single engine Jan 29, 2026

x

afe222f

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

kouroshHakha added ready ONLY add when PR is ready to merge/full CI is needed and removed documentation Improvements or additions to documentation labels Jan 29, 2026

mergify bot added the documentation Improvements or additions to documentation label Jan 29, 2026

kouroshHakha reviewed Jan 29, 2026

View reviewed changes

x

53a61e0

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

kouroshHakha approved these changes Jan 30, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

SumanthRH reviewed Jan 30, 2026

View reviewed changes

examples/offline_inference/pause_resume.py Outdated Show resolved Hide resolved

tests/v1/engine/test_async_llm.py Outdated Show resolved Hide resolved

tests/v1/engine/test_async_llm.py Outdated Show resolved Hide resolved

tests/v1/engine/test_async_llm.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Jan 30, 2026

View reviewed changes

vllm/entrypoints/serve/rlhf/api_router.py Show resolved Hide resolved

njhill reviewed Feb 5, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

njhill mentioned this pull request Feb 5, 2026

[BugFix] Fix small race condition when pausing generation for RL #33926

Closed

robertgshaw2-redhat reviewed Feb 5, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

Merge branch 'main' into pause-resume

702fc47

hao-aaron added 2 commits February 5, 2026 10:57

x

4920441

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

Merge branch 'pause-resume' of https://github.com/ahao-anyscale/vllm …

91b2771

…into pause-resume

njhill reviewed Feb 5, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

hao-aaron added 2 commits February 5, 2026 12:37

x

1b60f77

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

x

03dae36

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

njhill reviewed Feb 6, 2026

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

x

0904a1c

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

njhill approved these changes Feb 6, 2026

View reviewed changes

Merge branch 'main' into pause-resume

5db66bd

robertgshaw2-redhat approved these changes Feb 6, 2026

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) February 6, 2026 21:55

robertgshaw2-redhat merged commit 89a385d into vllm-project:main Feb 7, 2026
48 checks passed

Freder-chen reviewed Feb 8, 2026

View reviewed changes

hao-aaron mentioned this pull request Feb 9, 2026

[Core] Move pause and resume functions into engine #34125

Merged

5 tasks

hao-aaron mentioned this pull request Feb 11, 2026

[BUG] Reset running requests when clearing cache for pause/resume #34382

Merged

5 tasks

CharlieFRuan mentioned this pull request Mar 2, 2026

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump NovaSky-AI/SkyRL#1250

Merged

4 tasks

Uh oh!

Conversation

hao-aaron commented Jan 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

hao-aaron commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 29, 2026

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 5, 2026

Uh oh!

hao-aaron commented Feb 5, 2026

Uh oh!

hao-aaron commented Jan 14, 2026 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading