Skip to content

[Feat][RL] Pause and Resume with keep requests for single engine#32351

Merged
robertgshaw2-redhat merged 18 commits intovllm-project:mainfrom
hao-aaron:pause-resume
Feb 7, 2026
Merged

[Feat][RL] Pause and Resume with keep requests for single engine#32351
robertgshaw2-redhat merged 18 commits intovllm-project:mainfrom
hao-aaron:pause-resume

Conversation

@hao-aaron
Copy link
Contributor

@hao-aaron hao-aaron commented Jan 14, 2026

Purpose

Completes part 1 of #32103
We introduce new "mode" parameter to pause and resume APIs, allowing for the following behaviors:

  • mode="abort": all inflight requests are immediately aborted and partially generated sequences are returned to callers
  • mode="wait": before pausing, we wait for all inflight requests to finish first
  • NEW: mode="keep": we stop all inflight requests asap, but do not return. From the perspective of caller, it will appear as if token streaming is taking a long time.

Test Plan

  • include new example pause_resume.py
  • include unit tests

Test Result

passing


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@mergify
Copy link

mergify bot commented Jan 14, 2026

Documentation preview: https://vllm--32351.org.readthedocs.build/en/32351/

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "keep" mode for pausing and resuming generation, which freezes requests in the queue. The implementation spans multiple components, from the API endpoint to the engine core and scheduler. The changes are logical and include a new example for testing. However, I've identified a critical race condition in AsyncLLM.pause_generation that could lead to inconsistent states, and a case of code duplication in EngineCore that should be refactored for better maintainability.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pause and resume feature with a "keep" mode for single-engine setups, which is a valuable addition for managing engine state. The changes are comprehensive, touching the API, engine protocol, core scheduling logic, and client implementation. A new example is also included to demonstrate and test the feature. The implementation is mostly solid, but I've identified a potential race condition in the client implementation that could lead to issues if used concurrently, even though the current usage pattern in AsyncLLM prevents it. Addressing this would make the implementation more robust.

Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments / questions

continue

if output_handler is not None:
assert _self_ref is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove assert?

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as ready for review January 29, 2026 05:24
@hao-aaron hao-aaron changed the title [WIP] Pause and Resume with keep requests for single engine [Feat][RL] Pause and Resume with keep requests for single engine Jan 29, 2026
@mergify
Copy link

mergify bot commented Jan 29, 2026

Hi @ahao-anyscale, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@kouroshHakha kouroshHakha added ready ONLY add when PR is ready to merge/full CI is needed and removed documentation Improvements or additions to documentation labels Jan 29, 2026
@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 29, 2026
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question is why should clear_cache be ignored in case of mode=keep, That is not a fundamental limitation other than wanting to separate concerns between pause/resume and kv-cache clearance right?

2. Controller task: sends pause/resume commands

If the engine properly pauses, we should see a gap in token timestamps
matching the pause duration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the example output of this example?

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left minor comments primarily on examples/tests

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hao-aaron

Re the race condition, I had actually missed the fact that new requests are blocked (now I've looked closer at the original PR #28037). So it's not such an issue, but there is technically still a race condition because of how the lock is scoped. That's kind of independent of this PR though so I can open a separate one for it (opened #33926).

Re "keep" mode, as mentioned above I think this is different to the other two modes in that it should work fine with multiple API servers. It also doesn't require the locking _pause_cond for synchronizing with new requests, since we are essentially pausing in the engine.

So I think you could move the handing of that in pause_generation() outside of the _pause_cond context mgr, and not subject it to the _client_count check (but leave the check for the other two modes).

I guess the only caveat is that the is_paused method won't work in the "keep" case, not sure how important that is though.

@robertgshaw2-redhat
Copy link
Collaborator

nice, the utility output really simplified things. glad it could work with the existing infra. I will let Nick give final signoff

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron
Copy link
Contributor Author

Thanks for the reviews! Do we anticipate needing to add pause resume support for api-server-count > 1 in the future?

@robertgshaw2-redhat
Copy link
Collaborator

Thanks for the reviews! Do we anticipate needing to add pause resume support for api-server-count > 1 in the future?

Yes, for the DP case

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @hao-aaron just one last comment!

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hao-aaron !

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) February 6, 2026 21:55
@robertgshaw2-redhat robertgshaw2-redhat merged commit 89a385d into vllm-project:main Feb 7, 2026
48 checks passed
pause_token_idx = len(token_times)
print(f"Paused! Sleeping for {PAUSE_DURATION}s...")

# Sleep while paused - no tokens should be generated during this time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that using sleep here works without any problems, but referencing the parameter update implementation at url causes a deadlock. Are there any related tests for this functionality currently?

@Freder-chen
Copy link

I think this PR is excellent. However, I’m seeing the following warnings when using the feature:

[block_pool.py:435] Failed to reset prefix cache because some blocks (5150) are not freed yet
[core.py:578] Resetting the multi-modal cache when requests are in progress may lead to desynced internal caches.
[core.py:606] Resetting the encoder cache when requests are in progress may lead to desynced internal caches.

In addition, after pause → update weights → resume, the model sometimes fails to terminate.

Has this PR been tested for this workflow (pause/resume with in-flight requests)? If so, could you share the test setup or any recommended usage constraints (e.g., requiring all requests to drain before resetting caches)?

@hao-aaron
Copy link
Contributor Author

Hi @Freder-chen thanks for trying the code, right now there hasn't been any investigation on the interactions between this new keep mode and clearing caches, but I can address it in a follow up PR.

In addition, after pause → update weights → resume, the model sometimes fails to terminate.

Could you elaborate on what you mean by this? Anything you can share to help me reproduce would be great

@Freder-chen
Copy link

你好@Freder-chen 感谢您尝试这段代码,目前还没有对这种新的保持模式和清除缓存之间的交互进行任何调查,但我可以在后续的 PR 中解决这个问题。

此外,在暂停→更新权重→恢复后,模型有时无法终止。

您能详细解释一下您的意思吗?如果您能提供任何有助于我重现此问题的信息,那就太好了。

I’m using Qwen3-Thinking. After a pause → update weights → resume workflow, the generation becomes corrupted: the subsequent generated token_ids are either all zeros or decode into garbled output (e.g., 。。。。。????xxxx).

I’ve already isolated the components:

  1. The weight update method is correct when tested independently (without pause/resume).
  2. Pause/resume alone works correctly if I don’t update weights in between.
  3. The abnormal behavior only happens when I do pause → update weights → resume.

So this looks like a state/caching issue triggered by resuming generation after an in-place weight update (possibly KV cache / prefix cache / tokenizer state / internal buffers becoming inconsistent), but I haven’t confirmed the root cause yet.

Additionally, I'd like to confirm if the pausing/resume methods conflict with the sleep and wake_up methods; I'd like to try this implementation in hybrid engine.

@Freder-chen
Copy link

你好@Freder-chen感谢您尝试编写代码,目前还没有对这种新的保持模式和清除存储之间的交互进行任何调查,但我可以在后续的 PR 中解决这个问题。

另外,在暂停→更新权重→恢复后,模型有时无法终止。

您能详细解释一下您的意思吗?如果您能提供任何帮助我替换此问题的信息,那就太棒了。

我正在使用 Qwen3-Thinking。在暂停→更新权重→恢复工作流程后,生成过程会出错:后续生成的 token_id 要么全为零,要么解码成乱码输出(例如,。。。。。????xxxx)。

我已经将各个组件分离出来:

  1. 独立测试(不暂停/恢复)时,权重更新方法是正确的。
  2. 如果我不在两次暂停/恢复操作之间更新权重,则单独的暂停/恢复功能可以正常工作。
  3. 只有当我执行暂停→更新权重→恢复操作时,才会出现异常行为。

所以这看起来像是状态/缓存问题,是由在原地权重更新后恢复生成引起的(可能是 KV 缓存/前缀缓存/分词器状态/内部缓冲区不一致),但我还没有确认根本原因。

此外,我想确认暂停/恢复方法是否与睡眠和唤醒方法冲突;我想在混合引擎中尝试这种实现方式。

I referenced the implementation at https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/new_weight_syncing/rlhf_async_new_apis.py.

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…m-project#32351)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
SumanthRH pushed a commit to NovaSky-AI/SkyRL that referenced this pull request Mar 2, 2026
)

### Summary

- Fix `abort_generation()` and `sleep()` abort logic that broke silently
after the vllm 0.16.0 bump (#1240)
- Add backward-compatible `_get_unfinished_request_ids()` helper to
resolve internal vs external request ID mismatch
- Fixes #1243

### Root Cause

In vllm 0.16.0,
[`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py)
now creates **internal** request IDs (with a random suffix) that are
distinct from the user-provided **external** request IDs:

```python
request.external_req_id = request.request_id                       # save original as external
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"  # new internal ID
```

Our code was reading request IDs from
`output_processor.request_states.keys()` (which are now **internal**
IDs) and passing them to `engine.abort()` with `internal=False` (the
default). The abort looked them up in the `external_req_ids` mapping,
found nothing, and **silently did nothing**. Requests completed normally
with `finish_reason="length"` instead of `"abort"`.

This broke fully async RL's pause/resume flow, which relies on abort
returning partial outputs with `finish_reason="abort"` so the retry loop
can re-submit with accumulated tokens.

Related vllm changes:
- vllm-project/vllm#32103
- vllm-project/vllm#32351
- vllm-project/vllm#34125
- vllm-project/vllm#34528

### Fix

Add a `_get_unfinished_request_ids()` static method on
`BaseVLLMInferenceEngine` that:
- Uses `output_processor.external_req_ids.keys()` when available (vllm
0.16.0+)
- Falls back to `output_processor.request_states.keys()` for older vllm
versions

Applied to all three abort call sites:
1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async
pause/resume
2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep
3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep

### Test plan

- [x] `test_abort_generation_vllm_engine` — passes (was failing with
`assert 'length' == 'abort'`)
- [x] `test_continue_generation_vllm_engine_chat_completion` — passes
- [x] `test_continue_generation_generate_vllm_engine_generation` —
passes
- [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12
training steps successfully with pause/resume working correctly

Light blue is the run after this fix (our nightly gsm8k fully async CI)
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci

<img width="2163" height="976" alt="image"
src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47"
/>


<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…m-project#32351)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants