[Feat][RL] Pause and Resume with keep requests for single engine#32351
[Feat][RL] Pause and Resume with keep requests for single engine#32351robertgshaw2-redhat merged 18 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
Documentation preview: https://vllm--32351.org.readthedocs.build/en/32351/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a "keep" mode for pausing and resuming generation, which freezes requests in the queue. The implementation spans multiple components, from the API endpoint to the engine core and scheduler. The changes are logical and include a new example for testing. However, I've identified a critical race condition in AsyncLLM.pause_generation that could lead to inconsistent states, and a case of code duplication in EngineCore that should be refactored for better maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a pause and resume feature with a "keep" mode for single-engine setups, which is a valuable addition for managing engine state. The changes are comprehensive, touching the API, engine protocol, core scheduling logic, and client implementation. A new example is also included to demonstrate and test the feature. The implementation is mostly solid, but I've identified a potential race condition in the client implementation that could lead to issues if used concurrently, even though the current usage pattern in AsyncLLM prevents it. Addressing this would make the implementation more robust.
kouroshHakha
left a comment
There was a problem hiding this comment.
some comments / questions
| continue | ||
|
|
||
| if output_handler is not None: | ||
| assert _self_ref is not None |
|
Hi @ahao-anyscale, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
kouroshHakha
left a comment
There was a problem hiding this comment.
My main question is why should clear_cache be ignored in case of mode=keep, That is not a fundamental limitation other than wanting to separate concerns between pause/resume and kv-cache clearance right?
| 2. Controller task: sends pause/resume commands | ||
|
|
||
| If the engine properly pauses, we should see a gap in token timestamps | ||
| matching the pause duration. |
There was a problem hiding this comment.
What would be the example output of this example?
SumanthRH
left a comment
There was a problem hiding this comment.
Nice! Left minor comments primarily on examples/tests
There was a problem hiding this comment.
Thanks @hao-aaron
Re the race condition, I had actually missed the fact that new requests are blocked (now I've looked closer at the original PR #28037). So it's not such an issue, but there is technically still a race condition because of how the lock is scoped. That's kind of independent of this PR though so I can open a separate one for it (opened #33926).
Re "keep" mode, as mentioned above I think this is different to the other two modes in that it should work fine with multiple API servers. It also doesn't require the locking _pause_cond for synchronizing with new requests, since we are essentially pausing in the engine.
So I think you could move the handing of that in pause_generation() outside of the _pause_cond context mgr, and not subject it to the _client_count check (but leave the check for the other two modes).
I guess the only caveat is that the is_paused method won't work in the "keep" case, not sure how important that is though.
|
nice, the utility output really simplified things. glad it could work with the existing infra. I will let Nick give final signoff |
|
Thanks for the reviews! Do we anticipate needing to add pause resume support for api-server-count > 1 in the future? |
Yes, for the DP case |
njhill
left a comment
There was a problem hiding this comment.
Thank @hao-aaron just one last comment!
| pause_token_idx = len(token_times) | ||
| print(f"Paused! Sleeping for {PAUSE_DURATION}s...") | ||
|
|
||
| # Sleep while paused - no tokens should be generated during this time |
There was a problem hiding this comment.
I found that using sleep here works without any problems, but referencing the parameter update implementation at url causes a deadlock. Are there any related tests for this functionality currently?
|
I think this PR is excellent. However, I’m seeing the following warnings when using the feature: In addition, after pause → update weights → resume, the model sometimes fails to terminate. Has this PR been tested for this workflow (pause/resume with in-flight requests)? If so, could you share the test setup or any recommended usage constraints (e.g., requiring all requests to drain before resetting caches)? |
|
Hi @Freder-chen thanks for trying the code, right now there hasn't been any investigation on the interactions between this new keep mode and clearing caches, but I can address it in a follow up PR.
Could you elaborate on what you mean by this? Anything you can share to help me reproduce would be great |
I’m using Qwen3-Thinking. After a pause → update weights → resume workflow, the generation becomes corrupted: the subsequent generated token_ids are either all zeros or decode into garbled output (e.g., 。。。。。????xxxx). I’ve already isolated the components:
So this looks like a state/caching issue triggered by resuming generation after an in-place weight update (possibly KV cache / prefix cache / tokenizer state / internal buffers becoming inconsistent), but I haven’t confirmed the root cause yet. Additionally, I'd like to confirm if the pausing/resume methods conflict with the sleep and wake_up methods; I'd like to try this implementation in hybrid engine. |
I referenced the implementation at https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/new_weight_syncing/rlhf_async_new_apis.py. |
…m-project#32351) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m-project#32351) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Purpose
Completes part 1 of #32103
We introduce new "mode" parameter to pause and resume APIs, allowing for the following behaviors:
mode="abort": all inflight requests are immediately aborted and partially generated sequences are returned to callersmode="wait": before pausing, we wait for all inflight requests to finish firstmode="keep": we stop all inflight requests asap, but do not return. From the perspective of caller, it will appear as if token streaming is taking a long time.Test Plan
pause_resume.pyTest Result
passing
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.