[Core] Move pause and resume functions into engine#34125
[Core] Move pause and resume functions into engine#34125vllm-bot merged 25 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
Documentation preview: https://vllm--34125.org.readthedocs.build/en/34125/ |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the pause/resume functionality to support data-parallel deployments by centralizing the logic within EngineCore. Pause and resume commands are now broadcast to all engine instances, ensuring a consistent state. A new DeferredUtilityResult mechanism is introduced to handle asynchronous operations that require multiple steps to complete, making the pause operation more robust. The changes are well-tested with new unit tests and a new example for data-parallel pause/resume. My review identifies a couple of areas for improvement: one related to a potentially unreliable synchronization mechanism using queue.qsize(), and another concerning a possibly outdated NotImplementedError that may be overly restrictive with the new architecture.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring to enable pause and resume functionality in data parallel deployments. The core logic for pausing and resuming is now centralized in EngineCore, and these actions are broadcast to all engine instances, ensuring a consistent state. A new DeferredUtilityResult mechanism is introduced to handle asynchronous operations that need to wait for specific conditions, which is a clean and effective pattern. The changes are well-implemented and supported by new and updated tests. My main concern is a potential IndexError in the new example test script, which could cause it to crash under certain conditions.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aaron Hao <ahao@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
Did the first pass. Overall looks good. Needs some touch ups in the implementation of the deferred utility to bring down its complexity
njhill
left a comment
There was a problem hiding this comment.
Thanks @hao-aaron, overall I like the approach.
Have a couple of initial simplification comments from a first pass.
Signed-off-by: hao-aaron <ahao@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring that moves the pause and resume logic into the EngineCore. This is a crucial step for enabling this functionality in data-parallel deployments. The introduction of the UtilityFuture mechanism for handling deferred operations is a solid design choice for managing asynchronous completion of pause modes. The code is cleaner, and the logic is now centralized, which greatly improves maintainability. I've found one critical issue related to aborting requests that are submitted while the engine is paused, which I've detailed in a specific comment. Otherwise, the changes look excellent.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aaron Hao <ahao@anyscale.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aaron Hao <ahao@anyscale.com>
|
Hi @hao-aaron, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
8bcc74a to
0f68852
Compare
|
Hi @hao-aaron, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Looks like this got changed during vllm-project#34125 iterations, but the docs got out of sync. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Looks like this got changed during vllm-project#34125 iterations, but the docs got out of sync. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
Purpose
Relevant RFC: #32103
Follow up to #32351, enables pause and resume for api server count is > 1. This is necessary for data parallel deployments.
To keep a consistent pause state across multiple
AsyncLLMinstances, we push all pause logic to theEngineCorelevel. Pause and resume requests received by one AsyncLLM is broadcast to all EngineCores, so all engines instances share the same paused state and can be queried by any AsyncLLM to find the current paused state. New requests submitted while paused have the same behavior, e.g. new requests are frozen until resumed.To ensure that pause only returns when the operation is completely finished (
mode=waitfinishes processing,mode=abortfinishes processing all aborts), I introduce theFutureUtilitymechanism, which allows a custom callback function to be specified which runs every step which decides when to return utility output and unblock the utility caller. An example is give in formode=wait:A return value of DeferredUtility lets the busy_loop know to stash the callback function and call it ever step, checking for some finish condition.
Support for DPEP is not included in this PR and will come in a separate one.
Test Plan
Unit tests
Test Result
passing
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.