[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager by wuxibin89 · Pull Request #5430 · verl-project/verl

wuxibin89 · 2026-02-27T16:57:14Z

What does this PR do?

Pass trainer's global_steps to rollout when update weights.
FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup verl/experimental/fully_async_policy/agent_loop.

gemini-code-assist

Code Review

This pull request introduces an auto-resume feature for FullyAsyncLLMServerManager to handle interruptions during partial rollouts, making them transparent to the AgentLoop. The implementation involves passing global_steps to track model versions across various components, including checkpointing and different rollout backends (sglang, vllm, trtllm). The changes are well-integrated, and a new test case effectively validates the auto-resume functionality. I've identified one critical issue regarding a potential race condition due to in-place modification of a shared dictionary and have provided a suggestion to resolve it. The rest of the changes appear solid and consistently applied.

PeterSH6 · 2026-02-28T08:55:28Z

    num_preempted: Optional[int] = None
    """number of preempted times for metric calculation"""

+    global_steps: Optional[int] = None


Recording the global steps and min/max global steps in the rollout engine is a bit weird. Why do we need this?

Replaced with extra_info for any auxiliary information from rollout.

Recording global_steps in rollout engine is to monitoring the weight version of trajectory segment in partial rollout.

PeterSH6

This PR LGTM.
The only concern is whether the test cases cover update_weights with None input_param

PeterSH6 · 2026-03-03T13:54:35Z

+    for global_steps in range(initial_steps, initial_steps + train_steps):
+        # wait a while and update weights to interrupt the generation
+        await asyncio.sleep(3)
+        await checkpoint_manager.update_weights(global_steps=global_steps)


Does the test case include update_weights(global_steps=None)?

Add a test case _run_update_weights_with_global_steps_none

…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.

SnowCharmQ · 2026-03-09T13:27:36Z

Hi, I update verl to the latest version, an error occurs due to this PR:

ImportError: cannot import name 'ContinueGenerationReqInput' from 'sglang.srt.managers.io_struct' (/root/miniconda3/envs/searchr1/lib/python3.12/site-packages/sglang/srt/managers/io_struct.py)

It seems the default version of sglang is not new enough to support this feature?

SnowCharmQ · 2026-03-09T13:34:38Z

Once I reviewed the implementation of https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/io_struct.py as shown below:

It seems that the current implementation is not complete?

SnowCharmQ · 2026-03-10T03:24:16Z

    set_global_state,
 )
 from sglang.srt.managers.io_struct import (
+    ContinueGenerationReqInput,


It seems taht this definition was introduced in sglang v0.5.6, however, the default version of sglang used by verl is v0.5.2. So it will occur errors.

…sing fully_async (#5487) ### What does this PR do? Refactor the fully_async code based on #5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.

…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.

…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.

…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

wuxibin89 requested review from ArronHZG, PeterSH6, chenhaiq, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 27, 2026 16:57

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread verl/experimental/fully_async_policy/agent_loop/agent_loop.py

wuxibin89 requested a review from ISEEKYAN as a code owner February 28, 2026 03:10

ArronHZG previously approved these changes Feb 28, 2026

View reviewed changes

wuxibin89 dismissed ArronHZG’s stale review via 83a1ab8 February 28, 2026 03:36

wuxibin89 mentioned this pull request Feb 28, 2026

[roadmap] verl 26Q1 roadmap #4880

Open

28 tasks

PeterSH6 reviewed Feb 28, 2026

View reviewed changes

PeterSH6 reviewed Mar 3, 2026

View reviewed changes

wuxibin89 added 7 commits March 3, 2026 22:03

[rollout] feat: support auto resume on abort in AsyncLLMServerManager

3f05759

fix ci

a22b32b

unit test

57ddcf8

relax max_global_steps

780680b

fix ci

2012bb9

add TokenOutput.extra_info

ba41b47

add test case for update_weights(global_steps=None)

76596b6

wuxibin89 force-pushed the wuxibin/resume_client branch from 839e07c to 76596b6 Compare March 3, 2026 14:31

ArronHZG approved these changes Mar 4, 2026

View reviewed changes

wuxibin89 merged commit c4593e3 into verl-project:main Mar 4, 2026
135 of 243 checks passed

ArronHZG mentioned this pull request Mar 4, 2026

[fully_async, one_step_off] feat: support auto resume on abort when using fully_async #5487

Merged

8 tasks

SnowCharmQ reviewed Mar 10, 2026

View reviewed changes

ArronHZG mentioned this pull request Mar 19, 2026

feat: async partial rollout trainer with sample supplementation and caching verl-project/verl-recipe#58

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager#5430

[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager#5430
wuxibin89 merged 7 commits intoverl-project:mainfrom
wuxibin89:wuxibin/resume_client

wuxibin89 commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

PeterSH6 Feb 28, 2026

Uh oh!

wuxibin89 Feb 28, 2026

Uh oh!

wuxibin89 Feb 28, 2026

Uh oh!

PeterSH6 left a comment •

edited

Loading

Uh oh!

PeterSH6 Mar 3, 2026

Uh oh!

wuxibin89 Mar 3, 2026

Uh oh!

Uh oh!

SnowCharmQ commented Mar 9, 2026

Uh oh!

SnowCharmQ commented Mar 9, 2026

Uh oh!

SnowCharmQ Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wuxibin89 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

PeterSH6 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

PeterSH6 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterSH6 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SnowCharmQ commented Mar 9, 2026

Uh oh!

SnowCharmQ commented Mar 9, 2026

Uh oh!

SnowCharmQ Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wuxibin89 commented Feb 27, 2026 •

edited

Loading

PeterSH6 left a comment •

edited

Loading