Skip to content

[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager#5430

Merged
wuxibin89 merged 7 commits intoverl-project:mainfrom
wuxibin89:wuxibin/resume_client
Mar 4, 2026
Merged

[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager#5430
wuxibin89 merged 7 commits intoverl-project:mainfrom
wuxibin89:wuxibin/resume_client

Conversation

@wuxibin89
Copy link
Copy Markdown
Collaborator

@wuxibin89 wuxibin89 commented Feb 27, 2026

What does this PR do?

  1. Pass trainer's global_steps to rollout when update weights.
  2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup verl/experimental/fully_async_policy/agent_loop.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an auto-resume feature for FullyAsyncLLMServerManager to handle interruptions during partial rollouts, making them transparent to the AgentLoop. The implementation involves passing global_steps to track model versions across various components, including checkpointing and different rollout backends (sglang, vllm, trtllm). The changes are well-integrated, and a new test case effectively validates the auto-resume functionality. I've identified one critical issue regarding a potential race condition due to in-place modification of a shared dictionary and have provided a suggestion to resolve it. The rest of the changes appear solid and consistently applied.

Comment thread verl/experimental/fully_async_policy/agent_loop/agent_loop.py
@wuxibin89 wuxibin89 requested a review from ISEEKYAN as a code owner February 28, 2026 03:10
ArronHZG
ArronHZG previously approved these changes Feb 28, 2026
Comment thread verl/workers/rollout/replica.py Outdated
num_preempted: Optional[int] = None
"""number of preempted times for metric calculation"""

global_steps: Optional[int] = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording the global steps and min/max global steps in the rollout engine is a bit weird. Why do we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with extra_info for any auxiliary information from rollout.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording global_steps in rollout engine is to monitoring the weight version of trajectory segment in partial rollout.

Copy link
Copy Markdown
Collaborator

@PeterSH6 PeterSH6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM.
The only concern is whether the test cases cover update_weights with None input_param

for global_steps in range(initial_steps, initial_steps + train_steps):
# wait a while and update weights to interrupt the generation
await asyncio.sleep(3)
await checkpoint_manager.update_weights(global_steps=global_steps)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test case include update_weights(global_steps=None)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case _run_update_weights_with_global_steps_none

@wuxibin89 wuxibin89 force-pushed the wuxibin/resume_client branch from 839e07c to 76596b6 Compare March 3, 2026 14:31
@wuxibin89 wuxibin89 merged commit c4593e3 into verl-project:main Mar 4, 2026
135 of 243 checks passed
guillemgt pushed a commit to guillemgt/verl that referenced this pull request Mar 9, 2026
…nager (verl-project#5430)

### What does this PR do?

1. Pass trainer's `global_steps` to rollout when update weights.
2. FullyAsyncLLMServerManager: support auto resume generation when
partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
guillemgt added a commit to guillemgt/verl that referenced this pull request Mar 9, 2026
…nager (verl-project#5430)

### What does this PR do?

1. Pass trainer's `global_steps` to rollout when update weights.
2. FullyAsyncLLMServerManager: support auto resume generation when
partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
@SnowCharmQ
Copy link
Copy Markdown

Hi, I update verl to the latest version, an error occurs due to this PR:

ImportError: cannot import name 'ContinueGenerationReqInput' from 'sglang.srt.managers.io_struct' (/root/miniconda3/envs/searchr1/lib/python3.12/site-packages/sglang/srt/managers/io_struct.py)

It seems the default version of sglang is not new enough to support this feature?

@SnowCharmQ
Copy link
Copy Markdown

Once I reviewed the implementation of https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/io_struct.py as shown below:

image

It seems that the current implementation is not complete?

set_global_state,
)
from sglang.srt.managers.io_struct import (
ContinueGenerationReqInput,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems taht this definition was introduced in sglang v0.5.6, however, the default version of sglang used by verl is v0.5.2. So it will occur errors.

wuxibin89 pushed a commit that referenced this pull request Mar 10, 2026
…sing fully_async (#5487)

### What does this PR do?

Refactor the fully_async code based on
#5430 to support the gateway
mode, and decouple the tool invocation and rollout processes during the
partial rollout phase.

Feature `use_trainer_do_validate` is not ready for use, I will fix it in
a subsequent PR.

<img width="708" height="564" alt="image"
src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd"
/>

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [x] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
DearFishi pushed a commit to KunlunxinAD/verl that referenced this pull request Mar 20, 2026
…nager (verl-project#5430)

### What does this PR do?

1. Pass trainer's `global_steps` to rollout when update weights.
2. FullyAsyncLLMServerManager: support auto resume generation when
partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
DearFishi pushed a commit to KunlunxinAD/verl that referenced this pull request Mar 20, 2026
…sing fully_async (verl-project#5487)

### What does this PR do?

Refactor the fully_async code based on
verl-project#5430 to support the gateway
mode, and decouple the tool invocation and rollout processes during the
partial rollout phase.

Feature `use_trainer_do_validate` is not ready for use, I will fix it in
a subsequent PR.

<img width="708" height="564" alt="image"
src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd"
/>

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [x] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
sijyang pushed a commit to sijyang/verl that referenced this pull request Apr 1, 2026
…nager (verl-project#5430)

### What does this PR do?

1. Pass trainer's `global_steps` to rollout when update weights.
2. FullyAsyncLLMServerManager: support auto resume generation when
partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
sijyang pushed a commit to sijyang/verl that referenced this pull request Apr 1, 2026
…sing fully_async (verl-project#5487)

### What does this PR do?

Refactor the fully_async code based on
verl-project#5430 to support the gateway
mode, and decouple the tool invocation and rollout processes during the
partial rollout phase.

Feature `use_trainer_do_validate` is not ready for use, I will fix it in
a subsequent PR.

<img width="708" height="564" alt="image"
src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd"
/>

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [x] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
DaizeDong pushed a commit to DaizeDong/verl that referenced this pull request Apr 19, 2026
…nager (verl-project#5430)

### What does this PR do?

1. Pass trainer's `global_steps` to rollout when update weights.
2. FullyAsyncLLMServerManager: support auto resume generation when
partial rollout, making rollout interruption invisible to the AgentLoop.

Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
DaizeDong pushed a commit to DaizeDong/verl that referenced this pull request Apr 19, 2026
…sing fully_async (verl-project#5487)

### What does this PR do?

Refactor the fully_async code based on
verl-project#5430 to support the gateway
mode, and decouple the tool invocation and rollout processes during the
partial rollout phase.

Feature `use_trainer_do_validate` is not ready for use, I will fix it in
a subsequent PR.

<img width="708" height="564" alt="image"
src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd"
/>

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [x] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants