[rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager#5430
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an auto-resume feature for FullyAsyncLLMServerManager to handle interruptions during partial rollouts, making them transparent to the AgentLoop. The implementation involves passing global_steps to track model versions across various components, including checkpointing and different rollout backends (sglang, vllm, trtllm). The changes are well-integrated, and a new test case effectively validates the auto-resume functionality. I've identified one critical issue regarding a potential race condition due to in-place modification of a shared dictionary and have provided a suggestion to resolve it. The rest of the changes appear solid and consistently applied.
| num_preempted: Optional[int] = None | ||
| """number of preempted times for metric calculation""" | ||
|
|
||
| global_steps: Optional[int] = None |
There was a problem hiding this comment.
Recording the global steps and min/max global steps in the rollout engine is a bit weird. Why do we need this?
There was a problem hiding this comment.
Replaced with extra_info for any auxiliary information from rollout.
There was a problem hiding this comment.
Recording global_steps in rollout engine is to monitoring the weight version of trajectory segment in partial rollout.
| for global_steps in range(initial_steps, initial_steps + train_steps): | ||
| # wait a while and update weights to interrupt the generation | ||
| await asyncio.sleep(3) | ||
| await checkpoint_manager.update_weights(global_steps=global_steps) |
There was a problem hiding this comment.
Does the test case include update_weights(global_steps=None)?
There was a problem hiding this comment.
Add a test case _run_update_weights_with_global_steps_none
839e07c to
76596b6
Compare
…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
|
Hi, I update It seems the default version of |
|
Once I reviewed the implementation of https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/io_struct.py as shown below:
It seems that the current implementation is not complete? |
| set_global_state, | ||
| ) | ||
| from sglang.srt.managers.io_struct import ( | ||
| ContinueGenerationReqInput, |
There was a problem hiding this comment.
It seems taht this definition was introduced in sglang v0.5.6, however, the default version of sglang used by verl is v0.5.2. So it will occur errors.
…sing fully_async (#5487) ### What does this PR do? Refactor the fully_async code based on #5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
…nager (verl-project#5430) ### What does this PR do? 1. Pass trainer's `global_steps` to rollout when update weights. 2. FullyAsyncLLMServerManager: support auto resume generation when partial rollout, making rollout interruption invisible to the AgentLoop. Following PR, cleanup `verl/experimental/fully_async_policy/agent_loop`.
…sing fully_async (verl-project#5487) ### What does this PR do? Refactor the fully_async code based on verl-project#5430 to support the gateway mode, and decouple the tool invocation and rollout processes during the partial rollout phase. Feature `use_trainer_do_validate` is not ready for use, I will fix it in a subsequent PR. <img width="708" height="564" alt="image" src="https://github.com/user-attachments/assets/18e282ea-a4cf-43bc-ae1f-b4108eee8dfd" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

What does this PR do?
global_stepsto rollout when update weights.Following PR, cleanup
verl/experimental/fully_async_policy/agent_loop.