Skip to content

feat: async partial rollout trainer with sample supplementation and caching#58

Open
mamazi0131 wants to merge 1 commit intoverl-project:mainfrom
mamazi0131:main
Open

feat: async partial rollout trainer with sample supplementation and caching#58
mamazi0131 wants to merge 1 commit intoverl-project:mainfrom
mamazi0131:main

Conversation

@mamazi0131
Copy link
Copy Markdown

@mamazi0131 mamazi0131 commented Mar 1, 2026

What does this PR do?

This PR introduces the Async Partial Rollout (APR) mechanism to the verl framework to address the training efficiency bottleneck caused by long-tail samples (e.g., 160k tokens). By implementing Sample Supplementation and Interruption Techniques, we mitigate the "inference bubble" effect and significantly improve GPU utilization in synchronous RL training. Our implementation supports both verl 0.5.0 and 0.6.1.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Key Accomplishments:

  • Implemented Sample Supplementation and Interruption Mechanisms (SSIM) for dynamic sample replenishment.

  • Introduced Rollout Caching via a state-aware PromptsManager to resume partial generations, effectively managing sample staleness.

  • Ensured Off-Policy Correctness for PPO-style algorithms (GRPO/DAPO) using decoupled importance sampling.

  • Achieved up to 51.1% reduction in end-to-end training time on complex reasoning datasets.

Test

We validated the APR mechanism on two benchmarks using 2 nodes with 8 H20 GPUs and the Qwen3-4B model:

  1. GSM8K (Accuracy & Efficiency)
    Under consistent convergence, training time was reduced by 11.7% with a 5.93% boost in GPU utilization.
    • Baseline (GRPO+noPR): 4h 59m
    • Proposed (GRPO+PR): 4h 24m (-35m)
      gsm8k
  2. DAPO-MATH17k (Long-sequence Stress Test)
    In the presence of 160k-token long-tail samples, the APR achieved a 51.1% reduction in total training time while maintaining superior final performance.
    • Baseline (GRPO+noPR): 67h 34m
    • Proposed (GRPO+PR): 33h 02m (-34h 32m)
      dapo_math

API and Usage Example

Users can now trigger the partial rollout mode by using the specific recipes provided in the recipe/partial_rollout/ directory.

# Run DAPO-MATH17k with Partial Rollout on 2 nodes
bash recipe/partial_rollout/run_dapo_math17k_pr_4b_2node.sh

Design & Code Changes

  1. Sample Supplementation and Interruption Mechanisms:
    Introducing sample supplementation and interruption mechanisms to enable dynamic sample replenishment and automated scheduling of inference tasks.

  2. Rollout Caching:
    Using a prompt manager to resume partial rollouts, managing complete and partial samples in the buffer based on sample staleness.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive asynchronous partial rollout (APR) system to the verl framework, designed to dramatically enhance the efficiency of reinforcement learning training, particularly when dealing with datasets containing samples of highly varying lengths. By intelligently managing inference tasks, dynamically supplementing samples, and caching partial results, the system minimizes idle GPU time and accelerates the overall training process, leading to significant performance gains without compromising algorithmic correctness.

Highlights

  • Asynchronous Partial Rollout (APR) Mechanism: Introduced a novel APR mechanism to address training efficiency bottlenecks caused by long-tail samples in synchronous reinforcement learning, significantly improving GPU utilization.
  • Sample Supplementation and Interruption Techniques (SSIM): Implemented SSIM for dynamic sample replenishment and automated scheduling of inference tasks, mitigating 'inference bubbles' and caching unfinished samples for reuse.
  • Rollout Caching via PromptsManager: Integrated a state-aware PromptsManager to manage complete and partial samples, enabling the resumption of partial generations and effective handling of sample staleness.
  • Off-Policy Correctness: Ensured off-policy correctness for PPO-style algorithms (GRPO/DAPO) through decoupled importance sampling, preserving algorithmic integrity under interruptible generation and policy updates.
  • Performance Improvements: Achieved substantial reductions in end-to-end training time, including an 11.7% reduction on GSM8K and a 51.1% reduction on DAPO-MATH17k, alongside increased GPU utilization.
  • Compatibility: Ensured the implementation supports both verl 0.5.0 and 0.6.1, providing flexibility for users.
Changelog
  • partial_rollout/README.md
    • Added detailed documentation for the Async Partial Rollout Trainer, covering background, solution, experimental results, implementation details, and usage examples.
  • partial_rollout/agent_loop/init.py
    • Added imports for new agent loop components and updated the package's public interface.
  • partial_rollout/agent_loop/agent_loop.py
    • Introduced PRv3AsyncLLMServerManager to support partial generation in LLM servers.
    • Implemented PRv3AgentLoopWorker as a Ray remote actor to manage asynchronous sequence generation, including cancellation and prompt manager interaction.
    • Added PRv3AgentLoopManager to orchestrate the new async rollout workers and prompt management logic.
  • partial_rollout/agent_loop/partial_single_turn_agent_loop.py
    • Added PartialSingleTurnAgentLoop to enable partial generation and resumption for single-turn agent interactions.
  • partial_rollout/agent_loop/partial_tool_agent_loop.py
    • Added PartialToolAgentLoop to support partial generation and resumption within multi-turn tool invocation agent loops.
  • partial_rollout/main_ppo.py
    • Modified run_ppo and TaskRunner to integrate the new PRv3AgentLoopManager and RolloutPromptManager for asynchronous partial rollout training.
  • partial_rollout/prompt_manager.py
    • Added RolloutPrompt dataclass to encapsulate batch information and agent loop outputs for partial rollouts.
    • Implemented RolloutPromptManager as a Ray remote actor to manage the lifecycle of prompts (pending, ongoing, done), handle data iteration, and assemble batches for partial rollouts.
  • partial_rollout/ray_trainer.py
    • Updated RayPPOTrainer to incorporate the RolloutPromptManager and PRv3AgentLoopManager for asynchronous partial rollouts.
    • Modified the training loop to prepare, check, and pull prompts from the prompt manager, and handle cancellation events during generation.
  • partial_rollout/run_dapomath_nopr_grpo_4b_bs64.sh
    • Added a new shell script to configure and run DAPO-MATH17k training without the partial rollout feature.
  • partial_rollout/run_dapomath_pr_grpo_4b_bs64.sh
    • Added a new shell script to configure and run DAPO-MATH17k training with the partial rollout feature enabled.
  • partial_rollout/run_gsm8k_nopr_grpo_4b_bs128.sh
    • Added a new shell script to configure and run GSM8K training without the partial rollout feature.
  • partial_rollout/run_gsm8k_pr_grpo_4b_bs128.sh
    • Added a new shell script to configure and run GSM8K training with the partial rollout feature enabled.
  • partial_rollout/vllm_rollout/init.py
    • Added an empty initialization file to define the vllm_rollout directory as a Python package.
  • partial_rollout/vllm_rollout/vllm_async_server.py
    • Introduced vLLMHttpServerForPartial to extend vLLMHttpServerBase, adding support for partial generation, cancellation, and resumption of requests.
    • Implemented PRv3vLLMReplica to utilize the new vLLMHttpServerForPartial for managing rollout servers with partial generation capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an Async Partial Rollout (APR) mechanism to enhance training efficiency, particularly for datasets with long-tail samples. The implementation is comprehensive, adding new components like PRv3AgentLoopManager, RolloutPromptManager, and specialized agent loops for partial generation. The overall design is solid and effectively addresses the stated problem. My review focuses on improving code clarity, maintainability, and fixing a few minor issues. I've identified opportunities for improvement regarding magic numbers, a potential performance concern with busy-waiting, and some inconsistencies in documentation and script files.

Comment thread partial_rollout/README.md
Comment thread partial_rollout/README.md
Comment thread partial_rollout/agent_loop/agent_loop.py
Comment thread partial_rollout/agent_loop/agent_loop.py
Comment thread partial_rollout/agent_loop/agent_loop.py
Comment thread partial_rollout/prompt_manager.py
Comment thread partial_rollout/run_gsm8k_pr_grpo_4b_bs128.sh
@ArronHZG
Copy link
Copy Markdown

Hello, thank you very much for your work. I understand that an asynchronous training architecture with colocation similar to Kimi has been implemented now, which is also a missing part of the current Verl.

In terms of design, Verl 0.7.1 supports an auto-resume mechanism, decoupling the complex state storage logic between the server and the agent. Meanwhile, parameter synchronization uniformly adopts the checkpoint engine approach, vLLM supports a multi-process mode, and the training engine is integrated through the Model Engine interface in a unified manner. All these modifications facilitate subsequent development and iteration.

It is suggested to refer to the following PRs and the current code to refactor this PR: the rollout module shall leverage the auto-resume capability, the training module shall adopt the Model Engine, and parameter synchronization shall use the checkpoint engine, so as to align with the current code and future planning.

[Completed] vLLM multi-process: verl-project/verl#4280
[Completed] Add CheckpointEngineManager: verl-project/verl#5031
[Completed] Refactor the trainer to improve code reuse across various fit phases: verl-project/verl#5184
[Completed] Fully async supports invocation in engine mode: verl-project/verl#5269
[Completed] Fully async supports checkpoint engine: verl-project/verl#5029
[Completed] Rollout supports the abort-resume interface: verl-project/verl#5430
[Completed] Clean up the partial-related logic in AgentLoop: verl-project/verl#5487

@startju
Copy link
Copy Markdown

startju commented Apr 22, 2026

hello @mamazi0131 , i'm a new beginer of verl, i'm glad to help you to refactor this code, could i help you to do that?

@startju
Copy link
Copy Markdown

startju commented Apr 22, 2026

@ArronHZG do you still need this feature, in v0.8.0 ?

@mamazi0131
Copy link
Copy Markdown
Author

hello @mamazi0131 , i'm a new beginer of verl, i'm glad to help you to refactor this code, could i help you to do that?

I’d be happy to, of course. I’ve been so busy with work lately that I haven’t had time to take care of this.

@startju
Copy link
Copy Markdown

startju commented Apr 22, 2026

hello @mamazi0131 , i'm a new beginer of verl, i'm glad to help you to refactor this code, could i help you to do that?

I’d be happy to, of course. I’ve been so busy with work lately that I haven’t had time to take care of this.

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants