[data] feat: TransferQueue - An asynchronous streaming data management system by 0oshowero0 · Pull Request #3649 · verl-project/verl

0oshowero0 · 2025-09-30T06:36:45Z

What does this PR do?

This PR introduces the TransferQueue data management module to verl, aiming to accelerate experience data transfer and address performance bottlenecks in post-training systems. Detailed design rationale is available in our RFC (#2662).

This PR adds TransferQueue as a git dependence. Besides, we provide end-to-end scripts that integrate verl with TransferQueue.

TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows (in progress).

The system will introduce the following core components:

TransferQueueClient: Deployed on each Worker, manages the communication with TransferQueue system via simple put/get semantics.
TransferQueueController: Centralized dataflow scheduler tracking the production and consumption status of training samples.
TransferQueueStorage: Distributed storage units that holds the actual experience data.

The primary motivation for integrating TransferQueue to verl now is to alleviate the data transfer bottleneck of the single controller RayPPOTrainer. Currently, all DataProto objects must be routed through RayPPOTrainer, resulting in a single point bottleneck of the whole post-training system.

Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by

Replacing DataProto with BatchMeta (metadata) and TensorDict (actual data) structures
Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability)
Accelerating data transfer by TransferQueue's distributed storage units

For WorkerGroup class, we hide the above translation process by decorator. For AgentLoop related class, we explicitely do the adaption in AgentLoopBase.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

We've validated TransferQueue functionality through

Unit test of (Async)TransferQueueClient, TransferQueueController, and TransferQueueSimpleUnit
End-to-end demo that mimics the usage in verl

API and Usage Example

The primary interaction points are AsyncTransferQueueClient and TransferQueueClient, serving as the communication interface with the TransferQueue system.

Core client interfaces:

(async_)get_meta(data_fields: list[str], batch_size:int, global_step:int, get_n_samples:bool, task_name:str) -> BatchMeta
(async_)get_data(metadata:BatchMeta) -> TensorDict
(async_)put(data:TensorDict, metadata:BatchMeta, global_step)
(async_)clear(global_step: int)

You may refer to the example here, where we mimics the verl usage in both async & sync scenarios:
https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case.

Please use pip install ".[transferqueue]" to install TransferQueue to verl.

Then you can try our recipe as follows, which is adapted from run_qwen3-8b.sh with async rollout mode enabled using sglang backend. For more recipes and the accuracy comparison, refer to [this doc].(https://www.yuque.com/haomingzi-lfse7/hlx5g0/bqm536cgc52kv2gk?singleDoc#)

# Tested successfully on the hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0 image.
# It outperforms the Qwen2 7B base model by two percentage points on the test set of GSM8K.

set -x

MODEL_PATH="/home/xxx/models/Qwen3-8B"

TRAIN_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet"
TEST_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet"

log_dir="./logs"
mkdir -p ${log_dir}
timestamp=$(date +"%Y%m%d%H%M%S")
log_file="${log_dir}/qwen3-8b_tq_${timestamp}.log"

rollout_mode="async"
rollout_name="sglang" # sglang or vllm
if [ "$rollout_mode" = "async" ]; then
    export VLLM_USE_V1=1
    return_raw_chat="True"
fi

python3 -m recipe.transfer_queue.main_ppo \
    --config-name='transfer_queue_ppo_trainer' \
    algorithm.adv_estimator=grpo \
    data.train_files=${TRAIN_FILE} \
    data.val_files=${TEST_FILE} \
    data.return_raw_chat=$return_raw_chat \
    data.train_batch_size=128 \
    data.max_prompt_length=2048 \
    data.max_response_length=8192 \
    data.filter_overlong_prompts_workers=128 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=${MODEL_PATH} \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.max_num_batched_tokens=10240 \
    actor_rollout_ref.rollout.name=$rollout_name \
    actor_rollout_ref.rollout.mode=$rollout_mode \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=console \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen3_8b_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=1000 \
    trainer.total_epochs=15 \
    trainer.total_training_steps=50 \
    trainer.val_before_train=False \
    +trainer.num_global_batch=1 \
    +trainer.num_data_storage_units=8 \
    +trainer.num_data_controllers=8 \
    2>&1 | tee "$log_file"
echo "Finished, log is saved in: $log_file"

Accuracy comparison:

Design & Code Changes

Refer to our Paper, RFC, and Zhihu post :)

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

* Support controller in TransferQueue * Fix import * Fix comments --------- Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

Added copyright and licensing information to the controller.py file.

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

* update client docstring Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix n_sample related problems Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

* Add metadata.py and test_simple_storage_unit.py * Add copyright and license information to test_simple_storage_unit.py * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Han Zhenyu 韩振宇 <o0shower0o@outlook.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

* Origin recipe * Integrate TransferQueue with Ray Trainer * Fix codecheck * Fix codecheck * Fix codecheck * Fix codecheck * Fix * Fix codecheck --------- Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

* fix chinese comments & add TODO * provide general DataProto<->BatchMeta decorator Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * optimize code Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

fix

recipe/transfer_queue/ray_trainer.py

verl/single_controller/base/decorator.py

wuxibin89 · 2025-10-17T02:48:24Z

verl/utils/transferqueue_utils.py

+    global _TRANSFER_QUEUE_CLIENT
+    global _VAL_TRANSFER_QUEUE_CLIENT
+    if "val" in client_id:
+        _VAL_TRANSFER_QUEUE_CLIENT = AsyncTransferQueueClient(client_id, controller_infos, storage_infos)


Why we have a separate client for val? One client should be used for both train and val.

verl/utils/transferqueue_utils.py

recipe/transfer_queue/main_ppo.py

verl/utils/transferqueue_utils.py

Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

…t system (verl-project#3649) ### What does this PR do? This PR introduces the [**TransferQueue**](https://github.com/TransferQueue/TransferQueue) data management module to verl, aiming to accelerate experience data transfer and address performance bottlenecks in post-training systems. Detailed design rationale is available in our RFC (verl-project#2662). **This PR adds TransferQueue as a git submodule into `verl/experimental/transfer_queue`. Besides, we provide end-to-end scripts that integrate verl with TransferQueue.** <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696193102-a5654375-65a1-4e06-9c63-142b59df90b8.png" width="70%"> **TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows (in progress).** The system will introduce the following core components: - **TransferQueueClient**: Deployed on each `Worker`, manages the communication with TransferQueue system via simple put/get semantics. - **TransferQueueController**: Centralized dataflow scheduler tracking the production and consumption status of training samples. - **TransferQueueStorage**: Distributed storage units that holds the actual experience data. The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system. ![verl_dataflow_DataProto](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704289414-bcc54228-716b-4d4a-ad3b-f9ace6d10fcf.jpeg) Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by - Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures - Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability) - Accelerating data transfer by TransferQueue's distributed storage units ![verl_dataflow_TransferQueue](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704301666-0807dc06-766c-4a2d-9cde-889a6bb56b34.jpeg) For `WorkerGroup` class, we hide the above translation process by decorator. For `AgentLoop` related class, we explicitely do the adaption in `AgentLoopBase`. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We've validated TransferQueue functionality through - Unit test of (Async)TransferQueueClient, TransferQueueController, and TransferQueueSimpleUnit - End-to-end demo that mimics the usage in verl ### API and Usage Example The primary interaction points are `AsyncTransferQueueClient` and `TransferQueueClient`, serving as the communication interface with the TransferQueue system. Core client interfaces: - (async_)get_meta(data_fields: list[str], batch_size:int, global_step:int, get_n_samples:bool, task_name:str) -> BatchMeta - (async_)get_data(metadata:BatchMeta) -> TensorDict - (async_)put(data:TensorDict, metadata:BatchMeta, global_step) - (async_)clear(global_step: int) You may refer to the example here, where we mimics the verl usage in both async & sync scenarios: https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case. --- **Please use `pip install ".[transferqueue]"` to install TransferQueue to verl.** Then you can try our recipe as follows, which is adapted from `run_qwen3-8b.sh` with async rollout mode enabled using sglang backend. For more recipes and the accuracy comparison, refer to [this doc].(https://www.yuque.com/haomingzi-lfse7/hlx5g0/bqm536cgc52kv2gk?singleDoc#) ```bash # Tested successfully on the hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0 image. # It outperforms the Qwen2 7B base model by two percentage points on the test set of GSM8K. set -x MODEL_PATH="/home/xxx/models/Qwen3-8B" TRAIN_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" TEST_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" log_dir="./logs" mkdir -p ${log_dir} timestamp=$(date +"%Y%m%d%H%M%S") log_file="${log_dir}/qwen3-8b_tq_${timestamp}.log" rollout_mode="async" rollout_name="sglang" # sglang or vllm if [ "$rollout_mode" = "async" ]; then export VLLM_USE_V1=1 return_raw_chat="True" fi python3 -m recipe.transfer_queue.main_ppo \ --config-name='transfer_queue_ppo_trainer' \ algorithm.adv_estimator=grpo \ data.train_files=${TRAIN_FILE} \ data.val_files=${TEST_FILE} \ data.return_raw_chat=$return_raw_chat \ data.train_batch_size=128 \ data.max_prompt_length=2048 \ data.max_response_length=8192 \ data.filter_overlong_prompts_workers=128 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=${MODEL_PATH} \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.max_num_batched_tokens=10240 \ actor_rollout_ref.rollout.name=$rollout_name \ actor_rollout_ref.rollout.mode=$rollout_mode \ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=console \ trainer.project_name='verl_grpo_example_gsm8k' \ trainer.experiment_name='qwen3_8b_function_rm' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=1000 \ trainer.total_epochs=15 \ trainer.total_training_steps=50 \ trainer.val_before_train=False \ +trainer.num_global_batch=1 \ +trainer.num_data_storage_units=8 \ +trainer.num_data_controllers=8 \ 2>&1 | tee "$log_file" echo "Finished, log is saved in: $log_file" ``` Accuracy comparison: <img width="934" height="540" alt="image" src="https://github.com/user-attachments/assets/1edb7f35-9141-41cf-9202-ecf1df6a6c76" /> <img width="956" height="540" alt="image" src="https://github.com/user-attachments/assets/220102f0-b4ec-4f4d-b100-217fa02107f7" /> ### Design & Code Changes Refer to our [Paper](https://arxiv.org/abs/2507.01663), [RFC](verl-project#2662), and [Zhihu post](https://zhuanlan.zhihu.com/p/1930244241625449814) :) ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com> Co-authored-by: FightingZhen <295632982@qq.com> Co-authored-by: LLLLxmmm <130739718+LLLLxmmm@users.noreply.github.com> Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com> Co-authored-by: Huazhong <hzji210@gmail.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Jianjun Zhong <87791082+jianjunzhong@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: baymax591 <cbai@mail.nwpu.edu.cn>

…t system (verl-project#3649) This PR introduces the [**TransferQueue**](https://github.com/TransferQueue/TransferQueue) data management module to verl, aiming to accelerate experience data transfer and address performance bottlenecks in post-training systems. Detailed design rationale is available in our RFC (verl-project#2662). **This PR adds TransferQueue as a git submodule into `verl/experimental/transfer_queue`. Besides, we provide end-to-end scripts that integrate verl with TransferQueue.** <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696193102-a5654375-65a1-4e06-9c63-142b59df90b8.png" width="70%"> **TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows (in progress).** The system will introduce the following core components: - **TransferQueueClient**: Deployed on each `Worker`, manages the communication with TransferQueue system via simple put/get semantics. - **TransferQueueController**: Centralized dataflow scheduler tracking the production and consumption status of training samples. - **TransferQueueStorage**: Distributed storage units that holds the actual experience data. The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system. ![verl_dataflow_DataProto](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704289414-bcc54228-716b-4d4a-ad3b-f9ace6d10fcf.jpeg) Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by - Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures - Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability) - Accelerating data transfer by TransferQueue's distributed storage units ![verl_dataflow_TransferQueue](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704301666-0807dc06-766c-4a2d-9cde-889a6bb56b34.jpeg) For `WorkerGroup` class, we hide the above translation process by decorator. For `AgentLoop` related class, we explicitely do the adaption in `AgentLoopBase`. - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` We've validated TransferQueue functionality through - Unit test of (Async)TransferQueueClient, TransferQueueController, and TransferQueueSimpleUnit - End-to-end demo that mimics the usage in verl The primary interaction points are `AsyncTransferQueueClient` and `TransferQueueClient`, serving as the communication interface with the TransferQueue system. Core client interfaces: - (async_)get_meta(data_fields: list[str], batch_size:int, global_step:int, get_n_samples:bool, task_name:str) -> BatchMeta - (async_)get_data(metadata:BatchMeta) -> TensorDict - (async_)put(data:TensorDict, metadata:BatchMeta, global_step) - (async_)clear(global_step: int) You may refer to the example here, where we mimics the verl usage in both async & sync scenarios: https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case. --- **Please use `pip install ".[transferqueue]"` to install TransferQueue to verl.** Then you can try our recipe as follows, which is adapted from `run_qwen3-8b.sh` with async rollout mode enabled using sglang backend. For more recipes and the accuracy comparison, refer to [this doc].(https://www.yuque.com/haomingzi-lfse7/hlx5g0/bqm536cgc52kv2gk?singleDoc#) ```bash set -x MODEL_PATH="/home/xxx/models/Qwen3-8B" TRAIN_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" TEST_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" log_dir="./logs" mkdir -p ${log_dir} timestamp=$(date +"%Y%m%d%H%M%S") log_file="${log_dir}/qwen3-8b_tq_${timestamp}.log" rollout_mode="async" rollout_name="sglang" # sglang or vllm if [ "$rollout_mode" = "async" ]; then export VLLM_USE_V1=1 return_raw_chat="True" fi python3 -m recipe.transfer_queue.main_ppo \ --config-name='transfer_queue_ppo_trainer' \ algorithm.adv_estimator=grpo \ data.train_files=${TRAIN_FILE} \ data.val_files=${TEST_FILE} \ data.return_raw_chat=$return_raw_chat \ data.train_batch_size=128 \ data.max_prompt_length=2048 \ data.max_response_length=8192 \ data.filter_overlong_prompts_workers=128 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=${MODEL_PATH} \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.max_num_batched_tokens=10240 \ actor_rollout_ref.rollout.name=$rollout_name \ actor_rollout_ref.rollout.mode=$rollout_mode \ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=console \ trainer.project_name='verl_grpo_example_gsm8k' \ trainer.experiment_name='qwen3_8b_function_rm' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=1000 \ trainer.total_epochs=15 \ trainer.total_training_steps=50 \ trainer.val_before_train=False \ +trainer.num_global_batch=1 \ +trainer.num_data_storage_units=8 \ +trainer.num_data_controllers=8 \ 2>&1 | tee "$log_file" echo "Finished, log is saved in: $log_file" ``` Accuracy comparison: <img width="934" height="540" alt="image" src="https://github.com/user-attachments/assets/1edb7f35-9141-41cf-9202-ecf1df6a6c76" /> <img width="956" height="540" alt="image" src="https://github.com/user-attachments/assets/220102f0-b4ec-4f4d-b100-217fa02107f7" /> Refer to our [Paper](https://arxiv.org/abs/2507.01663), [RFC](verl-project#2662), and [Zhihu post](https://zhuanlan.zhihu.com/p/1930244241625449814) :) > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com> Co-authored-by: FightingZhen <295632982@qq.com> Co-authored-by: LLLLxmmm <130739718+LLLLxmmm@users.noreply.github.com> Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com> Co-authored-by: Huazhong <hzji210@gmail.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Jianjun Zhong <87791082+jianjunzhong@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: baymax591 <cbai@mail.nwpu.edu.cn>

…t system (verl-project#3649) ### What does this PR do? This PR introduces the [**TransferQueue**](https://github.com/TransferQueue/TransferQueue) data management module to verl, aiming to accelerate experience data transfer and address performance bottlenecks in post-training systems. Detailed design rationale is available in our RFC (verl-project#2662). **This PR adds TransferQueue as a git submodule into `verl/experimental/transfer_queue`. Besides, we provide end-to-end scripts that integrate verl with TransferQueue.** <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696193102-a5654375-65a1-4e06-9c63-142b59df90b8.png" width="70%"> **TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows (in progress).** The system will introduce the following core components: - **TransferQueueClient**: Deployed on each `Worker`, manages the communication with TransferQueue system via simple put/get semantics. - **TransferQueueController**: Centralized dataflow scheduler tracking the production and consumption status of training samples. - **TransferQueueStorage**: Distributed storage units that holds the actual experience data. The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system. ![verl_dataflow_DataProto](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704289414-bcc54228-716b-4d4a-ad3b-f9ace6d10fcf.jpeg) Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by - Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures - Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability) - Accelerating data transfer by TransferQueue's distributed storage units ![verl_dataflow_TransferQueue](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704301666-0807dc06-766c-4a2d-9cde-889a6bb56b34.jpeg) For `WorkerGroup` class, we hide the above translation process by decorator. For `AgentLoop` related class, we explicitely do the adaption in `AgentLoopBase`. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We've validated TransferQueue functionality through - Unit test of (Async)TransferQueueClient, TransferQueueController, and TransferQueueSimpleUnit - End-to-end demo that mimics the usage in verl ### API and Usage Example The primary interaction points are `AsyncTransferQueueClient` and `TransferQueueClient`, serving as the communication interface with the TransferQueue system. Core client interfaces: - (async_)get_meta(data_fields: list[str], batch_size:int, global_step:int, get_n_samples:bool, task_name:str) -> BatchMeta - (async_)get_data(metadata:BatchMeta) -> TensorDict - (async_)put(data:TensorDict, metadata:BatchMeta, global_step) - (async_)clear(global_step: int) You may refer to the example here, where we mimics the verl usage in both async & sync scenarios: https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case. --- **Please use `pip install ".[transferqueue]"` to install TransferQueue to verl.** Then you can try our recipe as follows, which is adapted from `run_qwen3-8b.sh` with async rollout mode enabled using sglang backend. For more recipes and the accuracy comparison, refer to [this doc].(https://www.yuque.com/haomingzi-lfse7/hlx5g0/bqm536cgc52kv2gk?singleDoc#) ```bash # Tested successfully on the hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0 image. # It outperforms the Qwen2 7B base model by two percentage points on the test set of GSM8K. set -x MODEL_PATH="/home/xxx/models/Qwen3-8B" TRAIN_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" TEST_FILE="/home/xxx/data/DAPO-Math-17k/data/dapo-math-17k.parquet" log_dir="./logs" mkdir -p ${log_dir} timestamp=$(date +"%Y%m%d%H%M%S") log_file="${log_dir}/qwen3-8b_tq_${timestamp}.log" rollout_mode="async" rollout_name="sglang" # sglang or vllm if [ "$rollout_mode" = "async" ]; then export VLLM_USE_V1=1 return_raw_chat="True" fi python3 -m recipe.transfer_queue.main_ppo \ --config-name='transfer_queue_ppo_trainer' \ algorithm.adv_estimator=grpo \ data.train_files=${TRAIN_FILE} \ data.val_files=${TEST_FILE} \ data.return_raw_chat=$return_raw_chat \ data.train_batch_size=128 \ data.max_prompt_length=2048 \ data.max_response_length=8192 \ data.filter_overlong_prompts_workers=128 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=${MODEL_PATH} \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.max_num_batched_tokens=10240 \ actor_rollout_ref.rollout.name=$rollout_name \ actor_rollout_ref.rollout.mode=$rollout_mode \ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=console \ trainer.project_name='verl_grpo_example_gsm8k' \ trainer.experiment_name='qwen3_8b_function_rm' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=1000 \ trainer.total_epochs=15 \ trainer.total_training_steps=50 \ trainer.val_before_train=False \ +trainer.num_global_batch=1 \ +trainer.num_data_storage_units=8 \ +trainer.num_data_controllers=8 \ 2>&1 | tee "$log_file" echo "Finished, log is saved in: $log_file" ``` Accuracy comparison: <img width="934" height="540" alt="image" src="https://github.com/user-attachments/assets/1edb7f35-9141-41cf-9202-ecf1df6a6c76" /> <img width="956" height="540" alt="image" src="https://github.com/user-attachments/assets/220102f0-b4ec-4f4d-b100-217fa02107f7" /> ### Design & Code Changes Refer to our [Paper](https://arxiv.org/abs/2507.01663), [RFC](verl-project#2662), and [Zhihu post](https://zhuanlan.zhihu.com/p/1930244241625449814) :) ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com> Co-authored-by: FightingZhen <295632982@qq.com> Co-authored-by: LLLLxmmm <130739718+LLLLxmmm@users.noreply.github.com> Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com> Co-authored-by: Huazhong <hzji210@gmail.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Jianjun Zhong <87791082+jianjunzhong@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: baymax591 <cbai@mail.nwpu.edu.cn>

FightingZhen and others added 29 commits September 23, 2025 19:56

Support storage unit in TransferQueue

a92a942

Fix importance error

bae27bb

Support controller in TransferQueue (#2)

0c03e14

* Support controller in TransferQueue * Fix import * Fix comments --------- Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

expose TransferQueueClient (#3)

64de012

Add copyright and license information

54e1889

Added copyright and licensing information to the controller.py file.

update client docstring (#5)

b79c2ab

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

merge TransferQueue utils (#4)

d67890e

[fix] Fix n_sample related problems (#8)

6a54445

* update client docstring Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> * fix n_sample related problems Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

expose TransferQueue client/controller UT (#6)

fec6303

Add reorder function to BatchMeta (#13)

7bf97ed

Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

Merge remote-tracking branch 'upstream/main' into main_tq_submodule

64d49d4

Add TransferQueue as submodule under experimental/transfer_queue

6ec7ca9

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

update requirements for transfer_queue

035d7a7

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

update doc

5cfa5f6

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

fix chinese comments & add TODO (#15)

02af787

update transferqueue submodule branch

fac98d0

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

update transferqueue submodule to latest commit

f0dbeb6

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

update transferqueue submodule to latest commit

d0add3f

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

simplify import

31814a3

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

fix

0abcacf

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

[recipe]feat: register tq server info for each workgroup (#18)

be9b19c

expose more API

667bb3d

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

expose all API

6eb1692

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

expose all APIs

9d92f0a

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

expose all APIs

ec91b86

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

fix

7f89443

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>

ji-huazhong force-pushed the main_tq_submodule branch from a50460e to 14ad39e Compare September 30, 2025 06:40

baymax591 added 2 commits October 16, 2025 20:04

fix

6462883

Merge pull request #40 from baymax591/main_tq_submodule

7a7a0c1

fix

wuxibin89 reviewed Oct 17, 2025

View reviewed changes

recipe/transfer_queue/ray_trainer.py Show resolved Hide resolved

wuxibin89 reviewed Oct 17, 2025

View reviewed changes

recipe/transfer_queue/main_ppo.py Show resolved Hide resolved

wuxibin89 reviewed Oct 17, 2025

View reviewed changes

verl/utils/transferqueue_utils.py Outdated Show resolved Hide resolved

LLLLxmmm and others added 2 commits October 17, 2025 17:05

Adapt for multimodal data format (#42)

9013472

Co-authored-by: liuximeng <13073314+liuximeng18772102439@user.noreply.gitee.com>

Merge branch 'main' into main_tq_submodule

a60ade4

ji-huazhong force-pushed the main_tq_submodule branch from aacf49d to df0c720 Compare October 18, 2025 06:07

apply review suggestions

77c9a0e

ji-huazhong force-pushed the main_tq_submodule branch 7 times, most recently from 5ef317d to 8fcf939 Compare October 20, 2025 09:00

simpify the implementation of main_ppo in recipe/transfer_queue

20d0f98

ji-huazhong force-pushed the main_tq_submodule branch from 8fcf939 to 20d0f98 Compare October 20, 2025 10:58

wuxibin89 approved these changes Oct 21, 2025

View reviewed changes

wuxibin89 merged commit ae4b5fe into verl-project:main Oct 21, 2025
74 of 76 checks passed

0oshowero0 mentioned this pull request Nov 18, 2025

[recipe, data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller #4175

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] feat: TransferQueue - An asynchronous streaming data management system#3649

[data] feat: TransferQueue - An asynchronous streaming data management system#3649
wuxibin89 merged 52 commits intoverl-project:mainfrom
TransferQueue:main_tq_submodule

0oshowero0 commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

wuxibin89 Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

0oshowero0 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Uh oh!

Uh oh!

wuxibin89 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

0oshowero0 commented Sep 30, 2025 •

edited

Loading