[recipe] feat: asynchronous reward agent with mini-batch pipeline and one-step off-policy training by haolinyan · Pull Request #2854 · verl-project/verl

haolinyan · 2025-08-01T02:15:55Z

What does this PR do?

This PR introduces the asynchronous reward agent to schedule and mitigate communication bottlenecks in RL training scenarios that rely on remote reward services (e.g., LLM-as-a-Judge, RAG, hybrid rule-based scoring). By leveraging the “mini-batch pipeline + one-step off-policy” strategy, it overlaps communication latency with GPU computation, significantly improving training efficiency.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [trainer, fsdp, vllm, recipe] feat: one step off async training recipe #2231, Add asynchronous rollout + reward stage to PPOTrainer #980,
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)

Test

To validate this solution, we utilize the GSM8K dataset and introduce randomized artificial delays ranging from 1 to 40 seconds during reward computation for each sample, simulating the latency typically incurred when calling remote reward services.

This delay range is empirically determined based on an analysis of the communication-to-computation latency ratio observed in real-world industrial training processes, thereby ensuring an accurate simulation of practical communication bottlenecks.

The experimental results show that:

The proposed solution achieves comparable training accuracy to existing open-source results in the community.
By incorporating the mini-batch pipeline and one-step off-policy strategies, we observe a reduction of up to 30.85% in total training time relative to the baseline.

Backend	Strategy	Model	Training Time	Accuracy (last/max)	Log
Megatron	baseline (from community)	Qwen2-7B-Instruct	-	89.61 / 89.61	Log
Megatron	baseline	Qwen2-7B-Instruct	17h53m	89.08 / 89.92	Log
FSDP	baseline	Qwen2-7B-Instruct	18h24m	89.54 / 89.92	Log
Megatron	mini-batch pipeline + one-step off-policy	Qwen2-7B-Instruct	12h22m (-30.85%)	89.61 / 90.04	Log
FSDP	mini-batch pipeline + one-step off-policy	Qwen2-7B-Instruct	13h10m (-28.44%)	88.86 / 89.99	Log
FSDP	baseline	Qwen2.5-3B-Instruct	17h23m	87.87 / 88.10	Log
Megatron	baseline	Qwen2.5-3B-Instruct	17h07m	88.02 / 88.02	Log
FSDP	mini-batch pipeline + one-step off-policy	Qwen2.5-3B-Instruct	13h15m (-23.08%)	88.93 / 88.93	Log
Megatron	mini-batch pipeline + one-step off-policy	Qwen2.5-3B-Instruct	13h10m (-23.08%)	87.19 / 88.40	Log

API and Usage Example

1. Reward Function Configuration:

Users can flexibly integrate a remote reward service (such as LLM-as-a-Judge, RAG-enhanced scoring, hybrid rule-based + model scoring, etc) in two ways:

Stateless function – for one-shot, context-free scoring.
Stateful class – when you need caching, token management, session context, or batch post-processing.

For example, users can implement a reward class that calls the OpenAI-style API to score individual responses, then performs group-wise post-processing of the results.

class RewardAgent:
    """
    This example shows:
    - Initializing the OpenAI-style client once
    - Re-using it in compute_score
    - Optional post_process_scores for smoothing or outlier handling
    """

    def __init__(self):
        # Initialize any remote client
        self.client = OpenAI(
            api_key = "your API key",
            base_url = "custom base url")
        self.system_prompt = ...

    def compute_score(
        self,
        data_source: Any,
        solution_str: str,
        ground_truth: str,
        extra_info: Optional[dict] = None,
    ) -> tuple[float, str, str]:
        """
        Stateful scoring function,

        Parameters:  
            data_source: Data source object  
            solution_str: Solution string to be scored  
            ground_truth: Standard answer  
            extra_info: Extra information dictionary (optional)  

        Returns:  
            A tuple containing three elements:  
            - Score value (float)  
            - Original solution string (str)  
            - Score explanation string (str)  
        """
        prompt = ...
       
        try:
            resp = self.client.chat.completions.create(
                model = "gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": prompt}
                ]
            )
            score_str = resp.choices[0].message.content.strip()
            score = float(score_str)
        except Exception as e:
            # Fallback: rule-based score or -1.0
            score = -1.0
            explanation = f"LLM judge failed: {e}"
        else:
            explanation = f"LLM judge returned {score}"

        return score, prompt, explanation

    def post_process_scores(self, rewards: list[float]) -> list[float]:
        """
        Post-process an entire group of scores, e.g.:
        - Replace NaN / -1 outliers with the group mean

        Parameters:  
            rewards: A list of scores to be processed  
        
        Returns:  
            A list of processed scores  

        Note:
            This is an optional processing step that will be automatically invoked by RayAsyncRewardAgent
            when a group of scores is ready for processing.
        """
        arr = np.array(rewards, dtype=float)
        mean_score = np.nanmean(arr)
        processed = np.where(np.isnan(arr) | (arr < 0), mean_score, arr)
        return processed.tolist()

Then, specify the function name and file path into the following training configuration:

custom_reward_function.path=${reward_file} \  
custom_reward_function.name=RewardAgent

2. Training Configuration

When launching a training process, the parameters below should be configured:

python3 -m recipe.async_reward_agent.main_ppo \
    # make sure you set the correct path of the config folder
    --config-path="${HOME}/verl/trainer/config" \
    custom_reward_function.path=${reward_file} \  
    custom_reward_function.name=${reward_function_name} \
    reward_model.reward_manager=batch \
    reward_model.launch_reward_fn_async=True \
    # enable mini-batch pipeline strategy
    +mini_batch_pipeline=True

Design & Code Changes

We designed an asynchronous reward agent that handles concurrent requests and manages their lifecycle. Then we leveraged the one-step off-policy training and mini-batch pipeline starategies to achieve overlapping of communication latency with computation:

One-step off-policy: Unlike the "One Step Off Async Trainer" implemented by Meituan, we reverted to the colocated design. This approach overlaps the computation time of the next-step rollout with the waiting time for current reward requests, thereby improving training efficiency.
Mini-batch pipeline: The existing method necessitates waiting for all rewards in the global batch to be collected before performing model updates, resulting in prolonged GPU idle time. To overcome this inefficiency, we implement a pipelined execution strategy that divides the global batch into mini-batches. This approach enables concurrent processing of asynchronous reward collection and model updates, effectively overlapping communication latency with computation.

For detailed design and code changes, please refer to Doc/文档.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

- add spacing in the header - Ffix numerical errors and unified precision in "Training Time and Percentage." - update the agent reward configuration introduction, emphasizing support for LLM-as-a-Judge, RAG, etc., and provided an OpenAI API-based LLM-as-a-Judge example - fix some syntax errors

gemini-code-assist

Code Review

This pull request introduces a significant new feature: an asynchronous reward agent designed to improve RL training efficiency by overlapping communication with computation. The implementation includes a mini-batch pipeline and a one-step off-policy strategy, with support for both FSDP and Megatron backends. The code is well-documented with comprehensive READMEs and example scripts.

My review has identified a critical race condition in the reward_agent.py that could lead to incorrect behavior in reward processing. I've also found a high-severity correctness issue in both dp_actor.py and megatron_actor.py related to inconsistent logic for selecting the policy loss function. Addressing these points will be crucial for the stability and correctness of this new feature.

gemini-code-assist · 2025-08-01T02:17:50Z

recipe/async_reward_agent/reward_agent.py

+                    request.group_dict[index][0][intra_data_index] = (score, valid_response_length)
+                    request.group_dict[index][1] -= 1
+                    timestamps.append(datetime.datetime.now().isoformat())
+                    queries.append(query)
+                    results.append(response)
+                    latencies.append(end_time - start_time)
+                    group_uids.append(index)
+
+                    if request.group_dict[index][1] == 0:


There is a potential race condition in the proxy_func method. Multiple worker threads from the ThreadPoolExecutor can access and modify the shared request.group_dict concurrently for the same group_uid (index).

The operation request.group_dict[index][1] -= 1 (line 215) is not atomic. If two threads execute this line for the same group (index) concurrently, one of the decrements could be lost. This can lead to incorrect group completion logic, where a group might be processed before all its rewards are collected, or never processed at all.

To fix this, you should introduce a threading.Lock to protect the read and write access to request.group_dict within the for future in as_completed(futures): loop. I recommend creating a lock in the __init__ method (e.g., self.group_dict_lock = threading.Lock()) and using a with self.group_dict_lock: block around the critical sections that modify request.group_dict.

The access to request.group_dict only occurs within a single deamon thread, so there is no race condition.

for ( data_source, response_str, ground_truth, extra_info, group_uid, data_idx, valid_response_length, ) in request.request_data: # Each worker thread independently executes the user-defined function (self.user_defined_func) # without modifying the request object. future = self.executor.submit( self.user_defined_func, data_source, response_str, ground_truth, extra_info ) future.meta_info = [group_uid, data_idx, valid_response_length, time.time(), response_str] futures.append(future) for future in as_completed(futures): score, query, response = future.result() end_time = time.time() index, intra_data_index, valid_response_length, start_time, response_str = future.meta_info if index not in request.group_dict: print(f"Warning: index {index} not in request.group_dict, add it in func: proxy_func") request.group_dict[index] = [dict(), request.group_size] request.group_dict[index][0][intra_data_index] = (score, valid_response_length) request.group_dict[index][1] -= 1 timestamps.append(datetime.datetime.now().isoformat()) queries.append(query) results.append(response) latencies.append(end_time - start_time) group_uids.append(index)

gemini-code-assist · 2025-08-01T02:17:50Z

recipe/async_reward_agent/dp_actor.py

+
+                    loss_mode = self.config.policy_loss.get("loss_mode", "vanilla")
+
+                    if self.config.policy_loss.loss_mode == "vanilla":


There's an inconsistency in how the policy loss mode is checked. On line 435, you define a local variable loss_mode = self.config.policy_loss.get("loss_mode", "vanilla"). However, this if condition checks self.config.policy_loss.loss_mode == "vanilla" instead of using the local loss_mode variable. This could lead to incorrect behavior if the configuration structure changes or if loss_mode is intended to be the single source of truth for this logic. The elif on line 450 correctly uses the loss_mode variable. For consistency and correctness, you should use the loss_mode variable in this if condition.

Suggested change

if self.config.policy_loss.loss_mode == "vanilla":

if loss_mode == "vanilla":

gemini-code-assist · 2025-08-01T02:17:50Z

recipe/async_reward_agent/megatron_actor.py

+
+                loss_mode = self.config.policy_loss.get("loss_mode", "vanilla")
+
+                if self.config.policy_loss.loss_mode == "vanilla":


There's an inconsistency in how the policy loss mode is checked. On line 433, you define a local variable loss_mode = self.config.policy_loss.get("loss_mode", "vanilla"). However, this if condition checks self.config.policy_loss.loss_mode == "vanilla" instead of using the local loss_mode variable. This could lead to incorrect behavior if the configuration structure changes or if loss_mode is intended to be the single source of truth for this logic. The elif on line 448 correctly uses the loss_mode variable. For consistency and correctness, you should use the loss_mode variable in this if condition.

Suggested change

if self.config.policy_loss.loss_mode == "vanilla":

if loss_mode == "vanilla":

eric-haibin-lin · 2025-08-03T16:31:03Z

thanks! could u remove the tensorboard artifacts from this PR?

haolinyan · 2025-08-03T16:54:43Z

thanks! could u remove the tensorboard artifacts from this PR?

Got it, I've removed the tensorboard artifacts in the new commit. Please check if everything looks good now. Let me know if there's anything else needed for this PR.

haolinyan · 2025-08-05T06:32:26Z

@eric-haibin-lin hi, the latest commit (dcd8dc3) has passed all CI checks. Could you please review the changes when you have time? I've removed the TensorBoard artifacts as requested, and everything should be ready for your final check.

If you're unavailable, I'd also appreciate it if you could suggest or assign another appropriate reviewer.

Thanks for your time!

wuxibin89 · 2025-08-08T14:56:27Z

@haolinyan Thanks for your great work. Batch rollout mode has some drawbacks:

batch mode is inefficient for multi_turn rollout, since we have to wait for all samples completion to do tool calling.
batch mode is inefficient for reward calculation. What even worse is that in some cases the reward is from interaction with environment, for example, in SWE-agent, the reward is by submit git patch to docker environment and run unit test cases.
batch mode is inefficient for long-tail problem, since we can't abort request_ids in batch mode.

Due to the above reasons, we're going to deprecate it and switch to server rollout mode: Agent Loop.

class AgentLoopBase(ABC):
    @abstractmethod
    async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput:
        while True:
            ...
        
        # calculate reward score: rule based/reward service/environment interaction, ...
       return AgentLoopOutput(..., score=score)

haolinyan · 2025-08-08T16:55:21Z

@wuxibin89 Thank you for your comments!

It is true that the calculation of reward scores in AgentLoop can achieve the overlap of computation and communication. However, the efficiency improvement brought by such overlap is less stable and insufficient compared with the scheme we proposed: in industry, not all tasks require tool calling, which means that AgentLoop may not necessarily overlap communication latency. For example, in short text generation scenarios, the length of text output by the model after SFT is similar, which enables AgentLoop or batch mode to complete rollout in a very short time, making it impossible to overlap communication latency.

In addition, we would like to emphasize that the main dilemma we face in using RL for many industrial tasks currently lies in how to define effective reward signals. Therefore, we adopt the method of remote rewards, guiding LLMs (such as GPT-4) to score responses through prompt design, so as to realize rapid exploration and experience accumulation in the initial stage (we believe we are not alone in this). In this context, we propose this scheme to enhance the training efficiency of verl, with the hope that it can be more efficiently applied to a wide range of tasks.

Finally, to avoid duplicate development, we would like to ask whether there are already relevant development plans to implement reward calculation in AgentLoop or a code prototype. If not, we are willing to adapt our scheme to AgentLoop and commit it.

edc3000 · 2025-10-10T07:36:41Z

@haolinyan Good job! But I have an error after running your recipe. Error is "omegaconf.errors.ConfigAttributeError: Key 'ray_init' is not in struct" on recipe/async_reward_agent/main_ppo.py 226 num_cpus=config.ray_kwargs.ray_init.num_cpus. Can you help me?

haolinyan · 2025-10-19T13:34:18Z

@haolinyan Good job! But I have an error after running your recipe. Error is "omegaconf.errors.ConfigAttributeError: Key 'ray_init' is not in struct" on recipe/async_reward_agent/main_ppo.py 226 num_cpus=config.ray_kwargs.ray_init.num_cpus. Can you help me?

@edc3000 Thanks for using our recipe! The error occurs because ray_init is defined in an older version of verl, such as in this commit: https://github.com/volcengine/verl/blob/3e2bceb1afcaa77ebc40106a64f7b440509b67e1/verl/trainer/config/ppo_megatron_trainer.yaml#L132

We recommend merging our PR based on this commit: 3e2bceb and trying the training again. Let us know if you run into any further issues!

edc3000 · 2025-11-26T11:40:50Z

@haolinyan Good job! But I have an error after running your recipe. Error is "omegaconf.errors.ConfigAttributeError: Key 'ray_init' is not in struct" on recipe/async_reward_agent/main_ppo.py 226 num_cpus=config.ray_kwargs.ray_init.num_cpus. Can you help me?

@edc3000 Thanks for using our recipe! The error occurs because ray_init is defined in an older version of verl, such as in this commit:

https://github.com/volcengine/verl/blob/3e2bceb1afcaa77ebc40106a64f7b440509b67e1/verl/trainer/config/ppo_megatron_trainer.yaml#L132

We recommend merging our PR based on this commit: 3e2bceb and trying the training again. Let us know if you run into any further issues!

@haolinyan, Your work is very helpful to me, and I am using your recipe to train my RL model. But now I found that entropy in actor is unusual, like this picture (However, the reward and response length are normal and rising). I very need your help.

haolinyan · 2025-11-28T07:52:07Z

@edc3000 I suspect this might tie to your training parameter settings—specifically, PPO-related coefficients like kl_loss_coef and entropy_coeff. Could you double-check if these values are configured appropriately (e.g., whether they’re set too high/low, or not adjusted as intended during training)?

edc3000 · 2025-11-28T08:32:04Z

@haolinyan I think this might be the code in async_reward_agent/main_ppo.py. In the function run(), the fsdp_workers is imported by verl, not your code. I changed it like from .fsdp_workers import ActorRolloutRefWorker, AsyncActorRolloutRefWorker and it is work. So can you confirm it?

haolinyan added 17 commits July 25, 2025 19:13

impl async reward agent

1c9fa71

add ci test and fix license

636624d

fix some format issues

6560b65

fix the support for user-defined reward function and add docs

f515de6

fix the license header

c8a2dd8

fix some format issues in docs/

8f0d1ae

do code format (PEP 8 -> ruff format)

f016f0e

update readme

37c4115

update img

cc6290e

fix the wording about "OpenAI-style"

884832d

update training scripts

0d41f98

rename "update pipeline" as "mini-batch pipeline"

237fc79

fix ray_trainer.py

45eaf39

update async_reward_agent.svg

58dda22

update authors' information

c95d974

update check_device_api_usage.py

855baec

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

haolinyan and others added 2 commits August 1, 2025 10:40

Update e2e_async_reward_agent.yml

4128ed2

rename logs

35cee0c

chenchaoxu7575 mentioned this pull request Aug 1, 2025

[ray]{feat}: 1) add multithread execution mode and global thread pool management; 2) add launch_reward_fn_sub_thread #2861

Closed

7 tasks

remove tensorboard artifacts

fce3b87

haolinyan and others added 3 commits August 4, 2025 16:55

fix: fix the path error in the test script

401be03

Merge branch 'volcengine:main' into main

7671936

fix: adapt the test script to the latest main branch

dcd8dc3

This was referenced Aug 18, 2025

[Do not Merge] Make VeRL SGLang Native #3102

Closed

[agentic RL] multi-turn rollout and agent loop development tracking #2618

Open


		loss_mode = self.config.policy_loss.get("loss_mode", "vanilla")

		if self.config.policy_loss.loss_mode == "vanilla":

	if self.config.policy_loss.loss_mode == "vanilla":
	if loss_mode == "vanilla":

Conversation

haolinyan commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

1. Reward Function Configuration:

2. Training Configuration

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

haolinyan Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Aug 3, 2025

Uh oh!

haolinyan commented Aug 3, 2025

Uh oh!

haolinyan commented Aug 5, 2025

Uh oh!

wuxibin89 commented Aug 8, 2025

Uh oh!

haolinyan commented Aug 8, 2025

Uh oh!

edc3000 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haolinyan commented Oct 19, 2025

Uh oh!

edc3000 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haolinyan commented Nov 28, 2025

Uh oh!

edc3000 commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haolinyan commented Aug 1, 2025 •

edited

Loading

edc3000 commented Oct 10, 2025 •

edited

Loading

edc3000 commented Nov 26, 2025 •

edited

Loading