[trainer] feat: Upstream Dynamic Sampling#2988
[trainer] feat: Upstream Dynamic Sampling#2988Hecate0821 wants to merge 78 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces dynamic sampling from DAPO into the main trainer to improve sample efficiency and training robustness. The core changes involve filtering rollout samples based on reward variance and backfilling mini-batches. A key addition is a new metric for pre-filtering rewards to provide more accurate logging, addressing a noted issue with critic-reward logging. The changes are primarily in verl/trainer/ppo/ray_trainer.py, with corresponding updates to configuration files and metric utilities. My review identified a critical bug in the _extract_reward_extra_infos helper function within ray_trainer.py that would cause a runtime error. The rest of the implementation for dynamic sampling appears to be logically sound and aligns with the goals outlined in the description.
verl/trainer/ppo/ray_trainer.py
Outdated
| if key not in standard_keys and isinstance((list, np.ndarray), values): | ||
| reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values |
There was a problem hiding this comment.
The arguments to isinstance are swapped in two places on these lines. The correct usage is isinstance(object, classinfo). The current code will raise a TypeError at runtime when this function is called.
| if key not in standard_keys and isinstance((list, np.ndarray), values): | |
| reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values | |
| if key not in standard_keys and isinstance(values, (list, np.ndarray)): | |
| reward_extra_infos_dict[key] = values.tolist() if isinstance(values, np.ndarray) else values |
There was a problem hiding this comment.
This comment is conflict with pre-commit, ignore
|
Great! Having dapo retool as the baseline |
|
reabse with main to fix this |
Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>
|
@gemini-code-assist /review |
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic sampling feature, a significant enhancement for improving training efficiency and robustness by filtering and backfilling data batches. The implementation is well-structured, introducing an extensible DynamicFilter class. The configuration changes and new metrics for pre-filter rewards are logical and well-documented. However, I've identified a critical issue in the batch backfilling logic that could lead to incorrect batch sizes, and another high-severity issue regarding the use of zip that could mask potential data inconsistencies. Addressing these points will ensure the feature is robust and reliable.
|
@gemini-code-assist /review |
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic sampling feature to improve training efficiency and robustness. The implementation is well-structured, featuring a modular DynamicFilter class and a clear separation of concerns for metrics and configuration. My review identifies a critical documentation issue regarding feature compatibility and a high-severity bug in the backfill limit logic. Addressing these points will enhance the feature's usability and prevent unexpected runtime failures for users.
What does this PR do?
Overview
We propose to upstream the Dynamic Sampling (also known as Dynamic Fair Turing) feature implemented in DAPO to the verl main branch. This feature significantly improves sample efficiency and training robustness through intelligent batch filtering and backfilling strategies.
What is Dynamic Sampling?
Dynamic Sampling is an advanced training strategy that addresses the non-stable problem in reinforcement learning by implementing intelligent sample filtering and batch construction:
Core Strategy
Benefits
Implementation Details
Configuration Structure
Key Components
1. Dynamic Filtering Logic
2. Batch Accumulation and Backfilling
3. Intelligent Sample Selection
Log Bias
Problem: Critic-Reward Logging Bias
When dynamic sampling is enabled, the logged critic-reward is based on the filtered training batch, which may discard high-reward samples during the filtering process. This results in artificially low recorded critic-reward values that don't accurately reflect the true sample quality.
Solution: Dual Logging Strategy
We've implemented a comprehensive logging approach that captures both perspectives:
1. Pre-Filter Metrics (Raw Sample Quality)
2. Post-Filter Metrics (Training Batch Quality)
This dual approach provides:
I understand! You want me to add a section about the separated dynamic_filter class directly into the original PR description you provided. Here's the section you can add to your existing PR:
Extensible Filter Architecture
To better serve different datasets' reward patterns, we have separated the filtering logic into a modular
DynamicFilterManagerclass that allows users to customize their own filter functions.Modular Design
The
DynamicFilterManagerclass provides a clean interface for loading custom filter functions:Custom Filter Interface
Users can implement dataset-specific filter functions following a simple signature:
Configuration Examples
Default Mixed Rewards Filter (Original DAPO):
Custom Filter for Specific Datasets:
This modular architecture enables easy adaptation to different datasets' unique reward patterns - from mathematical reasoning tasks requiring solution diversity, to code generation needing correctness variation, to creative tasks demanding quality spread. Users can implement domain-specific filtering strategies without modifying the core Dynamic Sampling infrastructure.
Experimental Results
We tested this feature with the DAPO task and observed significant improvements:
Future Work
In DAPO recipe, the oversample is available for generate customizable rollout batch size, avoiding backfill overhead. Our current implementation dose not has this mechanism.
To achieve that, we might be able to integrate over sample PR into dynamic sampling.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)