[trainer] feat: Upstream Dynamic Sampling by Hecate0821 · Pull Request #2988 · verl-project/verl

Hecate0821 · 2025-08-09T04:47:29Z

What does this PR do?

Overview

We propose to upstream the Dynamic Sampling (also known as Dynamic Fair Turing) feature implemented in DAPO to the verl main branch. This feature significantly improves sample efficiency and training robustness through intelligent batch filtering and backfilling strategies.

What is Dynamic Sampling?

Dynamic Sampling is an advanced training strategy that addresses the non-stable problem in reinforcement learning by implementing intelligent sample filtering and batch construction:

Core Strategy

Filter out uninformative samples - Remove samples with accuracy == 0 or 1 (no learning signal)
Keep only samples with non-zero gradients - Preserve samples that contribute to learning
Backfill the batch - Continue generating until the full mini-batch contains valid, informative samples

Benefits

Reduces zero-gradient sample proportion - Eliminates samples that don't contribute to learning
Increases gradient signal density - Each batch contains more informative samples
Prevents early saturation - Avoids entropy collapse by maintaining diverse, learnable samples
Improves training stability - More consistent learning signals across batches

Implementation Details

Configuration Structure

@dataclass
class FilterGroupsConfig(BaseConfig):
    enable: bool = False
    metric: Optional[str] = None  # "seq_reward"
    max_num_gen_batches: int = 0  # Max backfill attempts (0 = unlimited)

Key Components

1. Dynamic Filtering Logic

# Filter out prompts that are all positive OR all <= 0
all_positive = np.all(metric_vals > 0)
all_non_positive = np.all(metric_vals <= 0)
# Keep prompt only if it has both positive and non-positive values
should_keep = not (all_positive or all_non_positive) or len(metric_vals) == 1

2. Batch Accumulation and Backfilling

# Accumulate valid samples across multiple generation batches
accumulated_batch = (
    batch if accumulated_batch is None 
    else DataProto.concat([accumulated_batch, batch])
)

# Continue generating until target batch size is reached
if num_prompt_in_batch < prompt_bsz and num_gen_batches < max_num_gen_batches:
    continue  # Generate more batches

3. Intelligent Sample Selection

Uses reward variance as the primary filtering metric
Discards samples with uniform reward patterns (all high or all low)
Preserves samples with mixed reward patterns that provide learning signal

Log Bias

Problem: Critic-Reward Logging Bias

When dynamic sampling is enabled, the logged critic-reward is based on the filtered training batch, which may discard high-reward samples during the filtering process. This results in artificially low recorded critic-reward values that don't accurately reflect the true sample quality.

Solution: Dual Logging Strategy

We've implemented a comprehensive logging approach that captures both perspectives:

1. Pre-Filter Metrics (Raw Sample Quality)

# Calculate reward statistics over the entire rollout batch BEFORE filtering
reward_metrics = compute_reward_metrics(batch)
metrics.update(reward_metrics)
logger.log(data=metrics, step=self.global_steps)

2. Post-Filter Metrics (Training Batch Quality)

# Calculate post-filter reward pattern metrics
post_filter_metrics = compute_reward_pattern_metrics(
    batch.non_tensor_batch["uid"],
    batch.batch["token_level_scores"],
    prefix="train/post_filter_reward_pattern",
    include_exact_values=True,
)

This dual approach provides:

Accurate sample quality assessment - Pre-filter metrics show true rollout performance
Training effectiveness monitoring - Post-filter metrics show actual training batch quality
Complete visibility - Engineers can understand both the raw data quality and filtered training effectiveness
I understand! You want me to add a section about the separated dynamic_filter class directly into the original PR description you provided. Here's the section you can add to your existing PR:

Extensible Filter Architecture

To better serve different datasets' reward patterns, we have separated the filtering logic into a modular DynamicFilterManager class that allows users to customize their own filter functions.

Modular Design

The DynamicFilterManager class provides a clean interface for loading custom filter functions:

class DynamicFilterManager:
    def __init__(self, filter_function: Optional[str] = None, metric: str = "seq_reward", **filter_kwargs):
        # Dynamically imports and applies user-specified filter functions

Custom Filter Interface

Users can implement dataset-specific filter functions following a simple signature:

def custom_filter_function(metric_vals: List[Union[float, int]], **kwargs) -> bool:
    """
    Args:
        metric_vals: List of metric values for samples from the same prompt
        **kwargs: Additional configuration parameters from filter_kwargs
    
    Returns:
        bool: True if prompt should be kept, False if filtered out
    """
    # Custom filtering logic here
    return should_keep

Configuration Examples

Default Mixed Rewards Filter (Original DAPO):

algorithm:
  dynamic_filter:
    enable: true
    filter_function: "verl.trainer.ppo.dynamic_filtering.keep_mixed_reward"

Custom Filter for Specific Datasets:

algorithm:
  dynamic_filter:
    enable: true
    filter_function: "my_module.my_custom_filter"
    filter_kwargs:
      threshold: 0.5
      min_variance: 0.2

This modular architecture enables easy adaptation to different datasets' unique reward patterns - from mathematical reasoning tasks requiring solution diversity, to code generation needing correctness variation, to creative tasks demanding quality spread. Users can implement domain-specific filtering strategies without modifying the core Dynamic Sampling infrastructure.

Experimental Results

We tested this feature with the DAPO task and observed significant improvements:

Green line: Without Dynamic Fair Turing (baseline)
Yellow line: With Dynamic Fair Turing enabled

Future Work

In DAPO recipe, the oversample is available for generate customizable rollout batch size, avoiding backfill overhead. Our current implementation dose not has this mechanism.

To achieve that, we might be able to integrate over sample PR into dynamic sampling.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs).
Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

CLAassistant · 2025-08-09T04:47:36Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces dynamic sampling from DAPO into the main trainer to improve sample efficiency and training robustness. The core changes involve filtering rollout samples based on reward variance and backfilling mini-batches. A key addition is a new metric for pre-filtering rewards to provide more accurate logging, addressing a noted issue with critic-reward logging. The changes are primarily in verl/trainer/ppo/ray_trainer.py, with corresponding updates to configuration files and metric utilities. My review identified a critical bug in the _extract_reward_extra_infos helper function within ray_trainer.py that would cause a runtime error. The rest of the implementation for dynamic sampling appears to be logically sound and aligns with the goals outlined in the description.

gemini-code-assist · 2025-08-09T04:48:24Z

verl/trainer/ppo/ray_trainer.py

+                if key not in standard_keys and isinstance((list, np.ndarray), values):
+                    reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values


The arguments to isinstance are swapped in two places on these lines. The correct usage is isinstance(object, classinfo). The current code will raise a TypeError at runtime when this function is called.

Suggested change

if key not in standard_keys and isinstance((list, np.ndarray), values):

reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values

if key not in standard_keys and isinstance(values, (list, np.ndarray)):

reward_extra_infos_dict[key] = values.tolist() if isinstance(values, np.ndarray) else values

This comment is conflict with pre-commit, ignore

verl/trainer/config/ppo_trainer.yaml

verl/trainer/ppo/metric_utils.py

verl/trainer/ppo/ray_trainer.py

zhaochenyang20 · 2025-08-20T04:31:04Z

Great! Having dapo retool as the baseline

zhaochenyang20 · 2025-08-20T04:32:02Z

https://github.com/volcengine/verl/actions/runs/17026023747/job/48262007453?pr=2988#step:7:275

reabse with main to fix this

zhaochenyang20 · 2025-08-25T01:15:54Z

https://github.com/volcengine/verl/actions/runs/17026023726/job/48262007255#step:6:234

Register your parameters.

verl/trainer/config/_generated_ppo_trainer.yaml

verl/trainer/ppo/ray_trainer.py

Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

verl/trainer/ppo/ray_trainer.py

verl/utils/filtering/__init__.py

verl/trainer/ppo/ray_trainer.py

Hecate0821 · 2025-09-11T20:13:53Z

@gemini-code-assist /review

gemini-code-assist

Code Review

This pull request introduces a dynamic sampling feature, a significant enhancement for improving training efficiency and robustness by filtering and backfilling data batches. The implementation is well-structured, introducing an extensible DynamicFilter class. The configuration changes and new metrics for pre-filter rewards are logical and well-documented. However, I've identified a critical issue in the batch backfilling logic that could lead to incorrect batch sizes, and another high-severity issue regarding the use of zip that could mask potential data inconsistencies. Addressing these points will ensure the feature is robust and reliable.

verl/utils/filtering/dynamic_filtering.py

Hecate0821 · 2025-09-12T01:03:21Z

@gemini-code-assist /review

gemini-code-assist

Code Review

This pull request introduces a dynamic sampling feature to improve training efficiency and robustness. The implementation is well-structured, featuring a modular DynamicFilter class and a clear separation of concerns for metrics and configuration. My review identifies a critical documentation issue regarding feature compatibility and a high-severity bug in the backfill limit logic. Addressing these points will enhance the feature's usability and prevent unexpected runtime failures for users.

verl/trainer/config/algorithm.py

verl/utils/filtering/dynamic_filtering.py

Hecate0821

resolved

verl/trainer/config/_generated_ppo_megatron_trainer.yaml

verl/trainer/config/_generated_ppo_trainer.yaml

PrinsYin and others added 14 commits July 26, 2025 06:07

Added dynamic filter

cf00afa

Addedrunner

d800627

dynamic filter

9bb815f

dynamic filter no kl

f5ffc02

extra info logic update

20a822c

logging and run script

78b44f7

added metric train reward

ecf1c21

1

72eb9a9

1

82fb1ae

1

1ce98dd

Merge mainline

df7f92d

Add script for Qwen3-4b dapo

c33d341

Clean Up

e57fb25

Fix Pre-commit

9644809

Hecate0821 requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners August 9, 2025 04:47

gemini-code-assist bot reviewed Aug 9, 2025

View reviewed changes

Hecate0821 changed the title ~~Ds nokl~~ [trainer] feat: Upstream Dynamic Sampling Aug 9, 2025

zhaochenyang20 suggested changes Aug 11, 2025

View reviewed changes

Hecate0821 added 3 commits August 11, 2025 23:46

Fix: Comments

8a4ba8f

Fix:naming

a2bfbbf

merge main

e539fde

Merge branch 'main' into ds_nokl

2ac9fcb

Add filter for all negative and positive

8442353

tongyx361 requested changes Aug 31, 2025

View reviewed changes

PrinsYin and others added 2 commits August 31, 2025 16:58

Update verl/trainer/ppo/ray_trainer.py

36fc97b

Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

resolve comment

da4ab8b

PrinsYin force-pushed the ds_nokl branch from 778e318 to da4ab8b Compare August 31, 2025 21:08

fix extra info

cc34f16

PrinsYin force-pushed the ds_nokl branch from 7156c37 to cc34f16 Compare August 31, 2025 22:13

PrinsYin and others added 4 commits August 31, 2025 22:15

fix extra info

08f4c04

Fix pre-commit

3ab4752

Fix pre-commit

31887c4

Merge branch 'main' into ds_nokl

2115f70

tongyx361 requested changes Sep 3, 2025

View reviewed changes

verl/trainer/ppo/ray_trainer.py Outdated Show resolved Hide resolved

verl/trainer/ppo/ray_trainer.py Outdated Show resolved Hide resolved

verl/utils/filtering/__init__.py Show resolved Hide resolved

tongyx361 requested changes Sep 4, 2025

View reviewed changes

verl/trainer/ppo/ray_trainer.py Outdated Show resolved Hide resolved

verl/trainer/ppo/ray_trainer.py Outdated Show resolved Hide resolved

Hecate0821 added 2 commits September 4, 2025 16:19

Fix naming

8c74805

Fix comments

cd5093b

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

verl/utils/filtering/dynamic_filtering.py Show resolved Hide resolved

verl/utils/filtering/dynamic_filtering.py Outdated Show resolved Hide resolved

Hecate0821 added 2 commits September 12, 2025 00:40

Fix repeat traj bug and zip issue

c5778c1

Fix redundant if

809d37a

1

1f928e4

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

verl/trainer/config/algorithm.py Show resolved Hide resolved

verl/utils/filtering/dynamic_filtering.py Show resolved Hide resolved

PrinsYin added 2 commits September 12, 2025 01:11

1

e603155

1

c1ca9c1

Hecate0821 commented Sep 12, 2025

View reviewed changes

Merge branch 'volcengine:main' into ds_nokl

8346efb

zhaochenyang20 approved these changes Sep 17, 2025

View reviewed changes

zhaochenyang20 suggested changes Sep 18, 2025

View reviewed changes

verl/trainer/config/_generated_ppo_megatron_trainer.yaml Show resolved Hide resolved

verl/trainer/config/_generated_ppo_trainer.yaml Show resolved Hide resolved

zhaochenyang20 approved these changes Sep 19, 2025

View reviewed changes

Merge branch 'main' into ds_nokl

9617d09

tongyx361 assigned PeterSH6 Oct 12, 2025

		if key not in standard_keys and isinstance((list, np.ndarray), values):
		reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values

Conversation

Hecate0821 commented Aug 9, 2025 • edited by zhaochenyang20 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Overview

What is Dynamic Sampling?

Core Strategy

Benefits

Implementation Details

Configuration Structure

Key Components

Log Bias

Problem: Critic-Reward Logging Bias

Solution: Dual Logging Strategy

Extensible Filter Architecture

Modular Design

Custom Filter Interface

Configuration Examples

Experimental Results

Future Work

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Hecate0821 Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 20, 2025

Uh oh!

zhaochenyang20 commented Aug 20, 2025

Uh oh!

zhaochenyang20 commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hecate0821 commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Hecate0821 commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Hecate0821 left a comment

Choose a reason for hiding this comment

Hecate0821 commented Aug 9, 2025 •

edited by zhaochenyang20

Loading

CLAassistant commented Aug 9, 2025 •

edited

Loading