[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling by szrlee · Pull Request #3915 · verl-project/verl

szrlee · 2025-10-26T20:30:11Z

Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling

Summary

Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:

Main refactoring (39dd2e4): Separates IS weight correction from rejection sampling
Loss fix (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator
Veto default (10350b6): Changes veto threshold default from 1e-4 to None (opt-in)

Motivation

The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization.

Main Changes

1. Separates Two Mechanisms (`39dd2e4`)

IS weights (rollout_is_weights): Ratios π_train/π_rollout with processing applied

Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow:
- Token level: bounds per-token ratios
- Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
Truncate mode: upper clamped via .clamp(max=upper_threshold)
Mask mode: safety-bounded ratios preserved (no threshold clamping)
All modes: zeroed at padding positions
Preserved for policy gradient calculations

Rejection sampling (modified_response_mask): Applied via response_mask

Mask mode: Excludes tokens/sequences with outlier IS ratios
Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios)
Used for loss aggregation (excluded from denominator)

2. Fixes Seq-Mean Loss Normalization (`ab3e8af`)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator:

Uses masked_mean utility for proper masking
Adds epsilon to prevent division by zero
Ensures fully masked sequences don't affect loss computation

3. Makes Veto Opt-In (`10350b6`)

Changed rollout_is_veto_threshold default from 1e-4 to None:

Veto mechanism now opt-in by default
Users must explicitly enable via config
Updated across 11 files (configs, docs, examples)

API Changes (Breaking)

Breaking change affecting ALL users: compute_rollout_importance_weights() now returns 3 values instead of 2:

Before: (weights_proto, metrics)
After: (weights_proto, modified_response_mask, metrics)

Migration required: All callers must be updated to handle the new return signature, regardless of which rollout_is_mode you use:

Truncate mode: Must still unpack 3 values (though modified_response_mask is unchanged)
Mask mode: Must unpack 3 values AND update batch.response_mask with rejection applied
Veto enabled: Must update batch.response_mask regardless of mode

Files Changed

Main refactoring (39dd2e4):

docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)

Docs clarification (a5aa743):

docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)

Loss fix (ab3e8af):

verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

Veto default (10350b6):

11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)

Benefits

Correct loss normalization: Rejected samples excluded from denominator
Mode-specific weight processing: Truncate clamps, mask preserves safety-bounded ratios
Clear separation of concerns: Between IS correction and rejection
Safer defaults: Veto mechanism opt-in to prevent unexpected behavior
Numerical stability: Safety bounds prevent overflow, prevents division by zero in seq-mean modes

Testing

All tests passing (11/11):

pytest tests/trainer/ppo/test_rollout_is*.py -v

New test test_mask_mode() verifies:

IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed)
Rejection correctly applied via response_mask (not by zeroing weights)

Migration Guide

API Signature Change (Required for ALL users)

# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask

Mode-Specific Behavior

Truncate mode (rollout_is_mode="truncate"):

IS weights: upper clamped via .clamp(max=upper_threshold)
modified_response_mask equals input response_mask (unchanged for outlier ratios)
No outlier rejection applied, but must still handle 3-value return
Veto rejection (if enabled) still applies to mask

Mask mode (rollout_is_mode="mask"):

IS weights: safety-bounded ratios preserved (no threshold clamping)
modified_response_mask has outliers excluded (weights outside [lower, upper])
Rejection applied via mask, NOT by modifying IS weights
Veto rejection (if enabled) also applies to mask

Veto enabled (any mode with rollout_is_veto_threshold set):

Checks unclamped per-token ratios π_train(t)/π_rollout(t) (before safety bound)
Sequences with catastrophic tokens excluded from modified_response_mask
Works independently of truncate/mask mode
Does NOT modify IS weights

Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable it:

# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4

Reference

When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch

Liu, Li, Fu, Wang, Liu, Shen (2025)

BibTeX

@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}

The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring properly separates IS weight correction from rejection sampling to follow correct principles. This commit separates two mechanisms: IS Weights (rollout_is_weights): Always TRUE ratios π_train/π_rollout - Never zeroed, even for rejected samples - Preserved for policy gradient calculations Rejection Sampling (modified_response_mask): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens - Used for loss aggregation (excluded from denominator) This ensures: - Correct loss normalization (rejected samples excluded from denominator) - True IS ratios preserved for policy gradient calculations - Clear separation of concerns between IS correction and rejection Changes: - compute_rollout_importance_weights() now returns 3 values instead of 2 - Always update batch response_mask with rejection applied - Updated all tests to verify new behavior - Comprehensive documentation update with BibTeX citation Reference: When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch Liu, Li, Fu, Wang, Liu, Shen (2025) https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda

gemini-code-assist

Code Review

This pull request is a well-executed refactoring of the rollout importance sampling mechanism. It correctly separates the concerns of Importance Sampling (IS) weight correction and rejection sampling, which is a significant improvement for both correctness and code clarity. The changes are consistently applied across the core logic, trainer integration, tests, and documentation. The new tests, especially test_mask_mode, are comprehensive and accurately validate the new behavior. I have identified one high-severity issue concerning misleading documentation within the mismatch_helper.py docstring, which could lead to unexpected behavior for users of the truncate mode. My review comment provides a suggestion to clarify this. Overall, this is a high-quality contribution.

verl/trainer/ppo/mismatch_helper.py

docs/advance/rollout_is.md

Fixed two documentation issues: 1. Truncate mode only clamps upper bound (not [1, upper]) 2. Veto applies independently of rollout_is_mode The previous documentation was misleading: - Stated 'no rejection' for truncate mode (veto can still reject) - Stated clamp at [1, upper] (only upper is clamped) Changes: - Clarified truncate only clamps max (no lower bound) - Emphasized veto applies in both truncate and mask modes - Updated docstring, docs, and in-code comments - Prevents silent data loss when using truncate mode

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator using masked_mean, and added epsilon to prevent division by zero.

Changed rollout_is_veto_threshold default from 1e-4 to None, making the veto mechanism opt-in across 11 files (configs, runtime, docs).

wuxibin89 · 2025-10-27T13:50:25Z

verl/trainer/ppo/mismatch_helper.py


-    # Apply response_mask to ensure weights are 0 where mask is 0
+    # Zero out padding positions in IS weights for correct aggregation
+    # This is different from rejection - padding must be zeroed regardless of mode


Why not also mask out rollout_is_weights by modified_response_mask?

rollout_is_weights can have non-zero values at rejected positions (line 254 only zeros padding)

But when computing loss: pg_losses = pg_losses * rollout_is_weights then agg_loss(pg_losses, modified_response_mask)

The modified_response_mask has 0s at rejected positions, so in masked_mean:

Numerator: sum(pg_losses * modified_response_mask) - rejected positions contribute 0 (masked out)

Denominator: sum(modified_response_mask) - rejected positions not counted

Result: Even though rollout_is_weights has non-zero values at rejected positions, those values get multiplied by 0 in the mask during aggregation, so they don't affect the final loss. This design correctly separates:

IS weights: The actual importance ratios (informational)

Rejection sampling: Which samples to train on (via modified_response_mask)

Fix inaccurate documentation about IS weight processing: - IS weights are safety-bounded to [exp(-20), exp(20)], not "true ratios" - IS weights ARE zeroed at padding (not "never zeroed") - Truncate mode: safety-bounded + upper clamped - Mask mode: safety-bounded only (no threshold clamping) - Veto checks unclamped ratios before safety bounds Add "Operation Modes" section documenting independent control flags: - rollout_is_threshold: main on/off switch - rollout_is: controls IS weight application to loss - Rejection sampling (mask mode) applies regardless of rollout_is flag - Include mode combinations table and recommended workflow Update terminology throughout: - "safety-bounded ratios" replaces "true ratios" for mask mode - Update code comments in ray_trainer.py and test files

zhaochenyang20 · 2025-10-27T22:12:26Z

Solid, and good to go

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

@misc

…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```

szrlee requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992 and zhaochenyang20 as code owners October 26, 2025 20:30

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

verl/trainer/ppo/mismatch_helper.py Outdated Show resolved Hide resolved

szrlee force-pushed the yingru/mismatch-response-mask branch 2 times, most recently from 678bf01 to 5b982ff Compare October 26, 2025 20:45

vermouth1992 reviewed Oct 27, 2025

View reviewed changes

docs/advance/rollout_is.md Show resolved Hide resolved

szrlee force-pushed the yingru/mismatch-response-mask branch from 5b982ff to 965f4d0 Compare October 27, 2025 02:38

szrlee force-pushed the yingru/mismatch-response-mask branch 4 times, most recently from e8f5726 to da169f2 Compare October 27, 2025 08:02

szrlee added 2 commits October 27, 2025 16:02

fix(ppo): exclude fully masked sequences from seq-mean loss

ab3e8af

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator using masked_mean, and added epsilon to prevent division by zero.

feat(rollout_is): set veto threshold default to None

10350b6

Changed rollout_is_veto_threshold default from 1e-4 to None, making the veto mechanism opt-in across 11 files (configs, runtime, docs).

szrlee force-pushed the yingru/mismatch-response-mask branch from da169f2 to 10350b6 Compare October 27, 2025 08:02

wuxibin89 approved these changes Oct 27, 2025

View reviewed changes

wuxibin89 reviewed Oct 27, 2025

View reviewed changes

szrlee force-pushed the yingru/mismatch-response-mask branch 2 times, most recently from 8df6da7 to e4b9224 Compare October 27, 2025 14:43

szrlee force-pushed the yingru/mismatch-response-mask branch from e4b9224 to 359c966 Compare October 27, 2025 14:47

zhaochenyang20 merged commit 4424616 into verl-project:main Oct 27, 2025
79 of 82 checks passed

yueming-yuan mentioned this pull request Oct 31, 2025

Decouple IS Weights from Rejection Sampling in MIS THUDM/slime#657

Merged

szrlee mentioned this pull request Nov 2, 2025

[BREAKING][algo] feat: Rollout Correction for General Off-Policy Problems #3984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling#3915

[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling#3915
zhaochenyang20 merged 5 commits intoverl-project:mainfrom
szrlee:yingru/mismatch-response-mask

szrlee commented Oct 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

wuxibin89 Oct 27, 2025

Uh oh!

szrlee Oct 27, 2025

Uh oh!

zhaochenyang20 commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

szrlee commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling

Summary

Motivation

Main Changes

1. Separates Two Mechanisms (39dd2e4)

2. Fixes Seq-Mean Loss Normalization (ab3e8af)

3. Makes Veto Opt-In (10350b6)

API Changes (Breaking)

Files Changed

Benefits

Testing

Migration Guide

API Signature Change (Required for ALL users)

Mode-Specific Behavior

Veto Default Change

Reference

BibTeX

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

wuxibin89 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

szrlee Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

szrlee commented Oct 26, 2025 •

edited

Loading

1. Separates Two Mechanisms (`39dd2e4`)

2. Fixes Seq-Mean Loss Normalization (`ab3e8af`)

3. Makes Veto Opt-In (`10350b6`)