Skip to content

[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling#3915

Merged
zhaochenyang20 merged 5 commits intoverl-project:mainfrom
szrlee:yingru/mismatch-response-mask
Oct 27, 2025
Merged

[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling#3915
zhaochenyang20 merged 5 commits intoverl-project:mainfrom
szrlee:yingru/mismatch-response-mask

Conversation

@szrlee
Copy link
Collaborator

@szrlee szrlee commented Oct 26, 2025

Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling

Summary

Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:

  1. Main refactoring (39dd2e4): Separates IS weight correction from rejection sampling
  2. Loss fix (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator
  3. Veto default (10350b6): Changes veto threshold default from 1e-4 to None (opt-in)

Motivation

The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization.

Main Changes

1. Separates Two Mechanisms (39dd2e4)

IS weights (rollout_is_weights): Ratios π_train/π_rollout with processing applied

  • Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow:
    • Token level: bounds per-token ratios
    • Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
  • Truncate mode: upper clamped via .clamp(max=upper_threshold)
  • Mask mode: safety-bounded ratios preserved (no threshold clamping)
  • All modes: zeroed at padding positions
  • Preserved for policy gradient calculations

Rejection sampling (modified_response_mask): Applied via response_mask

  • Mask mode: Excludes tokens/sequences with outlier IS ratios
  • Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios)
  • Used for loss aggregation (excluded from denominator)

2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator:

  • Uses masked_mean utility for proper masking
  • Adds epsilon to prevent division by zero
  • Ensures fully masked sequences don't affect loss computation

3. Makes Veto Opt-In (10350b6)

Changed rollout_is_veto_threshold default from 1e-4 to None:

  • Veto mechanism now opt-in by default
  • Users must explicitly enable via config
  • Updated across 11 files (configs, docs, examples)

API Changes (Breaking)

Breaking change affecting ALL users: compute_rollout_importance_weights() now returns 3 values instead of 2:

  • Before: (weights_proto, metrics)
  • After: (weights_proto, modified_response_mask, metrics)

Migration required: All callers must be updated to handle the new return signature, regardless of which rollout_is_mode you use:

  • Truncate mode: Must still unpack 3 values (though modified_response_mask is unchanged)
  • Mask mode: Must unpack 3 values AND update batch.response_mask with rejection applied
  • Veto enabled: Must update batch.response_mask regardless of mode

Files Changed

Main refactoring (39dd2e4):

docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)

Docs clarification (a5aa743):

docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)

Loss fix (ab3e8af):

verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

Veto default (10350b6):

11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)

Benefits

  • Correct loss normalization: Rejected samples excluded from denominator
  • Mode-specific weight processing: Truncate clamps, mask preserves safety-bounded ratios
  • Clear separation of concerns: Between IS correction and rejection
  • Safer defaults: Veto mechanism opt-in to prevent unexpected behavior
  • Numerical stability: Safety bounds prevent overflow, prevents division by zero in seq-mean modes

Testing

All tests passing (11/11):

pytest tests/trainer/ppo/test_rollout_is*.py -v

New test test_mask_mode() verifies:

  • IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed)
  • Rejection correctly applied via response_mask (not by zeroing weights)

Migration Guide

API Signature Change (Required for ALL users)

# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask

Mode-Specific Behavior

Truncate mode (rollout_is_mode="truncate"):

  • IS weights: upper clamped via .clamp(max=upper_threshold)
  • modified_response_mask equals input response_mask (unchanged for outlier ratios)
  • No outlier rejection applied, but must still handle 3-value return
  • Veto rejection (if enabled) still applies to mask

Mask mode (rollout_is_mode="mask"):

  • IS weights: safety-bounded ratios preserved (no threshold clamping)
  • modified_response_mask has outliers excluded (weights outside [lower, upper])
  • Rejection applied via mask, NOT by modifying IS weights
  • Veto rejection (if enabled) also applies to mask

Veto enabled (any mode with rollout_is_veto_threshold set):

  • Checks unclamped per-token ratios π_train(t)/π_rollout(t) (before safety bound)
  • Sequences with catastrophic tokens excluded from modified_response_mask
  • Works independently of truncate/mask mode
  • Does NOT modify IS weights

Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable it:

# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4

Reference

When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch

Liu, Li, Fu, Wang, Liu, Shen (2025)

BibTeX

@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}

The previous implementation applied rejection by zeroing IS weights, which
conflated two distinct mechanisms. This refactoring properly separates IS
weight correction from rejection sampling to follow correct principles.

This commit separates two mechanisms:

IS Weights (rollout_is_weights): Always TRUE ratios π_train/π_rollout
- Never zeroed, even for rejected samples
- Preserved for policy gradient calculations

Rejection Sampling (modified_response_mask): Applied via response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens
- Used for loss aggregation (excluded from denominator)

This ensures:
- Correct loss normalization (rejected samples excluded from denominator)
- True IS ratios preserved for policy gradient calculations
- Clear separation of concerns between IS correction and rejection

Changes:
- compute_rollout_importance_weights() now returns 3 values instead of 2
- Always update batch response_mask with rejection applied
- Updated all tests to verify new behavior
- Comprehensive documentation update with BibTeX citation

Reference:
When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch
Liu, Li, Fu, Wang, Liu, Shen (2025)
https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring of the rollout importance sampling mechanism. It correctly separates the concerns of Importance Sampling (IS) weight correction and rejection sampling, which is a significant improvement for both correctness and code clarity. The changes are consistently applied across the core logic, trainer integration, tests, and documentation. The new tests, especially test_mask_mode, are comprehensive and accurately validate the new behavior. I have identified one high-severity issue concerning misleading documentation within the mismatch_helper.py docstring, which could lead to unexpected behavior for users of the truncate mode. My review comment provides a suggestion to clarify this. Overall, this is a high-quality contribution.

@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch 2 times, most recently from 678bf01 to 5b982ff Compare October 26, 2025 20:45
@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch from 5b982ff to 965f4d0 Compare October 27, 2025 02:38
Fixed two documentation issues:
1. Truncate mode only clamps upper bound (not [1, upper])
2. Veto applies independently of rollout_is_mode

The previous documentation was misleading:
- Stated 'no rejection' for truncate mode (veto can still reject)
- Stated clamp at [1, upper] (only upper is clamped)

Changes:
- Clarified truncate only clamps max (no lower bound)
- Emphasized veto applies in both truncate and mask modes
- Updated docstring, docs, and in-code comments
- Prevents silent data loss when using truncate mode
@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch 4 times, most recently from e8f5726 to da169f2 Compare October 27, 2025 08:02
Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude
fully masked sequences from denominator using masked_mean, and added
epsilon to prevent division by zero.
Changed rollout_is_veto_threshold default from 1e-4 to None, making
the veto mechanism opt-in across 11 files (configs, runtime, docs).
@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch from da169f2 to 10350b6 Compare October 27, 2025 08:02

# Apply response_mask to ensure weights are 0 where mask is 0
# Zero out padding positions in IS weights for correct aggregation
# This is different from rejection - padding must be zeroed regardless of mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also mask out rollout_is_weights by modified_response_mask?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rollout_is_weights can have non-zero values at rejected positions (line 254 only zeros padding)
  • But when computing loss: pg_losses = pg_losses * rollout_is_weights then agg_loss(pg_losses, modified_response_mask)
  • The modified_response_mask has 0s at rejected positions, so in masked_mean:
    • Numerator: sum(pg_losses * modified_response_mask) - rejected positions contribute 0 (masked out)
    • Denominator: sum(modified_response_mask) - rejected positions not counted

Result: Even though rollout_is_weights has non-zero values at rejected positions, those values get multiplied by 0 in the mask during aggregation, so they don't affect the final loss. This design correctly separates:

  • IS weights: The actual importance ratios (informational)
  • Rejection sampling: Which samples to train on (via modified_response_mask)

@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch 2 times, most recently from 8df6da7 to e4b9224 Compare October 27, 2025 14:43
Fix inaccurate documentation about IS weight processing:
- IS weights are safety-bounded to [exp(-20), exp(20)], not "true ratios"
- IS weights ARE zeroed at padding (not "never zeroed")
- Truncate mode: safety-bounded + upper clamped
- Mask mode: safety-bounded only (no threshold clamping)
- Veto checks unclamped ratios before safety bounds

Add "Operation Modes" section documenting independent control flags:
- rollout_is_threshold: main on/off switch
- rollout_is: controls IS weight application to loss
- Rejection sampling (mask mode) applies regardless of rollout_is flag
- Include mode combinations table and recommended workflow

Update terminology throughout:
- "safety-bounded ratios" replaces "true ratios" for mask mode
- Update code comments in ray_trainer.py and test files
@szrlee szrlee force-pushed the yingru/mismatch-response-mask branch from e4b9224 to 359c966 Compare October 27, 2025 14:47
@zhaochenyang20
Copy link
Collaborator

Solid, and good to go

@zhaochenyang20 zhaochenyang20 merged commit 4424616 into verl-project:main Oct 27, 2025
79 of 82 checks passed
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 3, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
albertimff pushed a commit to albertimff/verl that referenced this pull request Dec 1, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…rom Rejection Sampling (verl-project#3915)

# Refactor Rollout Importance Sampling: Separate IS Weights from
Rejection Sampling

## Summary

Refactors rollout importance sampling to properly separate IS weight
correction from rejection sampling, fixes loss normalization for fully
masked sequences, and makes veto mechanism opt-in by default.

This PR contains 3 important commits:
1. **Main refactoring** (39dd2e4): Separates IS weight correction from
rejection sampling
2. **Loss fix** (ab3e8af): Excludes fully masked sequences from
seq-mean loss denominator
3. **Veto default** (10350b6): Changes veto threshold default from 1e-4
to None (opt-in)

## Motivation

The previous implementation applied rejection by zeroing IS weights,
which conflated two distinct mechanisms. This refactoring separates them
to follow correct rejection sampling principles and improves loss
normalization.

## Main Changes

### 1. Separates Two Mechanisms (39dd2e4)

**IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with
processing applied
- Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent
overflow:
  * Token level: bounds per-token ratios
* Sequence/geometric: bounds aggregated ratio (broadcast to all tokens)
- Truncate mode: upper clamped via .clamp(max=upper_threshold)
- Mask mode: safety-bounded ratios preserved (no threshold clamping)
- All modes: zeroed at padding positions
- Preserved for policy gradient calculations

**Rejection sampling** (`modified_response_mask`): Applied via
response_mask
- Mask mode: Excludes tokens/sequences with outlier IS ratios
- Veto: Excludes sequences with catastrophic tokens (checks unclamped
per-token ratios)
- Used for loss aggregation (excluded from denominator)

### 2. Fixes Seq-Mean Loss Normalization (ab3e8af)

Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully
masked sequences from denominator:
- Uses `masked_mean` utility for proper masking
- Adds epsilon to prevent division by zero
- Ensures fully masked sequences don't affect loss computation

### 3. Makes Veto Opt-In (10350b6)

Changed `rollout_is_veto_threshold` default from `1e-4` to `None`:
- Veto mechanism now opt-in by default
- Users must explicitly enable via config
- Updated across 11 files (configs, docs, examples)

### API Changes (Breaking)

**Breaking change affecting ALL users**:
`compute_rollout_importance_weights()` now returns 3 values instead of
2:
- Before: `(weights_proto, metrics)`
- After: `(weights_proto, modified_response_mask, metrics)`

**Migration required**: All callers must be updated to handle the new
return signature, regardless of which `rollout_is_mode` you use:
- **Truncate mode**: Must still unpack 3 values (though
`modified_response_mask` is unchanged)
- **Mask mode**: Must unpack 3 values AND update `batch.response_mask`
with rejection applied
- **Veto enabled**: Must update `batch.response_mask` regardless of mode

### Files Changed

**Main refactoring** (39dd2e4):
```
docs/advance/rollout_is.md                       | 115 ++++++++++++++++++-----
tests/trainer/ppo/test_rollout_is.py             |  81 ++++++++++++++--
tests/trainer/ppo/test_rollout_is_integration.py |  12 +--
verl/trainer/ppo/mismatch_helper.py              | 105 ++++++++++++---------
verl/trainer/ppo/ray_trainer.py                  |  50 ++++++----
5 files changed, 267 insertions(+), 96 deletions(-)
```

**Docs clarification** (a5aa743):
```
docs/advance/rollout_is.md          | 23 ++++++++++++++---------
verl/trainer/ppo/mismatch_helper.py | 13 +++++++------
2 files changed, 21 insertions(+), 15 deletions(-)
```

**Loss fix** (ab3e8af):
```
verl/trainer/ppo/core_algos.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
```

**Veto default** (10350b6):
```
11 files changed, 28 insertions(+), 26 deletions(-)
(configs, docs, examples, mismatch_helper, ray_trainer)
```

## Benefits

- **Correct loss normalization**: Rejected samples excluded from
denominator
- **Mode-specific weight processing**: Truncate clamps, mask preserves
safety-bounded ratios
- **Clear separation of concerns**: Between IS correction and rejection
- **Safer defaults**: Veto mechanism opt-in to prevent unexpected
behavior
- **Numerical stability**: Safety bounds prevent overflow, prevents
division by zero in seq-mean modes

## Testing

All tests passing (11/11):

```bash
pytest tests/trainer/ppo/test_rollout_is*.py -v
```

New test `test_mask_mode()` verifies:
- IS weights remain non-zero for rejected samples (safety-bounded
ratios, not zeroed)
- Rejection correctly applied via response_mask (not by zeroing weights)

## Migration Guide

### API Signature Change (Required for ALL users)

```python
# Before: 2-value return
weights_proto, metrics = compute_rollout_importance_weights(...)

# After: 3-value return (ALL users must update)
weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...)

# ALWAYS update batch with modified response_mask
batch.response_mask = modified_response_mask
```

### Mode-Specific Behavior

**Truncate mode** (`rollout_is_mode="truncate"`):
- IS weights: upper clamped via .clamp(max=upper_threshold)
- `modified_response_mask` equals input `response_mask` (unchanged for
outlier ratios)
- No outlier rejection applied, but must still handle 3-value return
- Veto rejection (if enabled) still applies to mask

**Mask mode** (`rollout_is_mode="mask"`):
- IS weights: safety-bounded ratios preserved (no threshold clamping)
- `modified_response_mask` has outliers excluded (weights outside
[lower, upper])
- Rejection applied via mask, NOT by modifying IS weights
- Veto rejection (if enabled) also applies to mask

**Veto enabled** (any mode with `rollout_is_veto_threshold` set):
- Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before
safety bound)
- Sequences with catastrophic tokens excluded from
`modified_response_mask`
- Works independently of truncate/mask mode
- Does NOT modify IS weights

### Veto Default Change

If you relied on the default veto threshold (1e-4), explicitly enable
it:

```yaml
# Old: enabled by default with threshold=1e-4
# New: opt-in (default is None)
rollout_is_veto_threshold: 1e-4
```

## Reference

[When Speed Kills Stability: Demystifying RL Collapse from the
Inference-Training
Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda)

Liu, Li, Fu, Wang, Liu, Shen (2025)

### BibTeX

```bibtex
@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants