[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling#3915
Conversation
The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring properly separates IS weight correction from rejection sampling to follow correct principles. This commit separates two mechanisms: IS Weights (rollout_is_weights): Always TRUE ratios π_train/π_rollout - Never zeroed, even for rejected samples - Preserved for policy gradient calculations Rejection Sampling (modified_response_mask): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens - Used for loss aggregation (excluded from denominator) This ensures: - Correct loss normalization (rejected samples excluded from denominator) - True IS ratios preserved for policy gradient calculations - Clear separation of concerns between IS correction and rejection Changes: - compute_rollout_importance_weights() now returns 3 values instead of 2 - Always update batch response_mask with rejection applied - Updated all tests to verify new behavior - Comprehensive documentation update with BibTeX citation Reference: When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch Liu, Li, Fu, Wang, Liu, Shen (2025) https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda
There was a problem hiding this comment.
Code Review
This pull request is a well-executed refactoring of the rollout importance sampling mechanism. It correctly separates the concerns of Importance Sampling (IS) weight correction and rejection sampling, which is a significant improvement for both correctness and code clarity. The changes are consistently applied across the core logic, trainer integration, tests, and documentation. The new tests, especially test_mask_mode, are comprehensive and accurately validate the new behavior. I have identified one high-severity issue concerning misleading documentation within the mismatch_helper.py docstring, which could lead to unexpected behavior for users of the truncate mode. My review comment provides a suggestion to clarify this. Overall, this is a high-quality contribution.
678bf01 to
5b982ff
Compare
5b982ff to
965f4d0
Compare
Fixed two documentation issues: 1. Truncate mode only clamps upper bound (not [1, upper]) 2. Veto applies independently of rollout_is_mode The previous documentation was misleading: - Stated 'no rejection' for truncate mode (veto can still reject) - Stated clamp at [1, upper] (only upper is clamped) Changes: - Clarified truncate only clamps max (no lower bound) - Emphasized veto applies in both truncate and mask modes - Updated docstring, docs, and in-code comments - Prevents silent data loss when using truncate mode
e8f5726 to
da169f2
Compare
Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator using masked_mean, and added epsilon to prevent division by zero.
Changed rollout_is_veto_threshold default from 1e-4 to None, making the veto mechanism opt-in across 11 files (configs, runtime, docs).
da169f2 to
10350b6
Compare
|
|
||
| # Apply response_mask to ensure weights are 0 where mask is 0 | ||
| # Zero out padding positions in IS weights for correct aggregation | ||
| # This is different from rejection - padding must be zeroed regardless of mode |
There was a problem hiding this comment.
Why not also mask out rollout_is_weights by modified_response_mask?
There was a problem hiding this comment.
- rollout_is_weights can have non-zero values at rejected positions (line 254 only zeros padding)
- But when computing loss: pg_losses = pg_losses * rollout_is_weights then agg_loss(pg_losses, modified_response_mask)
- The modified_response_mask has 0s at rejected positions, so in masked_mean:
- Numerator: sum(pg_losses * modified_response_mask) - rejected positions contribute 0 (masked out)
- Denominator: sum(modified_response_mask) - rejected positions not counted
Result: Even though rollout_is_weights has non-zero values at rejected positions, those values get multiplied by 0 in the mask during aggregation, so they don't affect the final loss. This design correctly separates:
- IS weights: The actual importance ratios (informational)
- Rejection sampling: Which samples to train on (via modified_response_mask)
8df6da7 to
e4b9224
Compare
Fix inaccurate documentation about IS weight processing: - IS weights are safety-bounded to [exp(-20), exp(20)], not "true ratios" - IS weights ARE zeroed at padding (not "never zeroed") - Truncate mode: safety-bounded + upper clamped - Mask mode: safety-bounded only (no threshold clamping) - Veto checks unclamped ratios before safety bounds Add "Operation Modes" section documenting independent control flags: - rollout_is_threshold: main on/off switch - rollout_is: controls IS weight application to loss - Rejection sampling (mask mode) applies regardless of rollout_is flag - Include mode combinations table and recommended workflow Update terminology throughout: - "safety-bounded ratios" replaces "true ratios" for mask mode - Update code comments in ray_trainer.py and test files
e4b9224 to
359c966
Compare
|
Solid, and good to go |
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
…rom Rejection Sampling (verl-project#3915) # Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling ## Summary Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default. This PR contains 3 important commits: 1. **Main refactoring** (39dd2e4): Separates IS weight correction from rejection sampling 2. **Loss fix** (ab3e8af): Excludes fully masked sequences from seq-mean loss denominator 3. **Veto default** (10350b6): Changes veto threshold default from 1e-4 to None (opt-in) ## Motivation The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization. ## Main Changes ### 1. Separates Two Mechanisms (39dd2e4) **IS weights** (`rollout_is_weights`): Ratios π_train/π_rollout with processing applied - Safety-bounded to [exp(-20), exp(20)] ≈ [2e-9, 5e8] to prevent overflow: * Token level: bounds per-token ratios * Sequence/geometric: bounds aggregated ratio (broadcast to all tokens) - Truncate mode: upper clamped via .clamp(max=upper_threshold) - Mask mode: safety-bounded ratios preserved (no threshold clamping) - All modes: zeroed at padding positions - Preserved for policy gradient calculations **Rejection sampling** (`modified_response_mask`): Applied via response_mask - Mask mode: Excludes tokens/sequences with outlier IS ratios - Veto: Excludes sequences with catastrophic tokens (checks unclamped per-token ratios) - Used for loss aggregation (excluded from denominator) ### 2. Fixes Seq-Mean Loss Normalization (ab3e8af) Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator: - Uses `masked_mean` utility for proper masking - Adds epsilon to prevent division by zero - Ensures fully masked sequences don't affect loss computation ### 3. Makes Veto Opt-In (10350b6) Changed `rollout_is_veto_threshold` default from `1e-4` to `None`: - Veto mechanism now opt-in by default - Users must explicitly enable via config - Updated across 11 files (configs, docs, examples) ### API Changes (Breaking) **Breaking change affecting ALL users**: `compute_rollout_importance_weights()` now returns 3 values instead of 2: - Before: `(weights_proto, metrics)` - After: `(weights_proto, modified_response_mask, metrics)` **Migration required**: All callers must be updated to handle the new return signature, regardless of which `rollout_is_mode` you use: - **Truncate mode**: Must still unpack 3 values (though `modified_response_mask` is unchanged) - **Mask mode**: Must unpack 3 values AND update `batch.response_mask` with rejection applied - **Veto enabled**: Must update `batch.response_mask` regardless of mode ### Files Changed **Main refactoring** (39dd2e4): ``` docs/advance/rollout_is.md | 115 ++++++++++++++++++----- tests/trainer/ppo/test_rollout_is.py | 81 ++++++++++++++-- tests/trainer/ppo/test_rollout_is_integration.py | 12 +-- verl/trainer/ppo/mismatch_helper.py | 105 ++++++++++++--------- verl/trainer/ppo/ray_trainer.py | 50 ++++++---- 5 files changed, 267 insertions(+), 96 deletions(-) ``` **Docs clarification** (a5aa743): ``` docs/advance/rollout_is.md | 23 ++++++++++++++--------- verl/trainer/ppo/mismatch_helper.py | 13 +++++++------ 2 files changed, 21 insertions(+), 15 deletions(-) ``` **Loss fix** (ab3e8af): ``` verl/trainer/ppo/core_algos.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ``` **Veto default** (10350b6): ``` 11 files changed, 28 insertions(+), 26 deletions(-) (configs, docs, examples, mismatch_helper, ray_trainer) ``` ## Benefits - **Correct loss normalization**: Rejected samples excluded from denominator - **Mode-specific weight processing**: Truncate clamps, mask preserves safety-bounded ratios - **Clear separation of concerns**: Between IS correction and rejection - **Safer defaults**: Veto mechanism opt-in to prevent unexpected behavior - **Numerical stability**: Safety bounds prevent overflow, prevents division by zero in seq-mean modes ## Testing All tests passing (11/11): ```bash pytest tests/trainer/ppo/test_rollout_is*.py -v ``` New test `test_mask_mode()` verifies: - IS weights remain non-zero for rejected samples (safety-bounded ratios, not zeroed) - Rejection correctly applied via response_mask (not by zeroing weights) ## Migration Guide ### API Signature Change (Required for ALL users) ```python # Before: 2-value return weights_proto, metrics = compute_rollout_importance_weights(...) # After: 3-value return (ALL users must update) weights_proto, modified_response_mask, metrics = compute_rollout_importance_weights(...) # ALWAYS update batch with modified response_mask batch.response_mask = modified_response_mask ``` ### Mode-Specific Behavior **Truncate mode** (`rollout_is_mode="truncate"`): - IS weights: upper clamped via .clamp(max=upper_threshold) - `modified_response_mask` equals input `response_mask` (unchanged for outlier ratios) - No outlier rejection applied, but must still handle 3-value return - Veto rejection (if enabled) still applies to mask **Mask mode** (`rollout_is_mode="mask"`): - IS weights: safety-bounded ratios preserved (no threshold clamping) - `modified_response_mask` has outliers excluded (weights outside [lower, upper]) - Rejection applied via mask, NOT by modifying IS weights - Veto rejection (if enabled) also applies to mask **Veto enabled** (any mode with `rollout_is_veto_threshold` set): - Checks **unclamped per-token ratios** π_train(t)/π_rollout(t) (before safety bound) - Sequences with catastrophic tokens excluded from `modified_response_mask` - Works independently of truncate/mask mode - Does NOT modify IS weights ### Veto Default Change If you relied on the default veto threshold (1e-4), explicitly enable it: ```yaml # Old: enabled by default with threshold=1e-4 # New: opt-in (default is None) rollout_is_veto_threshold: 1e-4 ``` ## Reference [When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda) Liu, Li, Fu, Wang, Liu, Shen (2025) ### BibTeX ```bibtex @misc{liu-li-2025, title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch}, url = {https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda}, author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen}, year = {2025}, month = september, } ```
Refactor Rollout Importance Sampling: Separate IS Weights from Rejection Sampling
Summary
Refactors rollout importance sampling to properly separate IS weight correction from rejection sampling, fixes loss normalization for fully masked sequences, and makes veto mechanism opt-in by default.
This PR contains 3 important commits:
Motivation
The previous implementation applied rejection by zeroing IS weights, which conflated two distinct mechanisms. This refactoring separates them to follow correct rejection sampling principles and improves loss normalization.
Main Changes
1. Separates Two Mechanisms (39dd2e4)
IS weights (
rollout_is_weights): Ratios π_train/π_rollout with processing appliedRejection sampling (
modified_response_mask): Applied via response_mask2. Fixes Seq-Mean Loss Normalization (ab3e8af)
Fixed seq-mean-token-sum and seq-mean-token-mean modes to exclude fully masked sequences from denominator:
masked_meanutility for proper masking3. Makes Veto Opt-In (10350b6)
Changed
rollout_is_veto_thresholddefault from1e-4toNone:API Changes (Breaking)
Breaking change affecting ALL users:
compute_rollout_importance_weights()now returns 3 values instead of 2:(weights_proto, metrics)(weights_proto, modified_response_mask, metrics)Migration required: All callers must be updated to handle the new return signature, regardless of which
rollout_is_modeyou use:modified_response_maskis unchanged)batch.response_maskwith rejection appliedbatch.response_maskregardless of modeFiles Changed
Main refactoring (39dd2e4):
Docs clarification (a5aa743):
Loss fix (ab3e8af):
Veto default (10350b6):
Benefits
Testing
All tests passing (11/11):
pytest tests/trainer/ppo/test_rollout_is*.py -vNew test
test_mask_mode()verifies:Migration Guide
API Signature Change (Required for ALL users)
Mode-Specific Behavior
Truncate mode (
rollout_is_mode="truncate"):modified_response_maskequals inputresponse_mask(unchanged for outlier ratios)Mask mode (
rollout_is_mode="mask"):modified_response_maskhas outliers excluded (weights outside [lower, upper])Veto enabled (any mode with
rollout_is_veto_thresholdset):modified_response_maskVeto Default Change
If you relied on the default veto threshold (1e-4), explicitly enable it:
Reference
When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch
Liu, Li, Fu, Wang, Liu, Shen (2025)
BibTeX