Skip to content

[fix][train] Fix advantage response masking for step wise training #1506

Closed
SumanthRH wants to merge 2 commits intomainfrom
fix-adv-masking-step-wise
Closed

[fix][train] Fix advantage response masking for step wise training #1506
SumanthRH wants to merge 2 commits intomainfrom
fix-adv-masking-step-wise

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

What does this PR do?

Fixes #1492

SumanthRH and others added 2 commits April 13, 2026 22:18
Add return_raw_scores flag to compute_advantages so raw per-token
scores are preserved before masking.  Wire through trainer and add
unit tests covering step-wise advantage masking behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o [N, seqlen]

- GAE: add return_raw_scores flag; apply response_mask to advantages (skipped
  when return_raw_scores=True) for consistency with other estimators
- GRPO/RLOO/MAXRL: expand [N,1] scores to [N, seqlen] via expand_as when
  return_raw_scores=True, so the step-wise trainer receives uniform shapes
- Parametrize step-wise trainer test over all 5 estimators
- Add unit test asserting return_raw_scores=True yields [N, seqlen] for all estimators
- Update docstrings with return_raw_scores documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@SumanthRH
Copy link
Copy Markdown
Member Author

SumanthRH commented Apr 13, 2026

Closing in favour of #1507

@SumanthRH SumanthRH closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train] Incorrect advantage assignment for step_wise_trajectores in RayPPOTrainer?

1 participant