[fix][train] Fix advantage response masking for step wise training by SumanthRH · Pull Request #1506 · NovaSky-AI/SkyRL

SumanthRH · 2026-04-13T22:54:59Z

What does this PR do?

Add return_raw_scores flag to compute_advantages so raw per-token scores are preserved before masking. Wire through trainer and add unit tests covering step-wise advantage masking behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…o [N, seqlen] - GAE: add return_raw_scores flag; apply response_mask to advantages (skipped when return_raw_scores=True) for consistency with other estimators - GRPO/RLOO/MAXRL: expand [N,1] scores to [N, seqlen] via expand_as when return_raw_scores=True, so the step-wise trainer receives uniform shapes - Parametrize step-wise trainer test over all 5 estimators - Add unit test asserting return_raw_scores=True yields [N, seqlen] for all estimators - Update docstrings with return_raw_scores documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SumanthRH · 2026-04-13T23:21:53Z

Closing in favour of #1507

SumanthRH and others added 2 commits April 13, 2026 22:18

SumanthRH closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][train] Fix advantage response masking for step wise training #1506

[fix][train] Fix advantage response masking for step wise training #1506
SumanthRH wants to merge 2 commits intomainfrom
fix-adv-masking-step-wise

SumanthRH commented Apr 13, 2026

Uh oh!

SumanthRH commented Apr 13, 2026 •

edited by CharlieFRuan

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SumanthRH commented Apr 13, 2026

What does this PR do?

Uh oh!

SumanthRH commented Apr 13, 2026 • edited by CharlieFRuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SumanthRH commented Apr 13, 2026 •

edited by CharlieFRuan

Loading