[algo] SAPO algo by Qwen#4345
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the Soft Adaptive Policy Optimization (SAPO) algorithm by adding a new policy loss function and its configuration. The implementation is mostly correct, but I've found a few critical issues that need to be addressed. Firstly, the new configuration parameters tau_pos and tau_neg are missing from the ActorConfig dataclass, which will cause a crash on startup. Secondly, the loss_agg_mode in the new SAPO loss function is hardcoded, making it non-configurable. Lastly, there's a potential for a division-by-zero error in the gating function that could lead to training failure. Please address these points to ensure the stability and correctness of the new algorithm.
| # Positive and negative tau for smoothing function in SAPO (https://arxiv.org/pdf/2511.20347) | ||
| # default values used in the paper with Qwen3-30B-A3B-Base | ||
| tau_pos: 1.0 | ||
| tau_neg: 1.05 |
There was a problem hiding this comment.
The new configuration parameters tau_pos and tau_neg are not defined in the ActorConfig dataclass located in verl/workers/config/actor.py. This will cause a ValidationError when Hydra/OmegaConf attempts to instantiate the ActorConfig from this YAML file, as these are unrecognized fields. To fix this, you need to add these fields to the ActorConfig dataclass definition.
|
|
||
| # for SAPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean) | ||
| pg_loss = agg_loss( | ||
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info |
There was a problem hiding this comment.
The loss_agg_mode is hardcoded as "seq-mean-token-mean" in the call to agg_loss. This completely ignores the loss_agg_mode parameter passed to the compute_policy_loss_sapo function, making this aspect of the loss calculation non-configurable. The function should use the value from the loss_agg_mode argument to allow for flexibility.
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info | |
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info |
|
Could you please add an example script in the example folder? Thanks |
|
Also, if possible, show a convergence curve of some task |
a4293b1 to
545f1bc
Compare
@vermouth1992 done! |
@vermouth1992 two ongoing xps right now on dapo comparing vanilla GRPO and SAPO (I'll probably also add one run with GSPO). |
|
Could you please fix sanity and precommit ci? Thanks. |
@vermouth1992 done! |
|
@vermouth1992 @eric-haibin-lin any update? |
|
will this work with VL? |
|
It seems that there are two SAPO recipes. Shall we merge them into one and acknowledge both contribution? #4624 |
@CedricHwong could you please clarify the concrete differences you’re seeing between the two? I’m confident we can align them if needed, and I’m happy to give you access so we can work on this together. Also, note that by pushing my latest commit, some of your previous changes may have been broken making the two PRs the same... and so the behavior you’re referring to might not reflect the current state of the code. Finally, did you run any tests or training runs with your version that you could share? That would help ground the discussion. Many thanks! |
|
@vermouth1992 anything missing to merge this PR? Thanks :) |
What does this PR do?
Implements Soft Adaptive Policy Optimization (SAPO) for RL fine-tuning of LLMs. SAPO replaces hard clipping with temperature-controlled soft gating for more stable training and better sample efficiency. Paper: https://arxiv.org/abs/2511.20347
Checklist Before Starting
[algo] feat: implement SAPO (Soft Adaptive Policy Optimization)Design & Code Changes
compute_policy_loss_sapo()inverl/trainer/ppo/core_algos.pyf(r) = σ(τ(r-1)) · 4/τwherer = π_θ / π_θ_oldτ_neg > τ_pos) as in original paperseq-mean-token-meanas per paperChecklist Before Submitting
XP: SAPO vs GRPO