[algo] SAPO algo by Qwen by BounharAbdelaziz · Pull Request #4345 · verl-project/verl

BounharAbdelaziz · 2025-11-28T19:53:13Z

What does this PR do?

Implements Soft Adaptive Policy Optimization (SAPO) for RL fine-tuning of LLMs. SAPO replaces hard clipping with temperature-controlled soft gating for more stable training and better sample efficiency. Paper: https://arxiv.org/abs/2511.20347

Checklist Before Starting

Search for similar PRs: SAPO search
Format PR title: [algo] feat: implement SAPO (Soft Adaptive Policy Optimization)

Design & Code Changes

Added compute_policy_loss_sapo() in verl/trainer/ppo/core_algos.py
Implements soft gate: f(r) = σ(τ(r-1)) · 4/τ where r = π_θ / π_θ_old
Uses asymmetric temperatures (τ_neg > τ_pos) as in original paper
Aggregation: seq-mean-token-mean as per paper

Checklist Before Submitting

XP: SAPO vs GRPO

Setup: Qwen3-4B-Base, total batch size 256, mini-batch size 32, context len 8192.
Stability: SAPO training is more stable than GRPO.
Response length: SAPO produces substantially longer responses on average.
Entropy / collapse: GRPO collapses quickly and its performance saturates early, while SAPO maintains healthier exploration (seen entropy).
Gradient norm: SAPO runs with slightly higher, but well-behaved, gradient norms.
PG loss: The PG loss under SAPO stays around −0.01, i.e., the underlying SAPO objective (before the leading minus sign) is positive, indicating the model consistently discovers and reinforces high-reward behaviors.

gemini-code-assist

Code Review

This pull request introduces the Soft Adaptive Policy Optimization (SAPO) algorithm by adding a new policy loss function and its configuration. The implementation is mostly correct, but I've found a few critical issues that need to be addressed. Firstly, the new configuration parameters tau_pos and tau_neg are missing from the ActorConfig dataclass, which will cause a crash on startup. Secondly, the loss_agg_mode in the new SAPO loss function is hardcoded, making it non-configurable. Lastly, there's a potential for a division-by-zero error in the gating function that could lead to training failure. Please address these points to ensure the stability and correctness of the new algorithm.

gemini-code-assist · 2025-11-28T19:54:47Z

verl/trainer/config/actor/actor.yaml

+# Positive and negative tau for smoothing function in SAPO (https://arxiv.org/pdf/2511.20347)
+# default values used in the paper with Qwen3-30B-A3B-Base
+tau_pos: 1.0
+tau_neg: 1.05


The new configuration parameters tau_pos and tau_neg are not defined in the ActorConfig dataclass located in verl/workers/config/actor.py. This will cause a ValidationError when Hydra/OmegaConf attempts to instantiate the ActorConfig from this YAML file, as these are unrecognized fields. To fix this, you need to add these fields to the ActorConfig dataclass definition.

gemini-code-assist · 2025-11-28T19:54:47Z

verl/trainer/ppo/core_algos.py

+
+    # for SAPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)
+    pg_loss = agg_loss(
+        loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info


The loss_agg_mode is hardcoded as "seq-mean-token-mean" in the call to agg_loss. This completely ignores the loss_agg_mode parameter passed to the compute_policy_loss_sapo function, making this aspect of the loss calculation non-configurable. The function should use the value from the loss_agg_mode argument to allow for flexibility.

Suggested change

loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info

loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info

verl/trainer/ppo/core_algos.py

BounharAbdelaziz · 2025-11-28T19:56:00Z

@vermouth1992 @eric-haibin-lin

vermouth1992 · 2025-11-30T02:05:45Z

Could you please add an example script in the example folder? Thanks

vermouth1992 · 2025-11-30T02:06:04Z

Also, if possible, show a convergence curve of some task

BounharAbdelaziz · 2025-12-01T13:15:06Z

Could you please add an example script in the example folder? Thanks

@vermouth1992 done!

BounharAbdelaziz · 2025-12-01T13:16:23Z

Also, if possible, show a convergence curve of some task

@vermouth1992 two ongoing xps right now on dapo comparing vanilla GRPO and SAPO (I'll probably also add one run with GSPO).

vermouth1992 · 2025-12-03T02:14:43Z

Could you please fix sanity and precommit ci? Thanks.

BounharAbdelaziz · 2025-12-09T10:42:07Z

Could you please fix sanity and precommit ci? Thanks.

@vermouth1992 done!
issue was with the new tau params that needed to be added.
sorry for the delay.

BounharAbdelaziz · 2025-12-12T11:54:30Z

@vermouth1992 @eric-haibin-lin any update?

bhaktatejas922 · 2025-12-18T03:49:38Z

will this work with VL?

vermouth1992 · 2025-12-22T01:07:39Z

It seems that there are two SAPO recipes. Shall we merge them into one and acknowledge both contribution? #4624

BounharAbdelaziz · 2025-12-22T09:57:07Z

It seems that there are two SAPO recipes. Shall we merge them into one and acknowledge both contribution? #4624

@CedricHwong could you please clarify the concrete differences you’re seeing between the two? I’m confident we can align them if needed, and I’m happy to give you access so we can work on this together.

Also, note that by pushing my latest commit, some of your previous changes may have been broken making the two PRs the same... and so the behavior you’re referring to might not reflect the current state of the code.

Finally, did you run any tests or training runs with your version that you could share? That would help ground the discussion.

Many thanks!

BounharAbdelaziz · 2025-12-26T09:41:31Z

@vermouth1992 anything missing to merge this PR? Thanks :)

implemented SAPO algo by Qwen

b6c1b52

BounharAbdelaziz requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners November 28, 2025 19:53

BounharAbdelaziz changed the title ~~implemented SAPO algo by Qwen~~ [algo] SAPO algo by Qwen Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

Kirrito-k423 mentioned this pull request Nov 29, 2025

Support for SAPO Algorithm with Qwen3-VL in VeRL #4354

Closed

BounharAbdelaziz added 2 commits December 1, 2025 13:11

sapo training script example

2ee1e1c

added sapo tau params

545f1bc

BounharAbdelaziz force-pushed the main branch from a4293b1 to 545f1bc Compare December 1, 2025 13:13

vermouth1992 approved these changes Dec 2, 2025

View reviewed changes

BounharAbdelaziz added 2 commits December 9, 2025 10:38

Apply pre-commit formatting and regenerate configs

cf85a03

added lora dict to config

b24c217

erictang000 mentioned this pull request Dec 9, 2025

[skyrl-train] Add support for SAPO NovaSky-AI/SkyRL#761

Closed

vermouth1992 mentioned this pull request Dec 22, 2025

[algo, ci] feat: add SAPO loss, example and tests #4624

Closed

7 tasks

vermouth1992 merged commit c14652e into verl-project:main Dec 26, 2025
69 of 77 checks passed

boren-ms pushed a commit to boren-ms/verl that referenced this pull request Dec 30, 2025

[algo] SAPO algo by Qwen (verl-project#4345)

64e8272

jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026

[algo] SAPO algo by Qwen (verl-project#4345)

6c70205

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026

[algo] SAPO algo by Qwen (verl-project#4345)

ea40217

sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026

[algo] SAPO algo by Qwen (verl-project#4345)

c928198

	loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info
	loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info

Conversation

BounharAbdelaziz commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Design & Code Changes

Checklist Before Submitting

XP: SAPO vs GRPO

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BounharAbdelaziz commented Nov 28, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

vermouth1992 commented Dec 3, 2025

Uh oh!

BounharAbdelaziz commented Dec 9, 2025

Uh oh!

BounharAbdelaziz commented Dec 12, 2025

Uh oh!

bhaktatejas922 commented Dec 18, 2025

Uh oh!

vermouth1992 commented Dec 22, 2025

Uh oh!

BounharAbdelaziz commented Dec 22, 2025

Uh oh!

BounharAbdelaziz commented Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BounharAbdelaziz commented Nov 28, 2025 •

edited

Loading