Skip to content

feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199

Merged
qgallouedec merged 12 commits into
huggingface:mainfrom
casinca:VESPO
Mar 14, 2026
Merged

feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199
qgallouedec merged 12 commits into
huggingface:mainfrom
casinca:VESPO

Conversation

@casinca

@casinca casinca commented Feb 27, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR implements the VESPO loss, resolve #5196

Official implementation: https://github.com/FloyedShen/VESPO/blob/main/recipe/vespo/code/core_algos.py
Paper: https://huggingface.co/papers/2602.10693

Note:

  • The paper and the official implementation can have different variable names, to make things clearer:

    • c1 = k = α
    • c2 = lambda
  • Docstrings/comments are a mix of official impl and my writing.

 

Alternative options:

  • Currently VESPO has 4 hparams k_pos, lambda_pos, k_neg, lambda_neg but I could reduce with 2 tuples of 2 floats eg: lambdas (pos, neg) if it's better.
  • Original impl also returns for metrics w_seq. I can include it in metrics, but this would force me to return a tuple in get_gamma_weights or remove @staticmethod. Not sure here what's the preference.

 

For efficiency, the TRL VESPO implementation is slightly different than the official one. It's ~25% faster per call on gpu, and tested for equivalence.

With importance_sampling_ratio:
-----------------------------------------------------------------
B x T         TRL_VESPO (ms)    OG_VESPO (ms)     Faster
-----------------------------------------------------------------
8 x 128         0.4290          0.5301          TRL_VESPO (1.24x)
16 x 256        0.4281          0.5302          TRL_VESPO (1.24x)
32 x 512        0.4283          0.5299          TRL_VESPO (1.24x)
64 x 512        0.4284          0.5294          TRL_VESPO (1.24x)
128 x 512       0.4286          0.5322          TRL_VESPO (1.24x)
32 x 1024       0.4473          0.5313          TRL_VESPO (1.19x)
64 x 1024       0.4285          0.5360          TRL_VESPO (1.25x)
128 x 1024      0.4240          0.5203          TRL_VESPO (1.23x)
-----------------------------------------------------------------

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Adds a new loss formulation to GRPOTrainer that changes how gradients are scaled and how vLLM importance-sampling correction is applied, so training behavior can shift for users selecting loss_type="vespo". Scope is contained to a new loss branch plus config/docs/tests, but it touches core loss computation.

Overview
Adds VESPO (loss_type="vespo") to GRPOTrainer, including a new get_gamma_weights helper that computes detached, advantage-sign-dependent Gamma weights from sequence-level importance ratios (optionally incorporating vLLM TIS/MIS correction in log space) and uses them in the loss.

Extends GRPOConfig with four new hyperparameters (vespo_k_pos, vespo_lambda_pos, vespo_k_neg, vespo_lambda_neg) and updates validation/behavior so VESPO warns about importance_sampling_level usage, restricts vLLM correction modes to token_truncate/token_mask, and skips the generic per-token vLLM correction multiplier for VESPO.

Updates docs (paper_index.md) to add a VESPO paper entry/config snippet and expands the GRPO loss-type test matrix to include vespo.

Written by Cursor Bugbot for commit c5d0d50. This will update automatically on new commits. Configure here.

@casinca casinca changed the title init feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) Feb 27, 2026
@casinca casinca marked this pull request as ready for review March 1, 2026 19:22
@casinca

casinca commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

I owe some better explanations to facilitate the review concerning importing math for lower_clamp = math.log(1e-8) in get_gamma_weights

From the original implementation below, the author is recomputing the log_w_seq from w_seq but we already have the log from seq_log_ratio_clampled. The only difference for recomputing, is the range of the min clamp being reduced to min=1e-8.

image

 

In order to avoid a 2nd log op in TRL, I'm directly clamping in logspace log_w_seq = torch.clamp(seq_log_ratio, lower_clamp, 20.0) once. Which ends up being the same.

This is solely to follow the original implementation, otherwise I'm not really sure if reducing from $e^{-20}$ to $e^{-18.42}$ (ie $e^{log(1e-8)}$) is important. I had opened an issue in OP for this: FloyedShen/VESPO#6

If keeping the original logic and importing math is problematic, alternative could be to hardcode log(1e-8) or retrieved from a tensor torch.log(torch.tensor(1e-8))

@casinca

casinca commented Mar 6, 2026

Copy link
Copy Markdown
Contributor Author

hey @FloyedShen , feel free to share any thoughts on this TRL implementation.

@FloyedShen

Copy link
Copy Markdown

Hi @casinca, sorry for the late reply — been swamped with work lately 😅

Thanks for this great contribution! I'm one of the authors of VESPO. I took some time to go through your implementation and ran a quick validation experiment comparing VESPO against the GRPO baseline.

Setup: Qwen3-4B-Base, DAPO-Math-17k, 8×H20 GPUs, 8 generations, 8 steps per generation, lr=1e-6, max_completion_length=8192.

The results look solid and are consistent with what we see in our verl-based experiments — VESPO shows improved reward over GRPO as training progresses, with notably more stable gradient norms and better entropy retention throughout training. Everything checks out on my end!

Screenshot 2026-03-14 at 01 52 02

Full training logs: https://wandb.ai/brain-cog/vespo_trl

Thanks again for the clean implementation and the thorough benchmarking 👍

@casinca

casinca commented Mar 13, 2026

Copy link
Copy Markdown
Contributor Author

Np, thanks for the feedback and for taking the time to test, appreciate it.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, clean implementation!

Comment thread trl/trainer/grpo_trainer.py
@qgallouedec qgallouedec merged commit 406d406 into huggingface:main Mar 14, 2026
12 checks passed
@casinca casinca deleted the VESPO branch March 15, 2026 14:40
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 3972d66
Author: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date:   Wed Mar 18 22:26:44 2026 +0100

    Suggest the `Json()` type for tool calling dataset format (#5307)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 5c6e915
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 14:55:19 2026 -0600

    Update `RewardFunc` type annotation to allow `None`values in reward list (#5297)

commit ee96845
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Mar 18 17:03:54 2026 +0100

    Fix DPOTrainer collators to truncate sequences before padding (#5305)

commit 435c2ae
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 08:09:42 2026 -0600

    Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` (#5294)

commit 26ce6a3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 00:44:12 2026 -0600

    Apply docstyle (#5296)

commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
songhappy pushed a commit to songhappy/trl that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Variational Sequence-Level Soft Policy Optimization (VESPO)

4 participants