[GRPO] Truncated Importance Sampling to address rollout-training mismatch by LeonEricsson · Pull Request #3867 · huggingface/trl

LeonEricsson · 2025-08-07T13:52:19Z

Motivation

TRL provides the option of using vLLM for rollouts, enabling fast and scalable generation. However, the token probabilities for the generated completions—used in the GRPO objective—do not come directly from vLLM. Instead, these probabilities are recomputed by the training backend. It has been known for a while that vLLM probabilities differ from Hugging Face, which ultimately means we inadvertently train off policy from our generation policy, despite using the same weights. A recent blog post highlights the effect of this discrepancy and proposes a solution in the form of an importance sampling factor

What does this PR do?

Initially, we document the numerical differences in token probabilities when using vLLM.

Depending on the results, we may address the issue through the recommended Truncated Importance Sampling method.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LeonEricsson · 2025-08-07T13:52:51Z

First experiment results, these are with vLLM rollout (in server mode), and a vanilla training backend (no fsdp/deepspeed).

We observe a considerable difference, in line with the blog.

Run on 2ef3af6.

Code to run

```python from datasets import load_dataset from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train[:50]")

Define the reward function, which rewards completions that are close to 20 characters

def reward_len(completions, **kwargs):
return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(
num_train_epochs=1,
steps_per_generation=4,
per_device_train_batch_size=2,
num_generations=4,
logging_steps=1,
report_to="wandb"
)

trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()

</details>

qgallouedec · 2025-08-09T18:46:17Z

This is a very important one, thanks! Is it ready for review?

LeonEricsson · 2025-08-11T06:31:26Z

This is a very important one, thanks! Is it ready for review?

I'm wavering on how we want to address this. Either we keep recomputing and introduce the Truncated Importance Sampling approach from the blog, or we move away from recomputing and use vLLM logprobs directly, everywhere. Both are valid; I see this as more a question of which approach scales better

LeonEricsson · 2025-08-20T09:24:21Z

Some training runs with TIS compared to baseline (TIS=-1). I wouldn't expect things to look much different, both are stable. Generally KL($\pi_{inference}, \pi_{training}$) is low.

For context this is single gpu training, on gsm8k.

LeonEricsson · 2025-08-24T14:32:49Z

same experiments with PPO-IS

trl/scripts/vllm_serve.py

trl/trainer/grpo_trainer.py

LeonEricsson · 2025-08-26T10:43:40Z

I've cleaned up the implementation now.

There's one existing issue which is that vLLM sometimes spits out a NaN logprobs for the chosen token. This needs to be handled.

LeonEricsson · 2025-08-27T08:37:48Z

we should update to use the final processed logprobs from vLLM, from vllm-project/vllm#22387. prior versions of vLLM didn't support retrieving the sampled logprobs.

EDIT: the vllm patch has not been released yet, so we can hold of on that change for a future pr

trl/scripts/vllm_serve.py

…ortance sampling configuration

trl/trainer/grpo_trainer.py

qgallouedec · 2025-09-01T04:00:19Z

I think we're good @LeonEricsson, right? Or is this PR still draft?

HuggingFaceDocBuilderDev · 2025-09-01T12:51:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/grpo_config.py

qgallouedec

LGTM!

…atch (huggingface#3867) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

compare vllm rollout probs to actor model

2ef3af6

LeonEricsson mentioned this pull request Aug 8, 2025

Ideas to Improve GRPO Training Speed #3846

Closed

LeonEricsson added 3 commits August 18, 2025 10:14

prompt logprobs

302b981

implemented TIS. only for testing, lacks docs and cleaning

a57a70f

rename

a8e27e5

wip

caa1ed5

qgallouedec reviewed Aug 24, 2025

View reviewed changes

trl/scripts/vllm_serve.py Outdated Show resolved Hide resolved

qgallouedec reviewed Aug 24, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

LeonEricsson added 6 commits August 25, 2025 20:18

cleanup v1

f38a38e

cleanup v2

a20b75c

cleanup v3

fd35ee7

added TIS as configurable option

d6fde00

cleanup

49841ea

Merge branch 'main' into rollout_off_policy_importance_sampling

47933ce

LeonEricsson commented Aug 26, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

test

e1630eb

fix logs and rename

01a208d

LeonEricsson and others added 3 commits August 27, 2025 18:06

comment

2ad229f

always return logprobs from vllm; sanitize logprob

96baa2e

nit

4d10db0

qgallouedec reviewed Aug 27, 2025

View reviewed changes

trl/scripts/vllm_serve.py Outdated Show resolved Hide resolved

Enhance VLLM integration: return logprobs in responses and update imp…

87c8e8e

…ortance sampling configuration

LeonEricsson commented Aug 28, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

LeonEricsson changed the title ~~vLLM rollout numerical differences causing off-policy RL.~~ [GRPO] Truncated Importance Sampling to address rollout-training mismatch Aug 28, 2025

LeonEricsson and others added 5 commits August 28, 2025 13:03

nit

c3ffa44

nit

5bb5704

fix logprob sanitization

d4d8908

Merge branch 'main' into rollout_off_policy_importance_sampling

80e66b6

Merge branch 'main' into rollout_off_policy_importance_sampling

a2171bf

Merge branch 'main' into rollout_off_policy_importance_sampling

15cc470

LeonEricsson marked this pull request as ready for review September 1, 2025 12:47

Merge branch 'main' into rollout_off_policy_importance_sampling

ebd6084

LeonEricsson added 2 commits September 1, 2025 18:55

nit

fff82d8

nit

dc89722

LeonEricsson commented Sep 1, 2025

View reviewed changes

trl/trainer/grpo_config.py Show resolved Hide resolved

LeonEricsson and others added 2 commits September 1, 2025 19:31

nit

dca7e89

Merge branch 'main' into rollout_off_policy_importance_sampling

e06a3cc

qgallouedec approved these changes Sep 3, 2025

View reviewed changes

LeonEricsson and others added 3 commits September 3, 2025 08:16

update comment regarding old_per_token_logps

84fc626

Merge branch 'main' into rollout_off_policy_importance_sampling

c2d4866

Merge branch 'main' into rollout_off_policy_importance_sampling

d9723b6

LeonEricsson merged commit 12fc85f into huggingface:main Sep 3, 2025
10 checks passed

qgallouedec mentioned this pull request Feb 8, 2026

Add sanitize_logprob function for NaN handling in vLLM log probabilities #5001

Merged

Conversation

LeonEricsson commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Define the reward function, which rewards completions that are close to 20 characters

Uh oh!

qgallouedec commented Aug 9, 2025

Uh oh!

LeonEricsson commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Aug 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Aug 26, 2025

Uh oh!

LeonEricsson commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Sep 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 1, 2025

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LeonEricsson commented Aug 7, 2025 •

edited

Loading

LeonEricsson commented Aug 7, 2025 •

edited

Loading

LeonEricsson commented Aug 11, 2025 •

edited

Loading

LeonEricsson commented Aug 20, 2025 •

edited

Loading

LeonEricsson commented Aug 27, 2025 •

edited

Loading