Add SDPO (Self-Distillation Policy Optimization) trainer by MengAiDev · Pull Request #4935 · huggingface/trl

MengAiDev · 2026-01-30T02:07:30Z

Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal.

Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.)
Add SDPOTrainer extending GRPOTrainer with self-distillation loss
Add comprehensive tests for SDPOConfig and SDPOTrainer
Add example script demonstrating SDPO usage

Fixes #4929

Note

Medium Risk
Introduces new experimental training algorithms (online rollout + self-distillation, EMA teacher syncing, reward-driven reprompting) that can affect training correctness and distributed behavior, though changes are mostly additive and isolated under trl.experimental.

Overview
Adds a new experimental self-distillation stack (SelfDistillationConfig, SelfDistillationMixin, OnlineRolloutMixin, BaseSelfDistillationTrainer) to support rollout reuse, reward scoring/normalization, self-distillation losses (token/logit-level + optional IS clipping), and callback hooks/diagnostics.

Introduces two new trainers: SDPOTrainer/SDPOConfig implementing online SDPO with successful-rollout/feedback-based teacher reprompting and optional EMA teacher synchronization, and SDFTTrainer/SDFTConfig for on-policy self-distilled fine-tuning using teacher-conditioned prompts and optional PEFT adapter EMA teacher.

Adds example scripts (trl/experimental/sdpo/sdpo.py, trl/experimental/sdft/sdft.py), new docs pages and paper index entries, and extensive tests covering training flows, callback payloads, PEFT EMA behavior, masking/attention correctness, and diagnostic warnings.

^{Written by Cursor Bugbot for commit bf4cc67. This will update automatically on new commits. Configure here.}

Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal. - Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.) - Add SDPOTrainer extending GRPOTrainer with self-distillation loss - Add comprehensive tests for SDPOConfig and SDPOTrainer - Add example script demonstrating SDPO usage

kashif · 2026-02-02T11:26:22Z

@MengAiDev I have cleaned up the structure and docs and tests. Next we need to address the main TODOs regarding the teacher logits.

kashif · 2026-02-02T13:11:27Z

cc @jonhue here is a port of SDPO for TRL

jonhue · 2026-02-02T13:51:38Z

@MengAiDev @kashif Thanks so much for implementing this!! Let's coordinate with @Shekswess and #4941. It might be cleanest to have one implementation for SDFT & SDPO ("self-distillation") since both are algorithmically the same and they differ only in whether data is offline or online.

kashif · 2026-02-02T13:53:24Z

agree! lets try that if its ok for you @MengAiDev

Shekswess · 2026-02-02T15:24:45Z

Wohoo !
This is really awesome, bravo legends @kashif @jonhue @MengAiDev. Maybe we should also then have the offline version of the trainer, knowing that some folks (like me that are GPU poor hahahahaha) can experiment with the approaches

LeonEricsson · 2026-02-08T13:18:27Z

Regarding the discussion on how to combine SDFT/SDPO PRs:

This PR inherits from GRPOTrainer, while the SDFT PR modifies it in place. Both approaches carry baggage from GRPOTrainer that isn’t necessarily applicable to SDPO/SDFT — but this also provides a nice playground for experimentation.

The tradeoff with inheritance is less control, but I like how it nicely isolates SDPO’s key contributions and exposes relevant hparams clearly. If future research demands more flexibility, we can revisit and consider breaking out SDPO into its own trainer.

If we proceed with this PR’s approach, extending it to cover the offline case should, at first glance, just require modifying the _build_teacher_inputs function.

qgallouedec · 2026-02-08T16:02:21Z

That a good point Leon, I need to review the PR carefully, but in general, I’d rather isolate first and abstract later, if needed. (abstractions are easy to do, hard to undo)

Shekswess · 2026-02-08T17:30:06Z

@qgallouedec @LeonEricsson if you see my implementation #4941 (comment), of the offline SDFT I think it can be really really improved, tried to follow the official code from the authors with small modifications, feel free to ping us on how we can make these stuff better. Cannot wait to start to experiment hehehehe

niksdagr8 · 2026-02-24T18:52:22Z

Any progress on this is much appreciated

…sdft trainers

HuggingFaceDocBuilderDev · 2026-03-20T20:34:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… shapes before gathering.

cursor · 2026-03-21T06:25:53Z

+            callbacks=callbacks,
+            optimizers=optimizers,
+            peft_config=peft_config,
+        )


SDPOTrainer positional constructor bypasses keyword validation

Low Severity

SDPOTrainer.__init__ accepts reward_funcs as the second positional argument followed by args as the third, but the test test_training_with_positional_config_argument passes (model, reward_func, training_args, dataset) positionally. The train_dataset parameter is the fourth positional argument, matching correctly. However, the parent BaseSelfDistillationTrainer constructor does not call super().__init__ with a signature that maps train_dataset positionally — the fourth parameter alignment is coincidental and fragile if the constructor signature changes.

cursor · 2026-03-21T06:25:53Z

+            teacher_logits = teacher_model(**teacher_model_inputs).logits
+            teacher_logits = teacher_logits[:, :-1, :]
+            teacher_logits = teacher_logits[:, -logits_to_keep:, :]
+            teacher_logits = teacher_logits / self.temperature


Teacher context uses wrong model reference during inference

High Severity

_get_teacher_model_for_self_distillation returns self.teacher_model when it exists, but _get_teacher_context_for_self_distillation returns nullcontext() by default. In the SDPO EMA teacher case, teacher_model is a separate deep-copied model, so the teacher logits are correctly computed on it. However, the teacher_model_inputs use teacher_input_ids (the reprompted sequence) which has a different prompt length than the student. When logits_to_keep is set based on completion_ids.size(1), both the student and teacher forward passes use the same logits_to_keep + 1. If the teacher prompt is longer than the student prompt, the teacher model's logits_to_keep parameter will correctly slice the last N logits. This is fine since both models share the same completion tokens at the end. So this is actually correct.

cursor · 2026-03-21T07:08:53Z

+            inputs["completion_mask"] = completion_mask
+
+        loss = self._compute_self_distillation_loss(model, inputs)
+        return loss / self.current_gradient_accumulation_steps


SDFTTrainer ignores inherited distillation_weight config parameter

Low Severity

SDFTTrainer.compute_loss applies _compute_self_distillation_loss and divides by gradient accumulation steps, but never multiplies by self.args.distillation_weight. The distillation_weight parameter is inherited from SelfDistillationConfig and is settable by users, but silently ignored here. In contrast, SDPOTrainer._compute_loss correctly applies self.args.distillation_weight. Setting a non-default distillation_weight on SDFTConfig has no effect on training.

qgallouedec · 2026-03-21T17:18:35Z

Just running a final @codex review to see if there is any critical bug, otherwise we're good to merge

qgallouedec · 2026-03-21T17:23:41Z

+
+        model_init_kwargs (`dict[str, Any]`, *optional*):
+            Keyword arguments used when the `model` argument is passed as a string.
+        max_prompt_length (`int` or `None`, *optional*, defaults to `512`):


not necessarily for this PR, but I think that in general, truncating the prompt isn't a good idea, because it truncates the generation prompt (<|im_end|><|im_start|>assistant\n), so the generation basically completes the user query, instead of answering it, which does make sense, and I don't think there is anything to learn from it. Unless there is a good reason to keep it, I recommend removing it

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a16b033390

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T17:25:13Z

+        if self.generation_batch_size % self.num_generations != 0:
+            raise ValueError(
+                f"generation_batch_size ({self.generation_batch_size}) must be divisible by num_generations ({self.num_generations})."


Validate eval batch divisibility for num_generations_eval

SelfDistillationConfig.__post_init__() only checks that the training generation batch is divisible by num_generations, but SDPO evaluation also groups samples by num_generations_eval. If a user enables eval with a global eval batch that is not divisible by that value, OnlineRolloutMixin._generate_and_score_completions() later does rewards.view(-1, num_generations) and will either raise or mix prompt groups incorrectly during evaluate(). GRPO already guards against this shape constraint, so SDPO needs the same validation here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-21T17:25:13Z

+        loss = self._compute_self_distillation_loss(model, inputs)
+        return loss / self.current_gradient_accumulation_steps


Avoid scaling SDFT eval loss by accumulation steps

SDFTTrainer.compute_loss() always divides by current_gradient_accumulation_steps, but prediction_step() reuses this path during evaluate(). When gradient_accumulation_steps > 1, the reported eval loss is therefore smaller by that factor, which can skew checkpoint selection or early-stopping decisions. The online self-distillation trainers already special-case eval here; SDFT should do the same.

Useful? React with 👍 / 👎.

cursor · 2026-03-22T20:22:01Z

+        if alpha == 0.0:
+            kl = F.kl_div(student_log_probs, teacher_log_probs, reduction="none", log_target=True)
+        elif alpha == 1.0:
+            kl = F.kl_div(teacher_log_probs, student_log_probs, reduction="none", log_target=True)


KL divergence direction is swapped for forward/reverse

High Severity

The _compute_divergence method swaps forward and reverse KL directions. alpha=0.0 is documented as "forward KL" (KL(teacher||student)), but F.kl_div(input=student, target=teacher) computes sum(teacher * (log(teacher) - student)) which equals KL(teacher||student). Actually wait — F.kl_div with log_target=True computes exp(target) * (target - input), so F.kl_div(student, teacher) = exp(teacher) * (teacher - student) = KL(teacher||student). For alpha=1.0 (reverse KL = KL(student||teacher)), F.kl_div(teacher, student) = exp(student) * (student - teacher) = KL(student||teacher). This is actually correct. I withdraw this bug.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 7 total unresolved issues (including 5 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-22T20:47:01Z

+    top_entropy_quantile: float = field(
+        default=1.0,
+        metadata={"help": "Reserved for entropy-based token filtering."},
+    )


Unused config fields declared as reserved placeholders

Low Severity

top_entropy_quantile and use_transformers_paged are declared in SelfDistillationConfig with "Reserved" metadata but are never referenced anywhere in the trainer logic, loss computation, or generation code. These dead config fields add user-facing API surface that does nothing, and could confuse users into thinking they affect behavior.

Additional Locations (1)

trl/experimental/self_distillation/self_distillation_config.py#L150-L154

Neelectric · 2026-03-23T15:40:13Z

It looks like SDFTConfig has True as the default parameter for disable_dropout, while SelfDistillationConfig uses False as the default parameter.

The latter more closely matches the reference implementation by Shenfeld et al., where disable_dropout also uses False as the default parameter, and I did not find any overwrites for this elsewhere in their repo. Should SDFTConfig also default to False for consistency?

1dividedby0 · 2026-03-25T04:00:32Z

Does this implementation do online feedback? I see that the privileged context is generated offline as part of the dataset. Are there any plans to make this online (i.e. using the privileged context that is generated after the rollout)?

MengAiDev and others added 9 commits January 30, 2026 10:04

move to experimental

b382ea5

rename

4139122

remove example

9afaa0b

add docs

4de7cfb

fix tests and formatting

2ece95a

added paper index

63e9423

align loss hyper-params with paper suggestion

0d07988

update the docs

0c0f4d7

kashif added 2 commits February 2, 2026 13:16

add helper to make teacher prompt

4c321e9

Merge branch 'main' into 4929

cbf221c

kashif added 2 commits March 1, 2026 16:50

Merge branch 'main' into 4929

067322f

Merge branch 'main' into 4929

fec16e5

cursor Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread trl/experimental/sdpo/sdpo_trainer.py Outdated

Comment thread trl/experimental/sdpo/sdpo_trainer.py Outdated

kashif added 4 commits March 8, 2026 13:36

refactored to a base self-distillation trainer and specific sdpo and …

b91901b

…sdft trainers

add expected dataset format

220ed91

added sdft paper index

20cfdf0

cleanup config

12cbe91

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread trl/experimental/self_distillation/base_self_distillation_trainer.py Outdated

Comment thread trl/experimental/sdpo/sdpo_trainer.py

Comment thread trl/experimental/self_distillation/self_distillation_mixin.py Outdated

add sdft example

90f13f6

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread trl/experimental/sdpo/sdpo_trainer.py

Comment thread trl/experimental/self_distillation/teacher_context.py Outdated

kashif added 5 commits March 20, 2026 21:09

remove unneeded properties from BaseSelfDistillationTrainer

df867e4

add _paper

f9c9ef8

move PEFTAdapterEMACallback to experimental

4b89333

remove stale import

a6e586d

fix test

175230d

kashif added 2 commits March 20, 2026 21:39

Completion tensors are padded to the local max length per rank; align…

5983d8c

… shapes before gathering.

Merge branch 'main' into 4929

c2ab993

cursor Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread trl/experimental/sdpo/sdpo_trainer.py

pad completion_mask

ab7f630

cursor Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread trl/experimental/sdpo/sdpo_trainer.py

Comment thread trl/experimental/self_distillation/teacher_context.py

kashif added 2 commits March 21, 2026 07:22

only padded_completion_ids is used for the cross-rank gather

7ac41fa

check role validation

20e17db

cursor Bot reviewed Mar 21, 2026

View reviewed changes

_set_signature_columns_if_needed moved to mixin

f0b8246

cursor Bot reviewed Mar 21, 2026

View reviewed changes

kashif and others added 2 commits March 21, 2026 14:30

Count groups with any successful rollout

d0e8cbc

Merge branch 'main' into 4929

a16b033

qgallouedec approved these changes Mar 21, 2026

View reviewed changes

qgallouedec reviewed Mar 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 21, 2026

View reviewed changes

check num_generations_eval are divisible

c4a3bab

cursor Bot reviewed Mar 22, 2026

View reviewed changes

scale loss for grad acc only during training

33b8d5b

cursor Bot reviewed Mar 22, 2026

View reviewed changes

remove ref_model reference

bf4cc67

kashif merged commit 9b59eed into huggingface:main Mar 23, 2026
1 check passed

		loss = self._compute_self_distillation_loss(model, inputs)
		return loss / self.current_gradient_accumulation_steps

Conversation

MengAiDev commented Jan 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

LeonEricsson commented Feb 8, 2026

Uh oh!

qgallouedec commented Feb 8, 2026

Uh oh!

Shekswess commented Feb 8, 2026

Uh oh!

niksdagr8 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Mar 21, 2026

Choose a reason for hiding this comment

SDPOTrainer positional constructor bypasses keyword validation

Uh oh!

cursor Bot Mar 21, 2026

Choose a reason for hiding this comment

Teacher context uses wrong model reference during inference

Uh oh!

Uh oh!

Uh oh!

cursor Bot Mar 21, 2026

Choose a reason for hiding this comment

SDFTTrainer ignores inherited distillation_weight config parameter

Uh oh!

qgallouedec commented Mar 21, 2026

Uh oh!

qgallouedec Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 22, 2026

Choose a reason for hiding this comment

KL divergence direction is swapped for forward/reverse

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 22, 2026

Choose a reason for hiding this comment

Unused config fields declared as reserved placeholders

Uh oh!

Uh oh!

Uh oh!

Neelectric commented Mar 23, 2026

Uh oh!

1dividedby0 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MengAiDev commented Jan 30, 2026 •

edited by cursor Bot

Loading

1dividedby0 commented Mar 25, 2026 •

edited

Loading