[Experimental] Add SDFT trainer, config, docs, and tests by Shekswess · Pull Request #4941 · huggingface/trl

Shekswess · 2026-01-31T14:24:48Z

What does this PR do?

Adds an experimental Self‑Distillation Fine‑Tuning (SDFT) trainer to TRL, including:

SDFTTrainer + SDFTConfig under trl.experimental.sdft
strict dataset validation for prompt/teacher_prompt
teacher handling (explicit or defaulted from student checkpoint)
docs page + toctree entry
unit tests (init checks + low‑priority training smoke test)

Fixes #4940

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@qgallouedec

Shekswess · 2026-01-31T14:29:14Z

@qgallouedec maybe this is not the perfect implementation but overall I think it's okay. I followed the original code from the authors of the papers and that was kinda messy hahahahaha

Please any comments on how we can make this in the best shape possible, improvements, coverage, etc... feel free to drop it and I can help you on this one. This is my first PR like this so I'm really excited. ❤️

P.S I want this trainer to be added as experimental trainer because I want to do active research on self-distillation methods on tiny language models I see it as a possibility to make them even more powerful

jonhue · 2026-02-01T20:07:36Z

Hello @Shekswess, one of the authors here 👋
I had a quick glance over the code & it looks good. Thanks so much for making this PR!!

Currently, this implementation is for offline training (ie training on a fixed dataset of teacher prompts). I was wondering whether we could easily extend this implementation to online training too?
This is what we did in our other paper: https://github.com/lasgroup/SDPO. Unfortunately, our implementation for this is in verl, but it is 1to1 the same algorithm. The only difference is where the teacher prompts come from. In SDPO, the teacher prompts are created "online" using generated trajectories that are marked as correct by the environment and any other rich signal returned by the environment.

Do you think it would make sense to integrate these into one implementation of self-distillation?

Shekswess · 2026-02-02T00:07:53Z

Heyoooo @jonhue !
First of all really awesome job with this idea, paper and everything you've done guys in general with this approach. I love it and I cannot wait to test this on tiny language models. About the implementation I think we can actually do "online" version of this. Before starting anything to modify the code, I would want to consult @qgallouedec because this legend knows a lot about trl (as main contributor) and what could be the best way of implementing this in trl. I can do all the heavy work so no worries on that front. The only help I would require is some advices on how to handle this the best as possible in trl. This would be huge for a lot of research folks because the whole approach looks really promising to me and I cannot wait to get my hands dirty hahahaha.

jonhue · 2026-02-02T06:54:03Z

Amazing! Happy to help!

qgallouedec · 2026-02-18T04:02:41Z

Hey, sorry for the late review, this one is quite big! Thank you!

At this point, I have a few remarks:

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.
in my understanding, we only generate one completion from the teacher and one from the model, so it's fair to just hard code num_generations=1
in the paper they say "If needed, add importance sampling to compensate for differences between the inference engine (e.g., VLLM) and the training code.". but don't provide guidance on how they do it. I think for the first implementation, we should just completely drop this IS correction, and maybe add it later.
they've been many change in GRPO during the last days/weeks, including tool-calling support which seems to be very important in the paper. I'm trying to integrate all of them in this PR: [WIP] Integrate latest changes to SDFT Shekswess/trl#1 It is work in progress, not working, no need to review it yet

perceptiveshawty · 2026-02-19T03:09:52Z

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

this makes sense if its meant to serve as a reference implementation / reproduce results.

that said, it would be easy to just default to self-distillation when ref_model isn't explicitly provided. then having the optionality will support research into effective teachers, differences between model families, etc

jonhue · 2026-02-25T09:21:45Z

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

@qgallouedec I'd suggest to keep the separate ref model. Two main axes that were important in our experiments were:

using a slowly moving teacher (EMA of student's weights); and
swapping out the reverse-KL for other divergences like Jensen-Shannon

Regularizing the teacher parameters has been particularly important.

Shekswess · 2026-02-26T23:47:06Z

@qgallouedec @perceptiveshawty hey legends, this weekend I have some spare time, if I can help somehow feel free to pinpoint how can I be best of use :D

kashif · 2026-03-16T10:58:35Z

@Shekswess in #4935 we have refactored the trainer into a base self-distillation trainer so that we can implement the different self-distillation methods and also have an SDFT trainer implementation, if you want to have a look?

kashif · 2026-03-23T10:20:18Z

@Shekswess so the self-distillation trainer and 2 methods based of it are now merged into main, can you kindly check if your optimisations can be carried over those 2 methods? if you can kindly add vllm support or even tool call it would be great! Apologies again for the wasted work!

kashif · 2026-03-23T10:21:32Z

the two things that would be great would be:

vLLM integration
top_entropy_quantile masking: already exists as a config field in SelfDistillationConfig (used by GSPO/PAPO), just not wired into SDFT yet

Shekswess · 2026-03-23T10:42:13Z

@kashif Hey I'm happy that we have this included finally hehehehe
I can start to experiment right now !

Shekswess added 9 commits January 31, 2026 15:11

Add SDFT experimental init

7f3a94a

Add SDFT config

742c67e

Add SDFT trainer

0854002

Add SDFT docs page

7b425da

Add SDFT to docs toctree

acf6548

Add SDFT trainer tests

c115461

Format SDFT init

f0f3411

Format SDFT config

5d641a3

Format SDFT trainer

955d0bd

jonhue mentioned this pull request Feb 2, 2026

Add SDPO (Self-Distillation Policy Optimization) trainer #4935

Merged

kirawi mentioned this pull request Feb 4, 2026

What is the difference between it and SDFT? lasgroup/SDPO#6

Closed

Merge branch 'main' into feature/sdft-trainer

1b16c7c

kashif closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Add SDFT trainer, config, docs, and tests#4941

[Experimental] Add SDFT trainer, config, docs, and tests#4941
Shekswess wants to merge 10 commits into
huggingface:mainfrom
Shekswess:feature/sdft-trainer

Shekswess commented Jan 31, 2026

Uh oh!

Shekswess commented Jan 31, 2026 •

edited

Loading

Uh oh!

jonhue commented Feb 1, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

qgallouedec commented Feb 18, 2026

Uh oh!

perceptiveshawty commented Feb 19, 2026

Uh oh!

jonhue commented Feb 25, 2026 •

edited

Loading

Uh oh!

Shekswess commented Feb 26, 2026

Uh oh!

kashif commented Mar 16, 2026

Uh oh!

kashif commented Mar 23, 2026

Uh oh!

kashif commented Mar 23, 2026

Uh oh!

Shekswess commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Shekswess commented Jan 31, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

Shekswess commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhue commented Feb 1, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

qgallouedec commented Feb 18, 2026

Uh oh!

perceptiveshawty commented Feb 19, 2026

Uh oh!

jonhue commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shekswess commented Feb 26, 2026

Uh oh!

kashif commented Mar 16, 2026

Uh oh!

kashif commented Mar 23, 2026

Uh oh!

kashif commented Mar 23, 2026

Uh oh!

Shekswess commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Shekswess commented Jan 31, 2026 •

edited

Loading

jonhue commented Feb 25, 2026 •

edited

Loading