Skip to content

[Experimental] Add SDFT trainer, config, docs, and tests#4941

Closed
Shekswess wants to merge 10 commits into
huggingface:mainfrom
Shekswess:feature/sdft-trainer
Closed

[Experimental] Add SDFT trainer, config, docs, and tests#4941
Shekswess wants to merge 10 commits into
huggingface:mainfrom
Shekswess:feature/sdft-trainer

Conversation

@Shekswess

Copy link
Copy Markdown

What does this PR do?

Adds an experimental Self‑Distillation Fine‑Tuning (SDFT) trainer to TRL, including:

  • SDFTTrainer + SDFTConfig under trl.experimental.sdft
  • strict dataset validation for prompt/teacher_prompt
  • teacher handling (explicit or defaulted from student checkpoint)
  • docs page + toctree entry
  • unit tests (init checks + low‑priority training smoke test)

Fixes #4940

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@qgallouedec

@Shekswess

Shekswess commented Jan 31, 2026

Copy link
Copy Markdown
Author

@qgallouedec maybe this is not the perfect implementation but overall I think it's okay. I followed the original code from the authors of the papers and that was kinda messy hahahahaha

Please any comments on how we can make this in the best shape possible, improvements, coverage, etc... feel free to drop it and I can help you on this one. This is my first PR like this so I'm really excited. ❤️

P.S I want this trainer to be added as experimental trainer because I want to do active research on self-distillation methods on tiny language models I see it as a possibility to make them even more powerful

@jonhue

jonhue commented Feb 1, 2026

Copy link
Copy Markdown

Hello @Shekswess, one of the authors here 👋
I had a quick glance over the code & it looks good. Thanks so much for making this PR!!

Currently, this implementation is for offline training (ie training on a fixed dataset of teacher prompts). I was wondering whether we could easily extend this implementation to online training too?
This is what we did in our other paper: https://github.com/lasgroup/SDPO. Unfortunately, our implementation for this is in verl, but it is 1to1 the same algorithm. The only difference is where the teacher prompts come from. In SDPO, the teacher prompts are created "online" using generated trajectories that are marked as correct by the environment and any other rich signal returned by the environment.

Do you think it would make sense to integrate these into one implementation of self-distillation?

@Shekswess

Copy link
Copy Markdown
Author

Heyoooo @jonhue !
First of all really awesome job with this idea, paper and everything you've done guys in general with this approach. I love it and I cannot wait to test this on tiny language models. About the implementation I think we can actually do "online" version of this. Before starting anything to modify the code, I would want to consult @qgallouedec because this legend knows a lot about trl (as main contributor) and what could be the best way of implementing this in trl. I can do all the heavy work so no worries on that front. The only help I would require is some advices on how to handle this the best as possible in trl. This would be huge for a lot of research folks because the whole approach looks really promising to me and I cannot wait to get my hands dirty hahahaha.

@jonhue

jonhue commented Feb 2, 2026

Copy link
Copy Markdown

Amazing! Happy to help!

@qgallouedec

Copy link
Copy Markdown
Member

Hey, sorry for the late review, this one is quite big! Thank you!

At this point, I have a few remarks:

  • I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.
  • in my understanding, we only generate one completion from the teacher and one from the model, so it's fair to just hard code num_generations=1
  • in the paper they say "If needed, add importance sampling to compensate for differences between the inference engine (e.g., VLLM) and the training code.". but don't provide guidance on how they do it. I think for the first implementation, we should just completely drop this IS correction, and maybe add it later.
  • they've been many change in GRPO during the last days/weeks, including tool-calling support which seems to be very important in the paper. I'm trying to integrate all of them in this PR: [WIP] Integrate latest changes to SDFT Shekswess/trl#1 It is work in progress, not working, no need to review it yet

@perceptiveshawty

Copy link
Copy Markdown
  • I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

this makes sense if its meant to serve as a reference implementation / reproduce results.

that said, it would be easy to just default to self-distillation when ref_model isn't explicitly provided. then having the optionality will support research into effective teachers, differences between model families, etc

@jonhue

jonhue commented Feb 25, 2026

Copy link
Copy Markdown

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

@qgallouedec I'd suggest to keep the separate ref model. Two main axes that were important in our experiments were:

  • using a slowly moving teacher (EMA of student's weights); and
  • swapping out the reverse-KL for other divergences like Jensen-Shannon

Regularizing the teacher parameters has been particularly important.

@Shekswess

Copy link
Copy Markdown
Author

@qgallouedec @perceptiveshawty hey legends, this weekend I have some spare time, if I can help somehow feel free to pinpoint how can I be best of use :D

@kashif

kashif commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

@Shekswess in #4935 we have refactored the trainer into a base self-distillation trainer so that we can implement the different self-distillation methods and also have an SDFT trainer implementation, if you want to have a look?

@kashif

kashif commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

@Shekswess so the self-distillation trainer and 2 methods based of it are now merged into main, can you kindly check if your optimisations can be carried over those 2 methods? if you can kindly add vllm support or even tool call it would be great! Apologies again for the wasted work!

@kashif kashif closed this Mar 23, 2026
@kashif

kashif commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

the two things that would be great would be:

  • vLLM integration
  • top_entropy_quantile masking: already exists as a config field in SelfDistillationConfig (used by GSPO/PAPO), just not wired into SDFT yet

@Shekswess

Copy link
Copy Markdown
Author

@kashif Hey I'm happy that we have this included finally hehehehe
I can start to experiment right now !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDFT: Self-Distillation Fine-Tuning Trainer

5 participants