huggingface · kashif · Mar 23, 2026 · Jan 30, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -127,6 +127,10 @@
     title: PPO
   - local: prm_trainer
     title: PRM
+  - local: sdft_trainer
+    title: SDFT
+  - local: sdpo_trainer
+    title: SDPO
   - local: winrate_callback
     title: WinRateCallback
   - local: xpo_trainer

diff --git a/docs/source/paper_index.md b/docs/source/paper_index.md
@@ -1630,6 +1630,85 @@ trainer.train()
 
 For more details, see the [MiniLLM Trainer documentation](minillm) documentation.
 
+### Reinforcement Learning via Self-Distillation
+
+**📜 Paper**: https://huggingface.co/papers/2601.20802
+
+Self-Distillation Policy Optimization (SDPO) enhances reinforcement learning with verifiable rewards by converting rich textual feedback (e.g., runtime errors, judge evaluations) into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts.
+
+```python
+from trl.experimental.sdpo import SDPOConfig, SDPOTrainer
+
+training_args = SDPOConfig(
+    distillation_alpha=0.5,                # Jensen-Shannon divergence (recommended)
+    distillation_topk=100,                 # Top-K logit distillation approximation
+    full_logit_distillation=True,          # Required for top-K logit-level SDPO
+    distillation_is_clip=2.0,              # Importance sampling clipping
+    distillation_weight=1.0,               # Weight for self-distillation loss
+    sdpo_policy_loss_mode="distillation_only",
+    use_successful_as_teacher=True,        # Use successful rollouts as teacher
+    teacher_regularization="ema",          # Supported: "ema", "none"
+    teacher_update_rate=0.05,              # EMA update rate
+    include_environment_feedback=False,    # Use dataset privileged_context when available
+)
+
+trainer = SDPOTrainer(
+    model="Qwen/Qwen2.5-1.5B-Instruct",
+    reward_funcs=...,
+    args=training_args,
+    train_dataset=...,
+)
+trainer.train()
+```
+
+Expected dataset columns:
+
+- `prompt`
+- `privileged_context` for optional environment feedback
+
+For more details, see the [SDPO Trainer documentation](sdpo_trainer).
+
+### Self-Training with On-Policy Self-Distillation for Language Model Alignment
+
+**📜 Paper**: https://huggingface.co/papers/2601.19897
+
+Self-Distilled Fine-Tuning (SDFT) performs on-policy self-distillation by generating completions during training, then distilling an explicit teacher-conditioned view of those same completions back into the student. In TRL, SDFT uses a shared self-distillation core with SDPO while keeping its own explicit `teacher_model` and dataset-provided privileged context.
+The teacher prompt is composed internally from the student `prompt` plus the dataset `privileged_context`.
+
+```python
+from datasets import Dataset
+
+from trl.experimental.sdft import SDFTConfig, SDFTTrainer
+
+dataset = Dataset.from_dict(
+    {
+        "prompt": [[{"role": "user", "content": "Solve 2+2."}]],
+        "privileged_context": ["Example answer: 4."],
+    }
+)
+
+training_args = SDFTConfig(
+    distillation_alpha=0.5,
+    distillation_topk=5,
+    max_completion_length=64,
+)
+
+trainer = SDFTTrainer(
+    model="Qwen/Qwen2.5-1.5B-Instruct",
+    ref_model="Qwen/Qwen2.5-1.5B-Instruct",
+    args=training_args,
+    train_dataset=dataset,
+)
+trainer.train()
+```
+
+Expected dataset columns:
+
+- `prompt`
+- `privileged_context` containing only the extra teacher-only information
+
+For more details, see the [SDFT Trainer documentation](sdft_trainer).
+
 ## Distributed Training
 
 ### ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

diff --git a/docs/source/sdft_trainer.md b/docs/source/sdft_trainer.md
@@ -0,0 +1,82 @@
+# SDFT
+
+Self-Distilled Fine-Tuning (SDFT) is described in [Self-Training with On-Policy Self-Distillation for Language Model Alignment](https://huggingface.co/papers/2601.19897).
+
+The TRL implementation adapts SDFT to the experimental trainer API while reusing the shared self-distillation infrastructure also used by SDPO.
+
+In the current TRL implementation:
+
+- SDFT uses an explicit `ref_model` teacher
+- the dataset must provide both `prompt` and `privileged_context`
+- `privileged_context` contains only the extra teacher-only information; the trainer combines it with `prompt` to build the teacher prompt
+- `teacher_prompt_template` controls how `prompt` and `privileged_context` are combined into the teacher prompt
+- on-policy generation can use either the student prompt or the teacher-conditioned prompt via `generate_from_teacher`
+- `num_loss_tokens_to_skip` can exclude initial completion tokens from the distillation loss
+- SDFT currently supports text-only training and does not support `use_vllm=True`
+- the shared dataset contract is `prompt` plus `privileged_context`
+
+## Usage
+
+```python
+from datasets import Dataset
+
+from trl.experimental.sdft import SDFTConfig, SDFTTrainer
+
+dataset = Dataset.from_dict(
+    {
+        "prompt": [[{"role": "user", "content": "Solve 2+2."}]],
+        "privileged_context": ["Example answer: 4."],
+    }
+)
+
+training_args = SDFTConfig(
+    output_dir="sdft-model",
+    distillation_alpha=0.5,
+    distillation_topk=5,
+    max_completion_length=64,
+)
+
+trainer = SDFTTrainer(
+    model="Qwen/Qwen2.5-1.5B-Instruct",
+    ref_model="Qwen/Qwen2.5-1.5B-Instruct",
+    args=training_args,
+    train_dataset=dataset,
+)
+trainer.train()
+```
+
+To generate from the teacher-conditioned prompt instead of the student prompt, set `generate_from_teacher=True`.
+To customize how the teacher prompt is built, set `teacher_prompt_template` on [`SDFTConfig`].
+
+## Expected dataset columns
+
+Each example must provide:
+
+- `prompt`: the student-facing prompt
+- `privileged_context`: only the extra teacher-only information, such as a demonstration, hint, or privileged feedback
+
+Both standard text prompts and conversational prompts are supported by the trainer prompt handling.
+
+## Callbacks
+
+The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.
+
+Shared self-distillation hooks:
+
+- `on_self_distillation_batch_prepared`: fired when a self-distillation batch is ready. The payload includes `prompt_ids`, `completion_ids`, and `old_per_token_logps` when importance-sampling clipping inputs are available.
+- `on_generation_batch_built`: fired when a new buffered generation batch is created. The payload includes `generate_every` and `steps_per_generation`.
+
+SDFT-specific hook:
+
+- `on_generation_prompts_selected`: fired when SDFT chooses the prompt source for on-policy generation. The payload includes the selected `generation_prompts` and the corresponding `generation_prompt_text`.
+
+## SDFTConfig
+
+[[autodoc]] experimental.sdft.SDFTConfig
+
+## SDFTTrainer
+
+[[autodoc]] experimental.sdft.SDFTTrainer
+    - train
+    - save_model
+    - push_to_hub
diff --git a/docs/source/sdpo_trainer.md b/docs/source/sdpo_trainer.md
@@ -0,0 +1,79 @@
+# SDPO
+
+Self-Distillation Policy Optimization (SDPO) was introduced in [Reinforcement Learning via Self-Distillation](https://huggingface.co/papers/2601.20802) by [Jonas Hübotter](https://huggingface.co/jonhue), Frederike Lübeck, Lejs Behric, [Anton Baumann](https://huggingface.co/antonbaumann), Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause.
+
+> Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
+
+The SDPO trainer is built on TRL's experimental shared self-distillation stack. It keeps the online rollout-and-reward training flow, then builds a teacher-conditioned view of the same completions from successful rollouts and optional environment feedback.
+
+In the current TRL implementation:
+
+- the default SDPO policy loss mode is `distillation_only`
+- `hybrid` mode is also available to combine the base policy loss with the self-distillation loss
+- supported teacher regularization modes are `ema` and `none`
+- `distillation_topk` is only valid when `full_logit_distillation=True`
+- when `full_logit_distillation=False`, SDPO uses token-level reverse KL and requires `distillation_alpha=1.0`
+- environment feedback can be injected into teacher reprompts when the dataset exposes a `privileged_context` column
+
+## Expected dataset columns
+
+Each example must provide:
+
+- `prompt`: the student-facing prompt
+- `privileged_context`: optional privileged text, such as environment feedback, used when `include_environment_feedback=True`
+
+## Usage
+
+```python
+from datasets import Dataset
+
+from trl.experimental.sdpo import SDPOConfig, SDPOTrainer
+
+dataset = Dataset.from_dict(
+    {
+        "prompt": [[{"role": "user", "content": "Solve 2+2."}]],
+        "privileged_context": ["Your earlier answer used the wrong format."],
+    }
+)
+
+training_args = SDPOConfig(
+    output_dir="sdpo-model",
+    distillation_topk=100,                 # Top-K logit distillation approximation
+    full_logit_distillation=True,          # Required for top-K; enables non-reverse divergences
+    include_environment_feedback=True,     # Use dataset privileged_context for teacher reprompts
+)
+
+trainer = SDPOTrainer(
+    model="Qwen/Qwen2.5-1.5B-Instruct",
+    reward_funcs=reward_func,
+    args=training_args,
+    train_dataset=dataset,
+)
+trainer.train()
+```
+
+SDPO always requires a `prompt` column. To use environment feedback, also include a `privileged_context` column and set `include_environment_feedback=True`. SDPO will use successful rollouts and, when enabled, that text to build teacher reprompts for self-distillation.
+
+## Callbacks
+
+The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.
+
+Shared self-distillation hooks:
+
+- `on_self_distillation_batch_prepared`: fired when a self-distillation batch is ready. The payload includes `prompt_ids`, `completion_ids`, and `old_per_token_logps` when importance-sampling clipping inputs are available.
+- `on_generation_batch_built`: fired when a new buffered generation batch is created. The payload includes `generate_every` and `steps_per_generation`.
+
+SDPO-specific hook:
+
+- `on_teacher_context_built`: fired after SDPO constructs the teacher-conditioned inputs. The payload includes `teacher_input_ids`, `teacher_attention_mask`, `completion_mask`, and `self_distillation_mask`.
+
+## SDPOConfig
+
+[[autodoc]] experimental.sdpo.SDPOConfig
+
+## SDPOTrainer
+
+[[autodoc]] experimental.sdpo.SDPOTrainer
+    - train
+    - save_model
+    - push_to_hub