[GKD] Buffer Implementation for Distillation Trainer by cmpatino · Pull Request #5137 · huggingface/trl

cmpatino · 2026-02-20T22:27:41Z

Implement Buffer for Distillation Trainer (`GOLDTrainer`)

Implement generation buffering and multi-generation support for GOLDTrainer

Add a prompt-level generation buffer that decouples generation from the
optimization steps. We adopt a buffer similar to GRPO to generate all rollouts for all mini-batches within an optimization step, leveraging parallel inference engines. This means each worker handles a buffer of per_device_train_batch_size * gradient_accumulation_steps.

Buffer Details

We allow multiple rollouts per prompt, following Thinking Machine’s Tinker example. The number of rollouts per prompt is determined by the num_generations parameter. To keep the effective batch size constant, we introduce the generation_batch_size parameter, which controls how many unique prompts we pass to the inference engine. We enforce generation_batch_size = per_device_train_batch_size * gradient_accumulation_steps // num_generations to ensure the effective batch size is invariant across setups.

Benchmarks

We can replicate Thinking Machine’s results using both non-Liger and Liger losses, achieving a 3x speedup on a setup with 8 training nodes in colocate mode.

Phase	Tinker (s)	TRL (s)
Sampling	329.83	130
Loss	37.96	-
Training	98.69	38
Total	492.28	173

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Changes the GOLD training loop and dataloader sampling semantics to buffer generations across optimizer windows and support multiple rollouts per prompt, which can affect training dynamics and correctness across distributed setups.

Overview
Adds an optimizer-window generation buffer to GOLDTrainer, decoupling on-policy rollout generation from individual gradient-accumulation microsteps. Training now samples a full window batch via a RepeatSampler, generates (optionally via vLLM) once per window, and reuses buffered slices across accumulation steps.

Extends GOLDConfig with num_generations and generation_batch_size (with strict partitioning validation) and adds teacher_model_revision; updates model/teacher revision handling in gold.py and teacher model instantiation. Generation/label masking is refactored to correctly handle left-padded prompts and completion-only outputs, and Liger+ZeRO-3 gets an explicit parameter gather context.

Updates docs to describe buffering behavior and config changes, and adds/adjusts tests to cover left-padding retokenization and prompt-length masking expectations.

^{Written by Cursor Bugbot for commit abfadc1. This will update automatically on new commits. Configure here.}

Avoid crashing when using DeepSpeed ZeRO-3 and set up the correct values for `weight_hard_loss` and `weight_soft_loss`

KD Buffer Simplification

Add scripts to run GOLD

HuggingFaceDocBuilderDev · 2026-03-12T13:38:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

qgallouedec · 2026-03-14T02:42:48Z

I haven't reviewed it in detail; I have a general idea of what it's about, but I'm leaving the implementation mostly up to you. In future PRs, we can try to align it better with the rest of the codebase, but what matters most right now are the results you're getting.

Make sure to run make precommit to make to CI happy

qgallouedec

just ensure the CI is green before merging

commit 52cd0cc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:31:26 2026 +0100 Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295) commit 7b42fc4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:29:11 2026 +0100 Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286) commit 3acb8e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:27:10 2026 +0100 Support max_length in DPO VLM training (#5284) commit ee339a0 Author: Carlos Miguel Patiño <carlos.patino@huggingface.co> Date: Tue Mar 17 14:01:44 2026 +0100 [GKD] Buffer Implementation for Distillation Trainer (#5137) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit d46131f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:27:19 2026 +0100 Remove custom get_train/eval_dataloader from OnlineDPO (#5291) commit 85cf8f4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:24:24 2026 +0100 Remove TrainingArguments import from experimental trainers (#5290) commit 91e3da0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Mar 16 07:19:51 2026 -0600 Fix `accuracy_reward` crash when called from non-main thread (#5281) commit 4996631 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:44:28 2026 +0100 Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274) commit 5fceaa7 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:43:34 2026 +0100 Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273) commit 406d406 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Sat Mar 14 04:12:49 2026 +0100 feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199) commit d0ac7ef Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sat Mar 14 02:53:33 2026 +0100 Allow nullable logprobs in vLLM serve responses (#5203) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit c0eabc4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Mar 13 18:19:15 2026 -0600 Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> commit 6c0fccd Author: Mario Šaško <mariosasko777@gmail.com> Date: Sat Mar 14 00:19:38 2026 +0100 35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 3972d66 Author: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed Mar 18 22:26:44 2026 +0100 Suggest the `Json()` type for tool calling dataset format (#5307) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5c6e915 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 14:55:19 2026 -0600 Update `RewardFunc` type annotation to allow `None`values in reward list (#5297) commit ee96845 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Mar 18 17:03:54 2026 +0100 Fix DPOTrainer collators to truncate sequences before padding (#5305) commit 435c2ae Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 08:09:42 2026 -0600 Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` (#5294) commit 26ce6a3 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 00:44:12 2026 -0600 Apply docstyle (#5296) commit 52cd0cc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:31:26 2026 +0100 Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295) commit 7b42fc4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:29:11 2026 +0100 Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286) commit 3acb8e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:27:10 2026 +0100 Support max_length in DPO VLM training (#5284) commit ee339a0 Author: Carlos Miguel Patiño <carlos.patino@huggingface.co> Date: Tue Mar 17 14:01:44 2026 +0100 [GKD] Buffer Implementation for Distillation Trainer (#5137) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit d46131f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:27:19 2026 +0100 Remove custom get_train/eval_dataloader from OnlineDPO (#5291) commit 85cf8f4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:24:24 2026 +0100 Remove TrainingArguments import from experimental trainers (#5290) commit 91e3da0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Mar 16 07:19:51 2026 -0600 Fix `accuracy_reward` crash when called from non-main thread (#5281) commit 4996631 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:44:28 2026 +0100 Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274) commit 5fceaa7 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:43:34 2026 +0100 Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273) commit 406d406 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Sat Mar 14 04:12:49 2026 +0100 feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199) commit d0ac7ef Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sat Mar 14 02:53:33 2026 +0100 Allow nullable logprobs in vLLM serve responses (#5203) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit c0eabc4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Mar 13 18:19:15 2026 -0600 Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> commit 6c0fccd Author: Mario Šaško <mariosasko777@gmail.com> Date: Sat Mar 14 00:19:38 2026 +0100 35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

cmpatino added 25 commits February 18, 2026 12:30

Implement buffer for GOLDTrainer

719c644

Clean up code from KD buffer

904378b

Test scripts for trial run

6a2ece5

Apply fixes to the Liger loss setting

ee07aec

Avoid crashing when using DeepSpeed ZeRO-3 and set up the correct values for `weight_hard_loss` and `weight_soft_loss`

Remove test scripts

a3fd2af

Handle config parameters better in gold script

b0669d9

Upload provisional SLURM script for GOLD

b0c4f3e

Refine logic and comments

602e564

Improve clarity of buffer implementation

c4f9a64

Add validation for num_generations

111b85e

Add clarifying comment to num_generations

022af62

Patch issue with ZeRO-3

33e0a82

Refactor context for ZeRO-3 + Liger

dbb6e70

Simplify comments and code logic

9da54b3

Merge pull request #1 from cmpatino/kd-buffer-fix

1cec9ea

KD Buffer Simplification

Add scripts to run GOLD

4435409

Merge pull request #2 from cmpatino/kd-buffer-fix

ce41aba

Add scripts to run GOLD

Merge branch 'kd-buffering' of github.com:cmpatino/trl into kd-buffering

c0a857f

Merge branch 'main' into kd-buffering

fa62472

Refactor to simplify logic

31161a0

Handle student versioning params

da7ef50

Add warning when dropping incomplete batches

e24e681

Add clarifying note in docs

8d31b7a

Remove SLURM script used for testing

1ef205b

Remove reference to wandb

506afc1

cmpatino requested review from Copilot, edbeeching, kashif and lewtun March 3, 2026 21:24

Copilot started reviewing on behalf of cmpatino March 3, 2026 21:24 View session

cursor Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_trainer.py Outdated

Comment thread trl/experimental/gold/gold_trainer.py

Remove support for student_model_revision arg

da57e47

Fix prompt length calculation

c3a8d73

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold.py

cmpatino added 2 commits March 12, 2026 14:52

Fix logic of padding tokens and prompt lengths

d185716

Add teacher_model_revision arg

30a0fd5

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_trainer.py

cmpatino added 2 commits March 12, 2026 16:12

Avoid creating padding gaps

58b9f74

Merge branch 'main' into kd-buffering

c8cd1f0

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_trainer.py Outdated

cmpatino added 2 commits March 12, 2026 17:39

Fix prompt completion calculation for transformers

ea72770

Merge branch 'kd-buffering' of github.com:cmpatino/trl into kd-buffering

1cbdc32

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_trainer.py

Comment thread trl/experimental/gold/gold_trainer.py

qgallouedec reviewed Mar 14, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_trainer.py

qgallouedec reviewed Mar 14, 2026

View reviewed changes

Comment thread docs/source/gold_trainer.md Outdated

qgallouedec reviewed Mar 14, 2026

View reviewed changes

Comment thread trl/experimental/gold/gold_config.py Outdated

qgallouedec approved these changes Mar 14, 2026

View reviewed changes

cmpatino added 3 commits March 17, 2026 11:02

Lint files with precommit

6fea90b

Remove reference to student_model_revision

bfa7406

Remove duplicated arg in config

d4d5ae2

kashif approved these changes Mar 17, 2026

View reviewed changes

Update test to reflect full generated output from transformers

abfadc1

cmpatino merged commit ee339a0 into huggingface:main Mar 17, 2026
4 checks passed

cmpatino deleted the kd-buffering branch March 17, 2026 13:01

cmpatino mentioned this pull request Mar 23, 2026

Kd vllm generation #5351

Merged

3 tasks

songhappy pushed a commit to songhappy/trl that referenced this pull request Apr 20, 2026

[GKD] Buffer Implementation for Distillation Trainer (huggingface#5137)

1ceee83

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GKD] Buffer Implementation for Distillation Trainer#5137

[GKD] Buffer Implementation for Distillation Trainer#5137
cmpatino merged 40 commits into
huggingface:mainfrom
cmpatino:kd-buffering

cmpatino commented Feb 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

cmpatino commented Feb 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Buffer for Distillation Trainer (GOLDTrainer)

Buffer Details

Benchmarks

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cmpatino commented Feb 20, 2026 •

edited by cursor Bot

Loading

Implement Buffer for Distillation Trainer (`GOLDTrainer`)