Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1307 commits
Select commit Hold shift + click to select a range
ee339a0
[GKD] Buffer Implementation for Distillation Trainer (#5137)
cmpatino Mar 17, 2026
3acb8e8
Support max_length in DPO VLM training (#5284)
albertvillanova Mar 17, 2026
7b42fc4
Prevent corruption of DPO VLM training if "keep_end" truncation_mode …
albertvillanova Mar 17, 2026
52cd0cc
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a re…
albertvillanova Mar 17, 2026
26ce6a3
Apply docstyle (#5296)
qgallouedec Mar 18, 2026
435c2ae
Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENT…
qgallouedec Mar 18, 2026
ee96845
Fix DPOTrainer collators to truncate sequences before padding (#5305)
albertvillanova Mar 18, 2026
5c6e915
Update `RewardFunc` type annotation to allow `None`values in reward l…
qgallouedec Mar 18, 2026
3972d66
Suggest the `Json()` type for tool calling dataset format (#5307)
lhoestq Mar 18, 2026
91583cd
Allow reward functions to log extra columns and scalar metrics (#5233)
manueldeprada Mar 18, 2026
116fc42
Fix GRPOTrainer attribute access for vLLM model config (#5302)
falcondai Mar 19, 2026
bc60040
Support truncation_mode in SFT (#5306)
albertvillanova Mar 19, 2026
b86b760
Asynchronous GRPO (#5293)
qgallouedec Mar 19, 2026
b5af98d
Fix datasets version supporting Json dtype in docs about tool calling…
albertvillanova Mar 19, 2026
321a3cd
Align docs about tool calling in trainers with dataset format (#5311)
albertvillanova Mar 19, 2026
ebdfe82
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating …
qgallouedec Mar 19, 2026
8e6e062
feat(experimental): Divergence Proximal Policy Optimization (#5117)
LeonEricsson Mar 19, 2026
e8dcece
Clean up model update group on worker exit (#5325)
AmineDiro Mar 20, 2026
4fea6d1
Fix style in DPPO docstrings (#5326)
albertvillanova Mar 20, 2026
5bb083c
`GRPOTrainer/async`: fix prefix EOS slicing for tool suffix (with Qwe…
casinca Mar 21, 2026
e923a9a
refactor(async_rollout_worker): renamed tool variables to mirror `grp…
casinca Mar 21, 2026
d8a2dd5
Add truncation to SFT DataCollatorForLanguageModeling (#5315)
albertvillanova Mar 23, 2026
9b59eed
Add SDPO (Self-Distillation Policy Optimization) trainer (#4935)
MengAiDev Mar 23, 2026
c0e3fb0
Update openenv examples to use `environment_factory` (#5235)
sergiopaniego Mar 23, 2026
4fda23e
Enhance `print_prompt_completions_sample` to include reasoning conten…
qgallouedec Mar 23, 2026
bad33f2
Add Cursor Bugbot rules from `AGENTS.md` (#5280)
qgallouedec Mar 23, 2026
05cfc90
Change model dtype from bfloat16 to float32 in AsyncGRPOTrainer (#5333)
qgallouedec Mar 23, 2026
5635466
docs: Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrai…
DhruvvArora Mar 24, 2026
eff9242
fix: apply reward_weights to logged reward/reward_std in GRPOTrainer …
lailanelkoussy Mar 24, 2026
ec1802e
Remove post-collation truncation from DPO (#5350)
albertvillanova Mar 24, 2026
ffa14ee
Remove unused flush_right (#5358)
albertvillanova Mar 24, 2026
ee77df9
Fix IDs shape mismatch in SFT for VLMs with text-only (#5354)
albertvillanova Mar 24, 2026
3822674
Remove post-collation truncation from SFT (#5359)
albertvillanova Mar 24, 2026
7e987af
Simplify DPO DataCollatorForPreference (#5362)
albertvillanova Mar 24, 2026
1c4382b
Simplify SFT tokenization (#5363)
albertvillanova Mar 24, 2026
5449131
Simplify SFT DataCollatorForLanguageModeling (#5360)
albertvillanova Mar 25, 2026
840c8fa
Use BaseConfig post_init in experimental KTO and MiniLLM configs (#5371)
albertvillanova Mar 25, 2026
7f1dd11
Move truncate_dataset to experimental (#5370)
albertvillanova Mar 25, 2026
9a29d28
Simplify DPO tokenization (#5369)
albertvillanova Mar 26, 2026
05eac2c
Use `trl.generation.VLLMGeneration` instead of the separate vLLM logic
cmpatino Mar 26, 2026
614845d
Adds support for the `pixel_position_ids` vision key (#5374)
qgallouedec Mar 26, 2026
f18d594
Minor diff reduction between RLOO and GRPO (#5368)
qgallouedec Mar 26, 2026
23b738f
Remove requirements.txt (#5377)
albertvillanova Mar 26, 2026
1eab0ab
Remove dead truncation_mode from experimental BCO, CPO and ORPO (#5378)
albertvillanova Mar 26, 2026
b27d9de
Centralize AI agent templates in `.ai` (#5268)
qgallouedec Mar 26, 2026
4890abf
Pass tools as None to `apply_chat_template` when it is an empty list …
rabinadk1 Mar 27, 2026
ac5421b
Require datasets>=4.7.0 for Json dtype to prevent insertion of None v…
albertvillanova Mar 27, 2026
c264266
Remove deprecated `TRACKIO_SPACE_ID` env var from all scripts (#5365)
sergiopaniego Mar 27, 2026
8e69b68
Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 (#5387)
albertvillanova Mar 27, 2026
32a40bf
Enforce PR template for first-time contributors and document AI usage…
qgallouedec Mar 27, 2026
4cb7ab1
Enhance PR template check to exclude reopened PRs from first-time con…
qgallouedec Mar 27, 2026
83d68dd
chore: update `pr_template_check.yml` (#5393)
qgallouedec Mar 27, 2026
79e6e79
Move `disable_config=True` from `generate` to `GenerationConfig` (#5384)
qgallouedec Mar 28, 2026
1ee3975
Add vLLM inference to the Base Self-Distillation Trainer (#5388)
cmpatino Mar 30, 2026
71ff6a2
Add HF_TOKEN environment variable to workflow files (#5397)
qgallouedec Mar 30, 2026
e8d5dfc
Add second version of Qwen 3.5 chat template to chat_template_utils (…
apardyl Mar 30, 2026
f3e9ac1
Release: v1.0 (#5409)
qgallouedec Mar 30, 2026
bddb0fa
⬆️ Bump dev version (#5410)
qgallouedec Mar 30, 2026
129fbd8
Update "What's New": TRL v1 blog post (#5385)
qgallouedec Mar 30, 2026
cad4fb0
Fix CI slow-tests cannot remove: No such file or directory (#5401)
albertvillanova Mar 31, 2026
2e5a5b6
Remove xfail for Qwen3VL CI tests (#5402)
albertvillanova Mar 31, 2026
2f67d93
Fix flaky CI test_rloo[fsdp2]: Replace non-deterministic xfail with s…
albertvillanova Mar 31, 2026
cb410f3
Mark as strict the xfail tests with zero3 for RLOO and GRPO (#5404)
albertvillanova Mar 31, 2026
2d5cadb
Remove duplicated prepare_deepspeed (#5414)
albertvillanova Apr 1, 2026
67f2448
Hotfix CI: Mark tests as xfail due to missing input_ids or inputs_emb…
albertvillanova Apr 1, 2026
a144d3b
Update tests to not pass `eval_strategy` (#5426)
SunMarc Apr 1, 2026
d3a7fca
Hotfix CI: Mark tests as xfail with transformers dev due to TypeError…
albertvillanova Apr 1, 2026
1ad25f9
FIX CI: Targeting fused parameters with LoRA (#5430)
BenjaminBossan Apr 1, 2026
8501179
Support multimodal tool responses in `environment_factory` for VLM tr…
sergiopaniego Apr 2, 2026
b44b706
🔒 Pin GitHub Actions to commit SHAs (#5435)
paulinebm Apr 2, 2026
8f47d76
New carla vlm example (#5437)
sergiopaniego Apr 2, 2026
2929bc0
Revert hotfix CI for TypeError: 'NoneType' object is not iterable (#5…
albertvillanova Apr 2, 2026
9fdb1f3
Run make precommit to fix docstring style (#5436)
albertvillanova Apr 2, 2026
8c5a413
Fix ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv (#5423)
albertvillanova Apr 2, 2026
512386c
Add chunked LM head for memory-efficient log-prob computation for As…
AmineDiro Apr 2, 2026
095cd91
Update tests with zero3 for RLOO and GRPO as xfail only with transfor…
albertvillanova Apr 3, 2026
eb93c86
Make images optional in prepare_multimodal_messages (#5424)
albertvillanova Apr 3, 2026
3358429
Hotfix CI: Update skipif for test_rloo[fsdp2] after transformers 5.5.…
albertvillanova Apr 3, 2026
8b837de
Update vLLM minimum supported version to 0.11.0 (#5443)
albertvillanova Apr 3, 2026
d7a2539
Better test consistency RLOO vs GRPO (#5396)
qgallouedec Apr 3, 2026
0b05331
Align KTO with DPO: Precompute reference log probs at init (#5447)
albertvillanova Apr 3, 2026
767595d
Add support for logging extra columns in reward functions and update …
qgallouedec Apr 4, 2026
8edc17e
Remove unnecessary `isinstance(part, dict)` checks in image extractio…
qgallouedec Apr 4, 2026
41eda42
Remove xfail for ZeRO 2 and 3 + SFT + PEFT test (#5383)
qgallouedec Apr 4, 2026
b03d6d2
Replace `pixel_position_ids` with `image_position_ids` for Gemma4 sup…
qgallouedec Apr 4, 2026
5c22894
Add test and docs for multimodal tool responses (#5448)
qgallouedec Apr 6, 2026
4532166
Add tests for Gemma pixel splitting (#5450)
qgallouedec Apr 6, 2026
4a3cae0
Generic device support for CI tests (#5357)
kaixuanliu Apr 6, 2026
49bf205
Revert speculative argument parsing and add Gemma4 agent support (#5454)
qgallouedec Apr 6, 2026
a78e898
fix _get_per_token_logps_and_entropies return type (#5456)
kashif Apr 7, 2026
9bacce7
Deprecate keep_end truncation mode (#5465)
albertvillanova Apr 7, 2026
97704ee
Fix SFT deprecation warning (#5466)
albertvillanova Apr 7, 2026
7eba096
Remove unused truncation_mode from experimental truncate_dataset (#5467)
albertvillanova Apr 7, 2026
0ac6216
Use generic VLM key passthrough in DPO (#5468)
albertvillanova Apr 7, 2026
f19fda4
Narrow prefix-preserving check to the actual requirement (#5458)
qgallouedec Apr 7, 2026
0cb9667
Simplify `_get_tool_suffix_ids` (#5440)
qgallouedec Apr 7, 2026
14deadd
Update docstring about tool messages in prepare_multimodal_messages (…
albertvillanova Apr 8, 2026
8a1b87d
CI Gemma 4 support (#5453)
qgallouedec Apr 8, 2026
f44cf20
Move chat templates from inline strings to `.jinja` files (#5459)
qgallouedec Apr 8, 2026
bf187bb
Align KTO with DPO: Reorganize KTOConfig (#5477)
albertvillanova Apr 8, 2026
25813b6
Add `supports_tool_calling` utility and validate tool support at init…
qgallouedec Apr 9, 2026
720c1f2
Add GPT-OSS tool calling support (#5464)
qgallouedec Apr 9, 2026
dd071d7
Add `{% generation %}` support to training chat templates (#5470)
qgallouedec Apr 9, 2026
1e667d8
Avoid image deepcopy in prepare_multimodal_messages (#5475)
albertvillanova Apr 9, 2026
89c5ed6
Remove dead token attributes from trainers (#5483)
albertvillanova Apr 9, 2026
c475b97
Add `DistillationTrainer` for efficient on-policy distillation (#5407)
cmpatino Apr 9, 2026
8b30e50
Replace deprecated `huggingface-cli` references with `hf` (#5486)
hanouticelina Apr 9, 2026
a253dbe
Fix broken validation of user-specified tokens (#5482)
albertvillanova Apr 9, 2026
e0b23ca
Deprecate pad_token config parameter (#5480)
albertvillanova Apr 9, 2026
8900a14
Remove redundant alignment of pad_token_id (#5487)
albertvillanova Apr 9, 2026
502bdb8
Fix PR template check bot reopen loop (#5488)
qgallouedec Apr 9, 2026
6f6440b
feat(gpt-oss): Add `{% generation %}` markers for training chat templ…
casinca Apr 10, 2026
9a1549e
Remove the `trl.experimental.judges` module and all judge support fro…
qgallouedec Apr 10, 2026
462e028
Hotfix CI: Mark tests as xfail with transformers dev for Llava models…
albertvillanova Apr 10, 2026
ea283c6
Restrict VLM padding workaround to transformers 5.3.0 (#5503)
albertvillanova Apr 10, 2026
d4e8354
Update GitHub Action to use specific version of github-script (#5491)
qgallouedec Apr 10, 2026
b48c788
[docs] Add code example for completion_only_loss in SFT trainer docs …
RudrenduPaul Apr 10, 2026
d4caab8
Fix prepare_multimodal_messages not normalizing empty string content …
albertvillanova Apr 10, 2026
f2925a8
Add trackio support to `DistillationTrainer` (#5501)
cmpatino Apr 10, 2026
dbd3fac
feat: add Llama 3 training chat template with generation markers (#5493)
RudrenduPaul Apr 10, 2026
9c8e191
Add GLM-4-MoE tool calling support (#5463)
qgallouedec Apr 10, 2026
c73c2ec
Add Qwen3-VL tool calling support (#5469)
qgallouedec Apr 10, 2026
ca995b4
Add docs and good defaults for `DistillationTrainer` (#5500)
cmpatino Apr 11, 2026
d6d5efc
feat: add Qwen2.5 training chat template with generation markers (#5522)
RudrenduPaul Apr 12, 2026
3179965
Release: v1.1 (#5524)
qgallouedec Apr 12, 2026
5abe9cd
⬆️ Bump dev version (#5525)
qgallouedec Apr 12, 2026
80eb47d
Simplify role handling in prepare_multimodal_messages (#5508)
albertvillanova Apr 13, 2026
b43b551
Fix CI dependency installs to use a single resolve (#5513)
qgallouedec Apr 13, 2026
bdc1e10
Fix `supports_tool_calling` falsely accepting templates that drop ass…
qgallouedec Apr 13, 2026
9157aa7
feat: add DeepSeek-V3 training chat template with generation markers …
RudrenduPaul Apr 14, 2026
2761732
Drop, don't truncate, overlong tool results in GRPOTrainer (#5521)
qgallouedec Apr 14, 2026
9a8d523
Set upper transformers version to skip distributed test_rloo after fi…
albertvillanova Apr 14, 2026
7ce3772
Align KTO with DPO: Add precompute_ref_batch_size (#5530)
albertvillanova Apr 14, 2026
5f3ec05
Update tests with zero3 for RLOO and GRPO once fixed in transformers …
albertvillanova Apr 14, 2026
6e65da0
Align KTO with DPO: Align ref_model initialization (#5534)
albertvillanova Apr 14, 2026
90c5da4
Align KTO with DPO: Align model initialization (#5533)
albertvillanova Apr 14, 2026
9aa8dcc
Remove unused dependencies for judges from dev requirements (#5515)
qgallouedec Apr 14, 2026
d717471
Remove xfail condition for Gemma4 response_schema regex bug (#5510)
qgallouedec Apr 14, 2026
dfdcb88
Align KTO with DPO: Support None args (#5531)
albertvillanova Apr 14, 2026
892ee95
Add example script section to experimental trainer docs (#5543)
sergiopaniego Apr 14, 2026
7419723
[SSD] Added SSD trainer in experimental (#5505)
kashif Apr 14, 2026
bfcba46
[Docs] Fix formatting in SSD training example script (#5548)
kashif Apr 14, 2026
2991255
Don't load ref_model when precompute_ref_log_probs in DPO/KTO (#5542)
albertvillanova Apr 15, 2026
7bc73e8
chore: bump doc-builder SHA for PR upload workflow (#5553)
rtrompier Apr 15, 2026
aa5e321
Nits is SSD docs (#5554)
sergiopaniego Apr 15, 2026
12901a2
Deprecate `use_transformers_paged` (#5544)
qgallouedec Apr 15, 2026
de0f93d
Update vLLM version support to 0.18.0 (#5547)
qgallouedec Apr 15, 2026
571a6ea
Align KTO with DPO: Remove generate_during_eval (#5551)
albertvillanova Apr 15, 2026
4a2f750
Align KTO with DPO: Remove model and ref adapter names (#5552)
albertvillanova Apr 15, 2026
dc84e41
Support messages with images in prepare_multimodal_messages (#5474)
albertvillanova Apr 16, 2026
5d4f7c8
Update CARLA VLM example scripts (#5557)
sergiopaniego Apr 16, 2026
811ff6f
Fix `add_response_schema` for VLM processors (#5520)
qgallouedec Apr 16, 2026
d75a550
[docs] Add LLaMA 3 / Qwen 2.5 entries to `chat_templates/README` (#5545)
qgallouedec Apr 16, 2026
abe20a8
Add LLaMA 3.1 and 3.2 tool calling support (#5518)
qgallouedec Apr 16, 2026
aca4515
Release: v1.2 (#5576)
qgallouedec Apr 17, 2026
6632cda
⬆️ Bump dev version (#5577)
qgallouedec Apr 17, 2026
6d05f86
Support processor in maybe_apply_chat_template (#5567)
albertvillanova Apr 17, 2026
21cf71d
Remove dead token attributes from experimental trainers (#5565)
albertvillanova Apr 17, 2026
4595347
Support VLM processors in `is_chat_template_prefix_preserving` (#5558)
qgallouedec Apr 17, 2026
a09320e
Align KTO with DPO: Align add_model_tags (#5582)
albertvillanova Apr 17, 2026
8ff0069
Align KTO with DPO: Align processing_class initialization (#5578)
albertvillanova Apr 17, 2026
d648dc0
Align KTO with DPO: Align _prepare_dataset (#5579)
albertvillanova Apr 17, 2026
573ea22
Align KTO with DPO: Align ref_model preparation for distributed train…
albertvillanova Apr 17, 2026
15bbfc7
Align KTO with DPO: Make conditional prompt extraction and unpairing …
albertvillanova Apr 18, 2026
88826fd
Update AsyncGRPO example with GSM8K and tested hyperparameters (#5580)
sergiopaniego Apr 20, 2026
1d9b612
[docs] Add chat templates page to web docs (#5581)
sergiopaniego Apr 20, 2026
9502575
Add additional model parameters to `TestSupportsToolCalling` for impr…
qgallouedec Apr 20, 2026
06244b0
Fix CI with dev dependencies for Llava models (#5499)
albertvillanova Apr 20, 2026
4a2dc7c
Differentiate Phi-3 and Phi-3.5 in tests (#5546)
qgallouedec Apr 20, 2026
6e1705a
Set _tokenizer as trainer attribute (#5489)
albertvillanova Apr 20, 2026
b8d69f7
Align KTO with DPO: Support dict eval_dataset (#5599)
albertvillanova Apr 20, 2026
4ca2e9b
Align KTO with DPO: Align tokenization (#5601)
albertvillanova Apr 20, 2026
d5b534e
Check prefix preservation at the token level (#5559)
qgallouedec Apr 20, 2026
dfe3788
Replace wrong comment about chat template with EOS (#5607)
albertvillanova Apr 20, 2026
14ca4af
Align KTO with DPO: Support IterableDataset (#5600)
albertvillanova Apr 20, 2026
0a54b4d
Drop vLLM 0.11 support (#5549)
qgallouedec Apr 21, 2026
1cc2b98
Align KTO with DPO: Remove maybe_apply_chat_template (#5606)
albertvillanova Apr 21, 2026
ecf9cb3
[TPO] experimental TPO trainer (#5506)
kashif Apr 21, 2026
a08e713
fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker (…
xuanduy04 Apr 21, 2026
166d550
docs: update RapidFire AI integration with FSDP and multi-backend tra…
kamran-rapidfireAI Apr 22, 2026
edaf6ec
Fix generate_tiny_models for gpt-oss (#5622)
albertvillanova Apr 22, 2026
6a4a077
Added speculative_config to vllm-serve (#5605)
Ofir408 Apr 22, 2026
9a52d73
feat(glm-4-moe): Add `{% generation %}` markers for training chat tem…
casinca Apr 22, 2026
95e76d5
Fix docstring style in vllm-serve script (#5628)
albertvillanova Apr 22, 2026
3256995
feat: add Gemma/Gemma2 training chat templates with generation marker…
ps-abhi Apr 22, 2026
b3da4eb
Align KTO with DPO: Inline tokenization, new output format, DataColla…
albertvillanova Apr 22, 2026
644d173
feat: add Phi-3 training chat template with generation markers (#5526)
RudrenduPaul Apr 22, 2026
6da8ec5
Remove `forward_masked_logits` (#5626)
qgallouedec Apr 23, 2026
a9cfe47
Use `PreTrainedTokenizerBase` for tokenizer type hints (#5629)
qgallouedec Apr 23, 2026
1996c39
Add doc-builder style check to pre-commit and CI (#5630)
albertvillanova Apr 24, 2026
b43476a
Align and update doc-builder commit hash in CI GitHub Actions (#5631)
albertvillanova Apr 24, 2026
4c8b2e9
Align KTO with DPO: Move completion assembly from _prepare_dataset to…
albertvillanova Apr 24, 2026
208337c
Hotfix CI: Add ruff dependency to doc-builder style check (#5634)
albertvillanova Apr 24, 2026
c693ca1
Fix entropy calculation in SFT (#5620)
qgallouedec Apr 24, 2026
43cbd78
Renaming of internal variables: `async_reward_X` to `async_X` (#5616)
qgallouedec Apr 24, 2026
3aa9519
Align KTO with DPO: Remove BOS/EOS handling (#5635)
albertvillanova Apr 24, 2026
2f10689
Qwen3.6 integration (#5642)
qgallouedec Apr 26, 2026
9679645
Release: v1.3 (#5647)
qgallouedec Apr 26, 2026
4798893
⬆️ Bump dev version (#5648)
qgallouedec Apr 26, 2026
923c318
Align KTO with DPO: Remove model_init parameter (#5659)
albertvillanova Apr 27, 2026
510a6f5
Align KTO with DPO: Remove preprocess_logits_for_metrics parameter (#…
albertvillanova Apr 27, 2026
a7648ba
Add tiny Qwen3-4B-Instruct-2507 (#5586)
qgallouedec Apr 27, 2026
9bcf729
Chunked cross-entropy loss for SFT (up to –50% VRAM) (#5575)
qgallouedec Apr 27, 2026
8d3a3a2
Fix missing PEFT validation when passing peft_config to core trainers…
albertvillanova Apr 28, 2026
4d0fd7d
Fix missing PEFT availability check when passing peft_config to exper…
albertvillanova Apr 28, 2026
9516563
Align KTO with DPO: Align PEFT handling (#5661)
albertvillanova Apr 28, 2026
4455858
Set _tokenizer attribute in experimental trainers (#5566)
albertvillanova Apr 28, 2026
574ebe0
Fix peft_config type hint in experimental trainers (#5666)
albertvillanova Apr 28, 2026
788555a
Add Cohere training chat template (#5627)
dschulmeist Apr 28, 2026
88e0ed4
Simplify peft_config handling in core trainers (#5673)
albertvillanova Apr 29, 2026
fdad6d8
Simplify peft_config handling in experimental trainers (#5674)
albertvillanova Apr 29, 2026
43e09b8
fix(distillation): reverse-KL server path NaN on variable completion …
k1064190 Apr 29, 2026
187c899
Fix discarded assertion message in trainer parameter checks (#5677)
qgallouedec Apr 29, 2026
eae4235
Align KTO with DPO: Replace direct type check with is_peft_model (#5679)
albertvillanova Apr 29, 2026
95719e2
Remove redundant is_peft_available from core trainers (#5682)
albertvillanova Apr 29, 2026
3c24b1e
Replace isinstance with is_peft_model in experimental trainers (#5683)
albertvillanova Apr 29, 2026
64e3eb1
Upload testing suite for `DistillationTrainer` (#5615)
cmpatino Apr 30, 2026
32bec88
Fix OOM in CI by reducing batch size in VLM SFT tests (#5687)
albertvillanova Apr 30, 2026
becec89
Fix OOM in CI by reducing image size of tiny Gemma3 model (#5680)
albertvillanova Apr 30, 2026
a0dc552
Fix OOM in CI test reruns due to GPU memory leak from traceback frame…
albertvillanova Apr 30, 2026
7d97bf8
Add training-invariance tests (#5686)
qgallouedec Apr 30, 2026
465a5fd
Regenerate invariance data + relax the tolerance (#5688)
qgallouedec May 1, 2026
25e294c
fix: prevent RuntimeError crash in activation offloading for non-cont…
butterwecksolutions May 4, 2026
5c521b7
[GRPO] update Liger-kernel grpo loss (delta, vespo, KL bias correctio…
kashif May 4, 2026
e55d788
Extend invariant suite with gradient-checkpointing equivalence (#5689)
qgallouedec May 4, 2026
909d0c1
Add Gemma 3 training chat template (#5685)
hwanython May 4, 2026
0a3d956
Add `{% generation %}` markers for Cohere2 chat template (#5675)
qgallouedec May 4, 2026
20b2489
Add length-normalized sigmoid loss type to DPO trainer (#5406)
BrownianNotion May 4, 2026
e5677da
Add training chat template for Qwen3-2507 (#5574)
SwayamInSync May 5, 2026
babb16b
Align KTO with DPO: Remove enforcement of causal language models (#5701)
albertvillanova May 5, 2026
17f8aac
Align KTO with DPO: Remove duplicate import of PreTrainedModel (#5702)
albertvillanova May 5, 2026
d2b80e0
Align KTO with DPO: Simplify max_length init logic (#5703)
albertvillanova May 5, 2026
f39373e
Align KTO with DPO: Group training arguments (#5704)
albertvillanova May 5, 2026
df6ae2a
Align KTO with DPO: Use _metrics attribute (#5705)
albertvillanova May 5, 2026
a01bf61
Reduce inconsistency across trainer test files (#5678)
qgallouedec May 5, 2026
e7c0019
Refactor tiny-model generation scripts (#5637)
qgallouedec May 5, 2026
225a234
Accept processor in `get_training_chat_template` (#5560)
qgallouedec May 5, 2026
b8e6fc0
Enable chunked NLL loss with PEFT in SFT (#5676)
qgallouedec May 5, 2026
ca8d909
Fix GRPO VLM tests: Multimodal training requires conversational promp…
kaixuanliu May 5, 2026
6ad4f30
[experimental] Add OpenReward Standard environment adapter (#5696)
adithya-s-k May 5, 2026
1240ecf
GKDTrainer: Fix return_outputs in Liger kernel path and update tests …
roycho96 May 6, 2026
733f3b1
Reject parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO (#…
kashif May 6, 2026
afae06f
Fix typo in model name in README (#5711)
qgallouedec May 6, 2026
dddac4c
Explicitly set model_accepts_loss_kwargs=False in DPO and Reward (#5710)
albertvillanova May 6, 2026
19d007e
Fail early for unsupported PEFT + Liger Kernel in DPO (#5709)
albertvillanova May 6, 2026
de7efdc
Revert VLM support in `parse_response` (#5561)
qgallouedec May 6, 2026
acbd53f
Align KTO with DPO: Align _precompute_ref_logps (#5714)
albertvillanova May 7, 2026
8a6cc03
fix: prevent 5 GB+ CUDA memory leak in activation offloading by synci…
butterwecksolutions May 7, 2026
4601166
Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B (#5716)
qgallouedec May 7, 2026
47b3778
Add MFU helpers (#5698)
AmineDiro May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
97 changes: 97 additions & 0 deletions .ai/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# AGENTS.md

## Repository-specific guidance

### Main code vs experimental code

The repository is separated into **main code** and **experimental code**.

* **Main code** should remain stable, consistent, and well-tested.
* **Experimental code** may be less stable and may contain inconsistent patterns or limited testing.

Small non-invasive improvements that make experimental code more consistent with the main codebase are encouraged, but avoid large refactors.

### Paper implementations

If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to `paper_index.md`.

When reviewing such PRs, ensure that `paper_index.md` was updated.

### Code duplication and consistency

Trainers in this repository are **self-contained by design**. Shared logic (generation, reward computation, metric logging, weight syncing, etc.) is deliberately duplicated across trainers rather than abstracted into a shared base class.

This is intentional: each trainer must be readable, modifiable, and evolvable in isolation. The base class (`_BaseTrainer`) provides only minimal utilities (model card generation). Everything else — vLLM generation paths, `_get_per_token_logps_and_entropies`, `_calculate_rewards`, `_prepare_inputs`, metric logging — is copied in full.

**The tradeoff**: duplication is accepted, but **consistency is mandatory**. When the same logic appears in multiple trainers, the duplicated blocks must stay aligned:

- Same variable names (`self._last_loaded_step`, `self._metrics[mode]`, …)
- Same control flow structure (if/elif/else branches in the same order)
- Same comments (word-for-word when the logic is identical)
- Divergences only where the trainer's semantics require it (e.g., GRPO extracts logprobs from vLLM, RLOO discards them)

**Consistency over correctness**: this is a strong requirement. When duplicating code, reproduce it exactly — even if you believe the original has a bug. Do not silently fix the issue in your copy. Instead, keep your copy consistent with the source and report the problem so it can be fixed across all trainers in a dedicated PR. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep.

**When modifying duplicated code**: if you change a pattern that exists in multiple trainers (e.g., the vLLM generation path in `_generate_single_turn`), apply the same change to all other trainers. A fix in GRPO often implies the same fix in RLOO, and vice versa. Not propagating a change is a bug.

**When reviewing**: if a PR touches duplicated logic, verify that all copies are updated consistently. A common mistake is fixing one trainer and forgetting the others.

### Simplicity

This codebase values **leanness and simplicity above all**. Prefer straightforward, inline code over abstractions, helpers, or utilities — even at the cost of some robustness or generality.

Concretely:

- Do not add layers of indirection (registries, factory patterns, plugin systems). A contributor should be able to read a trainer top to bottom and understand the full flow.
- Prefer a simple implementation that covers 90% of cases over a complex one that covers 100%. A function that handles the common path in 20 lines is better than a catch-all that handles every edge case in 80.
- Do not add defensive code, fallback paths, or configuration options "just in case". Only handle cases that actually exist today.
- Avoid `hasattr` and `getattr`. Their use is almost always a symptom of overly defensive programming or a disguised version check (e.g., "this attribute was added in version X"). Instead, either drop the conditional entirely or express the version check explicitly with a version comparison. There is nearly always a cleaner alternative.
- When in doubt, prefer less code. Every new function, parameter, or branch is maintenance burden. The best abstraction is often no abstraction.

## Documentation

### Docstrings

Docstrings must follow the repository format below. Do **not** convert docstrings to other styles (Google, NumPy, etc.).

Rules:

* Types appear in backticks inside parentheses: (`str`)
* Optional parameters are marked with `*optional*`
* Defaults are written as: `defaults to <value>`
* When the default is `None`, prefer ```(`str`, *optional*)``` instead of ```(`str` or `None`, *optional*, defaults to `None`)```
* Union types use `or`: `str` or `None`
* References to classes use the format: [`~transformers.PreTrainedModel`]
* Class docstrings may group parameters using headers such as: `> Parameters for X:`

Example:

````python
def method(self, param1: str, param2: int = 1, param3: float | None = None):
"""
Brief one-line description of what this does.

Args:
param1 (`str`):
Description of required param.
param2 (`int`, *optional*, defaults to `1`):
Description of optional param with default.
param3 (`float`, *optional*):
Description of optional param without explicit default.

Returns:
`dict` with keys:
- `key1` (`list[int]`):
Description of this key.

Examples:

```python
>>> my_func("hello")
```
"""
````

### Links to papers

When linking to papers, use `https://huggingface.co/papers/<id>` instead of `https://arxiv.org/abs/<id>` (same ID suffix system).
1 change: 1 addition & 0 deletions .cursor/BUGBOT.md
18 changes: 11 additions & 7 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,22 @@ Once you're done, someone will review your PR shortly. They may suggest changes

Fixes # (issue)


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request),
Pull Request section?
- [ ] Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?
- [ ] Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?

## AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

- [ ] No AI usage: the PR was written entirely by a human.
- [ ] AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
- [ ] AI-generated: the PR was mostly or fully generated by an AI tool.

## Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
6 changes: 4 additions & 2 deletions .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ on:
- doc-builder*
- v*-release

env:
TRL_EXPERIMENTAL_SILENCE: 1

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.sha }}
package: trl
version_tag_suffix: ""
custom_container: huggingface/transformers-doc-builder
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
6 changes: 4 additions & 2 deletions .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,19 @@ name: Build PR Documentation
on:
pull_request:

env:
TRL_EXPERIMENTAL_SILENCE: 1

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
if: github.event.pull_request.draft == false
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: trl
version_tag_suffix: ""
custom_container: huggingface/transformers-doc-builder
2 changes: 1 addition & 1 deletion .github/workflows/clear_cache.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Cleanup
run: |
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/codeQL.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ jobs:

steps:
- name: "Checkout repository"
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: "Initialize CodeQL"
uses: github/codeql-action/init@v2
uses: github/codeql-action/init@b8d3b6e8af63cde30bdc382c0bc28114f4346c88 # v2
with:
languages: "yaml"
queries: +security-and-quality, ./.github/codeql/custom-queries.qls

- name: "Perform CodeQL Analysis"
uses: github/codeql-action/analyze@v2
uses: github/codeql-action/analyze@b8d3b6e8af63cde30bdc382c0bc28114f4346c88 # v2
101 changes: 46 additions & 55 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
@@ -1,95 +1,86 @@
name: Build Docker images (scheduled)
name: Build TRL Docker image

on:
push:
branches:
- main
workflow_dispatch:
workflow_call:
schedule:
- cron: "0 1 * * *"

concurrency:
group: docker-image-builds
cancel-in-progress: false

env:
CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }}

jobs:
trl-latest:
name: "Latest TRL GPU"
runs-on: ubuntu-latest
trl:
name: "Build and push TRL Docker image"
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Get TRL version from PyPI
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
VERSION=$(curl -s https://pypi.org/pypi/trl/json | jq -r .info.version)
echo "VERSION=$VERSION" >> $GITHUB_ENV

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v4
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3

- name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}

- name: Build and Push GPU
uses: docker/build-push-action@v4
- name: Build and Push
uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
with:
context: ./docker/trl-latest-gpu
context: docker/trl
push: true
tags: huggingface/trl-latest-gpu
tags: |
huggingface/trl:${{ env.VERSION }}
huggingface/trl

- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67 # main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the trl-latest-gpu Docker Image build
slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
title: 🤗 Results of the TRL Dev Docker Image build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

trl-source:
name: "Latest TRL + HF ecosystem from source"
runs-on: ubuntu-latest
trl-dev:
name: "Build and push TRL Dev Docker image"
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v4
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3

- name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}

- name: Build and Push GPU
uses: docker/build-push-action@v4
- name: Build and Push
uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
with:
context: ./docker/trl-source-gpu
context: docker/trl-dev
push: true
tags: huggingface/trl-source-gpu
tags: |
huggingface/trl:dev

- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67 # main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the trl-source-gpu Docker Image build
slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
title: 🤗 Results of the TRL Dev Docker Image build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
4 changes: 2 additions & 2 deletions .github/workflows/issue_auto_labeller.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
permissions:
issues: write
steps:
- uses: actions/checkout@v3
- uses: August-murr/auto-labeler@main
- uses: actions/checkout@v6
- uses: August-murr/auto-labeler@0.0.1
with:
hf-api-key: ${{ secrets.CI_HF_API_TOKEN }}
8 changes: 4 additions & 4 deletions .github/workflows/pr_style_bot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
steps:
- name: Extract PR details
id: pr_info
uses: actions/github-script@v6
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8
with:
script: |
const prNumber = context.payload.issue.number;
Expand All @@ -35,7 +35,7 @@ jobs:
core.setOutput("headRepoFullName", pr.head.repo.full_name);

- name: Check out PR branch
uses: actions/checkout@v3
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
env:
HEADREPOFULLNAME: ${{ steps.pr_info.outputs.headRepoFullName }}
HEADREF: ${{ steps.pr_info.outputs.headRef }}
Expand All @@ -58,7 +58,7 @@ jobs:
echo "Head Repo Full Name: ${{ env.HEADREPOFULLNAME }}"

- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6

- name: Install dependencies
run: |
Expand Down Expand Up @@ -111,7 +111,7 @@ jobs:

- name: Comment on PR with workflow run link
if: steps.commit_and_push.outputs.changes_pushed == 'true'
uses: actions/github-script@v6
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8
with:
script: |
const prNumber = parseInt(process.env.prNumber, 10);
Expand Down
Loading
Loading