Open
Conversation
This reverts commit 0e3c3e4.
3 tasks
This was referenced Feb 16, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this PR?
This pull request refactors the DPO Trainer.
Docs: https://moon-ci-docs.huggingface.co/docs/trl/pr_3906
Benchmark: #3906 (comment)
Closes #2563
Closes #3985
Closes #4071
Closes #2047
Important modifications
1. Remove the encoder-decoder support #
Remove encoder–decoder support to reduce code complexity and maintenance burden, focusing on decoder-only architectures which dominate current LLM usage and training workflows.
2. Remove
RunningMomentfrom the pairwise BCO objective #Section 4.2 of [Binary Classifier Optimization for Large Language Model Alignment] shows that alignment objectives must be invariant to adding constants to rewards, and the paper enforces this by formulating losses in terms of likelihood ratios and relative (baseline-subtracted) rewards. But this automatically satisfied in the preference case. So we don't need any running moment like here
trl/trl/trainer/dpo_trainer.py
Lines 550 to 551 in e5503ea
and here
trl/trl/trainer/dpo_trainer.py
Lines 1133 to 1137 in e5503ea
Probably a mistake from its initial implementation in #1524
3. Rename
"aot_pair"to"aot_unpaired"#In the paper, we have:
loss_type="aot", andloss_type="aot_pair"For some reason, from what I understand of consistency with the late
loss_type="kto_pair", the author initially called the later"aot_pair", even though it is the unpaired version, see #1701, which in my opinion is very misleading. I therefore propose to haveloss_type="aot"andloss_type="aot_unpaired"We will follow a minor version of deprecation.
4. Deprecate separate prompt/completion truncation #
DPOTrainercurrently truncates prompts and completions separately usingmax_prompt_lengthandmax_completion_length.This is suboptimal: with a fixed total token budget, separate limits cannot adapt to varying prompt/completion lengths, causing unnecessary truncation.
Both samples fit within a 7-token budget.
Separate truncation fails
No choice of
(max_prompt_length, max_completion_length)preserves both samples.Single-sequence truncation works
Separate prompt/completion truncation wastes token budget.
Truncating the concatenated sequence with
max_lengthis always strictly better.Recommendation: deprecate
max_prompt_lengthandmax_completion_length5. Switch to default truncation side being
"keep_start"#In DPO, preference labels are defined over the conditional distribution$p(y|x)$ induced by the full prompt.
Left truncation (keeping the end) alters this conditioning context by removing system instructions or intent-setting tokens, so the model is trained on preferences that no longer correspond to the same conditional distribution, potentially invalidating or even reversing the preference signal.
Right truncation (keeping the start) preserves the conditioning distribution and task semantics; while it may weaken the signal by shortening completions, it does not change what the preference is conditioned on. Moreover, because chosen and rejected responses typically have different lengths, left truncation can remove a different number of tokens from each completion, introducing additional asymmetry and noise in the preference comparison.
Therefore I recommend setting
truncation_side="keep_start"by default (instead of"keep_end")6. Deprecate
ref_model_init_kwargs#To my knowledge, the reference model is largely initialized to be equal to the initial trained model. Therefore, initializing it differently is an advanced/uncommon use case, and it would be more logical and simpler in this case to let the user initialize the reference model themselves, then pass it to the trainer via the
ref_modelargument:7. Deprecate
generate_during_eval#Note
I'm not entirely sure about this one yet.
In all trainers, we have a trained model
model. It would seem simpler to use a callback, such asLogCompletionCallback, instead of generating it within the trainer. Having bothLogCompletionCallbackandgenerate_during_evalfeels like a duplicate to me.8. Deprecate
force_use_ref_model#Before, providing both a
peft_configand aref_modelwould result in an error:There are two issues with this behavior:
peft_configwas provided, but not when themodelargument was already a PEFT model, leading to inconsistent behavior. More generally, using a differentref_modelis an uncommon usage pattern in DPO, regardless of whether PEFT is usedref_modelexplicitly it is reasonable to assume that this is intentional. Rejecting this combination at the API level and having a dedicated argument for this case is therefore unnecessarily restrictive.As a result, it is cleaner to always honor an explicitly provided
ref_model, while documenting that passing aref_modelis not necessary in the vast majority of use cases.9. Deprecated
use_logits_to_keep#Previously, we had the option to enable$N$ tokens, where $N$ was the largest completion length in the batch. It used to save VRAM, but needed a bit a complexity in the code:
use_logits_to_keepandlm_headwas only used on the lasttrl/trl/trainer/dpo_trainer.py
Lines 1578 to 1615 in 1dc8bbc
We're working on something even more efficient: using the
lm_headon completion tokens only, see internal discussion. This would be always activated.10. Deprecate
label_pad_token_id#It's now standard everywhere to use
-100. In my opinion, having a way to parametrize this value is not useful.11. Deprecate
FDivergenceType#Before, the
f_divergence_typecould be provided either as an enum (FDivergenceType) or as a string. The enum adds unnecessary complexity and doesn't bring any real benefit. To stay consistent withloss_type, we should standardize on plain strings and deprecate the enum.In the same spirit, I also removed
FDivergenceConstants. I didn't add a deprecation path because this enum had no user-facing value: it was only used internally, and the implementation was unnecessarily complex. The new approach is both simpler and more readable, and matches what we now use everywhere else.12. Deprecate
DPOConfig.tools#To better align with other trainers (SFT, GRPO, RLOO, Reward), we should remove the
toolsargument fromDPOConfig, and instead provide tools per example in the dataset via atoolscolumn consumed by the chat template.Before:
After:
13. Deprecate
reference_free#The
reference_freeargument is redundant with the CPO trainer in my understanding. It introduces a lot of special cases in the code, making it more complex to maintain. Plus, I see no codebase that uses it. Consequently, I suggest deprecating it to simplify the codebase. Users wanting to do reference-free DPO can use CPO instead.14. Deprecate
base_model_attribute_name#In Liger, we need to retrieve the underlying base model. Today,
base_model_attribute_nameis only used as a fallback whenget_decoderis unavailable or returnsNone.get_decoderis now aPreTrainedModelmethod, sohasattr(model, "get_decoder")is effectively alwaysTrue.PreTrainedModel.get_decodernow handles the vast majority of cases (including VLMs). See the implementation here and the extension to VLM in 🚨 Generalizeget_decoder()for multimodal and delete redundant code 🔪 transformers#42156.base_model_attribute_name.15. Deprecate
model_adapter_nameandref_adapter_name#These arguments were originally meant to select which PEFT adapter to use for the training model and the reference model—mainly for setups where a single model might contain multiple adapters.
In practice, that complexity isn't needed when resuming from a pretrained adapter:
Instead, the recommended flow is:
model(keep the default name"default")"ref")ref_modelTo keep behavior consistent, training assumes a single adapter named
"default". That means custom adapter names are no longer supported, andmodel_adapter_name/ref_adapter_namebecome unnecessary. Deprecating them removes redundant configuration and reduces confusion about how the reference model is produced.16. Change the default value of
f_alpha_divergence_coeffrom 1.0 to 0.5 #We propose changing the default value of
f_alpha_divergence_coeffrom 1.0 to 0.5. In the paper, the authors specify that α should lie in (0, 1), so α = 1 is excluded from the theoretically supported setting. Moreover, the α → 1 limit corresponds to the forward KL boundary case, which is already explicitly available viaf_divergence_type="forward_kl"when that behavior is desired. In contrast, α = 0.5 sits well inside the valid interval and provides a more balanced trade-off between mode-seeking and mass-covering behavior, making it a safer and more generally robust default.