Remove post-collation truncation from SFT by albertvillanova · Pull Request #5359 · huggingface/trl

albertvillanova · 2026-03-24T08:40:26Z

Remove post-collation truncation from SFT.

Follow-up to:

Add truncation to SFT DataCollatorForLanguageModeling #5315

Related to:

Remove post-collation truncation from DPO #5350

This PR removes internal dataset truncation logic from the SFT trainer and enforce that sequence truncation should be handled before padding. This change shifts responsibility for truncation to the dataset preparation or custom collators, making the data processing pipeline more explicit and less error-prone.

Changes

Removal of internal truncation logic:

Removed the use of the truncate_dataset function in the dataset processing pipeline; the trainer no longer performs truncation internally if packing is not enabled.
Updated comments and logic in the tokenize_fn to reflect that only packing (not truncation) is performed during dataset preparation.

Documentation and enforcement of truncation responsibility:

Updated the docstring for the data_collator parameter to specify that custom collators must truncate sequences before padding; the trainer will not perform truncation after collation.
Changed error messaging in the constructor to remove the suggestion to "Disable skip_prepare_dataset" and clarify that inputs must be packed/truncated before reaching the collator if padding_free=True.

Note

Medium Risk
Changes SFT data-prep semantics by removing automatic dataset truncation when packing=False, which can increase sequence lengths and affect memory/accuracy unless callers pre-truncate. Adds stricter validation for padding_free=True with max_length, potentially breaking existing configs that relied on implicit enforcement.

Overview
Removes trainer-side dataset truncation from SFTTrainer when packing=False, leaving sequence length control to pre-truncated inputs or custom collators (packing remains the only built-in length enforcement during dataset prep).

Tightens and clarifies configuration rules by raising an error when padding_free=True is used without packing alongside a non-None max_length, and updates the data_collator docstring to state that custom collators must truncate before padding.

Updates the related unit test to reflect the new error condition/message for padding-free + no packing + max_length.

^{Written by Cursor Bugbot for commit 12f9514. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2026-03-24T08:42:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

albertvillanova added 4 commits March 24, 2026 09:30

Do not call truncate_dataset in SFTTrainer

2c808cf

Update comment

e03905a

Update ValueError mesage

7291ea8

Update data_collator docstring to specify it must truncate

27c822b

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread trl/trainer/sft_trainer.py Outdated

Update ValueError logic and message

d5eef21

qgallouedec approved these changes Mar 24, 2026

View reviewed changes

Comment thread tests/test_sft_trainer.py Outdated

albertvillanova added 2 commits March 24, 2026 20:14

Apply suggestion

6cc199c

Merge remote-tracking branch 'upstream/main' into fu-5315

12f9514

albertvillanova merged commit 3822674 into huggingface:main Mar 24, 2026
12 checks passed

albertvillanova mentioned this pull request Mar 25, 2026

Move truncate_dataset to experimental #5370

Merged

kashif mentioned this pull request Jun 2, 2026

Fix SFT padding-free test config #5923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove post-collation truncation from SFT#5359

Remove post-collation truncation from SFT#5359
albertvillanova merged 7 commits into
huggingface:mainfrom
albertvillanova:fu-5315

albertvillanova commented Mar 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

albertvillanova commented Mar 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertvillanova commented Mar 24, 2026 •

edited by cursor Bot

Loading