Skip to content

Remove post-collation truncation from SFT#5359

Merged
albertvillanova merged 7 commits into
huggingface:mainfrom
albertvillanova:fu-5315
Mar 24, 2026
Merged

Remove post-collation truncation from SFT#5359
albertvillanova merged 7 commits into
huggingface:mainfrom
albertvillanova:fu-5315

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Mar 24, 2026

Copy link
Copy Markdown
Member

Remove post-collation truncation from SFT.

Follow-up to:

Related to:

This PR removes internal dataset truncation logic from the SFT trainer and enforce that sequence truncation should be handled before padding. This change shifts responsibility for truncation to the dataset preparation or custom collators, making the data processing pipeline more explicit and less error-prone.

Changes

Removal of internal truncation logic:

  • Removed the use of the truncate_dataset function in the dataset processing pipeline; the trainer no longer performs truncation internally if packing is not enabled.
  • Updated comments and logic in the tokenize_fn to reflect that only packing (not truncation) is performed during dataset preparation.

Documentation and enforcement of truncation responsibility:

  • Updated the docstring for the data_collator parameter to specify that custom collators must truncate sequences before padding; the trainer will not perform truncation after collation.
  • Changed error messaging in the constructor to remove the suggestion to "Disable skip_prepare_dataset" and clarify that inputs must be packed/truncated before reaching the collator if padding_free=True.

Note

Medium Risk
Changes SFT data-prep semantics by removing automatic dataset truncation when packing=False, which can increase sequence lengths and affect memory/accuracy unless callers pre-truncate. Adds stricter validation for padding_free=True with max_length, potentially breaking existing configs that relied on implicit enforcement.

Overview
Removes trainer-side dataset truncation from SFTTrainer when packing=False, leaving sequence length control to pre-truncated inputs or custom collators (packing remains the only built-in length enforcement during dataset prep).

Tightens and clarifies configuration rules by raising an error when padding_free=True is used without packing alongside a non-None max_length, and updates the data_collator docstring to state that custom collators must truncate before padding.

Updates the related unit test to reflect the new error condition/message for padding-free + no packing + max_length.

Written by Cursor Bugbot for commit 12f9514. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread trl/trainer/sft_trainer.py Outdated
Comment thread tests/test_sft_trainer.py Outdated
@albertvillanova albertvillanova merged commit 3822674 into huggingface:main Mar 24, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants