Allow pre-tokenized datasets in SFTTrainer by BramVanroy · Pull Request #1520 · huggingface/trl

BramVanroy · 2024-04-11T12:32:59Z

This tiny PR makes it so that datasets that are already tokenized (have an input_ids column) can be used as-is. This greatly improves flexibility, especially in larger training runs: we can do data preprocessing (tokenization) one one CPU-heavy server and save the tokenized dataset to disk, and transfer it to a GPU server to use it directly as a dataset.

Note: this only works for datasets datasets, not for PyTorch datasets.

HuggingFaceDocBuilderDev · 2024-04-11T12:47:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Thanks for adding the pre-tokenized dataset support in TRL SFTTrainer ! Can you also add few lines in the docs in a follow up PR ? 🙏 Otherwise happy to do it !

BramVanroy · 2024-04-11T13:00:18Z

@younesbelkada Sure! #1521

younesbelkada · 2024-04-11T13:05:14Z

Thanks @BramVanroy !

allow pre-tokenized datasets

bcd0efb

BramVanroy mentioned this pull request Apr 11, 2024

Allow streaming (datasets.IterableDataset) #1468

Merged

younesbelkada approved these changes Apr 11, 2024

View reviewed changes

younesbelkada merged commit ebbd37b into huggingface:main Apr 11, 2024

BramVanroy mentioned this pull request Apr 11, 2024

[DOC] Add data description for sfttrainer doc #1521

Merged

lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024

allow pre-tokenized datasets (huggingface#1520)

2f76921

ashokponkumar mentioned this pull request Jun 5, 2024

Add Support for Passing Pretokenized Datasets to TRL foundation-model-stack/fms-hf-tuning#166

Closed

2 tasks

ashokponkumar mentioned this pull request Jun 12, 2024

feat: support pretokenised datasets foundation-model-stack/fms-hf-tuning#191

Closed

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

allow pre-tokenized datasets (huggingface#1520)

7b39702

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Allow pre-tokenized datasets in SFTTrainer#1520

Allow pre-tokenized datasets in SFTTrainer#1520
younesbelkada merged 1 commit intohuggingface:mainfrom
BramVanroy:allow_preprocessed_dataset

BramVanroy commented Apr 11, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Apr 11, 2024

Uh oh!

younesbelkada left a comment

Uh oh!

BramVanroy commented Apr 11, 2024

Uh oh!

younesbelkada commented Apr 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

BramVanroy commented Apr 11, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Apr 11, 2024

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

BramVanroy commented Apr 11, 2024

Uh oh!

younesbelkada commented Apr 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants