Skip to content

Comments

Allow pre-tokenized datasets in SFTTrainer#1520

Merged
younesbelkada merged 1 commit intohuggingface:mainfrom
BramVanroy:allow_preprocessed_dataset
Apr 11, 2024
Merged

Allow pre-tokenized datasets in SFTTrainer#1520
younesbelkada merged 1 commit intohuggingface:mainfrom
BramVanroy:allow_preprocessed_dataset

Conversation

@BramVanroy
Copy link
Contributor

This tiny PR makes it so that datasets that are already tokenized (have an input_ids column) can be used as-is. This greatly improves flexibility, especially in larger training runs: we can do data preprocessing (tokenization) one one CPU-heavy server and save the tokenized dataset to disk, and transfer it to a GPU server to use it directly as a dataset.

Note: this only works for datasets datasets, not for PyTorch datasets.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the pre-tokenized dataset support in TRL SFTTrainer ! Can you also add few lines in the docs in a follow up PR ? 🙏 Otherwise happy to do it !

@younesbelkada younesbelkada merged commit ebbd37b into huggingface:main Apr 11, 2024
@BramVanroy
Copy link
Contributor Author

@younesbelkada Sure! #1521

@younesbelkada
Copy link
Contributor

Thanks @BramVanroy !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants