Can `preprocess_data.py` support Huggingface Dataset? #1321

cafeii · 2024-11-13T01:17:55Z

Since there are many datasets in the format of Huggingface datasets, it would be convenient if preprocess_data.py can directly preprocess and tokenize from HF datasets.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2024-11-20T14:06:23Z

The overwhelming majority of HuggingFace datasets are not structured in a way that makes sense for LLM pretraining. Given that, what do you envision this looking like? Specifying a field name and only training on the text in that field?

cafeii added the feature request New feature or request label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can `preprocess_data.py` support Huggingface Dataset? #1321

Can `preprocess_data.py` support Huggingface Dataset? #1321

cafeii commented Nov 13, 2024

StellaAthena commented Nov 20, 2024

Can preprocess_data.py support Huggingface Dataset? #1321

Can preprocess_data.py support Huggingface Dataset? #1321

Comments

cafeii commented Nov 13, 2024

StellaAthena commented Nov 20, 2024

Can `preprocess_data.py` support Huggingface Dataset? #1321

Can `preprocess_data.py` support Huggingface Dataset? #1321