Add HuggingFaceSink data source #5

wengh · 2025-01-27T23:26:18Z

Note: Description below is partially generated using AI.

This pull request introduces a new feature for writing Spark DataFrames to HuggingFace Datasets and includes associated dependencies and tests. The most important changes are the addition of the HuggingFaceSink data source, and new tests for the data source.

New Feature: HuggingFaceSink Data Source

pyspark_huggingface/huggingface_sink.py: Added a new data source huggingfacesink for writing Spark DataFrames to HuggingFace Datasets. This class supports writing data in Parquet format and offers options for authentication, file size limits, etc. It also supports overwrite and append modes.
- User must supply a HuggingFace token.
- The writer writes Arrow batches to Parquet file using a in-memory buffer. We might want to consider writing to a temp file to reduce memory requirements.
- Only the standard split names are supported.

Why is this a separate data source from HuggingFaceDatasets aka `huggingface`?

The writer has an completely different implementation from the reader:

Writer uses the huggingface_hub library rather than datasets library to interact with HuggingFace API.
Writer uses a different set of options.

The low cohesion between the reader and writer implementations makes it difficult to maintain them in a single class. If we want them to share the same name, we can add a wrapper class that delegates to the reader or writer based on the mode.

Dependency Updates

pyproject.toml:
- Added pytest-dotenv for loading HF_TOKEN from .env file in end-to-end tests.
- Added pytest-mock for mocking HuggingFace API in unit tests.

Testing

tests/test_huggingface_writer.py: Added tests for the HuggingFaceSink data source.
- End-to-end tests use real HuggingFace API.
  - HF_TOKEN environment variable (can be supplied in .env) is required to run these tests.
  - These tests create a new dataset on HuggingFace, write data to it, read it back, then delete the dataset.

lhoestq

Cool ! It looks all good to maybe :) Just one comment:

I think it would be easier for users if we mimic the same API as the reader ?

i.e. using .format("huggingface") and using .option(split=...) to specify a split.

(Btw, the current implementation works for splits named train/test/val but later we can also add the option to support arbitrary split names (see docs))

pyspark_huggingface/huggingface_sink.py

wengh · 2025-01-28T20:23:07Z

@lhoestq

i.e. using .format("huggingface") and using .option(split=...) to specify a split.

is it correct that to support arbitrary split name, the split name needs to be a prefix of the file name, instead of grouping the files under a directory with the split name? it seems huggingface only recognizes train, test, validation directory names.

edit: it seems that the files must be named data/<split_name>-xxxxx-of-xxxxx.parquet to be recognized?

lhoestq · 2025-01-29T09:52:54Z

it seems that the files must be named data/<split_name>-xxxxx-of-xxxxx.parquet to be recognized?

Yes this is the correct pattern to allow arbitrary split names.
For more advanced use cases it's also possible to define the mapping in YAML if you prefer but it's a bit more work and maybe not needed here

wengh · 2025-01-29T19:29:37Z

updated the API to take split name, revision, path in repo as option instead of having them all in path

lhoestq

Looks all good ! Can we also use the name huggingface instead of huggingfacesink ? (feel free to merge anyway)

pyspark_huggingface/huggingface_sink.py

wengh · 2025-01-30T00:07:01Z

@lhoestq

Can we also use the name huggingface instead of huggingfacesink?

See the following from the PR description. We can add the wrapper in a future PR.

Why is this a separate data source from HuggingFaceDatasets aka `huggingface`?

The writer has an completely different implementation from the reader:

Writer uses the huggingface_hub library rather than datasets library to interact with HuggingFace API.
Writer uses a different set of options.

The low cohesion between the reader and writer implementations makes it difficult to maintain them in a single class. If we want them to share the same name, we can add a wrapper class that delegates to the reader or writer based on the mode.

lhoestq · 2025-01-30T10:39:46Z

Woohoo ! Congrats on merging this, this will be so useful for the research/AI community :)

Related to the writer's name: I don't think the difference in implementation is a blocker to use the same name, and we can use the same options IMO (and mark some as not implemented for now if needed)

wengh added 9 commits January 23, 2025 15:16

add huggingfacesink data source

1837b25

support overwrite and append modes

d0764a8

limit size of each single parquet file

703581b

fix writing to new split

95d7a9d

improve comments

eb374a4

organize imports

4259225

add tests

6c68ae3

update dependencies

7b4bba9

clean up

c536a69

wengh changed the title ~~[WIP] Add HuggingFaceSink data source~~ Add HuggingFaceSink data source Jan 28, 2025

wengh requested review from allisonwang-db and lhoestq January 28, 2025 00:32

wengh marked this pull request as ready for review January 28, 2025 00:32

lhoestq reviewed Jan 28, 2025

View reviewed changes

pyspark_huggingface/huggingface_sink.py Outdated Show resolved Hide resolved

wengh added 2 commits January 29, 2025 10:14

pass split as a separate option

77f0667

add comment explaining split name support

fa3bdd9

wengh requested a review from lhoestq January 29, 2025 19:28

lhoestq approved these changes Jan 29, 2025

View reviewed changes

pyspark_huggingface/huggingface_sink.py Outdated Show resolved Hide resolved

wengh added 2 commits January 29, 2025 15:51

fix comment

3cb2a0b

fix path separator

104af7c

wengh merged commit 64da968 into main Jan 30, 2025

wengh deleted the haoyu-writer branch January 30, 2025 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add HuggingFaceSink data source #5

Add HuggingFaceSink data source #5

Uh oh!

wengh commented Jan 27, 2025 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

wengh commented Jan 28, 2025 •

edited

Loading

Uh oh!

lhoestq commented Jan 29, 2025

Uh oh!

wengh commented Jan 29, 2025

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

wengh commented Jan 30, 2025

Uh oh!

lhoestq commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Add HuggingFaceSink data source #5

Add HuggingFaceSink data source #5

Uh oh!

Conversation

wengh commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Feature: HuggingFaceSink Data Source

Why is this a separate data source from HuggingFaceDatasets aka huggingface?

Dependency Updates

Testing

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wengh commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jan 29, 2025

Uh oh!

wengh commented Jan 29, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wengh commented Jan 30, 2025

Why is this a separate data source from HuggingFaceDatasets aka huggingface?

Uh oh!

lhoestq commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wengh commented Jan 27, 2025 •

edited

Loading

Why is this a separate data source from HuggingFaceDatasets aka `huggingface`?

wengh commented Jan 28, 2025 •

edited

Loading

Why is this a separate data source from HuggingFaceDatasets aka `huggingface`?