Skip to content

Conversation

@wengh
Copy link
Collaborator

@wengh wengh commented Jan 27, 2025

Note: Description below is partially generated using AI.

This pull request introduces a new feature for writing Spark DataFrames to HuggingFace Datasets and includes associated dependencies and tests. The most important changes are the addition of the HuggingFaceSink data source, and new tests for the data source.

New Feature: HuggingFaceSink Data Source

  • pyspark_huggingface/huggingface_sink.py: Added a new data source huggingfacesink for writing Spark DataFrames to HuggingFace Datasets. This class supports writing data in Parquet format and offers options for authentication, file size limits, etc. It also supports overwrite and append modes.
    • User must supply a HuggingFace token.
    • The writer writes Arrow batches to Parquet file using a in-memory buffer. We might want to consider writing to a temp file to reduce memory requirements.
    • Only the standard split names are supported.

Why is this a separate data source from HuggingFaceDatasets aka huggingface?

The writer has an completely different implementation from the reader:

  • Writer uses the huggingface_hub library rather than datasets library to interact with HuggingFace API.
  • Writer uses a different set of options.

The low cohesion between the reader and writer implementations makes it difficult to maintain them in a single class. If we want them to share the same name, we can add a wrapper class that delegates to the reader or writer based on the mode.

Dependency Updates

  • pyproject.toml:
    • Added pytest-dotenv for loading HF_TOKEN from .env file in end-to-end tests.
    • Added pytest-mock for mocking HuggingFace API in unit tests.

Testing

  • tests/test_huggingface_writer.py: Added tests for the HuggingFaceSink data source.
    • End-to-end tests use real HuggingFace API.
      • HF_TOKEN environment variable (can be supplied in .env) is required to run these tests.
      • These tests create a new dataset on HuggingFace, write data to it, read it back, then delete the dataset.

@wengh wengh changed the title [WIP] Add HuggingFaceSink data source Add HuggingFaceSink data source Jan 28, 2025
@wengh wengh marked this pull request as ready for review January 28, 2025 00:32
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! It looks all good to maybe :) Just one comment:

I think it would be easier for users if we mimic the same API as the reader ?

i.e. using .format("huggingface") and using .option(split=...) to specify a split.

(Btw, the current implementation works for splits named train/test/val but later we can also add the option to support arbitrary split names (see docs))

@wengh
Copy link
Collaborator Author

wengh commented Jan 28, 2025

@lhoestq

i.e. using .format("huggingface") and using .option(split=...) to specify a split.

is it correct that to support arbitrary split name, the split name needs to be a prefix of the file name, instead of grouping the files under a directory with the split name? it seems huggingface only recognizes train, test, validation directory names.

edit: it seems that the files must be named data/<split_name>-xxxxx-of-xxxxx.parquet to be recognized?

@lhoestq
Copy link
Member

lhoestq commented Jan 29, 2025

it seems that the files must be named data/<split_name>-xxxxx-of-xxxxx.parquet to be recognized?

Yes this is the correct pattern to allow arbitrary split names.
For more advanced use cases it's also possible to define the mapping in YAML if you prefer but it's a bit more work and maybe not needed here

@wengh wengh requested a review from lhoestq January 29, 2025 19:28
@wengh
Copy link
Collaborator Author

wengh commented Jan 29, 2025

updated the API to take split name, revision, path in repo as option instead of having them all in path

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good ! Can we also use the name huggingface instead of huggingfacesink ? (feel free to merge anyway)

@wengh
Copy link
Collaborator Author

wengh commented Jan 30, 2025

@lhoestq

Can we also use the name huggingface instead of huggingfacesink?

See the following from the PR description. We can add the wrapper in a future PR.

Why is this a separate data source from HuggingFaceDatasets aka huggingface?

The writer has an completely different implementation from the reader:

  • Writer uses the huggingface_hub library rather than datasets library to interact with HuggingFace API.
  • Writer uses a different set of options.

The low cohesion between the reader and writer implementations makes it difficult to maintain them in a single class. If we want them to share the same name, we can add a wrapper class that delegates to the reader or writer based on the mode.

@wengh wengh merged commit 64da968 into main Jan 30, 2025
@wengh wengh deleted the haoyu-writer branch January 30, 2025 00:34
@lhoestq
Copy link
Member

lhoestq commented Jan 30, 2025

Woohoo ! Congrats on merging this, this will be so useful for the research/AI community :)

Related to the writer's name: I don't think the difference in implementation is a blocker to use the same name, and we can use the same options IMO (and mark some as not implemented for now if needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants