- 
        Couldn't load subscription status. 
- Fork 5
Add HuggingFaceSink data source #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! It looks all good to maybe :) Just one comment:
I think it would be easier for users if we mimic the same API as the reader ?
i.e. using .format("huggingface") and using .option(split=...) to specify a split.
(Btw, the current implementation works for splits named train/test/val but later we can also add the option to support arbitrary split names (see docs))
| 
 is it correct that to support arbitrary split name, the split name needs to be a prefix of the file name, instead of grouping the files under a directory with the split name? it seems huggingface only recognizes train, test, validation directory names. edit: it seems that the files must be named  | 
| 
 Yes this is the correct pattern to allow arbitrary split names. | 
| updated the API to take split name, revision, path in repo as option instead of having them all in path | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good ! Can we also use the name huggingface instead of huggingfacesink ? (feel free to merge anyway)
| 
 See the following from the PR description. We can add the wrapper in a future PR. Why is this a separate data source from HuggingFaceDatasets aka  | 
| Woohoo ! Congrats on merging this, this will be so useful for the research/AI community :) Related to the writer's name: I don't think the difference in implementation is a blocker to use the same name, and we can use the same options IMO (and mark some as not implemented for now if needed) | 
Note: Description below is partially generated using AI.
This pull request introduces a new feature for writing Spark DataFrames to HuggingFace Datasets and includes associated dependencies and tests. The most important changes are the addition of the
HuggingFaceSinkdata source, and new tests for the data source.New Feature: HuggingFaceSink Data Source
pyspark_huggingface/huggingface_sink.py: Added a new data sourcehuggingfacesinkfor writing Spark DataFrames to HuggingFace Datasets. This class supports writing data in Parquet format and offers options for authentication, file size limits, etc. It also supportsoverwriteandappendmodes.Why is this a separate data source from HuggingFaceDatasets aka
huggingface?The writer has an completely different implementation from the reader:
huggingface_hublibrary rather thandatasetslibrary to interact with HuggingFace API.The low cohesion between the reader and writer implementations makes it difficult to maintain them in a single class. If we want them to share the same name, we can add a wrapper class that delegates to the reader or writer based on the mode.
Dependency Updates
pyproject.toml:pytest-dotenvfor loadingHF_TOKENfrom.envfile in end-to-end tests.pytest-mockfor mocking HuggingFace API in unit tests.Testing
tests/test_huggingface_writer.py: Added tests for theHuggingFaceSinkdata source.HF_TOKENenvironment variable (can be supplied in.env) is required to run these tests.