|  | 
| 1 |  | -# pyspark_huggingface | 
| 2 |  | -PySpark custom data source for Hugging Face Datasets | 
|  | 1 | +<p align="center"> | 
|  | 2 | +  <img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;"> | 
|  | 3 | +  <br/> | 
|  | 4 | +  <br/> | 
|  | 5 | +</p> | 
|  | 6 | + | 
|  | 7 | +<p align="center"> | 
|  | 8 | +    <a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a> | 
|  | 9 | +    <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a> | 
|  | 10 | +</p> | 
|  | 11 | + | 
|  | 12 | +# Spark Data Source for Hugging Face Datasets | 
|  | 13 | + | 
|  | 14 | +A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets): | 
|  | 15 | + | 
|  | 16 | +- Stream datasets from Hugging Face as Spark DataFrames | 
|  | 17 | +- Select subsets and splits, apply projection and predicate filters | 
|  | 18 | +- Save Spark DataFrames as Parquet files to Hugging Face | 
|  | 19 | +- Fully distributed | 
|  | 20 | +- Authentication via `huggingface-cli login` or tokens | 
|  | 21 | +- Compatible with Spark 4 (with auto-import) | 
|  | 22 | +- Backport for Spark 3.5, 3.4 and 3.3 | 
|  | 23 | + | 
|  | 24 | +## Installation | 
|  | 25 | + | 
|  | 26 | +``` | 
|  | 27 | +pip install pyspark_huggingface | 
|  | 28 | +``` | 
|  | 29 | + | 
|  | 30 | +## Usage | 
|  | 31 | + | 
|  | 32 | +Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)): | 
|  | 33 | + | 
|  | 34 | +```python | 
|  | 35 | +df = spark.read.format("huggingface").load("stanfordnlp/imdb") | 
|  | 36 | +``` | 
|  | 37 | + | 
|  | 38 | +Save to Hugging Face: | 
|  | 39 | + | 
|  | 40 | +```python | 
|  | 41 | +# Login with huggingface-cli login | 
|  | 42 | +df.write.format("huggingface").save("username/my_dataset") | 
|  | 43 | +# Or pass a token manually | 
|  | 44 | +df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset") | 
|  | 45 | +```  | 
|  | 46 | + | 
|  | 47 | +## Advanced | 
|  | 48 | + | 
|  | 49 | +Select a split: | 
|  | 50 | + | 
|  | 51 | +```python | 
|  | 52 | +test_df = ( | 
|  | 53 | +    spark.read.format("huggingface") | 
|  | 54 | +    .option("split", "test") | 
|  | 55 | +    .load("stanfordnlp/imdb") | 
|  | 56 | +) | 
|  | 57 | +``` | 
|  | 58 | + | 
|  | 59 | +Select a subset/config: | 
|  | 60 | + | 
|  | 61 | +```python | 
|  | 62 | +test_df = ( | 
|  | 63 | +    spark.read.format("huggingface") | 
|  | 64 | +    .option("config", "sample-10BT") | 
|  | 65 | +    .load("HuggingFaceFW/fineweb-edu") | 
|  | 66 | +) | 
|  | 67 | +``` | 
|  | 68 | + | 
|  | 69 | +Filters columns and rows (especially efficient for Parquet datasets): | 
|  | 70 | + | 
|  | 71 | +```python | 
|  | 72 | +df = ( | 
|  | 73 | +    spark.read.format("huggingface") | 
|  | 74 | +    .option("filters", '[("language_score", ">", 0.99)]') | 
|  | 75 | +    .option("columns", '["text", "language_score"]') | 
|  | 76 | +    .load("HuggingFaceFW/fineweb-edu") | 
|  | 77 | +) | 
|  | 78 | +``` | 
|  | 79 | + | 
|  | 80 | +## Backport | 
|  | 81 | + | 
|  | 82 | +While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions. | 
|  | 83 | + | 
|  | 84 | +Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3: | 
|  | 85 | + | 
|  | 86 | +```python | 
|  | 87 | +>>> import pyspark_huggingface | 
|  | 88 | +huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) | 
|  | 89 | +``` | 
|  | 90 | + | 
|  | 91 | +The import is only necessary on Spark 3.x to enable the backport. | 
|  | 92 | +Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source. | 
0 commit comments