Skip to content

Commit c0fbd4c

Browse files
authored
Merge pull request #11 from huggingface/backport
Backport data source to pyspark 3
2 parents ff7a265 + 37e6051 commit c0fbd4c

File tree

8 files changed

+416
-117
lines changed

8 files changed

+416
-117
lines changed

README.md

Lines changed: 92 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,92 @@
1-
# pyspark_huggingface
2-
PySpark custom data source for Hugging Face Datasets
1+
<p align="center">
2+
<img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;">
3+
<br/>
4+
<br/>
5+
</p>
6+
7+
<p align="center">
8+
<a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a>
9+
<a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
10+
</p>
11+
12+
# Spark Data Source for Hugging Face Datasets
13+
14+
A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets):
15+
16+
- Stream datasets from Hugging Face as Spark DataFrames
17+
- Select subsets and splits, apply projection and predicate filters
18+
- Save Spark DataFrames as Parquet files to Hugging Face
19+
- Fully distributed
20+
- Authentication via `huggingface-cli login` or tokens
21+
- Compatible with Spark 4 (with auto-import)
22+
- Backport for Spark 3.5, 3.4 and 3.3
23+
24+
## Installation
25+
26+
```
27+
pip install pyspark_huggingface
28+
```
29+
30+
## Usage
31+
32+
Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):
33+
34+
```python
35+
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
36+
```
37+
38+
Save to Hugging Face:
39+
40+
```python
41+
# Login with huggingface-cli login
42+
df.write.format("huggingface").save("username/my_dataset")
43+
# Or pass a token manually
44+
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
45+
```
46+
47+
## Advanced
48+
49+
Select a split:
50+
51+
```python
52+
test_df = (
53+
spark.read.format("huggingface")
54+
.option("split", "test")
55+
.load("stanfordnlp/imdb")
56+
)
57+
```
58+
59+
Select a subset/config:
60+
61+
```python
62+
test_df = (
63+
spark.read.format("huggingface")
64+
.option("config", "sample-10BT")
65+
.load("HuggingFaceFW/fineweb-edu")
66+
)
67+
```
68+
69+
Filters columns and rows (especially efficient for Parquet datasets):
70+
71+
```python
72+
df = (
73+
spark.read.format("huggingface")
74+
.option("filters", '[("language_score", ">", 0.99)]')
75+
.option("columns", '["text", "language_score"]')
76+
.load("HuggingFaceFW/fineweb-edu")
77+
)
78+
```
79+
80+
## Backport
81+
82+
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
83+
84+
Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
85+
86+
```python
87+
>>> import pyspark_huggingface
88+
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
89+
```
90+
91+
The import is only necessary on Spark 3.x to enable the backport.
92+
Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source.

0 commit comments

Comments
 (0)