Skip to content

Commit 0d8eae0

Browse files
authored
Add fenic-datasets integration (#1936)
* Add fenic integration documentation * minor fixes * Update fenic session examples with minimal configs * docs: clarify schema merging note
1 parent 1916b23 commit 0d8eae0

File tree

3 files changed

+240
-0
lines changed

3 files changed

+240
-0
lines changed

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,8 @@
248248
title: Perform vector similarity search
249249
- local: datasets-embedding-atlas
250250
title: Embedding Atlas
251+
- local: datasets-fenic
252+
title: fenic
251253
- local: datasets-fiftyone
252254
title: FiftyOne
253255
- local: datasets-pandas

docs/hub/datasets-fenic.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# fenic
2+
3+
[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub.
4+
5+
<div class="flex justify-center">
6+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/>
7+
</div>
8+
9+
## Getting Started
10+
11+
To get started, pip install `fenic`:
12+
13+
```bash
14+
pip install fenic
15+
```
16+
17+
### Create a Session
18+
19+
Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations):
20+
21+
```python
22+
import fenic as fc
23+
24+
session = fc.Session.get_or_create(fc.SessionConfig())
25+
```
26+
27+
## Overview
28+
29+
fenic is an opinionated data processing framework that combines:
30+
- **DataFrame API**: PySpark-inspired operations for familiar data manipulation
31+
- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering
32+
- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google)
33+
- **Query Optimization**: Automatic optimization through logical plan transformations
34+
35+
## Read from Hugging Face Hub
36+
37+
fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface.
38+
39+
### Supported Formats
40+
41+
fenic supports reading the following formats from Hugging Face:
42+
- **Parquet files** (`.parquet`)
43+
- **CSV files** (`.csv`)
44+
45+
### Reading Datasets
46+
47+
To read a dataset from the Hugging Face Hub:
48+
49+
```python
50+
import fenic as fc
51+
52+
session = fc.Session.get_or_create(fc.SessionConfig())
53+
54+
# Read a CSV file from a public dataset
55+
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv")
56+
57+
# Read Parquet files using glob patterns
58+
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")
59+
60+
# Read from a specific dataset revision
61+
df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet")
62+
```
63+
64+
### Reading with Schema Management
65+
66+
```python
67+
# Read multiple CSV files with schema merging
68+
df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True)
69+
70+
# Read multiple Parquet files with schema merging
71+
df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True)
72+
```
73+
74+
> **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet).
75+
76+
### Authentication
77+
78+
To read private datasets, you need to set your Hugging Face token as an environment variable:
79+
80+
```shell
81+
export HF_TOKEN="your_hugging_face_token_here"
82+
```
83+
84+
### Path Format
85+
86+
The Hugging Face path format in fenic follows this structure:
87+
```
88+
hf://{repo_type}/{repo_id}/{path_to_file}
89+
```
90+
91+
You can also specify dataset revisions or versions:
92+
```
93+
hf://{repo_type}/{repo_id}@{revision}/{path_to_file}
94+
```
95+
96+
Features:
97+
- Supports glob patterns (`*`, `**`)
98+
- Dataset revisions/versions using `@` notation:
99+
- Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e`
100+
- Branch: `@refs/convert/parquet`
101+
- Branch alias: `@~parquet`
102+
- Requires `HF_TOKEN` environment variable for private datasets
103+
104+
### Mixing Data Sources
105+
106+
fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols:
107+
108+
```python
109+
# Mix HF and local files in one read call
110+
df = session.read.parquet([
111+
"hf://datasets/cais/mmlu/astronomy/*.parquet",
112+
"file:///local/data/*.parquet",
113+
"./relative/path/data.parquet"
114+
])
115+
```
116+
117+
This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline.
118+
119+
## Processing Data from Hugging Face
120+
121+
Once loaded from Hugging Face, you can use fenic's full DataFrame API:
122+
123+
### Basic DataFrame Operations
124+
125+
```python
126+
import fenic as fc
127+
128+
session = fc.Session.get_or_create(fc.SessionConfig())
129+
130+
# Load IMDB dataset from Hugging Face
131+
df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet")
132+
133+
# Filter and select
134+
positive_reviews = df.filter(fc.col("label") == 1).select("text", "label")
135+
136+
# Group by and aggregate
137+
label_counts = df.group_by("label").agg(
138+
fc.count("*").alias("count")
139+
)
140+
```
141+
142+
### AI-Powered Operations
143+
144+
To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured:
145+
146+
```python
147+
import fenic as fc
148+
149+
# Requires OPENAI_API_KEY to be set for language and embedding calls
150+
session = fc.Session.get_or_create(
151+
fc.SessionConfig(
152+
semantic=fc.SemanticConfig(
153+
language_models={
154+
"gpt-4o-mini": fc.OpenAILanguageModel(
155+
model_name="gpt-4o-mini",
156+
rpm=60,
157+
tpm=60000,
158+
)
159+
},
160+
embedding_models={
161+
"text-embedding-3-small": fc.OpenAIEmbeddingModel(
162+
model_name="text-embedding-3-small",
163+
rpm=60,
164+
tpm=60000,
165+
)
166+
},
167+
)
168+
)
169+
)
170+
171+
# Load a text dataset from Hugging Face
172+
df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet")
173+
174+
# Add embeddings to text columns
175+
df_with_embeddings = df.select(
176+
"*",
177+
fc.semantic.embed(fc.col("text")).alias("embedding")
178+
)
179+
180+
# Apply semantic functions for sentiment analysis
181+
df_analyzed = df_with_embeddings.select(
182+
"*",
183+
fc.semantic.analyze_sentiment(
184+
fc.col("text"),
185+
model_alias="gpt-4o-mini", # Optional: specify model
186+
).alias("sentiment")
187+
)
188+
```
189+
190+
## Example: Analyzing MMLU Dataset
191+
192+
```python
193+
import fenic as fc
194+
195+
# Requires OPENAI_API_KEY to be set for semantic calls
196+
session = fc.Session.get_or_create(
197+
fc.SessionConfig(
198+
semantic=fc.SemanticConfig(
199+
language_models={
200+
"gpt-4o-mini": fc.OpenAILanguageModel(
201+
model_name="gpt-4o-mini",
202+
rpm=60,
203+
tpm=60000,
204+
)
205+
},
206+
)
207+
)
208+
)
209+
210+
# Load MMLU astronomy subset from Hugging Face
211+
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")
212+
213+
# Process the data
214+
processed_df = (df
215+
# Filter for specific criteria
216+
.filter(fc.col("subject") == "astronomy")
217+
# Select relevant columns
218+
.select("question", "choices", "answer")
219+
# Add difficulty analysis using semantic.map
220+
.select(
221+
"*",
222+
fc.semantic.map(
223+
"Rate the difficulty of this question from 1-5: {{question}}",
224+
question=fc.col("question"),
225+
model_alias="gpt-4o-mini" # Optional: specify model
226+
).alias("difficulty")
227+
)
228+
)
229+
230+
# Show results
231+
processed_df.show()
232+
```
233+
234+
## Resources
235+
236+
- [fenic GitHub Repository](https://github.com/typedef-ai/fenic)
237+
- [fenic Documentation](https://docs.fenic.ai/latest/)

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ The table below summarizes the supported libraries and their level of integratio
1515
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. |||
1616
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. |||
1717
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. |||
18+
| [fenic](./datasets-fenic) | PySpark-inspired DataFrame framework for building production AI and agentic applications. |||
1819
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. |||
1920
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
2021
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. |||

0 commit comments

Comments
 (0)