Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,8 @@
title: Perform vector similarity search
- local: datasets-embedding-atlas
title: Embedding Atlas
- local: datasets-fenic
title: fenic
- local: datasets-fiftyone
title: FiftyOne
- local: datasets-pandas
Expand Down
235 changes: 235 additions & 0 deletions docs/hub/datasets-fenic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# fenic

[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/>
</div>

## Getting Started

To get started, pip install `fenic`:

```bash
pip install fenic
```

### Create a Session

Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations):

```python
import fenic as fc

session = fc.Session.get_or_create(fc.SessionConfig())
```

## Overview

fenic is an opinionated data processing framework that combines:
- **DataFrame API**: PySpark-inspired operations for familiar data manipulation
- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering
- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google)
- **Query Optimization**: Automatic optimization through logical plan transformations

## Read from Hugging Face Hub

fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface.

### Supported Formats

fenic supports reading the following formats from Hugging Face:
- **Parquet files** (`.parquet`)
- **CSV files** (`.csv`)

### Reading Datasets

To read a dataset from the Hugging Face Hub:

```python
import fenic as fc

session = fc.Session.get_or_create(fc.SessionConfig())

# Read a CSV file from a public dataset
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv")

# Read Parquet files using glob patterns
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")

# Read from a specific dataset revision
df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet")
```

### Reading with Schema Management

```python
# Read multiple CSV files with schema merging
df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True)

# Read multiple Parquet files with schema merging
df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True)
```

### Authentication

To read private datasets, you need to set your Hugging Face token as an environment variable:

```shell
export HF_TOKEN="your_hugging_face_token_here"
```

### Path Format

The Hugging Face path format in fenic follows this structure:
```
hf://{repo_type}/{repo_id}/{path_to_file}
```

You can also specify dataset revisions or versions:
```
hf://{repo_type}/{repo_id}@{revision}/{path_to_file}
```

Features:
- Supports glob patterns (`*`, `**`)
- Dataset revisions/versions using `@` notation:
- Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e`
- Branch: `@refs/convert/parquet`
- Branch alias: `@~parquet`
- Requires `HF_TOKEN` environment variable for private datasets

### Mixing Data Sources

fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols:

```python
# Mix HF and local files in one read call
df = session.read.parquet([
"hf://datasets/cais/mmlu/astronomy/*.parquet",
"file:///local/data/*.parquet",
"./relative/path/data.parquet"
])
```

This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline.

## Processing Data from Hugging Face

Once loaded from Hugging Face, you can use fenic's full DataFrame API:

### Basic DataFrame Operations

```python
import fenic as fc

session = fc.Session.get_or_create(fc.SessionConfig())

# Load IMDB dataset from Hugging Face
df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet")

# Filter and select
positive_reviews = df.filter(fc.col("label") == 1).select("text", "label")

# Group by and aggregate
label_counts = df.group_by("label").agg(
fc.count("*").alias("count")
)
```

### AI-Powered Operations

To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured:

```python
import fenic as fc

# Requires OPENAI_API_KEY to be set for language and embedding calls
session = fc.Session.get_or_create(
fc.SessionConfig(
semantic=fc.SemanticConfig(
language_models={
"gpt-4o-mini": fc.OpenAILanguageModel(
model_name="gpt-4o-mini",
rpm=60,
tpm=60000,
)
},
embedding_models={
"text-embedding-3-small": fc.OpenAIEmbeddingModel(
model_name="text-embedding-3-small",
rpm=60,
tpm=60000,
)
},
)
)
)

# Load a text dataset from Hugging Face
df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet")

# Add embeddings to text columns
df_with_embeddings = df.select(
"*",
fc.semantic.embed(fc.col("text")).alias("embedding")
)

# Apply semantic functions for sentiment analysis
df_analyzed = df_with_embeddings.select(
"*",
fc.semantic.analyze_sentiment(
fc.col("text"),
model_alias="gpt-4o-mini", # Optional: specify model
).alias("sentiment")
)
```

## Example: Analyzing MMLU Dataset

```python
import fenic as fc

# Requires OPENAI_API_KEY to be set for semantic calls
session = fc.Session.get_or_create(
fc.SessionConfig(
semantic=fc.SemanticConfig(
language_models={
"gpt-4o-mini": fc.OpenAILanguageModel(
model_name="gpt-4o-mini",
rpm=60,
tpm=60000,
)
},
)
)
)

# Load MMLU astronomy subset from Hugging Face
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")

# Process the data
processed_df = (df
# Filter for specific criteria
.filter(fc.col("subject") == "astronomy")
# Select relevant columns
.select("question", "choices", "answer")
# Add difficulty analysis using semantic.map
.select(
"*",
fc.semantic.map(
"Rate the difficulty of this question from 1-5: {{question}}",
question=fc.col("question"),
model_alias="gpt-4o-mini" # Optional: specify model
).alias("difficulty")
)
)

# Show results
processed_df.show()
```

## Resources

- [fenic GitHub Repository](https://github.com/typedef-ai/fenic)
- [fenic Documentation](https://docs.fenic.ai/latest/)
1 change: 1 addition & 0 deletions docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The table below summarizes the supported libraries and their level of integratio
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | ✅ | ✅ |
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. | ✅ | ✅ |
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. | ✅ | ❌ |
| [fenic](./datasets-fenic) | PySpark-inspired DataFrame framework for building production AI and agentic applications. | ✅ | ❌ |
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. | ✅ | ✅ |
| [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ✅ |
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ |
Expand Down