|  | 
|  | 1 | +# fenic | 
|  | 2 | + | 
|  | 3 | +[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub. | 
|  | 4 | + | 
|  | 5 | +<div class="flex justify-center"> | 
|  | 6 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/> | 
|  | 7 | +</div> | 
|  | 8 | + | 
|  | 9 | +## Getting Started | 
|  | 10 | + | 
|  | 11 | +To get started, pip install `fenic`: | 
|  | 12 | + | 
|  | 13 | +```bash | 
|  | 14 | +pip install fenic | 
|  | 15 | +``` | 
|  | 16 | + | 
|  | 17 | +### Create a Session | 
|  | 18 | + | 
|  | 19 | +Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations): | 
|  | 20 | + | 
|  | 21 | +```python | 
|  | 22 | +import fenic as fc | 
|  | 23 | + | 
|  | 24 | +session = fc.Session.get_or_create(fc.SessionConfig()) | 
|  | 25 | +``` | 
|  | 26 | + | 
|  | 27 | +## Overview | 
|  | 28 | + | 
|  | 29 | +fenic is an opinionated data processing framework that combines: | 
|  | 30 | +- **DataFrame API**: PySpark-inspired operations for familiar data manipulation | 
|  | 31 | +- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering | 
|  | 32 | +- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google) | 
|  | 33 | +- **Query Optimization**: Automatic optimization through logical plan transformations | 
|  | 34 | + | 
|  | 35 | +## Read from Hugging Face Hub | 
|  | 36 | + | 
|  | 37 | +fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface. | 
|  | 38 | + | 
|  | 39 | +### Supported Formats | 
|  | 40 | + | 
|  | 41 | +fenic supports reading the following formats from Hugging Face: | 
|  | 42 | +- **Parquet files** (`.parquet`) | 
|  | 43 | +- **CSV files** (`.csv`) | 
|  | 44 | + | 
|  | 45 | +### Reading Datasets | 
|  | 46 | + | 
|  | 47 | +To read a dataset from the Hugging Face Hub: | 
|  | 48 | + | 
|  | 49 | +```python | 
|  | 50 | +import fenic as fc | 
|  | 51 | + | 
|  | 52 | +session = fc.Session.get_or_create(fc.SessionConfig()) | 
|  | 53 | + | 
|  | 54 | +# Read a CSV file from a public dataset | 
|  | 55 | +df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") | 
|  | 56 | + | 
|  | 57 | +# Read Parquet files using glob patterns | 
|  | 58 | +df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | 
|  | 59 | + | 
|  | 60 | +# Read from a specific dataset revision | 
|  | 61 | +df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet") | 
|  | 62 | +``` | 
|  | 63 | + | 
|  | 64 | +### Reading with Schema Management | 
|  | 65 | + | 
|  | 66 | +```python | 
|  | 67 | +# Read multiple CSV files with schema merging | 
|  | 68 | +df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) | 
|  | 69 | + | 
|  | 70 | +# Read multiple Parquet files with schema merging | 
|  | 71 | +df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True) | 
|  | 72 | +``` | 
|  | 73 | + | 
|  | 74 | +> **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet). | 
|  | 75 | +
 | 
|  | 76 | +### Authentication | 
|  | 77 | + | 
|  | 78 | +To read private datasets, you need to set your Hugging Face token as an environment variable: | 
|  | 79 | + | 
|  | 80 | +```shell | 
|  | 81 | +export HF_TOKEN="your_hugging_face_token_here" | 
|  | 82 | +``` | 
|  | 83 | + | 
|  | 84 | +### Path Format | 
|  | 85 | + | 
|  | 86 | +The Hugging Face path format in fenic follows this structure: | 
|  | 87 | +``` | 
|  | 88 | +hf://{repo_type}/{repo_id}/{path_to_file} | 
|  | 89 | +``` | 
|  | 90 | + | 
|  | 91 | +You can also specify dataset revisions or versions: | 
|  | 92 | +``` | 
|  | 93 | +hf://{repo_type}/{repo_id}@{revision}/{path_to_file} | 
|  | 94 | +``` | 
|  | 95 | + | 
|  | 96 | +Features: | 
|  | 97 | +- Supports glob patterns (`*`, `**`) | 
|  | 98 | +- Dataset revisions/versions using `@` notation: | 
|  | 99 | +  - Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e` | 
|  | 100 | +  - Branch: `@refs/convert/parquet` | 
|  | 101 | +  - Branch alias: `@~parquet` | 
|  | 102 | +- Requires `HF_TOKEN` environment variable for private datasets | 
|  | 103 | + | 
|  | 104 | +### Mixing Data Sources | 
|  | 105 | + | 
|  | 106 | +fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols: | 
|  | 107 | + | 
|  | 108 | +```python | 
|  | 109 | +# Mix HF and local files in one read call | 
|  | 110 | +df = session.read.parquet([ | 
|  | 111 | +    "hf://datasets/cais/mmlu/astronomy/*.parquet", | 
|  | 112 | +    "file:///local/data/*.parquet", | 
|  | 113 | +    "./relative/path/data.parquet" | 
|  | 114 | +]) | 
|  | 115 | +``` | 
|  | 116 | + | 
|  | 117 | +This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline. | 
|  | 118 | + | 
|  | 119 | +## Processing Data from Hugging Face | 
|  | 120 | + | 
|  | 121 | +Once loaded from Hugging Face, you can use fenic's full DataFrame API: | 
|  | 122 | + | 
|  | 123 | +### Basic DataFrame Operations | 
|  | 124 | + | 
|  | 125 | +```python | 
|  | 126 | +import fenic as fc | 
|  | 127 | + | 
|  | 128 | +session = fc.Session.get_or_create(fc.SessionConfig()) | 
|  | 129 | + | 
|  | 130 | +# Load IMDB dataset from Hugging Face | 
|  | 131 | +df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") | 
|  | 132 | + | 
|  | 133 | +# Filter and select | 
|  | 134 | +positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") | 
|  | 135 | + | 
|  | 136 | +# Group by and aggregate | 
|  | 137 | +label_counts = df.group_by("label").agg( | 
|  | 138 | +    fc.count("*").alias("count") | 
|  | 139 | +) | 
|  | 140 | +``` | 
|  | 141 | + | 
|  | 142 | +### AI-Powered Operations | 
|  | 143 | + | 
|  | 144 | +To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured: | 
|  | 145 | + | 
|  | 146 | +```python | 
|  | 147 | +import fenic as fc | 
|  | 148 | + | 
|  | 149 | +# Requires OPENAI_API_KEY to be set for language and embedding calls | 
|  | 150 | +session = fc.Session.get_or_create( | 
|  | 151 | +    fc.SessionConfig( | 
|  | 152 | +        semantic=fc.SemanticConfig( | 
|  | 153 | +            language_models={ | 
|  | 154 | +                "gpt-4o-mini": fc.OpenAILanguageModel( | 
|  | 155 | +                    model_name="gpt-4o-mini", | 
|  | 156 | +                    rpm=60, | 
|  | 157 | +                    tpm=60000, | 
|  | 158 | +                ) | 
|  | 159 | +            }, | 
|  | 160 | +            embedding_models={ | 
|  | 161 | +                "text-embedding-3-small": fc.OpenAIEmbeddingModel( | 
|  | 162 | +                    model_name="text-embedding-3-small", | 
|  | 163 | +                    rpm=60, | 
|  | 164 | +                    tpm=60000, | 
|  | 165 | +                ) | 
|  | 166 | +            }, | 
|  | 167 | +        ) | 
|  | 168 | +    ) | 
|  | 169 | +) | 
|  | 170 | + | 
|  | 171 | +# Load a text dataset from Hugging Face | 
|  | 172 | +df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") | 
|  | 173 | + | 
|  | 174 | +# Add embeddings to text columns | 
|  | 175 | +df_with_embeddings = df.select( | 
|  | 176 | +    "*", | 
|  | 177 | +    fc.semantic.embed(fc.col("text")).alias("embedding") | 
|  | 178 | +) | 
|  | 179 | + | 
|  | 180 | +# Apply semantic functions for sentiment analysis | 
|  | 181 | +df_analyzed = df_with_embeddings.select( | 
|  | 182 | +    "*", | 
|  | 183 | +    fc.semantic.analyze_sentiment( | 
|  | 184 | +        fc.col("text"), | 
|  | 185 | +        model_alias="gpt-4o-mini",  # Optional: specify model | 
|  | 186 | +    ).alias("sentiment") | 
|  | 187 | +) | 
|  | 188 | +``` | 
|  | 189 | + | 
|  | 190 | +## Example: Analyzing MMLU Dataset | 
|  | 191 | + | 
|  | 192 | +```python | 
|  | 193 | +import fenic as fc | 
|  | 194 | + | 
|  | 195 | +# Requires OPENAI_API_KEY to be set for semantic calls | 
|  | 196 | +session = fc.Session.get_or_create( | 
|  | 197 | +    fc.SessionConfig( | 
|  | 198 | +        semantic=fc.SemanticConfig( | 
|  | 199 | +            language_models={ | 
|  | 200 | +                "gpt-4o-mini": fc.OpenAILanguageModel( | 
|  | 201 | +                    model_name="gpt-4o-mini", | 
|  | 202 | +                    rpm=60, | 
|  | 203 | +                    tpm=60000, | 
|  | 204 | +                ) | 
|  | 205 | +            }, | 
|  | 206 | +        ) | 
|  | 207 | +    ) | 
|  | 208 | +) | 
|  | 209 | + | 
|  | 210 | +# Load MMLU astronomy subset from Hugging Face | 
|  | 211 | +df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | 
|  | 212 | + | 
|  | 213 | +# Process the data | 
|  | 214 | +processed_df = (df | 
|  | 215 | +    # Filter for specific criteria | 
|  | 216 | +    .filter(fc.col("subject") == "astronomy") | 
|  | 217 | +    # Select relevant columns | 
|  | 218 | +    .select("question", "choices", "answer") | 
|  | 219 | +    # Add difficulty analysis using semantic.map | 
|  | 220 | +    .select( | 
|  | 221 | +        "*", | 
|  | 222 | +        fc.semantic.map( | 
|  | 223 | +            "Rate the difficulty of this question from 1-5: {{question}}", | 
|  | 224 | +            question=fc.col("question"), | 
|  | 225 | +            model_alias="gpt-4o-mini"  # Optional: specify model | 
|  | 226 | +        ).alias("difficulty") | 
|  | 227 | +    ) | 
|  | 228 | +) | 
|  | 229 | + | 
|  | 230 | +# Show results | 
|  | 231 | +processed_df.show() | 
|  | 232 | +``` | 
|  | 233 | + | 
|  | 234 | +## Resources | 
|  | 235 | + | 
|  | 236 | +- [fenic GitHub Repository](https://github.com/typedef-ai/fenic) | 
|  | 237 | +- [fenic Documentation](https://docs.fenic.ai/latest/) | 
0 commit comments