Skip to content
Zian Xu edited this page Dec 9, 2024 · 19 revisions
  • es3951, Elif Sakalli
  • em3772, Ella Mihailoff
  • vgs2114, Valentina Sartori Rodriguez
  • zx2514, Zian Xu (Annie)

Ibis - the portable Python dataframe library

“Ibis defines a Python dataframe API that executes on any query engine – the frontend for any backend data platform, with nearly 20 backends today. This allows Ibis to have excellent performance – as good as the backend it is connected to – with a consistent user experience” https://ibis-project.org/why

Part 1 - The Problem & The Solution

What problems does the technology solve? - Valentina

Category Problem Details
Why (Gaps in Traditional Tools) Unified Abstraction Across Backends SQL syntax and capabilities vary across databases (e.g., PostgreSQL, MySQL, Oracle). Developers must write custom queries for each backend, leading to complexity, redundancy, and fragmented workflows.
Performance and Scalability Writing performant SQL queries requires detailed knowledge of database internals (e.g., indexing, partitioning). Scaling computations efficiently is non-trivial, especially for distributed systems like Spark.
Pythonic Interface SQL's declarative nature feels unnatural for Python developers, making integration with Python workflows challenging. Debugging SQL queries is less intuitive than Pythonic debugging methods, reducing productivity.
Portability SQL queries often cannot run across databases due to syntax differences. Migrating workflows requires substantial rewriting, leading to vendor lock-in and reduced flexibility.
Bridging Analytics and Development Analysts use SQL, while developers use Python. Transitioning between the two involves redundant work and inefficiencies, as well as reduced collaboration due to disparate tools and languages.
What (Challenges in Traditional Data Processing) Backend Integration SQL tools are tied to specific backends, requiring separate libraries and connection management, increasing setup complexity and development time.
High-Level API for Data Manipulation SQL lacks expressive, high-level constructs for operations like joins or aggregations in a Pythonic way. This verbosity slows down development and increases errors.
Deferred Execution SQL queries execute immediately, limiting the ability to build, optimize, and debug complex workflows incrementally. Immediate execution reduces iterative development efficiency.
Backend-Specific Optimization SQL optimizations require knowledge of backend-specific features like indexes and caching. Generic SQL queries often fail to leverage these optimizations effectively.
Interoperability with Python Ecosystem SQL operates in isolation from Python libraries like Pandas, requiring manual data transfers and format conversions, introducing friction and error risks.
How (Manual Implementation) Backend-Agnostic API SQL doesn't support backend-independent queries. Even ANSI-SQL is inconsistently implemented, requiring developers to rewrite queries for each backend.
Push-Down Computation Python workflows often fetch large datasets into memory for processing, causing bottlenecks. SQL lacks mechanisms to dynamically push computations to the database.
Performance Optimization SQL optimization requires understanding backend-specific query plans and indexes. Generic queries often perform poorly without tuning, making optimization time-consuming and error-prone.
Consistent User Experience SQL debugging and error handling vary significantly across databases, requiring adaptation to different environments, increasing the overhead for developers.

How does the technology solve the problem? - Zian / Annie

Ibis provides a unified Python dataframe API that abstracts the backend-specific query languages and APIs. Here's how it addresses some of the issues:

Feature Description
Backend-Agnostic API Ibis defines a common API that users can interact with, regardless of the underlying backend. This eliminates the need to learn multiple query languages or backend-specific APIs, streamlining workflows across different systems.
Push-Down Computation Instead of transferring data to the local environment for processing (as with pandas), Ibis translates Python dataframe operations into efficient backend-native queries. This ensures computations are performed on the backend, leveraging its full power and minimizing data transfer.
Performance Optimization By delegating computation to the backend, Ibis achieves performance that matches the capabilities of the connected backend. For example, connecting Ibis to a distributed engine like Apache Spark enables distributed computation without additional code changes.
Consistent User Experience Developers use the same Ibis API across nearly 20 supported backends, making it easier to write and maintain code that can seamlessly switch between backends as needed.

newplot (1)

Alternatives and What Makes Ibis Unique - Ella

What are the alternatives, and what are the pros and cons of the technology compared with alternatives? What makes it unique)

Alternatives

  • Pandas
  • Dask DataFrame

Why Ibis? The Pros and Cons of Alternatives Compared to Ibis

Pandas

A Python library used for data manipulation and analysis.

Pros Compared to Ibis:
  • API: Comprehensive functionality for in-memory data manipulation, with many community-driven extensions.
  • Easy to Use: Syntax is simple and easy for beginners to pick up.
  • Integration: Easily integrates with other Python libraries.
Cons Compared to Ibis:
  • Scalability: Pandas loads the dataset into the computer's memory. While this works well for small to medium-scale data, it can become difficult for very large datasets.
    • What Makes Ibis Unique:
      Ibis interacts directly with backends like SQL databases, which are designed to handle large datasets efficiently. This avoids the need to load all of the data into memory at once.
  • No Native Multi-Backend Support: Pandas operates on local CSV files, Excel sheets, or in-memory data structures. Switching from working with a CSV file to a database or a cloud-based data system requires significant code changes.
    • What Makes Ibis Unique:
      Ibis provides a single API that works across various backends (e.g., DuckDB, PostgreSQL, BigQuery), allowing you to write code once and run it on different systems without significant changes.
  • Local-Only Execution: Pandas processes data on your local machine, which can be inefficient or impossible for datasets stored elsewhere (e.g., databases or cloud storage systems).
    • What Makes Ibis Unique:
      Ibis executes data manipulation directly on remote systems. For example, using BigQuery, Ibis translates operations into SQL and executes them in BigQuery’s cloud environment, eliminating the need to transfer data locally.

Dask DataFrame

A parallelized implementation of Pandas, meant for processing large tabular data and larger-than-memory datasets.

Pros Compared to Ibis:
  • API Familiarity: Uses the same API as Pandas, and because it's all in Python, it’s straightforward and easy for beginners.
  • Handles Large Datasets: Processes data larger than available memory by using out-of-core processing, processing data in smaller chunks as needed.
  • Cluster Computing: Works well with cluster-computing frameworks.
Cons Compared to Ibis:
  • Ecosystem-Specific: Dask focuses more on scalability within its own ecosystem, requiring the user to follow Dask-specific workflows. It doesn’t provide a single interface to work with different types of data systems.
    • What Makes Ibis Unique:
      Ibis combines backend abstraction with SQL integration, offering portability across systems.
  • Additional Setup for Distributed Execution: Requires additional setup.
    • What Makes Ibis Unique:
      Allows for seamless transition between backends.

What is the connection Between Ibis and COMS 4111? - Elif

Ibis relates to key database concepts covered in COMS 4111, particularly data modeling, query optimization, and database APIs, which emphasize how modern tools bridge abstraction gaps between users and database backends. Here are the most relevant connections:

Topic Lecture Connection Ibis Relevance
Query Abstraction and APIs In the lecture on database APIs, we discussed how applications interface with DBMS engines through abstractions like DBAPI and ORMs to simplify operations and unify interactions across multiple database systems. Ibis provides a Python-based dataframe API that abstracts SQL-like operations into a uniform programming interface, connecting seamlessly to backends like DuckDB, PostgreSQL, and BigQuery. This abstraction aligns with the need for a unified API layer that supports multiple database types and systems, mirroring concepts like embedded SQL and DBAPIs.
Query Optimization Query optimization, as discussed in Lecture 9, focuses on how DBMSs evaluate SQL queries through optimized execution plans, including methods like predicate pushdown and join optimizations. Ibis translates Python dataframe operations into SQL queries executed directly on the backend, leveraging query optimizers of the connected databases. For example, when using BigQuery, Ibis pushes filters and computations down to the database, avoiding redundant data transfers and improving performance. This aligns with optimization techniques like reducing unnecessary joins or applying predicates early in query plans.
Normalization and Schema Refinement Data normalization (Lecture 7) reduces redundancy and avoids anomalies, with schema refinement ensuring logical consistency. While Ibis operates at the query level rather than schema design, its capability to interact with normalized relational data through abstract APIs ensures consistent querying across normalized backends. This aligns with the course's emphasis on designing applications that avoid redundancy while maintaining flexibility.
Declarative Programming in Relational Algebra The relational algebra foundation (Lecture 4) and its mapping to SQL explain how declarative programming simplifies data operations and encourages logical data manipulation. Ibis functions as a declarative API that abstracts the complexities of relational algebra into Python syntax. Operations like filtering (filter), projecting (select), and joining (join) are directly mapped to underlying relational database operations, demonstrating practical applications of the lecture's concepts.
Physical Design and Storage Optimization Lectures on physical database design and storage discuss indexing, memory usage, and efficient access methods to optimize database performance. Ibis relies on the storage and indexing efficiencies of its backends (e.g., DuckDB for local execution or BigQuery for distributed systems). Its approach of pushing execution to the backend aligns with lecture discussions on minimizing in-memory computations and leveraging backend optimizations.

Part 2 - Tutorial (ALL)

Why Is Ibis Easier to Use Than Raw SQL?

  • Pythonic Syntax for SQL-Like Queries
  • Ibis provides a clean Python interface for querying the schema. For example, finding restaurants without a user's allergens becomes straightforward:
  • We can avoid nested subqueries and write the logic naturally in Python, reducing errors and improving readability.

We will use queries from our Project 1 Part 2 to illustrate this: (Zian / Annie)

Instead of constructing a raw SQL query like:

SELECT r.rest_name
FROM Restaurant_Creates r
WHERE NOT EXISTS (
    SELECT 1
    FROM Menu_Item_Includes mii
    JOIN Menu_Item_Contains mic ON mii.item_name = mic.item_name
    JOIN Customer_Allergens ca ON mii.ing_name = ca.allergen
    WHERE ca.user_id = 11 AND mic.rest_name = r.rest_name
);

We can achieve the same with Ibis:

safe_restaurants = restaurants.filter(
    ~restaurants.rest_name.isin(
        allergens.filter(allergens.user_id == 11).select("rest_name")
    )
)

Integration with Python Analysis Tools (Ella)

Convert Results Directly to Pandas for Analysis or Visualization:

result_df = avg_ratings.head().to_pandas()
result_df.plot.bar(x="rest_name", y="avg_rating")
Why It’s Easier:

Eliminates manual data transformation steps between SQL and Python workflows.


Effortless Aggregations and Joins (Valentina)

Combine tables and compute summaries like average restaurant ratings:

avg_ratings = reviews.join(restaurants, reviews.rest_name == restaurants.rest_name).group_by(
    "rest_name"
).aggregate(
    avg_rating=reviews.rating.mean()
).filter(reviews.rating.mean() >= 4)
Why It’s Easier:

No need to write JOIN or GROUP BY clauses; chaining simplifies operations.


Install Ibis (Elif)

Install the ibis-framework package with DuckDB support (recommended for local data operations):

pip install 'ibis-framework[duckdb,examples]'
import ibis

Connect to the Database (Elif)

Using DuckDB as the backend and set up the restaurant dataset from Project 1.

# Connect to DuckDB
con = ibis.connect("duckdb://restaurant_data.ddb")  # Check this part

Queries (All)

1. Tutorial Workflow (Elif)

Projections: Selecting Columns

Use select to specify the columns you need. Here’s the transcribed continuation formatted for GitHub Wiki (Markdown):

restaurants = con.table("Restaurant_Creates")

# Select specific columns
restaurants.select("rest_name", "loc", "cuisineType").head()

2. Filtering: WHERE Clauses (Ella)

Use filter to narrow down rows based on conditions.

# Restaurants with vegetarian diets
vegetarian_restaurants = restaurants.filter(
    restaurants.diet_name.contains("Vegetarian")
)
vegetarian_restaurants.head()

3. Aggregation: GROUP BY (Valentina)

Summarize data with aggregates.

reviews = con.table("Review_Writes")

# Average rating per restaurant
avg_ratings = reviews.group_by("rest_name").aggregate(
    avg_rating=reviews.rating.mean()
)
avg_ratings.order_by("avg_rating", ascending=False).head()

4. Sorting: ORDER BY (Zian / Annie)

Rearrange rows based on column values.

# Sort restaurants by average rating
sorted_restaurants = avg_ratings.order_by("avg_rating", ascending=False)
sorted_restaurants.head()

5. Top-K Operations (Elif)

Use limit to get the top results.

# Get the top 3 highest-rated restaurants
top_restaurants = sorted_restaurants.limit(3)
top_restaurants.head()

6. Joins (Ella)

Combine data from multiple tables.

allergens = con.table("Customer_Allergens")

# Find restaurants that don’t include a user’s allergens
safe_restaurants = restaurants.filter(
    ~allergens.user_id.any()
)
safe_restaurants.head()

7. Subqueries (Valentina)

Use subqueries to nest expressions.

# Restaurants with no allergens for a specific user
no_allergens_subquery = (
    allergens.filter(allergens.user_id == 11)
    .select("rest_name")
)

safe_restaurants = restaurants.filter(
    ~restaurants.rest_name.isin(no_allergens_subquery)
)
safe_restaurants.head()

8. Date/Time Data (Zian / Annie)

Work with date fields to filter data.

# Reviews in the last year
from datetime import datetime, timedelta

one_year_ago = datetime.now() - timedelta(days=365)
recent_reviews = reviews.filter(reviews.dt > one_year_ago)
recent_reviews.head()

Putting Together Multiple Operations

# Find top vegetarian restaurants with the highest ratings
top_veg_restaurants = (
    restaurants.filter(restaurants.diet_name.contains("Vegetarian"))
    .join(reviews, restaurants.rest_name == reviews.rest_name)
    .group_by("rest_name")
    .aggregate(avg_rating=reviews.rating.mean())
    .order_by("avg_rating", ascending=False)
    .limit(5)
)

top_veg_restaurants.head()

Clone this wiki locally