Skip to content

feat(ibis): introduce DuckDB connector#1247

Merged
douenergy merged 4 commits intoCanner:mainfrom
goldmedal:feat/duckdb-connector
Jul 8, 2025
Merged

feat(ibis): introduce DuckDB connector#1247
douenergy merged 4 commits intoCanner:mainfrom
goldmedal:feat/duckdb-connector

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Jul 4, 2025

Summary by CodeRabbit

  • New Features

    • Added support for DuckDB database files as a file format option for local, S3, Minio, and GCS connections.
    • Enabled attaching multiple DuckDB database files from remote storage for querying.
    • Enabled listing and querying of tables and columns from DuckDB files stored locally or in object storage.
    • Exposed DuckDB version information in metadata.
  • Bug Fixes

    • Improved error handling for missing or inaccessible DuckDB files during attachment.
  • Tests

    • Added tests to verify table listing and querying functionality for DuckDB file format.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jul 4, 2025

Walkthrough

DuckDB support is added as a new file format option for local and remote file-based connections. The system can now attach and query multiple DuckDB database files, retrieve their metadata, and run queries against them. New tests validate table listing and querying using DuckDB-formatted files.

Changes

File(s) Change Summary
app/model/__init__.py Added "duckdb" as an example format for file-based connection info classes.
app/model/connector.py Enhanced DuckDBConnector to attach multiple DuckDB files, handle remote files, update query signature, and add helper methods.
app/model/metadata/factory.py Added logic to return DuckDBMetadata for DuckDB file formats in file-based sources.
app/model/metadata/object_storage.py Introduced DuckDBMetadata class for metadata access to DuckDB databases.
tests/routers/v2/connector/test_local_file.py Added async test for listing tables from DuckDB metadata.
tests/routers/v3/connector/local_file/test_query.py Added async test for querying data from a DuckDB-formatted file.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant DuckDBMetadata
    participant DuckDBConnector
    participant FileSystem

    Client->>API: POST /metadata/tables (duckdb format)
    API->>DuckDBMetadata: get_table_list()
    DuckDBMetadata->>DuckDBConnector: query(metadata SQL)
    DuckDBConnector->>FileSystem: List & attach *.duckdb files
    DuckDBConnector-->>DuckDBMetadata: Table metadata
    DuckDBMetadata-->>API: Table list
    API-->>Client: Response with tables
Loading
sequenceDiagram
    participant Client
    participant API
    participant DuckDBConnector
    participant FileSystem

    Client->>API: POST /query (duckdb format)
    API->>DuckDBConnector: query(SQL, limit)
    DuckDBConnector->>FileSystem: List & attach *.duckdb files
    DuckDBConnector-->>API: Query result
    API-->>Client: Response with data
Loading

Suggested reviewers

  • onlyjackfrost

Poem

A hop and a leap, a DuckDB delight,
Now files and tables are queried just right.
With metadata fetched and queries that fly,
The rabbit applauds as the tests all comply.
🐇✨
DuckDB joins the warren—oh, what a sight!

Warning

Review ran into problems

🔥 Problems

Check-run timed out after 90 seconds. Some checks/pipelines were still in progress when the timeout was reached. Consider increasing the reviews.tools.github-checks.timeout_ms value in your CodeRabbit configuration to allow more time for checks to complete.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4a0ddf and 9d6459f.

📒 Files selected for processing (1)
  • ibis-server/app/model/metadata/object_storage.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/app/model/metadata/object_storage.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added ibis python Pull requests that update Python code labels Jul 4, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
ibis-server/app/model/connector.py (1)

214-225: Refactor to remove unnecessary else clause.

The static analysis tool correctly identifies an unnecessary else clause. The code can be simplified.

Apply this diff to improve readability:

-    def query(self, sql: str, limit: int | None) -> pa.Table:
-        try:
-            if limit is None:
-                # If no limit is specified, we return the full result
-                return self.connection.execute(sql).fetch_arrow_table()
-            else:
-                # If a limit is specified, we slice the result
-                # DuckDB does not support LIMIT in fetch_arrow_table, so we use slice
-                # to limit the number of rows returned
-                return (
-                    self.connection.execute(sql).fetch_arrow_table().slice(length=limit)
-                )
+    def query(self, sql: str, limit: int | None) -> pa.Table:
+        try:
+            if limit is None:
+                # If no limit is specified, we return the full result
+                return self.connection.execute(sql).fetch_arrow_table()
+            
+            # If a limit is specified, we slice the result
+            # DuckDB does not support LIMIT in fetch_arrow_table, so we use slice
+            # to limit the number of rows returned
+            return (
+                self.connection.execute(sql).fetch_arrow_table().slice(length=limit)
+            )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec0441b and 881bd9c.

📒 Files selected for processing (6)
  • ibis-server/app/model/__init__.py (4 hunks)
  • ibis-server/app/model/connector.py (4 hunks)
  • ibis-server/app/model/metadata/factory.py (2 hunks)
  • ibis-server/app/model/metadata/object_storage.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
PR: Canner/wren-engine#1038
File: ibis-server/app/model/utils.py:6-19
Timestamp: 2025-01-16T10:16:30.138Z
Learning: DuckDB's CREATE SECRET syntax does not support prepared statements or parameter binding, requiring direct value interpolation.
ibis-server/tests/routers/v2/connector/test_local_file.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
ibis-server/app/model/metadata/factory.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/model/connector.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/model/metadata/object_storage.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
🧬 Code Graph Analysis (3)
ibis-server/tests/routers/v2/connector/test_local_file.py (1)
ibis-server/tests/conftest.py (1)
  • client (18-23)
ibis-server/tests/routers/v3/connector/local_file/test_query.py (2)
ibis-server/tests/conftest.py (1)
  • client (18-23)
wren-core-base/manifest-macro/src/lib.rs (1)
  • manifest (26-56)
ibis-server/app/model/metadata/factory.py (2)
ibis-server/app/model/metadata/object_storage.py (1)
  • DuckDBMetadata (278-353)
ibis-server/app/model/data_source.py (1)
  • DataSource (44-71)
🪛 GitHub Actions: ibis CI
ibis-server/app/model/__init__.py

[error] 1-1: Ruff formatting check failed. File would be reformatted.

ibis-server/tests/routers/v3/connector/local_file/test_query.py

[error] 1-1: Ruff formatting check failed. File would be reformatted.

🪛 Pylint (3.3.7)
ibis-server/app/model/connector.py

[refactor] 216-225: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

ibis-server/app/model/metadata/object_storage.py

[error] 348-348: Instance of 'DuckDBMetadata' has no 'connector' member

(E1101)

🔇 Additional comments (9)
ibis-server/app/model/metadata/factory.py (2)

10-10: LGTM: DuckDBMetadata import added correctly.

The import is properly placed with other metadata imports.


45-56: LGTM: DuckDB format handling implemented correctly.

The conditional logic properly checks for file-based data sources with "duckdb" format and returns the appropriate DuckDBMetadata instance. This maintains compatibility with existing behavior while adding DuckDB support.

ibis-server/tests/routers/v2/connector/test_local_file.py (1)

452-482: LGTM: Comprehensive DuckDB metadata test added.

The test thoroughly validates DuckDB metadata functionality by:

  • Testing the metadata/tables endpoint with DuckDB format
  • Validating table properties including catalog, schema, and table names
  • Checking column metadata structure and types
  • Ensuring proper response format and status codes

This provides good coverage for the DuckDB metadata integration.

ibis-server/app/model/connector.py (4)

3-3: LGTM: Required imports added for DuckDB enhancements.

The os and opendal imports are necessary for the new file listing and database attachment functionality.

Also applies to: 12-12


209-211: LGTM: DuckDB format detection and attachment.

The conditional check correctly identifies DuckDB format connections and triggers database attachment.


240-255: LGTM: Database attachment implementation is robust.

The method correctly:

  • Lists DuckDB files using the helper method
  • Validates that files exist before attempting attachment
  • Uses proper error handling for IOException and HTTPException
  • Attaches databases as read-only with appropriate naming

The error handling ensures that connection failures are properly propagated as UnprocessableEntityError.


257-273: Confirm and Normalize Path Construction in _list_duckdb_files

I didn’t find any existing sanitization of connection_info.url.get_secret_value() or file.path in the codebase, so please double-check that neither value can introduce ../ segments or leading slashes before you concatenate:

• File: ibis-server/app/model/connector.py
Lines 266–269

full_path = f"{connection_info.url.get_secret_value()}/{file.path}"

• File: ibis-server/app/model/metadata/object_storage.py
Similar f-string concatenations around self.connection_info.url.get_secret_value() + paths

Recommendations:

  • Strip/normalize both segments before joining (e.g. .rstrip('/') on the root, .lstrip('/') on the relative path).
  • Consider using pathlib.PurePath or os.path.join to safely build the final URL/path.
ibis-server/app/model/metadata/object_storage.py (2)

5-5: Also applies to: 15-15


283-339: Well-implemented metadata retrieval logic!

The SQL query correctly joins information_schema tables to retrieve comprehensive metadata, and the processing logic efficiently handles unique table identification and column aggregation.

@goldmedal goldmedal requested a review from douenergy July 7, 2025 03:02
@douenergy douenergy merged commit 69a5f8a into Canner:main Jul 8, 2025
10 checks passed
@douenergy
Copy link
Copy Markdown
Contributor

Thanks @goldmedal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ibis python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants