feat(ibis): introduce DuckDB connector#1247
Conversation
WalkthroughDuckDB support is added as a new file format option for local and remote file-based connections. The system can now attach and query multiple DuckDB database files, retrieve their metadata, and run queries against them. New tests validate table listing and querying using DuckDB-formatted files. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant DuckDBMetadata
participant DuckDBConnector
participant FileSystem
Client->>API: POST /metadata/tables (duckdb format)
API->>DuckDBMetadata: get_table_list()
DuckDBMetadata->>DuckDBConnector: query(metadata SQL)
DuckDBConnector->>FileSystem: List & attach *.duckdb files
DuckDBConnector-->>DuckDBMetadata: Table metadata
DuckDBMetadata-->>API: Table list
API-->>Client: Response with tables
sequenceDiagram
participant Client
participant API
participant DuckDBConnector
participant FileSystem
Client->>API: POST /query (duckdb format)
API->>DuckDBConnector: query(SQL, limit)
DuckDBConnector->>FileSystem: List & attach *.duckdb files
DuckDBConnector-->>API: Query result
API-->>Client: Response with data
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsCheck-run timed out after 90 seconds. Some checks/pipelines were still in progress when the timeout was reached. Consider increasing the reviews.tools.github-checks.timeout_ms value in your CodeRabbit configuration to allow more time for checks to complete. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
ibis-server/app/model/connector.py (1)
214-225: Refactor to remove unnecessary else clause.The static analysis tool correctly identifies an unnecessary else clause. The code can be simplified.
Apply this diff to improve readability:
- def query(self, sql: str, limit: int | None) -> pa.Table: - try: - if limit is None: - # If no limit is specified, we return the full result - return self.connection.execute(sql).fetch_arrow_table() - else: - # If a limit is specified, we slice the result - # DuckDB does not support LIMIT in fetch_arrow_table, so we use slice - # to limit the number of rows returned - return ( - self.connection.execute(sql).fetch_arrow_table().slice(length=limit) - ) + def query(self, sql: str, limit: int | None) -> pa.Table: + try: + if limit is None: + # If no limit is specified, we return the full result + return self.connection.execute(sql).fetch_arrow_table() + + # If a limit is specified, we slice the result + # DuckDB does not support LIMIT in fetch_arrow_table, so we use slice + # to limit the number of rows returned + return ( + self.connection.execute(sql).fetch_arrow_table().slice(length=limit) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
ibis-server/app/model/__init__.py(4 hunks)ibis-server/app/model/connector.py(4 hunks)ibis-server/app/model/metadata/factory.py(2 hunks)ibis-server/app/model/metadata/object_storage.py(3 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py(1 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
PR: Canner/wren-engine#1038
File: ibis-server/app/model/utils.py:6-19
Timestamp: 2025-01-16T10:16:30.138Z
Learning: DuckDB's CREATE SECRET syntax does not support prepared statements or parameter binding, requiring direct value interpolation.
ibis-server/tests/routers/v2/connector/test_local_file.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
ibis-server/app/model/metadata/factory.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/model/connector.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/model/metadata/object_storage.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
🧬 Code Graph Analysis (3)
ibis-server/tests/routers/v2/connector/test_local_file.py (1)
ibis-server/tests/conftest.py (1)
client(18-23)
ibis-server/tests/routers/v3/connector/local_file/test_query.py (2)
ibis-server/tests/conftest.py (1)
client(18-23)wren-core-base/manifest-macro/src/lib.rs (1)
manifest(26-56)
ibis-server/app/model/metadata/factory.py (2)
ibis-server/app/model/metadata/object_storage.py (1)
DuckDBMetadata(278-353)ibis-server/app/model/data_source.py (1)
DataSource(44-71)
🪛 GitHub Actions: ibis CI
ibis-server/app/model/__init__.py
[error] 1-1: Ruff formatting check failed. File would be reformatted.
ibis-server/tests/routers/v3/connector/local_file/test_query.py
[error] 1-1: Ruff formatting check failed. File would be reformatted.
🪛 Pylint (3.3.7)
ibis-server/app/model/connector.py
[refactor] 216-225: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
ibis-server/app/model/metadata/object_storage.py
[error] 348-348: Instance of 'DuckDBMetadata' has no 'connector' member
(E1101)
🔇 Additional comments (9)
ibis-server/app/model/metadata/factory.py (2)
10-10: LGTM: DuckDBMetadata import added correctly.The import is properly placed with other metadata imports.
45-56: LGTM: DuckDB format handling implemented correctly.The conditional logic properly checks for file-based data sources with "duckdb" format and returns the appropriate DuckDBMetadata instance. This maintains compatibility with existing behavior while adding DuckDB support.
ibis-server/tests/routers/v2/connector/test_local_file.py (1)
452-482: LGTM: Comprehensive DuckDB metadata test added.The test thoroughly validates DuckDB metadata functionality by:
- Testing the metadata/tables endpoint with DuckDB format
- Validating table properties including catalog, schema, and table names
- Checking column metadata structure and types
- Ensuring proper response format and status codes
This provides good coverage for the DuckDB metadata integration.
ibis-server/app/model/connector.py (4)
3-3: LGTM: Required imports added for DuckDB enhancements.The
osandopendalimports are necessary for the new file listing and database attachment functionality.Also applies to: 12-12
209-211: LGTM: DuckDB format detection and attachment.The conditional check correctly identifies DuckDB format connections and triggers database attachment.
240-255: LGTM: Database attachment implementation is robust.The method correctly:
- Lists DuckDB files using the helper method
- Validates that files exist before attempting attachment
- Uses proper error handling for IOException and HTTPException
- Attaches databases as read-only with appropriate naming
The error handling ensures that connection failures are properly propagated as UnprocessableEntityError.
257-273: Confirm and Normalize Path Construction in_list_duckdb_filesI didn’t find any existing sanitization of
connection_info.url.get_secret_value()orfile.pathin the codebase, so please double-check that neither value can introduce../segments or leading slashes before you concatenate:• File:
ibis-server/app/model/connector.py
Lines 266–269full_path = f"{connection_info.url.get_secret_value()}/{file.path}"• File:
ibis-server/app/model/metadata/object_storage.py
Similar f-string concatenations aroundself.connection_info.url.get_secret_value()+ pathsRecommendations:
- Strip/normalize both segments before joining (e.g.
.rstrip('/')on the root,.lstrip('/')on the relative path).- Consider using
pathlib.PurePathoros.path.jointo safely build the final URL/path.ibis-server/app/model/metadata/object_storage.py (2)
5-5: Also applies to: 15-15
283-339: Well-implemented metadata retrieval logic!The SQL query correctly joins information_schema tables to retrieve comprehensive metadata, and the processing logic efficiently handles unique table identification and column aggregation.
ibis-server/tests/routers/v3/connector/local_file/test_query.py
Outdated
Show resolved
Hide resolved
|
Thanks @goldmedal |
Summary by CodeRabbit
New Features
Bug Fixes
Tests