fix(ibis): using DataFusion to format the query result#1235
fix(ibis): using DataFusion to format the query result#1235douenergy merged 8 commits intoCanner:mainfrom
Conversation
WalkthroughThis update migrates SQL execution on Arrow tables from DuckDB to DataFusion, introduces session timezone handling for timestamp fields, and updates SQL formatting to align with DataFusion's dialect. It also modernizes Python type hints, adjusts test expectations for timestamp and numeric formatting, adds DataFusion as a dependency, and refactors Rust code to set session timezone via properties and update string formatting. Changes
Sequence Diagram(s)sequenceDiagram
participant Router
participant Util
participant DataFusion
participant DataSource
Router->>Util: to_json(arrow_table, headers, data_source)
Util->>Util: _with_session_timezone(arrow_table, headers, data_source)
Util->>DataFusion: get_datafusion_context(headers)
Util->>DataFusion: Register "arrow_table"
Util->>DataFusion: Execute SQL query
DataFusion-->>Util: Result Table
Util->>Router: JSON response
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (8)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
ibis-server/app/model/connector.py (3)
155-155: Remove unusednoqadirective.The static analysis indicates this
noqadirective is unused because thePLC0415rule is not enabled in the linter configuration.- import ibis.backends.bigquery # noqa: PLC0415 + import ibis.backends.bigquery
193-193: Remove unusednoqadirective.The static analysis indicates this
noqadirective is unused because thePLC0415rule is not enabled in the linter configuration.- import duckdb # noqa: PLC0415 + import duckdb
224-224: Remove unusednoqadirective.The static analysis indicates this
noqadirective is unused because thePLC0415rule is not enabled in the linter configuration.- import redshift_connector # noqa: PLC0415 + import redshift_connectoribis-server/app/util.py (3)
70-111: Comprehensive timezone handling implementation.The function properly handles:
- Session timezone adjustments for timestamp fields
- MySQL special case for missing timezone information
- Nullable field handling to prevent casting errors
Note the TODOs indicate known limitations with MySQL and ClickHouse connectors.
Would you like me to create tracking issues for the MySQL timezone and ClickHouse nullable field TODOs?
207-211: Decimal formatting workaround implemented.The conditional logic prevents unnecessary scientific notation for zero values. Consider implementing the suggested
to_charUDF for better decimal formatting control as noted in the TODO.Would you like me to create an issue to track the implementation of a custom
to_charUDF for better decimal formatting?
215-220: Consider removing unnecessary else block.The timestamp formatting is correct, but the code can be simplified:
elif pa.types.is_timestamp(field.type): if field.type.tz is None: return f"to_char({column_name}, '%Y-%m-%d %H:%M:%S%.6f') as {column_name}" - else: - return ( - f"to_char({column_name}, '%Y-%m-%d %H:%M:%S%.6f %Z') as {column_name}" - ) + return ( + f"to_char({column_name}, '%Y-%m-%d %H:%M:%S%.6f %Z') as {column_name}" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
ibis-server/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (22)
ibis-server/app/model/connector.py(3 hunks)ibis-server/app/model/data_source.py(1 hunks)ibis-server/app/query_cache/__init__.py(2 hunks)ibis-server/app/routers/v2/connector.py(1 hunks)ibis-server/app/routers/v3/connector.py(1 hunks)ibis-server/app/util.py(4 hunks)ibis-server/pyproject.toml(1 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py(4 hunks)ibis-server/tests/routers/v2/connector/test_clickhouse.py(1 hunks)ibis-server/tests/routers/v2/connector/test_mysql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_oracle.py(1 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py(2 hunks)ibis-server/tests/routers/v2/connector/test_redshift.py(2 hunks)ibis-server/tests/routers/v2/connector/test_snowflake.py(1 hunks)ibis-server/tests/routers/v2/connector/test_trino.py(1 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py(1 hunks)ibis-server/tests/routers/v3/connector/oracle/test_query.py(1 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(4 hunks)ibis-server/tests/util/sql_generator.py(1 hunks)wren-core/core/src/mdl/context.rs(2 hunks)wren-core/core/src/mdl/mod.rs(6 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/model/connector.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/routers/v3/connector.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.570Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
ibis-server/app/routers/v2/connector.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.570Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
wren-core/core/src/mdl/context.rs (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.
wren-core/core/src/mdl/mod.rs (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
ibis-server/app/util.py (4)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.570Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
🧬 Code Graph Analysis (2)
ibis-server/tests/util/sql_generator.py (1)
ibis-server/tests/model/__init__.py (1)
Function(4-10)
ibis-server/app/routers/v2/connector.py (1)
ibis-server/app/util.py (1)
to_json(43-67)
🪛 Ruff (0.11.9)
ibis-server/app/model/connector.py
155-155: Unused noqa directive (non-enabled: PLC0415)
(RUF100)
193-193: Unused noqa directive (non-enabled: PLC0415)
(RUF100)
224-224: Unused noqa directive (non-enabled: PLC0415)
(RUF100)
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 215-220: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 205-205: Too many return statements (7/6)
(R0911)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: ci
- GitHub Check: cargo test (win64)
- GitHub Check: cargo check
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (macos)
- GitHub Check: test
🔇 Additional comments (56)
ibis-server/pyproject.toml (1)
49-49: LGTM! DataFusion dependency addition aligns with PR objectives.The addition of DataFusion as a dependency is necessary for the migration from DuckDB to DataFusion for query result formatting, as mentioned in the PR objectives.
ibis-server/tests/util/sql_generator.py (1)
12-12: LGTM! Type hint modernization improves consistency.The change from
Optional[str]tostr | Nonemodernizes the type annotation to use the newer union syntax, improving code consistency across the codebase.ibis-server/app/model/data_source.py (1)
243-243: LGTM! Type hint modernization improves consistency.The change from
Optional[ssl.SSLContext]tossl.SSLContext | Nonemodernizes the type annotation to use the newer union syntax, consistent with similar updates across the codebase.ibis-server/app/query_cache/__init__.py (2)
3-3: LGTM! Import cleanup removes unused Optional.Removing the unused
Optionalimport after modernizing the type hints is good housekeeping.
20-20: LGTM! Type hint modernization improves consistency.The change from
Optional[Any]toAny | Nonemodernizes the type annotation to use the newer union syntax, consistent with similar updates across the codebase.ibis-server/tests/routers/v2/connector/test_trino.py (1)
114-114: LGTM! Numeric type assertion updated correctly.The change from string to float representation aligns with the DataFusion migration and properly reflects the column's float type definition in the manifest.
ibis-server/tests/routers/v2/connector/test_oracle.py (1)
197-197: LGTM! Timezone format updated to use offset notation.The change from "UTC" to "+00:00" correctly implements the timezone standardization described in the PR objectives.
ibis-server/tests/routers/v2/connector/test_redshift.py (2)
113-113: LGTM! Timezone format standardized across test scenarios.The changes from "UTC" to "+00:00" correctly implement the timezone offset notation as described in the PR objectives.
153-153: LGTM! Consistent timezone format implementation.This change maintains consistency with the timezone format standardization applied in the primary test function.
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)
144-144: LGTM! Calculated field type assertion updated correctly.The change from string to float representation aligns with the column's float type definition and the new DataFusion serialization behavior.
ibis-server/tests/routers/v2/connector/test_snowflake.py (1)
97-97: LGTM! Timezone format standardization completed.The change from "UTC" to "+00:00" ensures consistent timezone offset notation across all connector tests, completing the standardization effort.
ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)
186-186: Timestamp format updated correctly for DataFusion compatibility.The test expectation correctly reflects the new timezone format where timestamps with timezone display the offset (
+00:00) instead of the named timezone (UTC). This aligns with the migration from DuckDB to DataFusion for query result formatting.ibis-server/tests/routers/v3/connector/oracle/test_query.py (1)
100-100: Timestamp format updated correctly for DataFusion compatibility.The test expectation correctly reflects the new timezone format where timestamps with timezone display the offset (
+00:00) instead of the named timezone (UTC). This change is consistent with the DataFusion migration for improved timezone handling.ibis-server/app/routers/v3/connector.py (1)
161-161: Data source parameter correctly added for timezone handling.The addition of
data_source=data_sourceparameter to theto_jsonfunction call is necessary for the DataFusion migration. This parameter enables session timezone adjustments and proper timestamp formatting in query results.ibis-server/app/routers/v2/connector.py (1)
175-175: Data source parameter correctly added for timezone handling.The addition of
data_source=data_sourceparameter to theto_jsonfunction call aligns with the DataFusion migration changes. This ensures consistent timezone handling across v2 and v3 API endpoints.ibis-server/tests/routers/v2/connector/test_mysql.py (2)
165-166: Timestamp values updated correctly for DataFusion timezone format.The timestamp values now include explicit timezone offsets (
+00:00) instead of naive timestamps, which aligns with the DataFusion migration's approach to consistent timezone representation in query results.
177-179: Data types updated to include timezone information.The timestamp data types now correctly include timezone information (
timestamp[us, tz=UTC]) instead of timezone-naive types (timestamp[us]). This reflects DataFusion's enhanced timezone handling and provides more accurate type information for timestamp columns.ibis-server/tests/routers/v3/connector/bigquery/test_query.py (3)
85-85: LGTM: Numeric type change reflects DataFusion behavior.The change from string
"356711.63"to float356711.63correctly reflects DataFusion's native numeric type handling, which is more accurate than DuckDB's string representation.
89-91: LGTM: Timezone format standardized to offset notation.The changes from
"UTC"suffix to"+00:00"offset format align with the PR objective to standardize timezone representation using DataFusion. The DST calculations appear correct:
- EST (UTC-5) for January 15th
- EDT (UTC-4) for July 15th
296-298: LGTM: Timestamp function tests updated consistently.The timezone format changes are consistently applied to all timestamp function test results, maintaining the same offset notation pattern throughout the test suite.
ibis-server/tests/routers/v2/connector/test_postgres.py (2)
191-191: LGTM: Timezone format standardized across connectors.The change from
"UTC"suffix to"+00:00"offset notation maintains consistency with the DataFusion timezone handling implementation across all connector tests.
1010-1010: LGTM: Improved numeric precision with DataFusion.The change from string
"74.6626535"to precise float74.66265347816136reflects DataFusion's improved numeric precision handling for PostGIS geometry calculations.wren-core/core/src/mdl/context.rs (3)
42-42: LGTM: Required import for DataFusion configuration.The
ScalarValueimport is necessary for setting DataFusion session configuration values.
55-58: LGTM: Safe timezone extraction with fallback.The timezone extraction properly handles the optional nature of session properties:
- Uses
"x-wren-timezone"key as documented in PR objectives- Provides safe fallback to
"UTC"when not specified- Properly handles nested Option types
61-64: LGTM: Correct DataFusion timezone configuration.The timezone configuration correctly uses DataFusion's
"datafusion.execution.time_zone"setting withScalarValue::Utf8wrapper, which is the proper way to set session timezone in DataFusion.wren-core/core/src/mdl/mod.rs (9)
474-474: LGTM: SessionContext import added.The import is necessary for the updated test functions that now create SessionContext instances with
SessionContext::new().
1122-1125: LGTM: Timezone handling updated to use headers.The test correctly implements the new approach of setting timezone via headers with the "x-wren-timezone" key instead of using SessionConfig. This aligns with the PR objective of switching from DuckDB to DataFusion for timezone handling.
1132-1132: LGTM: Using headers for session properties.The test now correctly passes the headers containing timezone information to
transform_sql_with_ctxinstead of an empty HashMap.
1144-1144: LGTM: Consistent use of headers for timezone.The test maintains consistency by using the same headers throughout the timestamp with timezone test.
1151-1157: LGTM: Timezone setup for America/New_York test case.The test correctly creates a new headers map for testing America/New_York timezone handling, following the same pattern as the previous test case.
1165-1165: LGTM: Using America/New_York timezone headers.The test correctly applies the timezone headers for the America/New_York test case.
1177-1177: LGTM: Consistent timezone header usage.The test maintains consistency by using the same timezone headers for all related test cases.
1330-1331: LGTM: Updated assertion reflects stricter type casting.The assertion now expects explicit casting of timestamp literals as
TIMESTAMP WITH TIME ZONEbefore the final cast, which reflects stricter typing expectations in DataFusion compared to DuckDB.
1359-1361: LGTM: Consistent type casting in assertions.The assertion follows the same pattern as the previous test case, expecting explicit casting of timestamp columns to
TIMESTAMP WITH TIME ZONEbefore comparison operations.ibis-server/tests/routers/v2/connector/test_bigquery.py (10)
94-94: LGTM!The change from string to float representation for numeric values is correct and aligns with DataFusion's numeric type handling.
98-98: Timestamp format correctly updated to offset notation.The change from
"UTC"suffix to"+00:00"offset is consistent with the PR objectives for DataFusion timestamp formatting.
320-320: Interval format correctly updated for DataFusion.The interval representation now uses DataFusion's format with months abbreviated as "mons" and simplified hour notation.
335-335: Average interval format correctly updated.The new format explicitly labels time units (hours, mins, secs) which is DataFusion's more verbose but clearer interval representation.
365-365: Consistent interval formatting applied.The interval format matches the earlier update, maintaining consistency across all interval test assertions.
94-94: LGTM: Numeric value formatting updated correctly.The change from string
"172799.49"to float172799.49aligns with proper data type representation in the DataFusion migration.
98-98: LGTM: Timezone format updated as documented.The change from
"UTC"to"+00:00"format is explicitly mentioned in the PR objectives as a key change when migrating to DataFusion for query result formatting.
320-320: LGTM: Interval format updated to DataFusion's representation.The change from verbose format to condensed format (
"112 mons 100 days 1 hours") reflects DataFusion's interval formatting conventions.
335-335: LGTM: Average interval format updated consistently.The more explicit breakdown format (
"10484 days 8 hours 54 mins 14.400000000 secs") is consistent with DataFusion's interval representation for aggregate functions.
365-365: LGTM: Consistent interval format across test cases.This change mirrors the format update in line 320, ensuring consistency across all interval-related test assertions.
ibis-server/tests/routers/v3/connector/postgres/test_query.py (5)
144-144: Numeric formatting consistent with DataFusion.Float representation matches the changes in other connector tests.
148-150: Timestamp formatting correctly handles timezones.The changes properly show:
- UTC timestamps with
+00:00offset notation- Correct DST conversions for America/New_York timezone
183-190: Data types correctly updated to Arrow format.The dtype changes from pandas/numpy style to Arrow types align with DataFusion's type system.
627-627: Test cases updated to cover DataFusion's floating point formatting.The modified test cases better reflect DataFusion's handling of scientific notation and decimal formatting thresholds.
Also applies to: 635-635
668-699: Test assertions correctly updated for DataFusion formatting.The expected results properly reflect:
- Consistent decimal string formatting
- Scientific notation for small values
- Special float values (NaN, Infinity) as null
Note the TODO comment about future scientific notation improvements.
ibis-server/app/util.py (7)
43-44: Function signature properly extended for timezone handling.The optional
data_sourceparameter enables data source-specific timezone adjustments while maintaining backward compatibility.
53-62: DataFusion integration correctly implemented.The code properly:
- Creates timezone-aware DataFusion context
- Registers Arrow tables using DataFusion API
- Formats SQL for DataFusion dialect
113-123: DataFusion context correctly configured with timezone.The implementation properly sets the session timezone from headers with a sensible UTC default.
200-203: SQL identifier quoting correctly implemented.Standard implementation that properly escapes double quotes and prevents SQL injection.
213-213: Date formatting correctly updated for DataFusion.Uses PostgreSQL-style
to_charfunction with appropriate format string.
222-222: Binary encoding correctly updated for DataFusion.Uses PostgreSQL-style
encodefunction which is the appropriate method for DataFusion.
3-3: ```shell
#!/bin/bashSearch for any reference to datafusion in the pyproject.toml to ensure it's listed under dependencies
grep -R "datafusion" -n ibis-server/pyproject.toml
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
BigQuery test has been tested locally. Let's ignore CI failure. |
Description
Because DuckDB doesn't support setting the timezone offset as the session timezone, we need to change to DataFusion to format the query result. There are some changes to the result:
2011-01-01 12:00:00 UTC->2011-01-01 12:00:00 +00:00TIMESTAMPandDATEtypes won't be affected by the session timezone. Session timezone only affects theTIMESTAMP WITH TIMEZONEtype.Known Issue
Summary by CodeRabbit
New Features
Bug Fixes
Chores