feat(ibis): handle query result using Arrow Type#1224
feat(ibis): handle query result using Arrow Type#1224douenergy merged 26 commits intoCanner:mainfrom
Conversation
WalkthroughThis update transitions the query result handling across the codebase from pandas DataFrames to Apache Arrow Tables, revises JSON serialization to leverage DuckDB SQL formatting, and refines data type and value expectations throughout the test suite. It also introduces new HTTP header constants, updates cache management logic, and enhances connector and cache-related response header handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant Connector
participant QueryCache
participant DuckDB
participant Response
Client->>API: Send query request
API->>QueryCache: Check for cached result
alt Cache Hit
QueryCache->>API: Return Arrow Table
API->>Response: Serialize Arrow Table to JSON via DuckDB
else Cache Miss
API->>Connector: Execute query
Connector->>API: Return Arrow Table
API->>QueryCache: Store Arrow Table in cache
API->>Response: Serialize Arrow Table to JSON via DuckDB
end
API->>Client: Return JSON response with cache headers
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 6
🔭 Outside diff range comments (5)
ibis-server/tests/routers/v3/connector/oracle/conftest.py (1)
69-75: Large CLOB insertion may break Arrow conversionThe comment in the v2 fixture said “ibis to pyarrow conversion does not support CLOBs yet”, but here the CLOB table is actually created and populated with a 2 MB value. If Arrow support is still missing, subsequent schema introspection or query calls could raise at runtime.
If the table is only for future work, consider skipping it for now:
- large_text = "x" * (1024 * 1024 * 2) # 2MB - conn.execute(text("CREATE TABLE test_lob (id NUMBER, content CLOB)")) - conn.execute( - text("INSERT INTO test_lob VALUES (1, :content)"), {"content": large_text} - ) + # TODO: enable once ibis→pyarrow supports CLOBibis-server/app/routers/v3/connector.py (1)
106-118:X-Cache-Create-Atmissing when a brand-new cache file is createdIn the cache-miss branch
(True, False, _)you persist the query result, but you never populateX_CACHE_CREATE_AT.
Down-stream clients – and your own tests – rely on this header to know when the cache was materialised. They currently receive the header only on a hit or an override, which is counter-intuitive.case (True, False, _): query_cache_manager.set( data_source, dto.sql, result, dto.connection_info ) + cache_headers[X_CACHE_CREATE_AT] = str( + query_cache_manager.get_cache_file_timestamp( + data_source, dto.sql, dto.connection_info + ) + )ibis-server/tests/routers/v2/connector/test_bigquery.py (1)
142-145: Hard-coded Unix timestamp will break after 2025-04-07assert int(response2.headers["X-Cache-Create-At"]) > 1743984000 # 2025.04.07This assertion will start failing once we pass that date. Use a dynamic bound instead:
create_at = int(response2.headers["X-Cache-Create-At"]) assert create_at >= int(time.time()) - 5 # allow small skewKeeps the test future-proof.
ibis-server/app/model/connector.py (2)
113-118: Return-type annotation is stale
CannerConnector.querynow returns an Arrowpa.Table, but the annotation still sayspd.DataFrame, which will mislead users and static type-checkers.- def query(self, sql: str, limit: int) -> pd.DataFrame: + def query(self, sql: str, limit: int) -> pa.Table:
145-181: Inconsistent return type on empty-result workaroundWhen the BigQuery path hits the
"Must pass schema"branch, it returns
pd.DataFrame(...), contradicting the declared-> pa.Tableand the rest
of the pipeline, which now assumes Arrow.
Down-stream utilities (e.g.to_json) will raise.Suggested fix:
- ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema) - return pd.DataFrame(columns=ibis_fields.names) + ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema) + empty_tbl = pa.Table.from_arrays( + [pa.array([], type=typ.to_pyarrow()) for typ in ibis_fields.types], + names=ibis_fields.names, + ) + return empty_tblYou can then drop the
import pandas as pdat the top.
♻️ Duplicate comments (3)
ibis-server/tests/routers/v2/connector/test_gcs_file.py (1)
119-129: GCS connector mirrors the decimal/type drift observed in local_fileThe same mismatch (
floatin manifest vsdecimal128in response) appears here.
Prefer fixing in one shared serialisation path instead of per-connector hacks.ibis-server/tests/routers/v2/connector/test_s3_file.py (2)
119-129: Same schema divergence (decimal128vsfloat) seen for S3 connectorPlease consolidate the typing strategy across all file connectors.
165-167: Aggregate precision upgrade – ensure consistency with analytics layerSee earlier note on
sum_totalpriceprecision jump.
🧹 Nitpick comments (16)
ibis-server/app/dependencies.py (1)
9-13: Header-constant naming diverges from existing patternEarlier constants embed the literal header value in lower-case (
"x-wren-*") while new ones use mixed-case ("X-Cache-*"). HTTP headers are case-insensitive, but mixing styles can be confusing and increases the risk of subtle bugs during look-ups or tests that use exact strings.Consider normalising the literals to one style (preferably lower-case to stay consistent with the pre-existing
x-wren-*values) or, at minimum, documenting the rationale for the different casing.ibis-server/tools/query_local_run.py (2)
17-22: Remove unusedpandasimport
pandasis no longer referenced after switching toto_pyarrow(), and Ruff flags this (F401). Delete the import to avoid needless dependency loading.-from app.model import MySqlConnectionInfo, OracleConnectionInfo, PostgresConnectionInfo -from app.util import to_json -import sqlglot -import sys -import pandas as pd +from app.model import MySqlConnectionInfo, OracleConnectionInfo, PostgresConnectionInfo +from app.util import to_json +import sqlglot +import sys
97-104: Minor clean-ups & header handling
dict()→{}is shorter and matches the rest of the file.to_jsoncurrently receives a plain dict for headers; its signature elsewhere expects astarlette.datastructures.Headers(or similar) object. Pass an actualHeaders({})or adapt the function to acceptdictexplicitly to prevent runtime type mismatches.-json_str = to_json(df, dict()) +from starlette.datastructures import Headers + +json_str = to_json(df, Headers({}))ibis-server/app/custom_ibis/backends/sql/datatypes.py (1)
5-8: Nit: typo in comment and link formatting
"ibs BQ connector"→"ibis BigQuery connector"for clarity. Also consider shortening the link with an inline comment to keep the file compact.- # It's a workaround for the issue of ibs BQ connector not supporting interval precision. + # Work-around: the ibis BigQuery connector ignores interval precision.ibis-server/tests/routers/v2/connector/test_trino.py (1)
115-133: Payload / dtype mismatch for several numeric & timestamp columns
totalprice,timestamp,timestamptz, andtest_null_timeare sent as strings but typed asdouble/timestamp[…]. Decide on one convention and apply it across tests.If string serialisation stays, consider updating dtypes to
stringor, better, serialise true numbers/timestamps (ISO 8601) and keep the precise Arrow dtypes.ibis-server/tests/routers/v2/connector/test_canner.py (1)
98-114: Consistency: numeric & timestamp values serialised as stringsYet again,
totalpriceand all timestamps are stringified while dtypes claim numeric/timestamp Arrow types. A helper that converts an Arrow table to the expected JSON (respecting Arrow type → JSON mapping) would prevent these per-file divergences.ibis-server/app/routers/v3/connector.py (1)
161-163: Cache-headers assignment duplicated across v2 & v3 routersThe same (non-trivial) block that derives
cache_headersnow lives in two routers. Any future tweak (e.g. adding a new header) must be done twice and risks drifting.Consider extracting a small helper, e.g.
build_cache_headers(cache_enable, cache_hit, override_cache, cache_manager, …)insideapp.util, returning both the Arrow table (if it needs to write) and the header dict.
That keeps router code focused on HTTP orchestration.ibis-server/app/routers/v2/connector.py (1)
175-177: Coupling response creation to header injection
update_response_headersmutates the response post-creation. A safer & cheaper pattern is to build the full header dict first and pass it toORJSONResponsevia itsheaders=param – that avoids a second pass over the headers and keeps them immutable after instantiation.Not blocking, but worth considering.
ibis-server/tests/routers/v2/connector/test_minio_file.py (1)
175-185: Minor: decimal values are compared as strings
"172799.49"is asserted as a string becauseto_jsonformatsdecimal128via DuckDB’sCAST(value AS VARCHAR).
This is fine, but keep in mind you lose numeric ordering in client code. If consumers need arithmetic, consider returning numbers and letting the JSON encoder handle precision, or add a follow-up test to ensure ordering works client-side.ibis-server/app/query_cache/__init__.py (2)
29-33: Log message & variable naming are misleading
ibis.read_parquet(...).to_pyarrow()returns apyarrow.Table, not a “dataframe”.
Keeping the wording consistent will avoid confusion when debugging.-logger.info("query cache to dataframe") +logger.info("query cache loaded as Arrow Table")
113-115: Redundant helper after refactorIf the above direct-write refactor is applied,
_get_duckdb_connection()(and the DuckDB import) become dead code and should be removed to avoid unused-code drift.ibis-server/tests/routers/v2/connector/test_postgres.py (1)
238-239: Time-based assertion is brittle
assert int(response2.headers["X-Cache-Create-At"]) > 1743984000embeds a hard-coded epoch (≈ 2025-04-07 UTC).
When this date becomes “too far in the past” the assertion will continue to pass, but it gives no guarantee that the header represents now.
Safer pattern:import time assert abs(int(response2.headers["X-Cache-Create-At"]) - int(time.time())) < 60…which confirms the timestamp is within a reasonable window of the test run.
ibis-server/app/model/connector.py (1)
199-202: Minor: slice() ignores negative / None limits
pa.Table.slice(length=limit)will happily accept negative or
None, yielding unexpected results.
Consider guarding with:if limit is not None and limit >= 0: return tbl.slice(length=limit) return tblibis-server/app/util.py (1)
169-190: _formater complexity & style nitsThe huge CASE string embeds newlines and uses RTRIM chains; readability
suffers and pylint rightly flags the many returns / redundantelif.Not urgent, but extracting discrete helpers (e.g.
_format_numeric,
_format_datetime) would improve maintainability and silence R0911 /
R1705.ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
155-162: Confirm dtype expectationsThe dtype strings now use Arrow notation (
"string","double",
"date32[day]", …). Ensureto_jsonalways produces these exact tokens
for every backend; a mismatch will cascade across many tests.ibis-server/tests/routers/v3/connector/oracle/test_query.py (1)
204-210: Nit: avoid shadowing –manifest_strhere is no longer a stringInside
test_query_number_scalethe variable namedmanifest_strnow holds a dict, and latermanifest_str_base64is the actual base-64 string.
Renaming the first one tomanifest_dict(or similar) would make the intent clearer and prevent accidental misuse.- manifest_str = { + manifest_dict = { ... - manifest_str_base64 = base64.b64encode(orjson.dumps(manifest_str)).decode("utf-8") + manifest_str_base64 = base64.b64encode(orjson.dumps(manifest_dict)).decode("utf-8") ... - assert len(result["columns"]) == len(manifest_str["models"][0]["columns"]) + assert len(result["columns"]) == len(manifest_dict["models"][0]["columns"])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
ibis-server/app/custom_ibis/backends/sql/datatypes.py(1 hunks)ibis-server/app/dependencies.py(1 hunks)ibis-server/app/model/connector.py(6 hunks)ibis-server/app/model/metadata/object_storage.py(1 hunks)ibis-server/app/query_cache/__init__.py(4 hunks)ibis-server/app/routers/v2/connector.py(5 hunks)ibis-server/app/routers/v3/connector.py(6 hunks)ibis-server/app/util.py(4 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py(10 hunks)ibis-server/tests/routers/v2/connector/test_canner.py(2 hunks)ibis-server/tests/routers/v2/connector/test_clickhouse.py(5 hunks)ibis-server/tests/routers/v2/connector/test_gcs_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_minio_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mssql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mysql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_oracle.py(4 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py(3 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_snowflake.py(1 hunks)ibis-server/tests/routers/v2/connector/test_trino.py(2 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_functions.py(1 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_query.py(5 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/oracle/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/oracle/test_query.py(4 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(8 hunks)ibis-server/tools/query_local_run.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
ibis-server/tests/routers/v3/connector/oracle/conftest.py (2)
ibis-server/tests/routers/v2/connector/test_oracle.py (1)
oracle(121-168)ibis-server/tests/conftest.py (1)
file_path(10-11)
ibis-server/app/routers/v3/connector.py (1)
ibis-server/app/util.py (2)
update_response_headers(158-166)to_json(43-59)
ibis-server/tests/routers/v2/connector/test_oracle.py (1)
ibis-server/tests/conftest.py (1)
file_path(10-11)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)
ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
connection_info(37-44)
🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py
21-21: pandas imported but unused
(F401)
98-98: print found
(T201)
99-99: print found
(T201)
100-100: print found
(T201)
102-102: print found
(T201)
103-103: print found
(T201)
104-104: print found
(T201)
🪛 Pylint (3.3.7)
ibis-server/tools/query_local_run.py
[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.
(R1735)
ibis-server/app/query_cache/__init__.py
[refactor] 40-40: Useless return at end of function or method
(R1711)
ibis-server/app/util.py
[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 169-169: Too many return statements (7/6)
(R0911)
🔇 Additional comments (19)
ibis-server/app/dependencies.py (1)
32-45: ```shell
#!/bin/bash
set -eecho "🔍 Displaying lines 1-200 of ibis-server/app/dependencies.py for context:"
sed -n '1,200p' ibis-server/app/dependencies.py || trueecho
echo "🔍 Displayingget_wren_headersdefinition in ibis-server/app/dependencies.py:"
rg -n "def get_wren_headers" -C5 ibis-server/app/dependencies.py || trueecho
echo "🔍 Searching for 'cache' in ibis-server/app:"
rg -n "cache" -C2 ibis-server/app || trueecho
echo "🔍 Searching for 'Cache' in ibis-server/app:"
rg -n "Cache" -C2 ibis-server/app || true</details> <details> <summary>ibis-server/app/model/metadata/object_storage.py (1)</summary> `254-257`: **LGTM – message fix only** The log message now correctly says “duckdb gcs”, which removes a misleading leftover from Minio. </details> <details> <summary>ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1)</summary> `100-102`: **Expectation update matches Arrow migration** The new expected payload (`"2024-01-02"` with `date32[day]`) aligns with the Arrow-based JSON serialisation. Looks good. </details> <details> <summary>ibis-server/tests/routers/v3/connector/local_file/test_query.py (2)</summary> `148-149`: **Double-check calculated-field precision** `sum_totalprice` now expects Arrow `double`, but the payload remains a string (`"2860895.79"`). Same comment as above: decide whether the API guarantees numeric JSON values or stringified numbers. --- `96-105`: ```shell #!/bin/bash echo "Occurrences of the string literal \"172799.49\":" rg -n '"172799.49"' -C1 ibis-server/tests/routers/v3/connector echo -e "\nOccurrences of the numeric literal 172799.49 (no quotes):" rg -n '172799\.49,' -C1 ibis-server/tests/routers/v3/connectoribis-server/tests/routers/v3/connector/oracle/conftest.py (1)
32-52: 👍 Explicit dtype mapping avoids implicit CLOBsGood call adding explicit
VARCHAR/DECIMALmappings – this prevents Oracle defaulting to CLOB for text columns.ibis-server/tests/routers/v2/connector/test_mssql.py (1)
129-147: ```shell
#!/bin/bashFind all occurrences of totalprice dtype declarations in connector tests
rg -n '"totalprice":' -C 1 ibis-server/tests/routers/v2/connector/test_*.py
</details> <details> <summary>ibis-server/tests/routers/v2/connector/test_mysql.py (1)</summary> `158-169`: **`bytea_column` value looks like raw hex, not binary – ensure encoding matches the API contract** `"616263"` is the ASCII-hex representation of `abc`, yet the declared dtype is `"binary"`. If the backend promises real binary payloads (e.g. base64 or escaped bytes), treating a plain hex string as binary can break downstream consumers. Consider normalising the JSON serialisation so that: 1. `data` carries either a base64 string or a regular UTF-8 string while 2. the dtype is `"binary"` only when the JSON value is base64/binary-safe. Otherwise rename the dtype to `"string"` to avoid mis-hinting. </details> <details> <summary>ibis-server/tests/routers/v2/connector/test_local_file.py (1)</summary> `110-120`: **Manifest says `float`, runtime returns `decimal128` – schema drift** `totalprice` and `orderdate` now come back as `decimal128(15, 2)` and `date32[day]`. The manifest, however, still marks `totalprice` as `"float"` and `orderdate` as `"date"`. Leaving this divergence un-addressed will surface as validation errors once the engine starts checking manifest ↔︎ result compatibility. Synchronise the manifest or adjust the DuckDB/Arrow casting layer. </details> <details> <summary>ibis-server/tests/routers/v2/connector/test_gcs_file.py (1)</summary> `164-167`: **High-precision aggregate promoted to `decimal128(38,2)` – verify client compatibility** Large–scale decimals may not round-trip cleanly to JS/JSON consumers. Confirm that the public API guarantees stringified decimals and document it. </details> <details> <summary>ibis-server/tests/routers/v2/connector/test_snowflake.py (1)</summary> `93-109`: **Snowflake: `totalprice` now a string, dates as `date32[day]`; confirm cross-connector parity** Snowflake diverges from file connectors (decimal128) and from the manifest (`float`). Before these expectations ossify in tests, decide on a canonical logical type for money values and propagate it uniformly. Also note that `timestamptz` correctly keeps the `tz=UTC` qualifier – good catch. </details> <details> <summary>ibis-server/tests/routers/v2/connector/test_oracle.py (1)</summary> `198-206`: **Timestamp unit inconsistency across connectors** Here you expect `"timestamp[ns]"`, while BigQuery tests expect `"timestamp[us]"`. If this stems from connector-specific casting, that’s fine, but if it’s accidental you may want to normalise the unit (µs vs ns) to avoid surprising API consumers that mix data sources. Please double-check the Arrow schema produced by the Oracle connector. </details> <details> <summary>ibis-server/tests/routers/v3/connector/bigquery/test_query.py (1)</summary> `108-111`: **Randomised constant alters column count** Appending `now` (`SELECT *, {now}`) changes the schema each run, so cache entries for identical SQL executed seconds apart will **never hit**. If the intention is *only* to avoid collisions during the test, consider embedding a fixed comment instead: ```python sql = "/* cache-bypass */ SELECT * FROM orders ORDER BY o_orderkey LIMIT 1"This keeps the schema stable while still producing a unique cache key.
ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)
190-201: Inconsistent null-timestamp dtypeFor ClickHouse the nullable timestamp
test_null_timeis asserted as"string"whereas other back-ends map the same column to"timestamp[us]".
If this is deliberate (e.g. ClickHouse driver limitation) add a short comment explaining the divergence; otherwise the discrepancy may hide a coercion bug.ibis-server/app/util.py (1)
78-92: Reliance onpg_timezone_names()may fail inside DuckDB
pg_timezone_names()is a Postgres catalog view and is not available
in vanilla DuckDB.duckdb.execute(sql)will therefore error on most
deployments.Consider shipping an internal mapping of offsets → IANA names or
delegating to Python’szoneinfoinstead of a SQL call.ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
640-662: NaN assertion could be brittle
CAST('NaN' AS FLOAT)is compared againstNone. Whether DuckDB returns
NULLor the string'nan'after the new formatter is backend-version
dependent. If DuckDB upgrades its formatting, this test may flip.Recommend using
math.isnan(...)or asserting membership in{None, "nan", "NaN"}to avoid flaky CI.ibis-server/tests/routers/v3/connector/oracle/test_query.py (3)
178-182: Ensuredecimal128(10, 2)matches the actual Oracle column definitionThe manifest hard-codes a scale/precision of
(10, 2)forid_p_s.
If the underlying column is defined asNUMBER(10, 2)this is perfect.
If it isNUMBER(8, 2)(or unconstrainedNUMBER) the driver will happily coerce, but Arrow will report a differentdecimal128(p, s)at runtime, causing the test to fail.#!/bin/bash # Quickly inspect the column definition through information_schema to confirm precision/scale echo "SELECT data_precision, data_scale FROM all_tab_columns WHERE table_name='TEST_NUMBER' AND column_name='ID_P_S';" | sqlplus -s "${DB_USER}/${DB_PASSWORD}@${DB_SVC}"
92-102: ```shell
#!/bin/bashSearch for any occurrences of TRUNC("O_ORDERDATE") in test files under ibis-server
rg -n 'TRUNC("O_ORDERDATE")' ibis-server/tests
--- `103-113`: ```shell #!/bin/bash set -e # List files in the oracle router directory echo "Files in ibis-server/app/routers/v3/connector/oracle:" find ibis-server/app/routers/v3/connector/oracle -type f # Show the beginning of the query router QUERY_FILE=$(find ibis-server/app/routers/v3/connector/oracle -type f | grep 'query.py$' || true) echo "Opening $QUERY_FILE" if [ -n "$QUERY_FILE" ]; then sed -n '1,200p' "$QUERY_FILE" else echo "No query.py found" fi
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ibis-server/app/model/validator.py (1)
145-149: Guard against connectors that already return aDataFrameand keep Arrow benefits where possible
self.connector.query(...).to_pandas()assumes every connector now returns apyarrow.Table.
If any connector still delivers apandas.DataFrame, this will raise anAttributeError.
A small guard removes that fragility and lets you gradually migrate connectors, while avoiding a needless conversion when aDataFrameis already returned.- result = self.connector.query(rewritten_sql, limit=1).to_pandas() + raw_result = self.connector.query(rewritten_sql, limit=1) + # Convert to pandas only when the object supports it + result = raw_result.to_pandas() if hasattr(raw_result, "to_pandas") else raw_resultThis also keeps open the possibility of later rewriting the validation logic to operate directly on Arrow Tables (and thus preserve the Arrow performance gains) without changing call-sites again.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ibis-server/app/model/validator.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
|
There are some CI issues for BigQuery. |
a27f4ef to
8293caa
Compare
|
Test redshift locally: |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
ibis-server/app/model/metadata/redshift.py (1)
110-112: Same optimisation applies to the constraints queryDuplicate of the comment above – the constraints path hits the same pandas dependency and memory cost. Reuse the Arrow-native conversion here as well.
ibis-server/app/query_cache/__init__.py (1)
52-58: Still opening an unused file handle & unnecessary DuckDB round-tripThe earlier feedback hasn’t been addressed: a binary file handle is opened but not used, and the Arrow → DuckDB → Parquet detour remains. This wastes I/O, allocates an unneeded connection, and may leave an empty file on failure.
See previous diff suggestion (unchanged):
- with op.open(cache_file_name, mode="wb") as file: - con = self._get_duckdb_connection() - arrow_table = con.from_arrow(result) - if file.writable(): - arrow_table.write_parquet(full_path) + import pyarrow.parquet as pq + pq.write_table(result, full_path)This also lets you drop
_get_duckdb_connection()entirely.ibis-server/app/util.py (1)
65-71: Critical bug –.startwithtypo still present
timezone.startwith("+")should betimezone.startswith("+")(same for the-branch).
The typo raisesAttributeErrorfor every offset based-timezone and prevents the header from working at all.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):
🧹 Nitpick comments (9)
ibis-server/tests/routers/v3/connector/redshift/test_query.py (1)
84-91:timestamp[us]may mask TZ information
pa.timestamp('us')without a timezone is not equivalent to RedshiftTIMESTAMPTZ; Arrow will keep a separate'us', tz=<zone>logical type for tz-aware values.
If the backend starts returning timezone info (e.g., after a session parameter change), this strict equality will fail.Either:
- Cast explicitly in SQL:
CAST('...' AS TIMESTAMP WITHOUT TIME ZONE)vsWITH TIME ZONE; or- Relax the assertion to accept
"timestamp[us]"or"timestamp[us, tz=*]".ibis-server/tools/query_local_run.py (2)
17-22: Remove unusedpandasimport
pandasis imported but never referenced after the migration topyarrow.
Eliminate the import (and optionally add# noqa: F401if you must keep it for local experiments) to silence Ruff F401 and keep the script tidy.-from app.util import to_json -import sqlglot -import sys -import pandas as pd +from app.util import to_json +import sqlglot +import sys
100-104: Console prints flagged by Ruff T201 – consider replacing withloggerFor a CLI script this is often acceptable, but if you want a clean Ruff run or richer log control switch to
loguru/logging.ibis-server/tests/routers/v2/connector/test_redshift.py (1)
64-69: Track unsupportedvarbytecolumn in an issueGood call to comment the bytea column until ConnectorX supports
varbyte.
Consider adding a TODO reference to an issue number to avoid losing the task.ibis-server/app/util.py (3)
79-91:get_timezone_from_offsetsilently returns the first matchIf multiple time-zones share the same UTC offset (very common) the first row returned is arbitrary and can change between DuckDB versions.
Return all matches or document the behaviour; at minimum log a warning.
158-167: Header propagation helper is verbose
update_response_headersrepeats the same pattern four times and will drift when new cache headers arrive.
Use a loop to keep it DRY:-def update_response_headers(response, required_headers: dict): - if X_CACHE_HIT in required_headers: - response.headers[X_CACHE_HIT] = required_headers[X_CACHE_HIT] - ... +def update_response_headers(response, required_headers: dict): + for key in (X_CACHE_HIT, X_CACHE_CREATE_AT, X_CACHE_OVERRIDE, X_CACHE_OVERRIDE_AT): + if key in required_headers: + response.headers[key] = required_headers[key]
145-155: Dead code?pd_to_arrow_schemano longer usedThe new pipeline is Arrow-native; nothing invokes
pd_to_arrow_schema.
Delete it to keep the module lean or add a comment explaining future usage.ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)
640-662: Flaky assertion – NaN mapped toNone
_formatercurrently convertsNaNto SQLNULL, which is marshalled toNone.
If the implementation ever changes to produce"nan"(e.g. by adding explicitformat()handling) this test will break.
Consider asserting withmath.isnanor accepting both representations.import math assert result["data"][0][18] is None or math.isnan(float('nan')) # relax expectation
591-668: Very large test payload – weigh cost vs. benefitThe
test_format_floatingSQL literal list is massive (28 literals) and all values are asserted individually.
While exhaustive, it slows the suite and is brittle for insignificant formatting tweaks.Suggestion: parametrize a smaller representative subset and focus on edge-cases (
0, very small/large, NaN/Inf).[performance]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
ibis-server/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
ibis-server/app/custom_ibis/backends/sql/datatypes.py(1 hunks)ibis-server/app/dependencies.py(1 hunks)ibis-server/app/model/connector.py(7 hunks)ibis-server/app/model/metadata/object_storage.py(1 hunks)ibis-server/app/model/metadata/redshift.py(2 hunks)ibis-server/app/model/validator.py(1 hunks)ibis-server/app/query_cache/__init__.py(4 hunks)ibis-server/app/routers/v2/connector.py(5 hunks)ibis-server/app/routers/v3/connector.py(6 hunks)ibis-server/app/util.py(4 hunks)ibis-server/pyproject.toml(1 hunks)ibis-server/tests/routers/v2/connector/test_athena.py(4 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py(10 hunks)ibis-server/tests/routers/v2/connector/test_canner.py(2 hunks)ibis-server/tests/routers/v2/connector/test_clickhouse.py(5 hunks)ibis-server/tests/routers/v2/connector/test_gcs_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_minio_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mssql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mysql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_oracle.py(4 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py(3 hunks)ibis-server/tests/routers/v2/connector/test_redshift.py(2 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_snowflake.py(1 hunks)ibis-server/tests/routers/v2/connector/test_trino.py(2 hunks)ibis-server/tests/routers/v3/connector/athena/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_functions.py(1 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_query.py(5 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/oracle/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/oracle/test_query.py(4 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(8 hunks)ibis-server/tests/routers/v3/connector/redshift/test_functions.py(1 hunks)ibis-server/tests/routers/v3/connector/redshift/test_query.py(1 hunks)ibis-server/tools/query_local_run.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- ibis-server/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (25)
- ibis-server/app/model/metadata/object_storage.py
- ibis-server/app/model/validator.py
- ibis-server/app/custom_ibis/backends/sql/datatypes.py
- ibis-server/tests/routers/v2/connector/test_trino.py
- ibis-server/tests/routers/v3/connector/oracle/conftest.py
- ibis-server/tests/routers/v2/connector/test_canner.py
- ibis-server/app/dependencies.py
- ibis-server/tests/routers/v2/connector/test_mssql.py
- ibis-server/tests/routers/v3/connector/local_file/test_query.py
- ibis-server/tests/routers/v2/connector/test_local_file.py
- ibis-server/tests/routers/v3/connector/bigquery/test_functions.py
- ibis-server/tests/routers/v2/connector/test_mysql.py
- ibis-server/tests/routers/v2/connector/test_gcs_file.py
- ibis-server/app/routers/v3/connector.py
- ibis-server/tests/routers/v3/connector/bigquery/test_query.py
- ibis-server/tests/routers/v2/connector/test_minio_file.py
- ibis-server/app/routers/v2/connector.py
- ibis-server/tests/routers/v2/connector/test_bigquery.py
- ibis-server/tests/routers/v2/connector/test_oracle.py
- ibis-server/tests/routers/v2/connector/test_snowflake.py
- ibis-server/tests/routers/v2/connector/test_s3_file.py
- ibis-server/tests/routers/v3/connector/oracle/test_query.py
- ibis-server/tests/routers/v2/connector/test_clickhouse.py
- ibis-server/app/model/connector.py
- ibis-server/tests/routers/v2/connector/test_postgres.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
connection_info(37-44)
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 169-169: Too many return statements (7/6)
(R0911)
ibis-server/app/query_cache/__init__.py
[refactor] 40-40: Useless return at end of function or method
(R1711)
ibis-server/tools/query_local_run.py
[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.
(R1735)
🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py
21-21: pandas imported but unused
(F401)
98-98: print found
(T201)
99-99: print found
(T201)
100-100: print found
(T201)
102-102: print found
(T201)
103-103: print found
(T201)
104-104: print found
(T201)
🔇 Additional comments (7)
ibis-server/tests/routers/v3/connector/redshift/test_functions.py (1)
90-91: Confirmint32assumption for portability
ABS(-1)indeed returns the same type as its operand, but the SQL literal-1is not consistently typed as a 32-bit integer across engines (it isint64in some builds / arrow back-ends).
If Redshift/Arrow ever widens the literal toint64, this assertion will start failing.Consider loosening the check to
{"col": re.compile(r"int(32|64)")}or build the value with an explicit cast (SELECT CAST(-1 AS INT4)).ibis-server/tools/query_local_run.py (1)
91-98: Use.execute().fetch_arrow_table()instead oflimit(...).to_pyarrow()Calling
limit(...).to_pyarrow()materialises the whole query in the database and then truncates the Arrow table client-side, which still streams the full result over the wire.
Most back-ends (including Ibis ≥ 9) supportexecute(limit=n, result="arrow"), yielding the first n records directly on the server.-df = connection.sql(dialect_sql).limit(10).to_pyarrow() +df = connection.sql(dialect_sql).execute(limit=10, result="arrow")This reduces network traffic and latency for large result sets.
(Keep the current call if the chosen backend/driver doesn’t yet support the optimisation.)ibis-server/tests/routers/v2/connector/test_athena.py (2)
102-125: Updated expectations match Arrow-backed types – looks goodThe row payload and
dtypesassertions now reflectdate32anddecimal128as produced by Arrow/DuckDB.
No further action required.
147-165: Second manifest-driven query block consistent with first — okMirrors the changes above for the Glue database path; expectations are coherent.
ibis-server/tests/routers/v2/connector/test_redshift.py (1)
109-116: Arrow-aware dtype assertions validated – looks goodDtype refinements (
double,timestamp[us, tz=+00:00], etc.) align with the new serialization path.ibis-server/tests/routers/v3/connector/athena/test_query.py (1)
77-99: V3 API expectations updated correctlyDate simplification and Arrow dtype updates are consistent with the v2 tests; assertions remain deterministic.
ibis-server/app/util.py (1)
169-176: ```shell
#!/bin/bashSearch for usage of _formater in the repository
rg -n "_formater" -n .
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
ibis-server/app/util.py (1)
65-69:startwithtypo breaks offset handling (already reported)The guard still uses
timezone.startwith("+"); this will raiseAttributeErrorfor offset values and bypass the intended logic.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):
🧹 Nitpick comments (1)
ibis-server/app/util.py (1)
52-52: Log message typo (“formmated_sql”)Minor nit – the key is misspelled. This makes log filtering harder.
- logger.debug(f"formmated_sql: {formatted_sql}") + logger.debug(f"formatted_sql: {formatted_sql}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/model/metadata/redshift.py(2 hunks)ibis-server/app/util.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/app/model/metadata/redshift.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 168-186: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 177-182: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 166-166: Too many return statements (7/6)
(R0911)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (1)
ibis-server/app/util.py (1)
166-174: [web_search]Does DuckDB support the SQL function format() for numeric formatting, or is printf/to_char the correct function to use?[web_search]
What is the correct syntax and argument order for the strftime function in DuckDB?
6ea21a2 to
4775465
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
ibis-server/app/util.py (1)
62-70:startwithtypo crashes timezone handling
timezone.startwith("+")raisesAttributeError, preventing any non-named-tz offsets from working.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):
🧹 Nitpick comments (2)
ibis-server/app/util.py (2)
50-54: Minor typos & missing variable exposure reduce debuggability
formmated_sqlis miss-spelled in the debug log – makes searching harder.- Consider logging the first n rows instead of the full SQL when extremely long – avoids megabyte-sized log lines.
- logger.debug(f"formmated_sql: {formatted_sql}") + logger.debug("formatted_sql: %s", formatted_sql[:500])
166-187: Simplify_formaterbranchingSeven early returns plus nested triple-quoted strings are hard to scan. Consider a mapping table or a
match/case(py3.10+) to reduce cognitive load.Not blocking, purely readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
ibis-server/app/model/connector.py(7 hunks)ibis-server/app/model/metadata/redshift.py(2 hunks)ibis-server/app/util.py(4 hunks)ibis-server/tests/routers/v2/connector/test_redshift.py(1 hunks)ibis-server/tests/routers/v3/connector/redshift/test_query.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- ibis-server/tests/routers/v2/connector/test_redshift.py
🚧 Files skipped from review as they are similar to previous changes (3)
- ibis-server/app/model/metadata/redshift.py
- ibis-server/tests/routers/v3/connector/redshift/test_query.py
- ibis-server/app/model/connector.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 168-186: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 177-182: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 166-166: Too many return statements (7/6)
(R0911)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ibis-server/app/util.py (1)
65-69: Fix typo that breaks offset-based timezone detection
timezone.startwith("+")raisesAttributeErrorat runtime – the correct method isstartswith.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):
🧹 Nitpick comments (4)
ibis-server/app/util.py (4)
51-52: Spelling/clarity – typo in debug messageSmall nit:
formmated_sql→formatted_sql.- logger.debug(f"formmated_sql: {formatted_sql}") + logger.debug(f"formatted_sql: {formatted_sql}")
61-75: Connection lifecycle – ensure explicit close
get_duckdb_conncreates a new in-memory connection on every call but none of the callers close it.
While GC will eventually collect the object, explicitly closing the connection avoids surprise file-descriptor exhaustion during high-volume requests.Consider:
- conn = get_duckdb_conn(headers) - formatted_df = conn.execute(formatted_sql).fetch_df() + with get_duckdb_conn(headers) as conn: + formatted_df = conn.execute(formatted_sql).fetch_df()
154-163: Reduce repetition when copying cache headersThe four nearly-identical
ifblocks can be condensed, improving maintainability:-def update_response_headers(response, required_headers: dict): - if X_CACHE_HIT in required_headers: - response.headers[X_CACHE_HIT] = required_headers[X_CACHE_HIT] - if X_CACHE_CREATE_AT in required_headers: - response.headers[X_CACHE_CREATE_AT] = required_headers[X_CACHE_CREATE_AT] - if X_CACHE_OVERRIDE in required_headers: - response.headers[X_CACHE_OVERRIDE] = required_headers[X_CACHE_OVERRIDE] - if X_CACHE_OVERRIDE_AT in required_headers: - response.headers[X_CACHE_OVERRIDE_AT] = required_headers[X_CACHE_OVERRIDE_AT] +def update_response_headers(response, required_headers: dict): + for hdr in ( + X_CACHE_HIT, + X_CACHE_CREATE_AT, + X_CACHE_OVERRIDE, + X_CACHE_OVERRIDE_AT, + ): + if hdr in required_headers: + response.headers[hdr] = required_headers[hdr]
165-186: _formater: simplify branching & silence linter warningsPylint flags “unnecessary elif after return” and “too many return statements”.
A compact dict-based dispatch avoids both issues and is easier to extend.-def _formater(field: pa.Field) -> str: - column_name = _quote_identifier(field.name) - if pa.types.is_decimal(field.type) or pa.types.is_floating(field.type): - return f""" - case when {column_name} = 0 then '0' - when length(CAST({column_name} AS VARCHAR)) > 15 then format('{{:.9g}}', {column_name}) - else RTRIM(RTRIM(format('{{:.8f}}', {column_name}), '0'), '.') - end as {column_name}""" - elif pa.types.is_date(field.type): - return f"strftime({column_name}, '%Y-%m-%d') as {column_name}" - elif pa.types.is_timestamp(field.type): - if field.type.tz is None: - return f"strftime({column_name}, '%Y-%m-%d %H:%M:%S.%f') as {column_name}" - else: - return ( - f"strftime({column_name}, '%Y-%m-%d %H:%M:%S.%f %Z') as {column_name}" - ) - elif pa.types.is_binary(field.type): - return f"to_hex({column_name}) as {column_name}" - elif pa.types.is_interval(field.type): - return f"cast({column_name} as varchar) as {column_name}" - return column_name +def _formater(field: pa.Field) -> str: + col = _quote_identifier(field.name) + + if pa.types.is_decimal(field.type) or pa.types.is_floating(field.type): + return ( + f"case when {col} = 0 then '0' " + f"when length(CAST({col} AS VARCHAR)) > 15 " + f"then format('{{:.9g}}', {col}) " + f"else RTRIM(RTRIM(format('{{:.8f}}', {col}), '0'), '.') end as {col}" + ) + + if pa.types.is_date(field.type): + return f"strftime({col}, '%Y-%m-%d') as {col}" + + if pa.types.is_timestamp(field.type): + fmt = '%Y-%m-%d %H:%M:%S.%f%Z' if field.type.tz else '%Y-%m-%d %H:%M:%S.%f' + return f"strftime({col}, '{fmt}') as {col}" + + if pa.types.is_binary(field.type): + return f"to_hex({col}) as {col}" + + if pa.types.is_interval(field.type): + return f"cast({col} as varchar) as {col}" + + return col
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ibis-server/app/util.py(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 167-185: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 176-181: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 165-165: Too many return statements (7/6)
(R0911)
🪛 GitHub Actions: ibis CI
ibis-server/app/util.py
[error] 1-1: Ruff formatting check failed. File would be reformatted. Run 'ruff --fix' to fix code style issues.
🔇 Additional comments (1)
ibis-server/app/util.py (1)
1-191: CI failure – run Ruff auto-formatterThe GitHub Action reports: “Ruff formatting check failed. File would be reformatted.”
Runruff --fix ibis-server/app/util.pyto pass the style gate.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ibis-server/app/util.py (1)
68-74: Fix fatal typostartwith→startswithin timezone handling
strhas no attributestartwith; the first call will raise anAttributeError, breaking every request that supplies a+HH:MMoffset.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):Please patch immediately – nothing downstream will run until this is corrected.
🧹 Nitpick comments (3)
ibis-server/app/util.py (3)
55-55: Nit: misspelled log labelformmated_sqlMinor, but typos in log messages hinder grep-based troubleshooting.
- logger.debug(f"formmated_sql: {formatted_sql}") + logger.debug(f"formatted_sql: {formatted_sql}")
169-190: Simplify_formaterbranches to appease static-analysis and readabilityFive
elifchains after an earlyreturntrigger pylint R1705 and make the function longer than required. De-indent the branches once a match is found:@@ - elif pa.types.is_date(field.type): + if pa.types.is_date(field.type): @@ - elif pa.types.is_timestamp(field.type): + elif pa.types.is_timestamp(field.type): @@ - elif pa.types.is_binary(field.type): + elif pa.types.is_binary(field.type): @@ - elif pa.types.is_interval(field.type): + elif pa.types.is_interval(field.type):(or switch to a
match/casewhen Python ≥3.10)
This removes superfluouseliflevels, satisfying linters and future readers.
145-155: Potential dtype loss when converting Pandas → ArrowFallback to
pa.string()for unsupported dtypes silently degrades information (e.g.,category,Int64,datetime64[ns, tz]). Consider usingpyarrow’spandas_compatto infer richer types:- # Fallback to string type for unsupported dtypes - pa_type = pa.string() + # Let Arrow attempt a best-effort mapping first + try: + pa_type = pa.from_numpy_dtype(dtype) + except (TypeError, NotImplementedError): + pa_type = pa.string()This preserves more semantic information without much extra code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ibis-server/app/util.py(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 169-169: Too many return statements (7/6)
(R0911)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
cc6b2b7 to
76c5114
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
ibis-server/app/query_cache/__init__.py (1)
52-58: Inefficient file handling with unused handle.You open a file handle with
op.open(..., "wb")but don't use it, then write via DuckDB tofull_path. This creates an unnecessary file handle allocation and potential for a stray empty file if the DuckDB write fails.Consider writing the Arrow table directly to Parquet:
- # Create cache directory if it doesn't exist - with op.open(cache_file_name, mode="wb") as file: - con = self._get_duckdb_connection() - arrow_table = con.from_arrow(result) - if file.writable(): - arrow_table.write_parquet(full_path) + # Write Arrow table directly to Parquet + with op.open(cache_file_name, mode="wb") as sink: + pq.write_table(result, sink)This requires adding
import pyarrow.parquet as pqand removes the DuckDB dependency for cache writing.ibis-server/app/util.py (1)
70-70: Fix typo in method name.There's a typo in the method name
startwithwhich should bestartswith. This will cause an AttributeError at runtime.- if timezone.startwith("+") or timezone.startswith("-"): + if timezone.startswith(("+", "-")):
🧹 Nitpick comments (1)
ibis-server/tools/query_local_run.py (1)
21-21: Remove unused pandas import.The pandas import is no longer needed since the code has migrated to PyArrow.
-import pandas as pd
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
ibis-server/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (35)
ibis-server/app/custom_ibis/backends/sql/datatypes.py(1 hunks)ibis-server/app/dependencies.py(1 hunks)ibis-server/app/model/connector.py(7 hunks)ibis-server/app/model/metadata/object_storage.py(1 hunks)ibis-server/app/model/metadata/redshift.py(2 hunks)ibis-server/app/model/validator.py(1 hunks)ibis-server/app/query_cache/__init__.py(4 hunks)ibis-server/app/routers/v2/connector.py(5 hunks)ibis-server/app/routers/v3/connector.py(6 hunks)ibis-server/app/util.py(4 hunks)ibis-server/pyproject.toml(1 hunks)ibis-server/tests/routers/v2/connector/test_athena.py(4 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py(10 hunks)ibis-server/tests/routers/v2/connector/test_canner.py(2 hunks)ibis-server/tests/routers/v2/connector/test_clickhouse.py(5 hunks)ibis-server/tests/routers/v2/connector/test_gcs_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_minio_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mssql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_mysql.py(2 hunks)ibis-server/tests/routers/v2/connector/test_oracle.py(3 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py(3 hunks)ibis-server/tests/routers/v2/connector/test_redshift.py(1 hunks)ibis-server/tests/routers/v2/connector/test_s3_file.py(2 hunks)ibis-server/tests/routers/v2/connector/test_snowflake.py(1 hunks)ibis-server/tests/routers/v2/connector/test_trino.py(2 hunks)ibis-server/tests/routers/v3/connector/athena/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_functions.py(1 hunks)ibis-server/tests/routers/v3/connector/bigquery/test_query.py(5 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py(2 hunks)ibis-server/tests/routers/v3/connector/oracle/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/oracle/test_query.py(4 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(8 hunks)ibis-server/tests/routers/v3/connector/redshift/test_query.py(1 hunks)ibis-server/tools/query_local_run.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (31)
- ibis-server/app/model/metadata/object_storage.py
- ibis-server/tests/routers/v3/connector/bigquery/test_functions.py
- ibis-server/app/model/validator.py
- ibis-server/app/model/metadata/redshift.py
- ibis-server/app/custom_ibis/backends/sql/datatypes.py
- ibis-server/pyproject.toml
- ibis-server/tests/routers/v2/connector/test_redshift.py
- ibis-server/tests/routers/v3/connector/bigquery/test_query.py
- ibis-server/tests/routers/v3/connector/oracle/conftest.py
- ibis-server/app/dependencies.py
- ibis-server/tests/routers/v3/connector/redshift/test_query.py
- ibis-server/tests/routers/v2/connector/test_gcs_file.py
- ibis-server/tests/routers/v2/connector/test_trino.py
- ibis-server/tests/routers/v2/connector/test_canner.py
- ibis-server/app/routers/v3/connector.py
- ibis-server/tests/routers/v3/connector/local_file/test_query.py
- ibis-server/tests/routers/v2/connector/test_mysql.py
- ibis-server/tests/routers/v2/connector/test_minio_file.py
- ibis-server/app/routers/v2/connector.py
- ibis-server/tests/routers/v2/connector/test_mssql.py
- ibis-server/tests/routers/v2/connector/test_local_file.py
- ibis-server/tests/routers/v2/connector/test_athena.py
- ibis-server/tests/routers/v2/connector/test_s3_file.py
- ibis-server/tests/routers/v2/connector/test_oracle.py
- ibis-server/tests/routers/v3/connector/athena/test_query.py
- ibis-server/tests/routers/v2/connector/test_snowflake.py
- ibis-server/tests/routers/v2/connector/test_clickhouse.py
- ibis-server/tests/routers/v2/connector/test_bigquery.py
- ibis-server/tests/routers/v2/connector/test_postgres.py
- ibis-server/tests/routers/v3/connector/oracle/test_query.py
- ibis-server/app/model/connector.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.570Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py
[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 169-169: Too many return statements (7/6)
(R0911)
ibis-server/app/query_cache/__init__.py
[refactor] 40-40: Useless return at end of function or method
(R1711)
ibis-server/tools/query_local_run.py
[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.
(R1735)
🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py
21-21: pandas imported but unused
(F401)
98-98: print found
(T201)
99-99: print found
(T201)
100-100: print found
(T201)
102-102: print found
(T201)
103-103: print found
(T201)
104-104: print found
(T201)
🔇 Additional comments (8)
ibis-server/tools/query_local_run.py (2)
91-93: Oracle support implementation looks good.The Oracle connection setup follows the same pattern as other data sources with proper validation using
OracleConnectionInfo.
97-104: PyArrow migration implemented correctly.The migration from
.to_pandas()to.to_pyarrow()aligns with the PR objectives and the updatedto_jsonfunction correctly handles PyArrow Tables.ibis-server/tests/routers/v3/connector/postgres/test_query.py (3)
145-145: Date format simplified correctly.The change from timestamp format to date-only format
"1996-01-02"aligns with the PyArrow date handling and is more appropriate for date columns.
155-162: Data type specifications updated to Arrow types.The updated dtypes reflect more precise Arrow type information (e.g.,
"string","double","date32[day]","timestamp[us, tz=UTC]"), which provides better type fidelity in the PyArrow migration.
591-667: Comprehensive floating point formatting test.This test covers extensive floating point scenarios including scientific notation, decimals, special values (NaN, Infinity), and precision handling. It's a valuable addition for validating the new DuckDB-based formatting logic.
ibis-server/app/util.py (3)
43-62: PyArrow-based JSON serialization implemented well.The new
to_jsonfunction correctly handles PyArrow Tables with proper empty table handling, DuckDB-based formatting, and maintains the requiredorjsonserialization format. The use of DuckDB's automatic PyArrow Table recognition is elegant.
65-78: DuckDB connection with timezone support.The timezone handling logic properly supports both named timezones and offset formats, with appropriate fallback to UTC. The parameterized query approach prevents SQL injection.
169-190: Comprehensive type-specific formatting logic.The
_formaterfunction handles various Arrow types appropriately:
- Decimal/float formatting with scientific notation for large numbers
- Proper date and timestamp formatting with timezone awareness
- Binary and interval type handling
The logic is comprehensive and handles edge cases well.
|
Thanks @goldmedal for the hard working 👍 |
Description
Known Issue
LOBtype forto_pyarrow()andto_polars()ibis-project/ibis#11350Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests
Chores