Skip to content

feat(ibis): handle query result using Arrow Type#1224

Merged
douenergy merged 26 commits intoCanner:mainfrom
goldmedal:feat/use-arrow-type
Jun 23, 2025
Merged

feat(ibis): handle query result using Arrow Type#1224
douenergy merged 26 commits intoCanner:mainfrom
goldmedal:feat/use-arrow-type

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Jun 17, 2025

Description

  • PyArrow is used to handle the query result.
  • The query result type is ArrowType
  • Format the query result using DuckDB instead of Pandas.
  • Store the cache result using DuckDB instead of Ibis.

Known Issue

Summary by CodeRabbit

  • New Features

    • Added support for additional HTTP headers to enhance query response metadata.
  • Refactor

    • Query results now use Apache Arrow Tables for improved performance and consistency.
    • Enhanced JSON serialization leverages DuckDB and PyArrow, providing more accurate data formatting and timezone handling.
    • Centralized and streamlined cache-related HTTP header management in query endpoints.
  • Bug Fixes

    • Improved handling of interval precision for BigQuery to prevent casting errors.
    • Corrected log messages for Google Cloud Storage initialization.
  • Tests

    • Updated test suites to expect more precise data types and simplified date formats in query results.
    • Improved cache testing by introducing dynamic queries and refining type assertions.
    • Added and updated tests for floating-point and decimal formatting.
  • Chores

    • Improved type conversions and data handling in connectors and metadata retrieval.
    • Enforced explicit column types in Oracle test fixtures to avoid unsupported CLOB usage.
    • Updated validation and metadata methods to accommodate new data formats.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jun 17, 2025

Walkthrough

This update transitions the query result handling across the codebase from pandas DataFrames to Apache Arrow Tables, revises JSON serialization to leverage DuckDB SQL formatting, and refines data type and value expectations throughout the test suite. It also introduces new HTTP header constants, updates cache management logic, and enhances connector and cache-related response header handling.

Changes

Files / Area Change Summary
app/model/connector.py Query methods in all connector classes now return Arrow Tables (pa.Table) instead of pandas DataFrames; method signatures and query logic updated accordingly.
app/query_cache/__init__.py Query cache now reads/writes PyArrow Tables using DuckDB for parquet operations; method signatures and logic updated, including a new DuckDB connection helper.
app/util.py Rewrote to_json to use DuckDB SQL on Arrow Tables; added utilities for schema conversion, timezone handling, header updates, and SQL formatting expressions.
app/routers/v2/connector.py, app/routers/v3/connector.py Refactored cache-related HTTP response header handling to use a centralized dictionary and utility function for header updates, consolidating response creation and header assignment.
app/model/metadata/object_storage.py Corrected debug log message to accurately reflect initialization of DuckDB with GCS.
app/model/metadata/redshift.py Changed query result conversion from .to_dict(orient="records") to .to_pylist() in metadata retrieval methods.
app/model/validator.py Query result in relationship validation now explicitly converted to pandas DataFrame after Arrow Table query.
app/dependencies.py Added new HTTP header constants for timezone and cache control.
app/custom_ibis/backends/sql/datatypes.py Changed default interval precision for BigQueryType from seconds to hours, with explanatory comments.
tools/query_local_run.py Added Oracle support, switched query result handling to Arrow Tables, and output JSON using new serialization utility.
tests/routers/v2/connector/test_*.py, tests/routers/v3/connector/*/test_*.py Updated test assertions for query results: refined expected data types (e.g., from "object" to precise Arrow types), changed date/time string formats, improved cache tests, and removed or added floating-point formatting tests as appropriate.
tests/routers/v3/connector/oracle/conftest.py Explicitly defined SQLAlchemy column types in Oracle fixture to prevent CLOB columns and ensure correct table creation in tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant Connector
    participant QueryCache
    participant DuckDB
    participant Response

    Client->>API: Send query request
    API->>QueryCache: Check for cached result
    alt Cache Hit
        QueryCache->>API: Return Arrow Table
        API->>Response: Serialize Arrow Table to JSON via DuckDB
    else Cache Miss
        API->>Connector: Execute query
        Connector->>API: Return Arrow Table
        API->>QueryCache: Store Arrow Table in cache
        API->>Response: Serialize Arrow Table to JSON via DuckDB
    end
    API->>Client: Return JSON response with cache headers
Loading

Suggested labels

core

Suggested reviewers

  • douenergy

Poem

A hop, a leap, a bounding cheer—
Arrow Tables now appear!
Pandas waved a fond goodbye,
DuckDB helps data fly.
Types are crisp, the tests refined,
Cache and headers well-aligned.
With every hop, our code grows bright—
🐇✨ Data handled just right!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added bigquery ibis python Pull requests that update Python code labels Jun 17, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (5)
ibis-server/tests/routers/v3/connector/oracle/conftest.py (1)

69-75: Large CLOB insertion may break Arrow conversion

The comment in the v2 fixture said “ibis to pyarrow conversion does not support CLOBs yet”, but here the CLOB table is actually created and populated with a 2 MB value. If Arrow support is still missing, subsequent schema introspection or query calls could raise at runtime.

If the table is only for future work, consider skipping it for now:

- large_text = "x" * (1024 * 1024 * 2)  # 2MB
- conn.execute(text("CREATE TABLE test_lob (id NUMBER, content CLOB)"))
- conn.execute(
-     text("INSERT INTO test_lob VALUES (1, :content)"), {"content": large_text}
- )
+ # TODO: enable once ibis→pyarrow supports CLOB
ibis-server/app/routers/v3/connector.py (1)

106-118: X-Cache-Create-At missing when a brand-new cache file is created

In the cache-miss branch (True, False, _) you persist the query result, but you never populate X_CACHE_CREATE_AT.
Down-stream clients – and your own tests – rely on this header to know when the cache was materialised. They currently receive the header only on a hit or an override, which is counter-intuitive.

case (True, False, _):
    query_cache_manager.set(
        data_source, dto.sql, result, dto.connection_info
    )
+   cache_headers[X_CACHE_CREATE_AT] = str(
+       query_cache_manager.get_cache_file_timestamp(
+           data_source, dto.sql, dto.connection_info
+       )
+   )
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

142-145: Hard-coded Unix timestamp will break after 2025-04-07

assert int(response2.headers["X-Cache-Create-At"]) > 1743984000  # 2025.04.07

This assertion will start failing once we pass that date. Use a dynamic bound instead:

create_at = int(response2.headers["X-Cache-Create-At"])
assert create_at >= int(time.time()) - 5    # allow small skew

Keeps the test future-proof.

ibis-server/app/model/connector.py (2)

113-118: Return-type annotation is stale

CannerConnector.query now returns an Arrow pa.Table, but the annotation still says pd.DataFrame, which will mislead users and static type-checkers.

-    def query(self, sql: str, limit: int) -> pd.DataFrame:
+    def query(self, sql: str, limit: int) -> pa.Table:

145-181: Inconsistent return type on empty-result workaround

When the BigQuery path hits the "Must pass schema" branch, it returns
pd.DataFrame(...), contradicting the declared -> pa.Table and the rest
of the pipeline, which now assumes Arrow.
Down-stream utilities (e.g. to_json) will raise.

Suggested fix:

-                    ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
-                    return pd.DataFrame(columns=ibis_fields.names)
+                    ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
+                    empty_tbl = pa.Table.from_arrays(
+                        [pa.array([], type=typ.to_pyarrow()) for typ in ibis_fields.types],
+                        names=ibis_fields.names,
+                    )
+                    return empty_tbl

You can then drop the import pandas as pd at the top.

♻️ Duplicate comments (3)
ibis-server/tests/routers/v2/connector/test_gcs_file.py (1)

119-129: GCS connector mirrors the decimal/type drift observed in local_file

The same mismatch (float in manifest vs decimal128 in response) appears here.
Prefer fixing in one shared serialisation path instead of per-connector hacks.

ibis-server/tests/routers/v2/connector/test_s3_file.py (2)

119-129: Same schema divergence (decimal128 vs float) seen for S3 connector

Please consolidate the typing strategy across all file connectors.


165-167: Aggregate precision upgrade – ensure consistency with analytics layer

See earlier note on sum_totalprice precision jump.

🧹 Nitpick comments (16)
ibis-server/app/dependencies.py (1)

9-13: Header-constant naming diverges from existing pattern

Earlier constants embed the literal header value in lower-case ("x-wren-*") while new ones use mixed-case ("X-Cache-*"). HTTP headers are case-insensitive, but mixing styles can be confusing and increases the risk of subtle bugs during look-ups or tests that use exact strings.

Consider normalising the literals to one style (preferably lower-case to stay consistent with the pre-existing x-wren-* values) or, at minimum, documenting the rationale for the different casing.

ibis-server/tools/query_local_run.py (2)

17-22: Remove unused pandas import

pandas is no longer referenced after switching to to_pyarrow(), and Ruff flags this (F401). Delete the import to avoid needless dependency loading.

-from app.model import MySqlConnectionInfo, OracleConnectionInfo, PostgresConnectionInfo
-from app.util import to_json
-import sqlglot
-import sys
-import pandas as pd
+from app.model import MySqlConnectionInfo, OracleConnectionInfo, PostgresConnectionInfo
+from app.util import to_json
+import sqlglot
+import sys

97-104: Minor clean-ups & header handling

  1. dict(){} is shorter and matches the rest of the file.
  2. to_json currently receives a plain dict for headers; its signature elsewhere expects a starlette.datastructures.Headers (or similar) object. Pass an actual Headers({}) or adapt the function to accept dict explicitly to prevent runtime type mismatches.
-json_str = to_json(df, dict())
+from starlette.datastructures import Headers
+
+json_str = to_json(df, Headers({}))
ibis-server/app/custom_ibis/backends/sql/datatypes.py (1)

5-8: Nit: typo in comment and link formatting

"ibs BQ connector""ibis BigQuery connector" for clarity. Also consider shortening the link with an inline comment to keep the file compact.

-    # It's a workaround for the issue of ibs BQ connector not supporting interval precision.
+    # Work-around: the ibis BigQuery connector ignores interval precision.
ibis-server/tests/routers/v2/connector/test_trino.py (1)

115-133: Payload / dtype mismatch for several numeric & timestamp columns

totalprice, timestamp, timestamptz, and test_null_time are sent as strings but typed as double / timestamp[…]. Decide on one convention and apply it across tests.

If string serialisation stays, consider updating dtypes to string or, better, serialise true numbers/timestamps (ISO 8601) and keep the precise Arrow dtypes.

ibis-server/tests/routers/v2/connector/test_canner.py (1)

98-114: Consistency: numeric & timestamp values serialised as strings

Yet again, totalprice and all timestamps are stringified while dtypes claim numeric/timestamp Arrow types. A helper that converts an Arrow table to the expected JSON (respecting Arrow type → JSON mapping) would prevent these per-file divergences.

ibis-server/app/routers/v3/connector.py (1)

161-163: Cache-headers assignment duplicated across v2 & v3 routers

The same (non-trivial) block that derives cache_headers now lives in two routers. Any future tweak (e.g. adding a new header) must be done twice and risks drifting.

Consider extracting a small helper, e.g. build_cache_headers(cache_enable, cache_hit, override_cache, cache_manager, …) inside app.util, returning both the Arrow table (if it needs to write) and the header dict.
That keeps router code focused on HTTP orchestration.

ibis-server/app/routers/v2/connector.py (1)

175-177: Coupling response creation to header injection

update_response_headers mutates the response post-creation. A safer & cheaper pattern is to build the full header dict first and pass it to ORJSONResponse via its headers= param – that avoids a second pass over the headers and keeps them immutable after instantiation.

Not blocking, but worth considering.

ibis-server/tests/routers/v2/connector/test_minio_file.py (1)

175-185: Minor: decimal values are compared as strings

"172799.49" is asserted as a string because to_json formats decimal128 via DuckDB’s CAST(value AS VARCHAR).
This is fine, but keep in mind you lose numeric ordering in client code. If consumers need arithmetic, consider returning numbers and letting the JSON encoder handle precision, or add a follow-up test to ensure ordering works client-side.

ibis-server/app/query_cache/__init__.py (2)

29-33: Log message & variable naming are misleading

ibis.read_parquet(...).to_pyarrow() returns a pyarrow.Table, not a “dataframe”.
Keeping the wording consistent will avoid confusion when debugging.

-logger.info("query cache to dataframe")
+logger.info("query cache loaded as Arrow Table")

113-115: Redundant helper after refactor

If the above direct-write refactor is applied, _get_duckdb_connection() (and the DuckDB import) become dead code and should be removed to avoid unused-code drift.

ibis-server/tests/routers/v2/connector/test_postgres.py (1)

238-239: Time-based assertion is brittle

assert int(response2.headers["X-Cache-Create-At"]) > 1743984000 embeds a hard-coded epoch (≈ 2025-04-07 UTC).
When this date becomes “too far in the past” the assertion will continue to pass, but it gives no guarantee that the header represents now.
Safer pattern:

import time
assert abs(int(response2.headers["X-Cache-Create-At"]) - int(time.time())) < 60

…which confirms the timestamp is within a reasonable window of the test run.

ibis-server/app/model/connector.py (1)

199-202: Minor: slice() ignores negative / None limits

pa.Table.slice(length=limit) will happily accept negative or
None, yielding unexpected results.
Consider guarding with:

if limit is not None and limit >= 0:
    return tbl.slice(length=limit)
return tbl
ibis-server/app/util.py (1)

169-190: _formater complexity & style nits

The huge CASE string embeds newlines and uses RTRIM chains; readability
suffers and pylint rightly flags the many returns / redundant elif.

Not urgent, but extracting discrete helpers (e.g. _format_numeric,
_format_datetime) would improve maintainability and silence R0911 /
R1705.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

155-162: Confirm dtype expectations

The dtype strings now use Arrow notation ("string", "double",
"date32[day]", …). Ensure to_json always produces these exact tokens
for every backend; a mismatch will cascade across many tests.

ibis-server/tests/routers/v3/connector/oracle/test_query.py (1)

204-210: Nit: avoid shadowing – manifest_str here is no longer a string

Inside test_query_number_scale the variable named manifest_str now holds a dict, and later manifest_str_base64 is the actual base-64 string.
Renaming the first one to manifest_dict (or similar) would make the intent clearer and prevent accidental misuse.

-    manifest_str = {
+    manifest_dict = {
...
-    manifest_str_base64 = base64.b64encode(orjson.dumps(manifest_str)).decode("utf-8")
+    manifest_str_base64 = base64.b64encode(orjson.dumps(manifest_dict)).decode("utf-8")
...
-    assert len(result["columns"]) == len(manifest_str["models"][0]["columns"])
+    assert len(result["columns"]) == len(manifest_dict["models"][0]["columns"])
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa94acd and 12b55a7.

📒 Files selected for processing (28)
  • ibis-server/app/custom_ibis/backends/sql/datatypes.py (1 hunks)
  • ibis-server/app/dependencies.py (1 hunks)
  • ibis-server/app/model/connector.py (6 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/app/query_cache/__init__.py (4 hunks)
  • ibis-server/app/routers/v2/connector.py (5 hunks)
  • ibis-server/app/routers/v3/connector.py (6 hunks)
  • ibis-server/app/util.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (10 hunks)
  • ibis-server/tests/routers/v2/connector/test_canner.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py (5 hunks)
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_minio_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mssql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mysql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_oracle.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_postgres.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_s3_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_snowflake.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_trino.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_query.py (5 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/test_query.py (4 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_query.py (8 hunks)
  • ibis-server/tools/query_local_run.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
ibis-server/tests/routers/v3/connector/oracle/conftest.py (2)
ibis-server/tests/routers/v2/connector/test_oracle.py (1)
  • oracle (121-168)
ibis-server/tests/conftest.py (1)
  • file_path (10-11)
ibis-server/app/routers/v3/connector.py (1)
ibis-server/app/util.py (2)
  • update_response_headers (158-166)
  • to_json (43-59)
ibis-server/tests/routers/v2/connector/test_oracle.py (1)
ibis-server/tests/conftest.py (1)
  • file_path (10-11)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)
ibis-server/tests/conftest.py (1)
  • client (18-23)
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
  • connection_info (37-44)
🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py

21-21: pandas imported but unused

(F401)


98-98: print found

(T201)


99-99: print found

(T201)


100-100: print found

(T201)


102-102: print found

(T201)


103-103: print found

(T201)


104-104: print found

(T201)

🪛 Pylint (3.3.7)
ibis-server/tools/query_local_run.py

[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.

(R1735)

ibis-server/app/query_cache/__init__.py

[refactor] 40-40: Useless return at end of function or method

(R1711)

ibis-server/app/util.py

[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 169-169: Too many return statements (7/6)

(R0911)

🔇 Additional comments (19)
ibis-server/app/dependencies.py (1)

32-45: ```shell
#!/bin/bash
set -e

echo "🔍 Displaying lines 1-200 of ibis-server/app/dependencies.py for context:"
sed -n '1,200p' ibis-server/app/dependencies.py || true

echo
echo "🔍 Displaying get_wren_headers definition in ibis-server/app/dependencies.py:"
rg -n "def get_wren_headers" -C5 ibis-server/app/dependencies.py || true

echo
echo "🔍 Searching for 'cache' in ibis-server/app:"
rg -n "cache" -C2 ibis-server/app || true

echo
echo "🔍 Searching for 'Cache' in ibis-server/app:"
rg -n "Cache" -C2 ibis-server/app || true


</details>
<details>
<summary>ibis-server/app/model/metadata/object_storage.py (1)</summary>

`254-257`: **LGTM – message fix only**

The log message now correctly says “duckdb gcs”, which removes a misleading leftover from Minio.

</details>
<details>
<summary>ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1)</summary>

`100-102`: **Expectation update matches Arrow migration**

The new expected payload (`"2024-01-02"` with `date32[day]`) aligns with the Arrow-based JSON serialisation. Looks good.

</details>
<details>
<summary>ibis-server/tests/routers/v3/connector/local_file/test_query.py (2)</summary>

`148-149`: **Double-check calculated-field precision**

`sum_totalprice` now expects Arrow `double`, but the payload remains a string (`"2860895.79"`). Same comment as above: decide whether the API guarantees numeric JSON values or stringified numbers.

---

`96-105`: ```shell
#!/bin/bash
echo "Occurrences of the string literal \"172799.49\":"
rg -n '"172799.49"' -C1 ibis-server/tests/routers/v3/connector

echo -e "\nOccurrences of the numeric literal 172799.49 (no quotes):"
rg -n '172799\.49,' -C1 ibis-server/tests/routers/v3/connector
ibis-server/tests/routers/v3/connector/oracle/conftest.py (1)

32-52: 👍 Explicit dtype mapping avoids implicit CLOBs

Good call adding explicit VARCHAR / DECIMAL mappings – this prevents Oracle defaulting to CLOB for text columns.

ibis-server/tests/routers/v2/connector/test_mssql.py (1)

129-147: ```shell
#!/bin/bash

Find all occurrences of totalprice dtype declarations in connector tests

rg -n '"totalprice":' -C 1 ibis-server/tests/routers/v2/connector/test_*.py


</details>
<details>
<summary>ibis-server/tests/routers/v2/connector/test_mysql.py (1)</summary>

`158-169`: **`bytea_column` value looks like raw hex, not binary – ensure encoding matches the API contract**

`"616263"` is the ASCII-hex representation of `abc`, yet the declared dtype is `"binary"`.  
If the backend promises real binary payloads (e.g. base64 or escaped bytes), treating a plain hex string as binary can break downstream consumers.

Consider normalising the JSON serialisation so that:

1. `data` carries either a base64 string or a regular UTF-8 string while  
2. the dtype is `"binary"` only when the JSON value is base64/binary-safe.

Otherwise rename the dtype to `"string"` to avoid mis-hinting.

</details>
<details>
<summary>ibis-server/tests/routers/v2/connector/test_local_file.py (1)</summary>

`110-120`: **Manifest says `float`, runtime returns `decimal128` – schema drift**

`totalprice` and `orderdate` now come back as `decimal128(15, 2)` and `date32[day]`.  
The manifest, however, still marks `totalprice` as `"float"` and `orderdate` as `"date"`.

Leaving this divergence un-addressed will surface as validation errors once the engine starts
checking manifest ↔︎ result compatibility.

Synchronise the manifest or adjust the DuckDB/Arrow casting layer.

</details>
<details>
<summary>ibis-server/tests/routers/v2/connector/test_gcs_file.py (1)</summary>

`164-167`: **High-precision aggregate promoted to `decimal128(38,2)` – verify client compatibility**

Large–scale decimals may not round-trip cleanly to JS/JSON consumers.  
Confirm that the public API guarantees stringified decimals and document it.

</details>
<details>
<summary>ibis-server/tests/routers/v2/connector/test_snowflake.py (1)</summary>

`93-109`: **Snowflake: `totalprice` now a string, dates as `date32[day]`; confirm cross-connector parity**

Snowflake diverges from file connectors (decimal128) and from the manifest (`float`).  
Before these expectations ossify in tests, decide on a canonical logical type for money values
and propagate it uniformly.

Also note that `timestamptz` correctly keeps the `tz=UTC` qualifier – good catch.

</details>
<details>
<summary>ibis-server/tests/routers/v2/connector/test_oracle.py (1)</summary>

`198-206`: **Timestamp unit inconsistency across connectors**

Here you expect `"timestamp[ns]"`, while BigQuery tests expect `"timestamp[us]"`.  
If this stems from connector-specific casting, that’s fine, but if it’s accidental you may want to normalise the unit (µs vs ns) to avoid surprising API consumers that mix data sources.

Please double-check the Arrow schema produced by the Oracle connector.

</details>
<details>
<summary>ibis-server/tests/routers/v3/connector/bigquery/test_query.py (1)</summary>

`108-111`: **Randomised constant alters column count**

Appending `now` (`SELECT *, {now}`) changes the schema each run, so cache entries for identical SQL executed seconds apart will **never hit**.  
If the intention is *only* to avoid collisions during the test, consider embedding a fixed comment instead:

```python
sql = "/* cache-bypass */ SELECT * FROM orders ORDER BY o_orderkey LIMIT 1"

This keeps the schema stable while still producing a unique cache key.

ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)

190-201: Inconsistent null-timestamp dtype

For ClickHouse the nullable timestamp test_null_time is asserted as "string" whereas other back-ends map the same column to "timestamp[us]".
If this is deliberate (e.g. ClickHouse driver limitation) add a short comment explaining the divergence; otherwise the discrepancy may hide a coercion bug.

ibis-server/app/util.py (1)

78-92: Reliance on pg_timezone_names() may fail inside DuckDB

pg_timezone_names() is a Postgres catalog view and is not available
in vanilla DuckDB. duckdb.execute(sql) will therefore error on most
deployments.

Consider shipping an internal mapping of offsets → IANA names or
delegating to Python’s zoneinfo instead of a SQL call.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

640-662: NaN assertion could be brittle

CAST('NaN' AS FLOAT) is compared against None. Whether DuckDB returns
NULL or the string 'nan' after the new formatter is backend-version
dependent. If DuckDB upgrades its formatting, this test may flip.

Recommend using math.isnan(...) or asserting membership in {None, "nan", "NaN"} to avoid flaky CI.

ibis-server/tests/routers/v3/connector/oracle/test_query.py (3)

178-182: Ensure decimal128(10, 2) matches the actual Oracle column definition

The manifest hard-codes a scale/precision of (10, 2) for id_p_s.
If the underlying column is defined as NUMBER(10, 2) this is perfect.
If it is NUMBER(8, 2) (or unconstrained NUMBER) the driver will happily coerce, but Arrow will report a different decimal128(p, s) at runtime, causing the test to fail.

#!/bin/bash
# Quickly inspect the column definition through information_schema to confirm precision/scale
echo "SELECT data_precision, data_scale FROM all_tab_columns WHERE table_name='TEST_NUMBER' AND column_name='ID_P_S';" | sqlplus -s "${DB_USER}/${DB_PASSWORD}@${DB_SVC}"

92-102: ```shell
#!/bin/bash

Search for any occurrences of TRUNC("O_ORDERDATE") in test files under ibis-server

rg -n 'TRUNC("O_ORDERDATE")' ibis-server/tests


---

`103-113`: ```shell
#!/bin/bash
set -e
# List files in the oracle router directory
echo "Files in ibis-server/app/routers/v3/connector/oracle:"
find ibis-server/app/routers/v3/connector/oracle -type f

# Show the beginning of the query router
QUERY_FILE=$(find ibis-server/app/routers/v3/connector/oracle -type f | grep 'query.py$' || true)
echo "Opening $QUERY_FILE"
if [ -n "$QUERY_FILE" ]; then
  sed -n '1,200p' "$QUERY_FILE"
else
  echo "No query.py found"
fi

@goldmedal goldmedal closed this Jun 17, 2025
@goldmedal goldmedal reopened this Jun 17, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ibis-server/app/model/validator.py (1)

145-149: Guard against connectors that already return a DataFrame and keep Arrow benefits where possible

self.connector.query(...).to_pandas() assumes every connector now returns a pyarrow.Table.
If any connector still delivers a pandas.DataFrame, this will raise an AttributeError.
A small guard removes that fragility and lets you gradually migrate connectors, while avoiding a needless conversion when a DataFrame is already returned.

-            result = self.connector.query(rewritten_sql, limit=1).to_pandas()
+            raw_result = self.connector.query(rewritten_sql, limit=1)
+            # Convert to pandas only when the object supports it
+            result = raw_result.to_pandas() if hasattr(raw_result, "to_pandas") else raw_result

This also keeps open the possibility of later rewriting the validation logic to operate directly on Arrow Tables (and thus preserve the Arrow performance gains) without changing call-sites again.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12b55a7 and a27f4ef.

📒 Files selected for processing (1)
  • ibis-server/app/model/validator.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci

@goldmedal
Copy link
Copy Markdown
Contributor Author

There are some CI issues for BigQuery.
The BigQuery test has been tested locally.

poetry run pytest -m 'bigquery'
================================================================================================================================= test session starts =================================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.4.0, pluggy-1.5.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 353 items / 317 deselected / 36 selected                                                                                                                                                                                                                                    

tests/routers/v2/connector/test_bigquery.py ......................                                                                                                                                                                                                              [ 61%]
tests/routers/v3/connector/bigquery/test_functions.py ...                                                                                                                                                                                                                       [ 69%]
tests/routers/v3/connector/bigquery/test_query.py ...........   

@goldmedal goldmedal requested a review from douenergy June 17, 2025 07:47
@goldmedal goldmedal marked this pull request as draft June 17, 2025 07:47
@goldmedal goldmedal force-pushed the feat/use-arrow-type branch from a27f4ef to 8293caa Compare June 17, 2025 10:17
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Jun 17, 2025
@goldmedal
Copy link
Copy Markdown
Contributor Author

Test redshift locally:

poetry run pytest -m 'redshift'
================================================================================================================== test session starts ==================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.4.0, pluggy-1.5.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 379 items / 353 deselected / 26 selected                                                                                                                                                                                                      

tests/routers/v2/connector/test_redshift.py ...............                                                                                                                                                                                       [ 57%]
tests/routers/v3/connector/redshift/test_functions.py ...                                                                                                                                                                                         [ 69%]
tests/routers/v3/connector/redshift/test_query.py ........   

@goldmedal goldmedal marked this pull request as ready for review June 17, 2025 10:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
ibis-server/app/model/metadata/redshift.py (1)

110-112: Same optimisation applies to the constraints query

Duplicate of the comment above – the constraints path hits the same pandas dependency and memory cost. Reuse the Arrow-native conversion here as well.

ibis-server/app/query_cache/__init__.py (1)

52-58: Still opening an unused file handle & unnecessary DuckDB round-trip

The earlier feedback hasn’t been addressed: a binary file handle is opened but not used, and the Arrow → DuckDB → Parquet detour remains. This wastes I/O, allocates an unneeded connection, and may leave an empty file on failure.

See previous diff suggestion (unchanged):

-            with op.open(cache_file_name, mode="wb") as file:
-                con = self._get_duckdb_connection()
-                arrow_table = con.from_arrow(result)
-                if file.writable():
-                    arrow_table.write_parquet(full_path)
+            import pyarrow.parquet as pq
+            pq.write_table(result, full_path)

This also lets you drop _get_duckdb_connection() entirely.

ibis-server/app/util.py (1)

65-71: Critical bug – .startwith typo still present

timezone.startwith("+") should be timezone.startswith("+") (same for the - branch).
The typo raises AttributeError for every offset based-timezone and prevents the header from working at all.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):
🧹 Nitpick comments (9)
ibis-server/tests/routers/v3/connector/redshift/test_query.py (1)

84-91: timestamp[us] may mask TZ information

pa.timestamp('us') without a timezone is not equivalent to Redshift TIMESTAMPTZ; Arrow will keep a separate 'us', tz=<zone> logical type for tz-aware values.
If the backend starts returning timezone info (e.g., after a session parameter change), this strict equality will fail.

Either:

  1. Cast explicitly in SQL: CAST('...' AS TIMESTAMP WITHOUT TIME ZONE) vs WITH TIME ZONE; or
  2. Relax the assertion to accept "timestamp[us]" or "timestamp[us, tz=*]".
ibis-server/tools/query_local_run.py (2)

17-22: Remove unused pandas import

pandas is imported but never referenced after the migration to pyarrow.
Eliminate the import (and optionally add # noqa: F401 if you must keep it for local experiments) to silence Ruff F401 and keep the script tidy.

-from app.util import to_json
-import sqlglot
-import sys
-import pandas as pd
+from app.util import to_json
+import sqlglot
+import sys

100-104: Console prints flagged by Ruff T201 – consider replacing with logger

For a CLI script this is often acceptable, but if you want a clean Ruff run or richer log control switch to loguru / logging.

ibis-server/tests/routers/v2/connector/test_redshift.py (1)

64-69: Track unsupported varbyte column in an issue

Good call to comment the bytea column until ConnectorX supports varbyte.
Consider adding a TODO reference to an issue number to avoid losing the task.

ibis-server/app/util.py (3)

79-91: get_timezone_from_offset silently returns the first match

If multiple time-zones share the same UTC offset (very common) the first row returned is arbitrary and can change between DuckDB versions.
Return all matches or document the behaviour; at minimum log a warning.


158-167: Header propagation helper is verbose

update_response_headers repeats the same pattern four times and will drift when new cache headers arrive.
Use a loop to keep it DRY:

-def update_response_headers(response, required_headers: dict):
-    if X_CACHE_HIT in required_headers:
-        response.headers[X_CACHE_HIT] = required_headers[X_CACHE_HIT]
-    ...
+def update_response_headers(response, required_headers: dict):
+    for key in (X_CACHE_HIT, X_CACHE_CREATE_AT, X_CACHE_OVERRIDE, X_CACHE_OVERRIDE_AT):
+        if key in required_headers:
+            response.headers[key] = required_headers[key]

145-155: Dead code? pd_to_arrow_schema no longer used

The new pipeline is Arrow-native; nothing invokes pd_to_arrow_schema.
Delete it to keep the module lean or add a comment explaining future usage.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)

640-662: Flaky assertion – NaN mapped to None

_formater currently converts NaN to SQL NULL, which is marshalled to None.
If the implementation ever changes to produce "nan" (e.g. by adding explicit format() handling) this test will break.
Consider asserting with math.isnan or accepting both representations.

import math
assert result["data"][0][18] is None or math.isnan(float('nan'))  # relax expectation

591-668: Very large test payload – weigh cost vs. benefit

The test_format_floating SQL literal list is massive (28 literals) and all values are asserted individually.
While exhaustive, it slows the suite and is brittle for insignificant formatting tweaks.

Suggestion: parametrize a smaller representative subset and focus on edge-cases (0, very small/large, NaN/Inf).

[performance]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a27f4ef and 8293caa.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • ibis-server/app/custom_ibis/backends/sql/datatypes.py (1 hunks)
  • ibis-server/app/dependencies.py (1 hunks)
  • ibis-server/app/model/connector.py (7 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/app/model/metadata/redshift.py (2 hunks)
  • ibis-server/app/model/validator.py (1 hunks)
  • ibis-server/app/query_cache/__init__.py (4 hunks)
  • ibis-server/app/routers/v2/connector.py (5 hunks)
  • ibis-server/app/routers/v3/connector.py (6 hunks)
  • ibis-server/app/util.py (4 hunks)
  • ibis-server/pyproject.toml (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_athena.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (10 hunks)
  • ibis-server/tests/routers/v2/connector/test_canner.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py (5 hunks)
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_minio_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mssql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mysql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_oracle.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_postgres.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_redshift.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_s3_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_snowflake.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_trino.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/athena/test_query.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_query.py (5 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/test_query.py (4 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_query.py (8 hunks)
  • ibis-server/tests/routers/v3/connector/redshift/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/redshift/test_query.py (1 hunks)
  • ibis-server/tools/query_local_run.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • ibis-server/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (25)
  • ibis-server/app/model/metadata/object_storage.py
  • ibis-server/app/model/validator.py
  • ibis-server/app/custom_ibis/backends/sql/datatypes.py
  • ibis-server/tests/routers/v2/connector/test_trino.py
  • ibis-server/tests/routers/v3/connector/oracle/conftest.py
  • ibis-server/tests/routers/v2/connector/test_canner.py
  • ibis-server/app/dependencies.py
  • ibis-server/tests/routers/v2/connector/test_mssql.py
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py
  • ibis-server/tests/routers/v2/connector/test_local_file.py
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py
  • ibis-server/tests/routers/v2/connector/test_mysql.py
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py
  • ibis-server/app/routers/v3/connector.py
  • ibis-server/tests/routers/v3/connector/bigquery/test_query.py
  • ibis-server/tests/routers/v2/connector/test_minio_file.py
  • ibis-server/app/routers/v2/connector.py
  • ibis-server/tests/routers/v2/connector/test_bigquery.py
  • ibis-server/tests/routers/v2/connector/test_oracle.py
  • ibis-server/tests/routers/v2/connector/test_snowflake.py
  • ibis-server/tests/routers/v2/connector/test_s3_file.py
  • ibis-server/tests/routers/v3/connector/oracle/test_query.py
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py
  • ibis-server/app/model/connector.py
  • ibis-server/tests/routers/v2/connector/test_postgres.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
  • connection_info (37-44)
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 169-169: Too many return statements (7/6)

(R0911)

ibis-server/app/query_cache/__init__.py

[refactor] 40-40: Useless return at end of function or method

(R1711)

ibis-server/tools/query_local_run.py

[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.

(R1735)

🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py

21-21: pandas imported but unused

(F401)


98-98: print found

(T201)


99-99: print found

(T201)


100-100: print found

(T201)


102-102: print found

(T201)


103-103: print found

(T201)


104-104: print found

(T201)

🔇 Additional comments (7)
ibis-server/tests/routers/v3/connector/redshift/test_functions.py (1)

90-91: Confirm int32 assumption for portability

ABS(-1) indeed returns the same type as its operand, but the SQL literal -1 is not consistently typed as a 32-bit integer across engines (it is int64 in some builds / arrow back-ends).
If Redshift/Arrow ever widens the literal to int64, this assertion will start failing.

Consider loosening the check to {"col": re.compile(r"int(32|64)")} or build the value with an explicit cast (SELECT CAST(-1 AS INT4)).

ibis-server/tools/query_local_run.py (1)

91-98: Use .execute().fetch_arrow_table() instead of limit(...).to_pyarrow()

Calling limit(...).to_pyarrow() materialises the whole query in the database and then truncates the Arrow table client-side, which still streams the full result over the wire.
Most back-ends (including Ibis ≥ 9) support execute(limit=n, result="arrow"), yielding the first n records directly on the server.

-df = connection.sql(dialect_sql).limit(10).to_pyarrow()
+df = connection.sql(dialect_sql).execute(limit=10, result="arrow")

This reduces network traffic and latency for large result sets.
(Keep the current call if the chosen backend/driver doesn’t yet support the optimisation.)

ibis-server/tests/routers/v2/connector/test_athena.py (2)

102-125: Updated expectations match Arrow-backed types – looks good

The row payload and dtypes assertions now reflect date32 and decimal128 as produced by Arrow/DuckDB.
No further action required.


147-165: Second manifest-driven query block consistent with first — ok

Mirrors the changes above for the Glue database path; expectations are coherent.

ibis-server/tests/routers/v2/connector/test_redshift.py (1)

109-116: Arrow-aware dtype assertions validated – looks good

Dtype refinements (double, timestamp[us, tz=+00:00], etc.) align with the new serialization path.

ibis-server/tests/routers/v3/connector/athena/test_query.py (1)

77-99: V3 API expectations updated correctly

Date simplification and Arrow dtype updates are consistent with the v2 tests; assertions remain deterministic.

ibis-server/app/util.py (1)

169-176: ```shell
#!/bin/bash

Search for usage of _formater in the repository

rg -n "_formater" -n .


</details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
ibis-server/app/util.py (1)

65-69: startwith typo breaks offset handling (already reported)

The guard still uses timezone.startwith("+"); this will raise AttributeError for offset values and bypass the intended logic.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):
🧹 Nitpick comments (1)
ibis-server/app/util.py (1)

52-52: Log message typo (“formmated_sql”)

Minor nit – the key is misspelled. This makes log filtering harder.

-    logger.debug(f"formmated_sql: {formatted_sql}")
+    logger.debug(f"formatted_sql: {formatted_sql}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8293caa and 6ea21a2.

📒 Files selected for processing (2)
  • ibis-server/app/model/metadata/redshift.py (2 hunks)
  • ibis-server/app/util.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/app/model/metadata/redshift.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 168-186: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 177-182: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 166-166: Too many return statements (7/6)

(R0911)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci
🔇 Additional comments (1)
ibis-server/app/util.py (1)

166-174: [web_search]

Does DuckDB support the SQL function format() for numeric formatting, or is printf/to_char the correct function to use?

[web_search]

What is the correct syntax and argument order for the strftime function in DuckDB?

@goldmedal goldmedal force-pushed the feat/use-arrow-type branch from 6ea21a2 to 4775465 Compare June 18, 2025 06:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
ibis-server/app/util.py (1)

62-70: startwith typo crashes timezone handling

timezone.startwith("+") raises AttributeError, preventing any non-named-tz offsets from working.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):
🧹 Nitpick comments (2)
ibis-server/app/util.py (2)

50-54: Minor typos & missing variable exposure reduce debuggability

  1. formmated_sql is miss-spelled in the debug log – makes searching harder.
  2. Consider logging the first n rows instead of the full SQL when extremely long – avoids megabyte-sized log lines.
-    logger.debug(f"formmated_sql: {formatted_sql}")
+    logger.debug("formatted_sql: %s", formatted_sql[:500])

166-187: Simplify _formater branching

Seven early returns plus nested triple-quoted strings are hard to scan. Consider a mapping table or a match/case (py3.10+) to reduce cognitive load.

Not blocking, purely readability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea21a2 and 4775465.

📒 Files selected for processing (5)
  • ibis-server/app/model/connector.py (7 hunks)
  • ibis-server/app/model/metadata/redshift.py (2 hunks)
  • ibis-server/app/util.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_redshift.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/redshift/test_query.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • ibis-server/tests/routers/v2/connector/test_redshift.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • ibis-server/app/model/metadata/redshift.py
  • ibis-server/tests/routers/v3/connector/redshift/test_query.py
  • ibis-server/app/model/connector.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 168-186: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 177-182: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 166-166: Too many return statements (7/6)

(R0911)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
ibis-server/app/util.py (1)

65-69: Fix typo that breaks offset-based timezone detection

timezone.startwith("+") raises AttributeError at runtime – the correct method is startswith.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):
🧹 Nitpick comments (4)
ibis-server/app/util.py (4)

51-52: Spelling/clarity – typo in debug message

Small nit: formmated_sqlformatted_sql.

-    logger.debug(f"formmated_sql: {formatted_sql}")
+    logger.debug(f"formatted_sql: {formatted_sql}")

61-75: Connection lifecycle – ensure explicit close

get_duckdb_conn creates a new in-memory connection on every call but none of the callers close it.
While GC will eventually collect the object, explicitly closing the connection avoids surprise file-descriptor exhaustion during high-volume requests.

Consider:

-    conn = get_duckdb_conn(headers)
-    formatted_df = conn.execute(formatted_sql).fetch_df()
+    with get_duckdb_conn(headers) as conn:
+        formatted_df = conn.execute(formatted_sql).fetch_df()

154-163: Reduce repetition when copying cache headers

The four nearly-identical if blocks can be condensed, improving maintainability:

-def update_response_headers(response, required_headers: dict):
-    if X_CACHE_HIT in required_headers:
-        response.headers[X_CACHE_HIT] = required_headers[X_CACHE_HIT]
-    if X_CACHE_CREATE_AT in required_headers:
-        response.headers[X_CACHE_CREATE_AT] = required_headers[X_CACHE_CREATE_AT]
-    if X_CACHE_OVERRIDE in required_headers:
-        response.headers[X_CACHE_OVERRIDE] = required_headers[X_CACHE_OVERRIDE]
-    if X_CACHE_OVERRIDE_AT in required_headers:
-        response.headers[X_CACHE_OVERRIDE_AT] = required_headers[X_CACHE_OVERRIDE_AT]
+def update_response_headers(response, required_headers: dict):
+    for hdr in (
+        X_CACHE_HIT,
+        X_CACHE_CREATE_AT,
+        X_CACHE_OVERRIDE,
+        X_CACHE_OVERRIDE_AT,
+    ):
+        if hdr in required_headers:
+            response.headers[hdr] = required_headers[hdr]

165-186: _formater: simplify branching & silence linter warnings

Pylint flags “unnecessary elif after return” and “too many return statements”.
A compact dict-based dispatch avoids both issues and is easier to extend.

-def _formater(field: pa.Field) -> str:
-    column_name = _quote_identifier(field.name)
-    if pa.types.is_decimal(field.type) or pa.types.is_floating(field.type):
-        return f"""
-        case when {column_name} = 0 then '0'
-        when length(CAST({column_name} AS VARCHAR)) > 15 then format('{{:.9g}}', {column_name})
-        else RTRIM(RTRIM(format('{{:.8f}}', {column_name}), '0'), '.')
-        end as {column_name}"""
-    elif pa.types.is_date(field.type):
-        return f"strftime({column_name}, '%Y-%m-%d') as {column_name}"
-    elif pa.types.is_timestamp(field.type):
-        if field.type.tz is None:
-            return f"strftime({column_name}, '%Y-%m-%d %H:%M:%S.%f') as {column_name}"
-        else:
-            return (
-                f"strftime({column_name}, '%Y-%m-%d %H:%M:%S.%f %Z') as {column_name}"
-            )
-    elif pa.types.is_binary(field.type):
-        return f"to_hex({column_name}) as {column_name}"
-    elif pa.types.is_interval(field.type):
-        return f"cast({column_name} as varchar) as {column_name}"
-    return column_name
+def _formater(field: pa.Field) -> str:
+    col = _quote_identifier(field.name)
+
+    if pa.types.is_decimal(field.type) or pa.types.is_floating(field.type):
+        return (
+            f"case when {col} = 0 then '0' "
+            f"when length(CAST({col} AS VARCHAR)) > 15 "
+            f"then format('{{:.9g}}', {col}) "
+            f"else RTRIM(RTRIM(format('{{:.8f}}', {col}), '0'), '.') end as {col}"
+        )
+
+    if pa.types.is_date(field.type):
+        return f"strftime({col}, '%Y-%m-%d') as {col}"
+
+    if pa.types.is_timestamp(field.type):
+        fmt = '%Y-%m-%d %H:%M:%S.%f%Z' if field.type.tz else '%Y-%m-%d %H:%M:%S.%f'
+        return f"strftime({col}, '{fmt}') as {col}"
+
+    if pa.types.is_binary(field.type):
+        return f"to_hex({col}) as {col}"
+
+    if pa.types.is_interval(field.type):
+        return f"cast({col} as varchar) as {col}"
+
+    return col
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4775465 and 78e4dda.

📒 Files selected for processing (1)
  • ibis-server/app/util.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 167-185: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 176-181: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 165-165: Too many return statements (7/6)

(R0911)

🪛 GitHub Actions: ibis CI
ibis-server/app/util.py

[error] 1-1: Ruff formatting check failed. File would be reformatted. Run 'ruff --fix' to fix code style issues.

🔇 Additional comments (1)
ibis-server/app/util.py (1)

1-191: CI failure – run Ruff auto-formatter

The GitHub Action reports: “Ruff formatting check failed. File would be reformatted.”
Run ruff --fix ibis-server/app/util.py to pass the style gate.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
ibis-server/app/util.py (1)

68-74: Fix fatal typo startwithstartswith in timezone handling

str has no attribute startwith; the first call will raise an AttributeError, breaking every request that supplies a +HH:MM offset.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):

Please patch immediately – nothing downstream will run until this is corrected.

🧹 Nitpick comments (3)
ibis-server/app/util.py (3)

55-55: Nit: misspelled log label formmated_sql

Minor, but typos in log messages hinder grep-based troubleshooting.

-    logger.debug(f"formmated_sql: {formatted_sql}")
+    logger.debug(f"formatted_sql: {formatted_sql}")

169-190: Simplify _formater branches to appease static-analysis and readability

Five elif chains after an early return trigger pylint R1705 and make the function longer than required. De-indent the branches once a match is found:

@@
-    elif pa.types.is_date(field.type):
+    if pa.types.is_date(field.type):
@@
-    elif pa.types.is_timestamp(field.type):
+    elif pa.types.is_timestamp(field.type):
@@
-    elif pa.types.is_binary(field.type):
+    elif pa.types.is_binary(field.type):
@@
-    elif pa.types.is_interval(field.type):
+    elif pa.types.is_interval(field.type):

(or switch to a match/case when Python ≥3.10)
This removes superfluous elif levels, satisfying linters and future readers.


145-155: Potential dtype loss when converting Pandas → Arrow

Fallback to pa.string() for unsupported dtypes silently degrades information (e.g., category, Int64, datetime64[ns, tz]). Consider using pyarrow’s pandas_compat to infer richer types:

-            # Fallback to string type for unsupported dtypes
-            pa_type = pa.string()
+            # Let Arrow attempt a best-effort mapping first
+            try:
+                pa_type = pa.from_numpy_dtype(dtype)
+            except (TypeError, NotImplementedError):
+                pa_type = pa.string()

This preserves more semantic information without much extra code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78e4dda and 1ae2d5a.

📒 Files selected for processing (1)
  • ibis-server/app/util.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.516Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.012Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 169-169: Too many return statements (7/6)

(R0911)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci

@goldmedal goldmedal force-pushed the feat/use-arrow-type branch from cc6b2b7 to 76c5114 Compare June 20, 2025 08:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
ibis-server/app/query_cache/__init__.py (1)

52-58: Inefficient file handling with unused handle.

You open a file handle with op.open(..., "wb") but don't use it, then write via DuckDB to full_path. This creates an unnecessary file handle allocation and potential for a stray empty file if the DuckDB write fails.

Consider writing the Arrow table directly to Parquet:

-            # Create cache directory if it doesn't exist
-            with op.open(cache_file_name, mode="wb") as file:
-                con = self._get_duckdb_connection()
-                arrow_table = con.from_arrow(result)
-                if file.writable():
-                    arrow_table.write_parquet(full_path)
+            # Write Arrow table directly to Parquet
+            with op.open(cache_file_name, mode="wb") as sink:
+                pq.write_table(result, sink)

This requires adding import pyarrow.parquet as pq and removes the DuckDB dependency for cache writing.

ibis-server/app/util.py (1)

70-70: Fix typo in method name.

There's a typo in the method name startwith which should be startswith. This will cause an AttributeError at runtime.

-        if timezone.startwith("+") or timezone.startswith("-"):
+        if timezone.startswith(("+", "-")):
🧹 Nitpick comments (1)
ibis-server/tools/query_local_run.py (1)

21-21: Remove unused pandas import.

The pandas import is no longer needed since the code has migrated to PyArrow.

-import pandas as pd
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc6b2b7 and 76c5114.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (35)
  • ibis-server/app/custom_ibis/backends/sql/datatypes.py (1 hunks)
  • ibis-server/app/dependencies.py (1 hunks)
  • ibis-server/app/model/connector.py (7 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/app/model/metadata/redshift.py (2 hunks)
  • ibis-server/app/model/validator.py (1 hunks)
  • ibis-server/app/query_cache/__init__.py (4 hunks)
  • ibis-server/app/routers/v2/connector.py (5 hunks)
  • ibis-server/app/routers/v3/connector.py (6 hunks)
  • ibis-server/app/util.py (4 hunks)
  • ibis-server/pyproject.toml (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_athena.py (4 hunks)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (10 hunks)
  • ibis-server/tests/routers/v2/connector/test_canner.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py (5 hunks)
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_minio_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mssql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_mysql.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_oracle.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_postgres.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_redshift.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_s3_file.py (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_snowflake.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_trino.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/athena/test_query.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_query.py (5 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py (2 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/oracle/test_query.py (4 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_query.py (8 hunks)
  • ibis-server/tests/routers/v3/connector/redshift/test_query.py (1 hunks)
  • ibis-server/tools/query_local_run.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (31)
  • ibis-server/app/model/metadata/object_storage.py
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py
  • ibis-server/app/model/validator.py
  • ibis-server/app/model/metadata/redshift.py
  • ibis-server/app/custom_ibis/backends/sql/datatypes.py
  • ibis-server/pyproject.toml
  • ibis-server/tests/routers/v2/connector/test_redshift.py
  • ibis-server/tests/routers/v3/connector/bigquery/test_query.py
  • ibis-server/tests/routers/v3/connector/oracle/conftest.py
  • ibis-server/app/dependencies.py
  • ibis-server/tests/routers/v3/connector/redshift/test_query.py
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py
  • ibis-server/tests/routers/v2/connector/test_trino.py
  • ibis-server/tests/routers/v2/connector/test_canner.py
  • ibis-server/app/routers/v3/connector.py
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py
  • ibis-server/tests/routers/v2/connector/test_mysql.py
  • ibis-server/tests/routers/v2/connector/test_minio_file.py
  • ibis-server/app/routers/v2/connector.py
  • ibis-server/tests/routers/v2/connector/test_mssql.py
  • ibis-server/tests/routers/v2/connector/test_local_file.py
  • ibis-server/tests/routers/v2/connector/test_athena.py
  • ibis-server/tests/routers/v2/connector/test_s3_file.py
  • ibis-server/tests/routers/v2/connector/test_oracle.py
  • ibis-server/tests/routers/v3/connector/athena/test_query.py
  • ibis-server/tests/routers/v2/connector/test_snowflake.py
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py
  • ibis-server/tests/routers/v2/connector/test_bigquery.py
  • ibis-server/tests/routers/v2/connector/test_postgres.py
  • ibis-server/tests/routers/v3/connector/oracle/test_query.py
  • ibis-server/app/model/connector.py
🧰 Additional context used
🧠 Learnings (1)
ibis-server/app/util.py (3)
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:50-57
Timestamp: 2025-06-18T02:12:43.570Z
Learning: In the `to_json` function in `ibis-server/app/util.py`, the code intentionally uses `fetch_df()` to get a pandas DataFrame and then calls `to_dict(orient='split')` because this specific format is required for `orjson` serialization. The pandas conversion step is necessary to generate the correct dictionary structure for orjson.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
PR: Canner/wren-engine#1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
🪛 Pylint (3.3.7)
ibis-server/app/util.py

[refactor] 171-189: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 180-185: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 169-169: Too many return statements (7/6)

(R0911)

ibis-server/app/query_cache/__init__.py

[refactor] 40-40: Useless return at end of function or method

(R1711)

ibis-server/tools/query_local_run.py

[refactor] 101-101: Consider using '{}' instead of a call to 'dict'.

(R1735)

🪛 Ruff (0.11.9)
ibis-server/tools/query_local_run.py

21-21: pandas imported but unused

(F401)


98-98: print found

(T201)


99-99: print found

(T201)


100-100: print found

(T201)


102-102: print found

(T201)


103-103: print found

(T201)


104-104: print found

(T201)

🔇 Additional comments (8)
ibis-server/tools/query_local_run.py (2)

91-93: Oracle support implementation looks good.

The Oracle connection setup follows the same pattern as other data sources with proper validation using OracleConnectionInfo.


97-104: PyArrow migration implemented correctly.

The migration from .to_pandas() to .to_pyarrow() aligns with the PR objectives and the updated to_json function correctly handles PyArrow Tables.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (3)

145-145: Date format simplified correctly.

The change from timestamp format to date-only format "1996-01-02" aligns with the PyArrow date handling and is more appropriate for date columns.


155-162: Data type specifications updated to Arrow types.

The updated dtypes reflect more precise Arrow type information (e.g., "string", "double", "date32[day]", "timestamp[us, tz=UTC]"), which provides better type fidelity in the PyArrow migration.


591-667: Comprehensive floating point formatting test.

This test covers extensive floating point scenarios including scientific notation, decimals, special values (NaN, Infinity), and precision handling. It's a valuable addition for validating the new DuckDB-based formatting logic.

ibis-server/app/util.py (3)

43-62: PyArrow-based JSON serialization implemented well.

The new to_json function correctly handles PyArrow Tables with proper empty table handling, DuckDB-based formatting, and maintains the required orjson serialization format. The use of DuckDB's automatic PyArrow Table recognition is elegant.


65-78: DuckDB connection with timezone support.

The timezone handling logic properly supports both named timezones and offset formats, with appropriate fallback to UTC. The parameterized query approach prevents SQL injection.


169-190: Comprehensive type-specific formatting logic.

The _formater function handles various Arrow types appropriately:

  • Decimal/float formatting with scientific notation for large numbers
  • Proper date and timestamp formatting with timezone awareness
  • Binary and interval type handling

The logic is comprehensive and handles edge cases well.

@douenergy
Copy link
Copy Markdown
Contributor

Thanks @goldmedal for the hard working 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery dependencies Pull requests that update a dependency file ibis python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants