feat(core): use BigQuery python client directly by goldmedal · Pull Request #1370 · Canner/wren-engine

goldmedal · 2025-11-13T03:07:11Z

Description

This PR stops using ibis as the connector because it will do many additional things which isn't necessary for us.
After changing to use the BigQuery Python client, the average query time of a simple SQL SELECT 1 can be reduced by about 1 second.

Summary by CodeRabbit

Refactor
- BigQuery connector now initializes credentials and client directly and returns query results as Arrow tables with proper result limiting.
- Metadata retrieval updated to use the unified query flow while preserving existing table and constraint outputs.
- Improved observability of query execution via added tracing.

coderabbitai · 2025-11-13T03:07:22Z

Walkthrough

BigQueryConnector.query now builds Google Cloud credentials from stored secrets, instantiates a BigQuery client, runs the SQL and returns a PyArrow table (honoring an optional limit). DuckDB is imported during DuckDBConnector initialization. BigQuery metadata adapters call connection.query(...) instead of connection.sql(...).

Changes

Cohort / File(s)	Summary
BigQuery Connector Implementation `ibis-server/app/model/connector.py`	`BigQueryConnector.query` now constructs GCP credentials from the stored secret, creates `google.cloud.bigquery.Client`, executes the query and returns a PyArrow table (uses `result(max_results=limit)` when `limit` provided). Removed previous fallback/schema-reconstruction path and added tracing span. Also moved `duckdb` import into `DuckDBConnector.__init__`.
BigQuery Metadata Adapter `ibis-server/app/model/metadata/bigquery.py`	Replaced calls to `self.connection.sql(...)` with `self.connection.query(...)` in `get_table_list` and `get_constraints`; result handling (to pandas -> records) remains unchanged.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant BigQueryConnector
    participant SecretStore
    participant GCPCredentials
    participant BigQueryClient
    participant ArrowTable

    Caller->>BigQueryConnector: query(sql, limit)

    rect rgb(240,248,255)
    Note over BigQueryConnector,SecretStore: build credentials from stored secret
    BigQueryConnector->>SecretStore: fetch secret
    SecretStore-->>BigQueryConnector: secret
    BigQueryConnector->>GCPCredentials: construct credentials
    GCPCredentials-->>BigQueryConnector: credentials
    end

    rect rgb(240,255,240)
    Note over BigQueryConnector,BigQueryClient: init client and run query
    BigQueryConnector->>BigQueryClient: init(credentials)
    BigQueryConnector->>BigQueryClient: execute query (max_results=limit)
    BigQueryClient-->>BigQueryConnector: results
    end

    BigQueryConnector->>ArrowTable: convert results -> PyArrow Table
    ArrowTable-->>BigQueryConnector: pa.Table
    BigQueryConnector-->>Caller: pa.Table

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas to verify:
- Correctness of credential construction and required OAuth scopes.
- Query behavior for empty results and schema inference (fallback removed).
- PyArrow conversion and max_results=limit semantics.
- DuckDB import timing and any side effects in environments where duckdb is optional.

Possibly related PRs

feat(ibis): add statement level timeout for BigQuery #1339 — also modifies BigQuery handling to use google.cloud.bigquery.Client and adjusts client/query lifecycle.
fix(ibis): implement workaround for empty json result #1013 — addresses alternative BigQuery fallback/error paths and empty-schema handling in BigQueryConnector.query.
feat(bigquery): support to query BigQuery external table #1055 — changes credential construction and OAuth scopes used when creating BigQuery clients.

Suggested reviewers

wwwy3y3
douenergy

Poem

🐰 I hopped inside the connector tree,
Built creds from secrets, set the client free.
DuckDB yawned and woke in time,
Arrow tables gleamed, results in rhyme. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: switching from ibis to using the BigQuery Python client directly, which aligns with the core modifications in both connector.py and bigquery.py files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7468d and 7afcfd9.

📒 Files selected for processing (2)

ibis-server/app/model/connector.py (1 hunks)
ibis-server/app/model/metadata/bigquery.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

ibis-server/app/model/metadata/bigquery.py

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.

📚 Learning: 2025-02-06T07:58:58.830Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

Applied to files:

ibis-server/app/model/connector.py

📚 Learning: 2025-06-18T02:23:34.040Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Applied to files:

ibis-server/app/model/connector.py

📚 Learning: 2025-06-18T02:23:34.040Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Applied to files:

ibis-server/app/model/connector.py

🧬 Code graph analysis (1)

ibis-server/app/model/connector.py (2)

ibis-server/app/routers/v2/connector.py (1)

query (65-207)

ibis-server/app/routers/v3/connector.py (1)

query (60-232)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: ci

🔇 Additional comments (3)

ibis-server/app/model/connector.py (3)

453-453: LGTM: Lazy import pattern improves performance.

The in-method import duckdb defers the import cost until the connector is instantiated, which is consistent with RedshiftConnector (line 534) and appropriate for optional dependencies.

428-430: No functional issues with dry_run, but fix credentials/scopes handling per design guidance.

The dry_run method will work correctly—it inherits SimpleConnector.dry_run which calls self.connection.sql(sql) on the ibis BigQuery connection. While the hybrid approach (ibis for dry_run, BigQuery client for query) creates unnecessary overhead by initializing an unused connection, it's functionally sound.

However, there's a more significant issue: credentials and scopes are created on every query() call (lines 438-446), not just in a retry path. Per design guidance, credentials with additional scopes should only be instantiated in the retry path, not during normal execution. Refactor to create credentials once at initialization or defer creation to a retry mechanism.

432-448: Add consistency note or verify BigQuery .to_arrow() handles precision.

The concern about missing Decimal/UUID handling is architecturally valid, but the evidence suggests it's managed differently:

BigQuery NUMERIC and BIGNUMERIC types are automatically supported in the python-bigquery library, and the direct client.query().result().to_arrow() call handles this conversion natively without explicit type mapping.

The test test_decimal_precision in test_query.py confirms Decimal values (NUMERIC columns) return correctly as serialized strings.

However, this differs from SimpleConnector.query, which explicitly applies _handle_pyarrow_unsupported_type to Ibis tables before converting to PyArrow. CannerConnector similarly includes this handling despite using custom logic.

Recommendation: Document why BigQuery bypasses this type handling (native .to_arrow() support) to clarify the architectural difference, or apply the same defensive pattern as CannerConnector for consistency and future maintainability.

ibis-server/app/model/connector.py

goldmedal · 2025-11-13T03:23:42Z

BigQuery has been tested locally

poetry run pytest -m 'bigquery'
================================================================================================================================ test session starts ================================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 396 items / 349 deselected / 47 selected                                                                                                                                                                                                                                  

tests/routers/v2/connector/test_bigquery.py ..................                                                                                                                                                                                                                [ 38%]
tests/routers/v3/connector/bigquery/test_functions.py ............                                                                                                                                                                                                            [ 63%]
tests/routers/v3/connector/bigquery/test_query.py .................

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

ibis-server/app/model/connector.py (1)
442-447: Reconsider unconditional broad credential scopes.

The credentials are scoped with drive and cloud-platform for every query execution. Based on learnings, these broad scopes were previously reserved for a retry path when handling empty results with special types, not for normal execution. Creating credentials with broader-than-necessary scopes for every query violates the principle of least privilege and may have performance implications.

Based on learnings.

Consider one of the following approaches:

If the retry logic for special types is no longer needed, document why these broad scopes are now required for all queries

If these scopes are truly necessary, explain the requirement in a comment

If the original scopes are sufficient, remove the additional scopes:
-        credentials = credentials.with_scopes(
-            [
-                "https://www.googleapis.com/auth/drive",
-                "https://www.googleapis.com/auth/cloud-platform",
-            ]
-        )
+        # Use default scopes from the service account

🧹 Nitpick comments (2)

ibis-server/app/model/connector.py (2)

448-449: Consider caching the BigQuery client for efficiency.

A new BigQuery client is instantiated on every query call. This could be inefficient as it involves credential setup and connection initialization overhead.

Consider caching the client as an instance variable:

+    def _get_client(self) -> bigquery.Client:
+        if not hasattr(self, "_client"):
+            credits_json = loads(
+                base64.b64decode(
+                    self.connection_info.credentials.get_secret_value()
+                ).decode("utf-8")
+            )
+            credentials = service_account.Credentials.from_service_account_info(
+                credits_json
+            )
+            credentials = credentials.with_scopes(
+                [
+                    "https://www.googleapis.com/auth/drive",
+                    "https://www.googleapis.com/auth/cloud-platform",
+                ]
+            )
+            self._client = bigquery.Client(credentials=credentials)
+        return self._client
+
     @tracer.start_as_current_span("connector_query", kind=trace.SpanKind.CLIENT)
     def query(self, sql: str, limit: int | None = None) -> pa.Table:
-        credits_json = loads(
-            base64.b64decode(
-                self.connection_info.credentials.get_secret_value()
-            ).decode("utf-8")
-        )
-        credentials = service_account.Credentials.from_service_account_info(
-            credits_json
-        )
-        credentials = credentials.with_scopes(
-            [
-                "https://www.googleapis.com/auth/drive",
-                "https://www.googleapis.com/auth/cloud-platform",
-            ]
-        )
-        client = bigquery.Client(credentials=credentials)
+        client = self._get_client()
         return client.query(sql).result(max_results=limit).to_arrow()

Also add cleanup in a close method:

def close(self) -> None:
    """Close the BigQuery client and parent connection."""
    if hasattr(self, "_client"):
        self._client.close()
    super().close()

448-448: Consider specifying the project ID explicitly.

The BigQuery client is instantiated without an explicit project parameter. While it may be inferred from the credentials, explicitly specifying it improves clarity and prevents potential ambiguity.

Extract and pass the project ID:

+        project_id = credits_json.get("project_id")
-        client = bigquery.Client(credentials=credentials)
+        client = bigquery.Client(credentials=credentials, project=project_id)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7afcfd9 and 173dafb.

📒 Files selected for processing (1)

ibis-server/app/model/connector.py (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

📚 Learning: 2025-02-06T07:58:58.830Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

Applied to files:

ibis-server/app/model/connector.py

📚 Learning: 2025-06-18T02:23:34.040Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Applied to files:

ibis-server/app/model/connector.py

📚 Learning: 2025-06-18T02:23:34.040Z

Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Applied to files:

ibis-server/app/model/connector.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: ci

douenergy · 2025-11-14T05:56:00Z

Thanks @goldmedal

use bigquery python client directly

3f7468d

github-actions bot added bigquery ibis python Pull requests that update Python code labels Nov 13, 2025

fix fmt

7afcfd9

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

ibis-server/app/model/connector.py Show resolved Hide resolved

ibis-server/app/model/connector.py Show resolved Hide resolved

add tracing span

173dafb

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

goldmedal requested a review from douenergy November 14, 2025 02:23

douenergy approved these changes Nov 14, 2025

View reviewed changes

douenergy merged commit c53150e into Canner:main Nov 14, 2025
4 of 5 checks passed

coderabbitai bot mentioned this pull request Nov 14, 2025

feat(ibis): optimize Databricks connector and support service principal connection #1373

Merged

goldmedal mentioned this pull request Nov 21, 2025

fix(ibis): missing to use BigQuery SDK for dry run #1381

Merged

nhaluc1005 pushed a commit to nhaluc1005/text2sql-practice that referenced this pull request Apr 3, 2026

feat(core): use BigQuery python client directly (Canner#1370)

2424db7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): use BigQuery python client directly#1370

feat(core): use BigQuery python client directly#1370
douenergy merged 3 commits intoCanner:mainfrom
goldmedal:feat/optimize-bigquery-connector

goldmedal commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

goldmedal commented Nov 13, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

douenergy commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goldmedal commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

goldmedal commented Nov 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

douenergy commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goldmedal commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2025 •

edited

Loading