Skip to content

feat(core): use BigQuery python client directly#1370

Merged
douenergy merged 3 commits intoCanner:mainfrom
goldmedal:feat/optimize-bigquery-connector
Nov 14, 2025
Merged

feat(core): use BigQuery python client directly#1370
douenergy merged 3 commits intoCanner:mainfrom
goldmedal:feat/optimize-bigquery-connector

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Nov 13, 2025

Description

  • This PR stops using ibis as the connector because it will do many additional things which isn't necessary for us.
  • After changing to use the BigQuery Python client, the average query time of a simple SQL SELECT 1 can be reduced by about 1 second.

Summary by CodeRabbit

  • Refactor
    • BigQuery connector now initializes credentials and client directly and returns query results as Arrow tables with proper result limiting.
    • Metadata retrieval updated to use the unified query flow while preserving existing table and constraint outputs.
    • Improved observability of query execution via added tracing.

@github-actions github-actions bot added bigquery ibis python Pull requests that update Python code labels Nov 13, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 13, 2025

Walkthrough

BigQueryConnector.query now builds Google Cloud credentials from stored secrets, instantiates a BigQuery client, runs the SQL and returns a PyArrow table (honoring an optional limit). DuckDB is imported during DuckDBConnector initialization. BigQuery metadata adapters call connection.query(...) instead of connection.sql(...).

Changes

Cohort / File(s) Summary
BigQuery Connector Implementation
ibis-server/app/model/connector.py
BigQueryConnector.query now constructs GCP credentials from the stored secret, creates google.cloud.bigquery.Client, executes the query and returns a PyArrow table (uses result(max_results=limit) when limit provided). Removed previous fallback/schema-reconstruction path and added tracing span. Also moved duckdb import into DuckDBConnector.__init__.
BigQuery Metadata Adapter
ibis-server/app/model/metadata/bigquery.py
Replaced calls to self.connection.sql(...) with self.connection.query(...) in get_table_list and get_constraints; result handling (to pandas -> records) remains unchanged.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant BigQueryConnector
    participant SecretStore
    participant GCPCredentials
    participant BigQueryClient
    participant ArrowTable

    Caller->>BigQueryConnector: query(sql, limit)

    rect rgb(240,248,255)
    Note over BigQueryConnector,SecretStore: build credentials from stored secret
    BigQueryConnector->>SecretStore: fetch secret
    SecretStore-->>BigQueryConnector: secret
    BigQueryConnector->>GCPCredentials: construct credentials
    GCPCredentials-->>BigQueryConnector: credentials
    end

    rect rgb(240,255,240)
    Note over BigQueryConnector,BigQueryClient: init client and run query
    BigQueryConnector->>BigQueryClient: init(credentials)
    BigQueryConnector->>BigQueryClient: execute query (max_results=limit)
    BigQueryClient-->>BigQueryConnector: results
    end

    BigQueryConnector->>ArrowTable: convert results -> PyArrow Table
    ArrowTable-->>BigQueryConnector: pa.Table
    BigQueryConnector-->>Caller: pa.Table
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas to verify:
    • Correctness of credential construction and required OAuth scopes.
    • Query behavior for empty results and schema inference (fallback removed).
    • PyArrow conversion and max_results=limit semantics.
    • DuckDB import timing and any side effects in environments where duckdb is optional.

Possibly related PRs

Suggested reviewers

  • wwwy3y3
  • douenergy

Poem

🐰 I hopped inside the connector tree,
Built creds from secrets, set the client free.
DuckDB yawned and woke in time,
Arrow tables gleamed, results in rhyme. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: switching from ibis to using the BigQuery Python client directly, which aligns with the core modifications in both connector.py and bigquery.py files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7468d and 7afcfd9.

📒 Files selected for processing (2)
  • ibis-server/app/model/connector.py (1 hunks)
  • ibis-server/app/model/metadata/bigquery.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/app/model/metadata/bigquery.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
📚 Learning: 2025-02-06T07:58:58.830Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Applied to files:

  • ibis-server/app/model/connector.py
🧬 Code graph analysis (1)
ibis-server/app/model/connector.py (2)
ibis-server/app/routers/v2/connector.py (1)
  • query (65-207)
ibis-server/app/routers/v3/connector.py (1)
  • query (60-232)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci
🔇 Additional comments (3)
ibis-server/app/model/connector.py (3)

453-453: LGTM: Lazy import pattern improves performance.

The in-method import duckdb defers the import cost until the connector is instantiated, which is consistent with RedshiftConnector (line 534) and appropriate for optional dependencies.


428-430: No functional issues with dry_run, but fix credentials/scopes handling per design guidance.

The dry_run method will work correctly—it inherits SimpleConnector.dry_run which calls self.connection.sql(sql) on the ibis BigQuery connection. While the hybrid approach (ibis for dry_run, BigQuery client for query) creates unnecessary overhead by initializing an unused connection, it's functionally sound.

However, there's a more significant issue: credentials and scopes are created on every query() call (lines 438-446), not just in a retry path. Per design guidance, credentials with additional scopes should only be instantiated in the retry path, not during normal execution. Refactor to create credentials once at initialization or defer creation to a retry mechanism.


432-448: Add consistency note or verify BigQuery .to_arrow() handles precision.

The concern about missing Decimal/UUID handling is architecturally valid, but the evidence suggests it's managed differently:

  • BigQuery NUMERIC and BIGNUMERIC types are automatically supported in the python-bigquery library, and the direct client.query().result().to_arrow() call handles this conversion natively without explicit type mapping.
  • The test test_decimal_precision in test_query.py confirms Decimal values (NUMERIC columns) return correctly as serialized strings.
  • However, this differs from SimpleConnector.query, which explicitly applies _handle_pyarrow_unsupported_type to Ibis tables before converting to PyArrow. CannerConnector similarly includes this handling despite using custom logic.

Recommendation: Document why BigQuery bypasses this type handling (native .to_arrow() support) to clarify the architectural difference, or apply the same defensive pattern as CannerConnector for consistency and future maintainability.

@goldmedal
Copy link
Copy Markdown
Contributor Author

BigQuery has been tested locally

poetry run pytest -m 'bigquery'
================================================================================================================================ test session starts ================================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 396 items / 349 deselected / 47 selected                                                                                                                                                                                                                                  

tests/routers/v2/connector/test_bigquery.py ..................                                                                                                                                                                                                                [ 38%]
tests/routers/v3/connector/bigquery/test_functions.py ............                                                                                                                                                                                                            [ 63%]
tests/routers/v3/connector/bigquery/test_query.py .................  

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
ibis-server/app/model/connector.py (1)

442-447: Reconsider unconditional broad credential scopes.

The credentials are scoped with drive and cloud-platform for every query execution. Based on learnings, these broad scopes were previously reserved for a retry path when handling empty results with special types, not for normal execution. Creating credentials with broader-than-necessary scopes for every query violates the principle of least privilege and may have performance implications.

Based on learnings.

Consider one of the following approaches:

  1. If the retry logic for special types is no longer needed, document why these broad scopes are now required for all queries
  2. If these scopes are truly necessary, explain the requirement in a comment
  3. If the original scopes are sufficient, remove the additional scopes:
-        credentials = credentials.with_scopes(
-            [
-                "https://www.googleapis.com/auth/drive",
-                "https://www.googleapis.com/auth/cloud-platform",
-            ]
-        )
+        # Use default scopes from the service account
🧹 Nitpick comments (2)
ibis-server/app/model/connector.py (2)

448-449: Consider caching the BigQuery client for efficiency.

A new BigQuery client is instantiated on every query call. This could be inefficient as it involves credential setup and connection initialization overhead.

Consider caching the client as an instance variable:

+    def _get_client(self) -> bigquery.Client:
+        if not hasattr(self, "_client"):
+            credits_json = loads(
+                base64.b64decode(
+                    self.connection_info.credentials.get_secret_value()
+                ).decode("utf-8")
+            )
+            credentials = service_account.Credentials.from_service_account_info(
+                credits_json
+            )
+            credentials = credentials.with_scopes(
+                [
+                    "https://www.googleapis.com/auth/drive",
+                    "https://www.googleapis.com/auth/cloud-platform",
+                ]
+            )
+            self._client = bigquery.Client(credentials=credentials)
+        return self._client
+
     @tracer.start_as_current_span("connector_query", kind=trace.SpanKind.CLIENT)
     def query(self, sql: str, limit: int | None = None) -> pa.Table:
-        credits_json = loads(
-            base64.b64decode(
-                self.connection_info.credentials.get_secret_value()
-            ).decode("utf-8")
-        )
-        credentials = service_account.Credentials.from_service_account_info(
-            credits_json
-        )
-        credentials = credentials.with_scopes(
-            [
-                "https://www.googleapis.com/auth/drive",
-                "https://www.googleapis.com/auth/cloud-platform",
-            ]
-        )
-        client = bigquery.Client(credentials=credentials)
+        client = self._get_client()
         return client.query(sql).result(max_results=limit).to_arrow()

Also add cleanup in a close method:

def close(self) -> None:
    """Close the BigQuery client and parent connection."""
    if hasattr(self, "_client"):
        self._client.close()
    super().close()

448-448: Consider specifying the project ID explicitly.

The BigQuery client is instantiated without an explicit project parameter. While it may be inferred from the credentials, explicitly specifying it improves clarity and prevents potential ambiguity.

Extract and pass the project ID:

+        project_id = credits_json.get("project_id")
-        client = bigquery.Client(credentials=credentials)
+        client = bigquery.Client(credentials=credentials, project=project_id)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7afcfd9 and 173dafb.

📒 Files selected for processing (1)
  • ibis-server/app/model/connector.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
📚 Learning: 2025-02-06T07:58:58.830Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Applied to files:

  • ibis-server/app/model/connector.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci

@goldmedal goldmedal requested a review from douenergy November 14, 2025 02:23
@douenergy
Copy link
Copy Markdown
Contributor

Thanks @goldmedal

@douenergy douenergy merged commit c53150e into Canner:main Nov 14, 2025
4 of 5 checks passed
nhaluc1005 pushed a commit to nhaluc1005/text2sql-practice that referenced this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery ibis python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants