feat(core): use BigQuery python client directly#1370
Conversation
WalkthroughBigQueryConnector.query now builds Google Cloud credentials from stored secrets, instantiates a BigQuery client, runs the SQL and returns a PyArrow table (honoring an optional limit). DuckDB is imported during DuckDBConnector initialization. BigQuery metadata adapters call Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant BigQueryConnector
participant SecretStore
participant GCPCredentials
participant BigQueryClient
participant ArrowTable
Caller->>BigQueryConnector: query(sql, limit)
rect rgb(240,248,255)
Note over BigQueryConnector,SecretStore: build credentials from stored secret
BigQueryConnector->>SecretStore: fetch secret
SecretStore-->>BigQueryConnector: secret
BigQueryConnector->>GCPCredentials: construct credentials
GCPCredentials-->>BigQueryConnector: credentials
end
rect rgb(240,255,240)
Note over BigQueryConnector,BigQueryClient: init client and run query
BigQueryConnector->>BigQueryClient: init(credentials)
BigQueryConnector->>BigQueryClient: execute query (max_results=limit)
BigQueryClient-->>BigQueryConnector: results
end
BigQueryConnector->>ArrowTable: convert results -> PyArrow Table
ArrowTable-->>BigQueryConnector: pa.Table
BigQueryConnector-->>Caller: pa.Table
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/model/connector.py(1 hunks)ibis-server/app/model/metadata/bigquery.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/app/model/metadata/bigquery.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
📚 Learning: 2025-02-06T07:58:58.830Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Applied to files:
ibis-server/app/model/connector.py
🧬 Code graph analysis (1)
ibis-server/app/model/connector.py (2)
ibis-server/app/routers/v2/connector.py (1)
query(65-207)ibis-server/app/routers/v3/connector.py (1)
query(60-232)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci
🔇 Additional comments (3)
ibis-server/app/model/connector.py (3)
453-453: LGTM: Lazy import pattern improves performance.The in-method
import duckdbdefers the import cost until the connector is instantiated, which is consistent withRedshiftConnector(line 534) and appropriate for optional dependencies.
428-430: No functional issues with dry_run, but fix credentials/scopes handling per design guidance.The
dry_runmethod will work correctly—it inheritsSimpleConnector.dry_runwhich callsself.connection.sql(sql)on the ibis BigQuery connection. While the hybrid approach (ibis for dry_run, BigQuery client for query) creates unnecessary overhead by initializing an unused connection, it's functionally sound.However, there's a more significant issue: credentials and scopes are created on every
query()call (lines 438-446), not just in a retry path. Per design guidance, credentials with additional scopes should only be instantiated in the retry path, not during normal execution. Refactor to create credentials once at initialization or defer creation to a retry mechanism.
432-448: Add consistency note or verify BigQuery .to_arrow() handles precision.The concern about missing Decimal/UUID handling is architecturally valid, but the evidence suggests it's managed differently:
- BigQuery NUMERIC and BIGNUMERIC types are automatically supported in the python-bigquery library, and the direct
client.query().result().to_arrow()call handles this conversion natively without explicit type mapping.- The test
test_decimal_precisionintest_query.pyconfirms Decimal values (NUMERIC columns) return correctly as serialized strings.- However, this differs from
SimpleConnector.query, which explicitly applies_handle_pyarrow_unsupported_typeto Ibis tables before converting to PyArrow.CannerConnectorsimilarly includes this handling despite using custom logic.Recommendation: Document why BigQuery bypasses this type handling (native .to_arrow() support) to clarify the architectural difference, or apply the same defensive pattern as
CannerConnectorfor consistency and future maintainability.
|
BigQuery has been tested locally |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ibis-server/app/model/connector.py (1)
442-447: Reconsider unconditional broad credential scopes.The credentials are scoped with
driveandcloud-platformfor every query execution. Based on learnings, these broad scopes were previously reserved for a retry path when handling empty results with special types, not for normal execution. Creating credentials with broader-than-necessary scopes for every query violates the principle of least privilege and may have performance implications.Based on learnings.
Consider one of the following approaches:
- If the retry logic for special types is no longer needed, document why these broad scopes are now required for all queries
- If these scopes are truly necessary, explain the requirement in a comment
- If the original scopes are sufficient, remove the additional scopes:
- credentials = credentials.with_scopes( - [ - "https://www.googleapis.com/auth/drive", - "https://www.googleapis.com/auth/cloud-platform", - ] - ) + # Use default scopes from the service account
🧹 Nitpick comments (2)
ibis-server/app/model/connector.py (2)
448-449: Consider caching the BigQuery client for efficiency.A new BigQuery client is instantiated on every query call. This could be inefficient as it involves credential setup and connection initialization overhead.
Consider caching the client as an instance variable:
+ def _get_client(self) -> bigquery.Client: + if not hasattr(self, "_client"): + credits_json = loads( + base64.b64decode( + self.connection_info.credentials.get_secret_value() + ).decode("utf-8") + ) + credentials = service_account.Credentials.from_service_account_info( + credits_json + ) + credentials = credentials.with_scopes( + [ + "https://www.googleapis.com/auth/drive", + "https://www.googleapis.com/auth/cloud-platform", + ] + ) + self._client = bigquery.Client(credentials=credentials) + return self._client + @tracer.start_as_current_span("connector_query", kind=trace.SpanKind.CLIENT) def query(self, sql: str, limit: int | None = None) -> pa.Table: - credits_json = loads( - base64.b64decode( - self.connection_info.credentials.get_secret_value() - ).decode("utf-8") - ) - credentials = service_account.Credentials.from_service_account_info( - credits_json - ) - credentials = credentials.with_scopes( - [ - "https://www.googleapis.com/auth/drive", - "https://www.googleapis.com/auth/cloud-platform", - ] - ) - client = bigquery.Client(credentials=credentials) + client = self._get_client() return client.query(sql).result(max_results=limit).to_arrow()Also add cleanup in a
closemethod:def close(self) -> None: """Close the BigQuery client and parent connection.""" if hasattr(self, "_client"): self._client.close() super().close()
448-448: Consider specifying the project ID explicitly.The BigQuery client is instantiated without an explicit
projectparameter. While it may be inferred from the credentials, explicitly specifying it improves clarity and prevents potential ambiguity.Extract and pass the project ID:
+ project_id = credits_json.get("project_id") - client = bigquery.Client(credentials=credentials) + client = bigquery.Client(credentials=credentials, project=project_id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ibis-server/app/model/connector.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
📚 Learning: 2025-02-06T07:58:58.830Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1055
File: ibis-server/app/model/connector.py:152-157
Timestamp: 2025-02-06T07:58:58.830Z
Learning: In BigQueryConnector, credentials with additional scopes (drive and cloud-platform) should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Applied to files:
ibis-server/app/model/connector.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci
|
Thanks @goldmedal |
Description
SELECT 1can be reduced by about 1 second.Summary by CodeRabbit