feat(wren): add standalone wren Python SDK package#1471
feat(wren): add standalone wren Python SDK package#1471goldmedal merged 11 commits intoCanner:mainfrom
Conversation
Introduces a new top-level `wren/` package — a standalone SDK and CLI for the Wren semantic SQL layer that can be installed independently of ibis-server. Key components: - `WrenEngine`: SQL transform (MDL → dialect SQL) + execute against 20+ data sources - `wren.connector/`: per-connector files with lazy DB driver imports - `wren.model/`: ConnectionInfo models, DataSource enum, WrenError - `wren.mdl/`: wren-core-py SessionContext wrapper - `wren.cli`: Typer CLI with `query`, `dry-run`, `transpile`, `validate` commands - `pyproject.toml`: hatchling build, optional extras per connector (wren[postgres], etc.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use cursor for executemany() calls (psycopg v3 API requires it) - Add pyarrow-hotfix to core dependencies - Support --dev flag in `just install-extra` - Add fmt alias in justfile Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs lint, unit tests, and postgres connector tests on PRs touching wren/, wren-core/, wren-core-py/, or wren-core-base/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new Wren Python package and CLI: a WrenEngine for manifest-based planning/transpilation/execution, a connector framework with many backend implementations, typed connection models and errors, test suites (unit and connector), packaging/automation, and a targeted GitHub Actions CI workflow. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as "Client"
participant Engine as "WrenEngine"
participant MDL as "SessionContext (wren_core)"
participant Connector as "Connector (e.g., Postgres/DuckDB)"
participant DB as "Database"
Client->>Engine: query(sql, limit)
Engine->>MDL: transform_sql(sql)
MDL-->>Engine: planned_sql
Engine->>Engine: transpile(planned_sql) -> dialect_sql
Engine->>Connector: query(dialect_sql, limit)
Connector->>DB: execute(dialect_sql)
DB-->>Connector: pa.Table
Connector->>Connector: normalize types (Decimal/UUID/JSON)
Connector-->>Engine: pa.Table
Engine-->>Client: pa.Table
Note over Engine,Connector: Exceptions at any step are wrapped as WrenError with ErrorPhase/metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The lint job only needs ruff, not the full dependency tree. Using uvx avoids the missing platform wheel error for wren-core-py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (22)
wren/.claude/CLAUDE.md (1)
7-18: Add language specifier to fenced code block.The directory structure code block lacks a language specifier. Per Markdown best practices, specify a language (e.g.,
textorplaintext) for syntax highlighting consistency.Proposed fix
-``` +```text wren/ src/wren/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/.claude/CLAUDE.md` around lines 7 - 18, Update the fenced code block in CLAUDE.md to include a language specifier (e.g., replace the opening ``` with ```text or ```plaintext) so the directory tree (showing engine.py, cli.py, mdl/, connector/, model/data_source.py, model/error.py, tests/) gets proper syntax highlighting; locate the block that starts with ``` and the directory listing and change only the opening fence to include the specifier.wren/pyproject.toml (1)
49-49: Theallextra excludesspark.The
sparkextra is defined but not included in theallaggregate. If this is intentional (e.g., due to PySpark's size), consider adding a brief comment. Otherwise, include it for completeness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/pyproject.toml` at line 49, The 'all' extras entry currently omits the 'spark' extra; update the pyproject.toml so the all aggregate either includes "spark" alongside the other extras or add a brief inline comment next to the 'all' definition explaining why 'spark' is intentionally excluded (e.g., PySpark size/footprint), referencing the 'all' extras line and the existing 'spark' extra name so reviewers can see the intended behavior.wren/tests/unit/test_engine.py (2)
121-127: Test doesn't verify close behavior for an initialized connector.The test name suggests it verifies connector cleanup, but since
e._connectoris never initialized (noquery()ordry_run()call), the test only confirms that_connectorremainsNonethroughout. Consider adding an assertion that actually triggers connector initialization and verifies cleanup:Proposed enhancement
def test_context_manager_closes_connector() -> None: - conn_info = {"url": "/tmp", "format": "duckdb"} + import tempfile + with tempfile.TemporaryDirectory() as tmp_dir: + conn_info = {"url": tmp_dir, "format": "duckdb"} - with WrenEngine(_MANIFEST_STR, DataSource.duckdb, conn_info) as e: - assert e._connector is None # connector is lazily initialized + with WrenEngine(_MANIFEST_STR, DataSource.duckdb, conn_info) as e: + assert e._connector is None # connector is lazily initialized + # Optionally trigger initialization to test actual close: + # _ = e._get_connector() + # assert e._connector is not None - # After __exit__, internal state is cleaned up - assert e._connector is None + # After __exit__, internal state is cleaned up + assert e._connector is None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/tests/unit/test_engine.py` around lines 121 - 127, The test currently never initializes WrenEngine._connector so it only asserts None; update test_context_manager_closes_connector to trigger connector creation (call a method that lazily initializes it such as WrenEngine.query(...) or WrenEngine.dry_run()) inside the with block, assert e._connector is not None while inside the context, then after exiting the with (which invokes WrenEngine.__exit__), assert e._connector is None to verify cleanup; reference WrenEngine, test_context_manager_closes_connector, query/dry_run, __exit__, and _connector when making the change.
10-10: Unused import.
mathis imported but never used in the test file.Proposed fix
import base64 -import math import orjson🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/tests/unit/test_engine.py` at line 10, Remove the unused import symbol "math" from the top of the test module (in test_engine.py) so the file no longer imports a module that isn't used; if you intended to use math in tests instead, add the actual usage (e.g., math.isclose) where appropriate, otherwise delete the import line.wren/src/wren/connector/spark.py (1)
40-48: Consider logging suppressed exceptions inclose().Silently swallowing all exceptions during
close()can hide important errors. Consider logging at debug/warning level to aid debugging while still ensuring cleanup completes.Proposed fix
+from loguru import logger + def close(self) -> None: if self._closed: return try: self.connection.stop() - except Exception: - pass + except Exception as e: + logger.debug(f"Exception during Spark session close: {e}") finally: self._closed = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/spark.py` around lines 40 - 48, The close method currently swallows all exceptions from self.connection.stop(), which should be logged; update close(self) in the class to catch Exception as e (around the self.connection.stop() call), log the exception with a debug or warning level (including the exception object and traceback) using the module/class logger (or create one if missing), and still set self._closed = True in the finally block so cleanup always completes; keep the existing early-return when self._closed is already True.wren/tests/connectors/test_duckdb.py (1)
35-48: Return type annotation mismatch for generator fixture.The fixture uses
yieldbut declares-> WrenEngine. For a yielding fixture, the type hint should beGenerator[WrenEngine, None, None]orIterator[WrenEngine].Also, the comment on line 40 states "1500 orders, 150 customers" but per the
WrenQueryTestSuitedefaults (order_count=15000,customer_count=1500for sf=0.01), this should be "15000 orders, 1500 customers".Proposed fix
+from collections.abc import Generator + class TestDuckDB(WrenQueryTestSuite): manifest = make_tpch_manifest(table_catalog=_CATALOG, table_schema=_SCHEMA) # DuckDB TPCH dbgen produces INTEGER as int64 in Arrow order_id_dtype = "int64" `@pytest.fixture`(scope="class") - def engine(self, tmp_path_factory) -> WrenEngine: # type: ignore[override] + def engine(self, tmp_path_factory) -> Generator[WrenEngine, None, None]: # type: ignore[override] db_dir = tmp_path_factory.mktemp("duckdb") db_path = db_dir / "tpch.duckdb" - # Generate TPCH sf=0.01 (1500 orders, 150 customers) into the file. + # Generate TPCH sf=0.01 (15000 orders, 1500 customers) into the file. con = duckdb.connect(str(db_path))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/tests/connectors/test_duckdb.py` around lines 35 - 48, The engine pytest fixture currently declares a non-generator return type WrenEngine but uses yield; change its annotation to a generator type such as Generator[WrenEngine, None, None] or Iterator[WrenEngine] on the def engine(...) signature, and update the in-code comment that describes TPCH generation in the engine fixture from "1500 orders, 150 customers" to the correct "15000 orders, 1500 customers" per the WrenQueryTestSuite defaults; locate the changes in the engine fixture and adjust imports if needed (typing.Generator or typing.Iterator)..github/workflows/wren-ci.yml (1)
64-73: Consider extracting wren-core-py build steps into a composite action.The wren-core-py wheel build steps (lines 64-69, 100-105) are duplicated across
test-unitandtest-postgresjobs. As the number of connector test jobs grows, consider extracting this into a reusable composite action to reduce maintenance overhead.Also applies to: 100-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/wren-ci.yml around lines 64 - 73, Duplicate step sequence that builds the wren-core-py wheel and installs deps (the steps named "Build wren-core-py wheel", "Install dependencies" and any identical blocks in the `test-unit` and `test-postgres` jobs) should be extracted into a reusable composite action; create a composite action (e.g., .github/actions/build-wren-core-py/action.yml) that runs the pipx/poetry/maturin build and the uv sync install, accept any inputs needed (like working-directory or extra flags), then replace the duplicated step blocks in both jobs with a single uses: ./github/actions/build-wren-core-py (or appropriate path) invocation and wire any inputs/outputs through the action so downstream steps (like "Run unit tests") remain unchanged.wren/src/wren/connector/canner.py (1)
37-49: Consider reusing base class helpers for type handling.This duplicates logic from
IbisConnector._round_decimal_columnsand_cast_uuid_columns. IfCannerConnectorextendedIbisConnector(or a shared mixin), you could call the existing helpers instead of reimplementing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/canner.py` around lines 37 - 49, The _handle_pyarrow_unsupported_type method in CannerConnector duplicates decimal and UUID handling already implemented as IbisConnector._round_decimal_columns and IbisConnector._cast_uuid_columns; change CannerConnector to reuse those helpers instead of reimplementing: have CannerConnector inherit from IbisConnector or import the shared mixin, then replace the loop in _handle_pyarrow_unsupported_type with calls to IbisConnector._round_decimal_columns(ibis_table, ...) and IbisConnector._cast_uuid_columns(...) (or the equivalent mixin methods) so decimals and UUIDs are processed by the existing helpers, returning the final transformed Table.wren/src/wren/connector/bigquery.py (2)
16-20: Minor typo:credits_json→creds_json.The variable name appears to be a typo—"credits" should likely be "creds" (short for credentials).
✏️ Suggested fix
- credits_json = loads( + creds_json = loads( base64.b64decode(connection_info.credentials.get_secret_value()).decode( "utf-8" ) ) credentials = service_account.Credentials.from_service_account_info( - credits_json + creds_json )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/bigquery.py` around lines 16 - 20, Rename the mistakenly named variable credits_json to creds_json in the BigQuery connector: update the assignment statement that decodes and loads the base64 credentials (the variable currently declared as credits_json) and replace all subsequent references to that variable within the same scope (e.g., anywhere creds_json is used after the loads(...) call) to use creds_json so the name matches "credentials" shorthand; ensure no other variables collide and run tests/imports to verify no unresolved references.
24-29: Consider deferring OAuth scope expansion to retry path.The current implementation adds drive and cloud-platform scopes eagerly during initialization. Based on learnings from a previous PR, credentials with additional scopes should only be created in the retry path when handling empty results with special types, not during initialization, to maintain lazy initialization.
If this connector needs to handle Google Sheets or other drive-backed tables, consider moving scope expansion to a retry mechanism when the initial query returns empty results for special types.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/bigquery.py` around lines 24 - 29, The code currently calls credentials.with_scopes([...]) eagerly during connector initialization (the credentials variable in bigquery.py); remove that eager scope expansion and keep the original credentials untouched at init. Instead, perform credentials.with_scopes(...) only in the retry path that handles empty results for special Drive-backed types (the method that detects empty query results and retries—e.g., the connector's retry/query handling function), using the same scopes ["https://www.googleapis.com/auth/drive","https://www.googleapis.com/auth/cloud-platform"] just before re-creating the BigQuery client for the retry attempt.wren/src/wren/mdl/__init__.py (1)
8-17: Consider bounded cache for long-running processes.
@cacheprovides unbounded memoization. For CLI usage this is fine, but if this module is used in a long-running server with many different manifests, consider@lru_cache(maxsize=N)to prevent unbounded memory growth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/mdl/__init__.py` around lines 8 - 17, The get_session_context function is decorated with functools.cache which is unbounded; replace it with functools.lru_cache(maxsize=...) (or configurable maxsize) to prevent unbounded memory growth in long‑running processes: update the decorator on get_session_context, add/import lru_cache from functools, and choose a reasonable maxsize (or read from config/env) so SessionContext instances are memoized but capped.wren/src/wren/connector/ibis.py (2)
29-29: Style nit: prefer!=overnot ... ==.
if not e.error_name == "EXCEEDED_TIME_LIMIT"is clearer asif e.error_name != "EXCEEDED_TIME_LIMIT". Same applies to line 46.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/ibis.py` at line 29, Replace the awkward "not ... == ..." style checks with "!=" comparisons: change the condition in the exception handling where the code uses if not e.error_name == "EXCEEDED_TIME_LIMIT" to use if e.error_name != "EXCEEDED_TIME_LIMIT" (also update the same pattern found later in the file around the other exception branch); look for the exception handler referencing e.error_name in ibis.py to make both replacements.
99-103: Simplify redundant conditional.Lines 101-102 can be simplified since the condition explicitly returns the same type.
✏️ Suggested simplification
def create_connector(data_source: DataSource, connection_info) -> IbisConnector: cls = _DATA_SOURCE_TO_CLASS.get(data_source, IbisConnector) - if cls is IbisConnector: - return IbisConnector(data_source, connection_info) - return cls(connection_info) + return cls(data_source, connection_info) if cls is IbisConnector else cls(connection_info)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/ibis.py` around lines 99 - 103, The create_connector function can be simplified by removing the explicit IbisConnector constructor and instantiating cls with the correct arguments: if cls is IbisConnector call cls(data_source, connection_info) else call cls(connection_info); implement this by building args = (data_source, connection_info) when cls is IbisConnector and args = (connection_info,) otherwise, then return cls(*args). Update the function that uses _DATA_SOURCE_TO_CLASS, IbisConnector, DataSource and connection_info accordingly.wren/src/wren/connector/mysql.py (1)
14-51: Extract duplicate_handle_pyarrow_unsupported_typelogic.
MySqlConnectorandDorisConnectorhave identical implementations of_handle_pyarrow_unsupported_type. Consider:
- Adding JSON handling to the base
IbisConnector._handle_pyarrow_unsupported_type, or- Creating a shared mixin/helper that both connectors use.
♻️ Option: Extend base class and only add JSON handling
class MySqlConnector(IbisConnector): def __init__(self, connection_info): super().__init__(DataSource.mysql, connection_info) def _handle_pyarrow_unsupported_type(self, ibis_table: Table, **kwargs) -> Table: - result_table = ibis_table - for name, dtype in ibis_table.schema().items(): - if isinstance(dtype, Decimal): - result_table = self._round_decimal_columns( - result_table=result_table, col_name=name, **kwargs - ) - elif isinstance(dtype, UUID): - result_table = self._cast_uuid_columns( - result_table=result_table, col_name=name - ) - elif isinstance(dtype, dt.JSON): - result_table = result_table.mutate( - **{name: result_table[name].cast("string")} - ) - return result_table + # Let base class handle Decimal and UUID + result_table = super()._handle_pyarrow_unsupported_type(ibis_table, **kwargs) + # Additionally handle JSON for MySQL/Doris + for name, dtype in result_table.schema().items(): + if isinstance(dtype, dt.JSON): + result_table = result_table.mutate( + **{name: result_table[name].cast("string")} + ) + return result_table🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/mysql.py` around lines 14 - 51, MySqlConnector and DorisConnector duplicate the same _handle_pyarrow_unsupported_type logic; move the shared logic into IbisConnector._handle_pyarrow_unsupported_type (handling Decimal via _round_decimal_columns, UUID via _cast_uuid_columns, and dt.JSON -> cast to string) and have MySqlConnector and DorisConnector call or inherit that base implementation (override only if connector-specific behavior is needed). Alternatively, extract the shared code into a small mixin/helper function (e.g., handle_pyarrow_unsupported_types) that uses ibis_table.schema(), Decimal, UUID, dt.JSON and the existing _round_decimal_columns/_cast_uuid_columns helpers, then have both MySqlConnector and DorisConnector delegate to that single implementation to remove duplication.wren/src/wren/connector/duckdb.py (1)
57-62: Useelifinstead of separateifstatements.Using separate
ifstatements means multiple branches could execute if the connection_info happens to be an instance of multiple types (e.g., via inheritance). Useelifto ensure mutual exclusivity.♻️ Proposed fix
if isinstance(connection_info, S3FileConnectionInfo): _init_duckdb_s3(self.connection, connection_info) - if isinstance(connection_info, MinioFileConnectionInfo): + elif isinstance(connection_info, MinioFileConnectionInfo): _init_duckdb_minio(self.connection, connection_info) - if isinstance(connection_info, GcsFileConnectionInfo): + elif isinstance(connection_info, GcsFileConnectionInfo): _init_duckdb_gcs(self.connection, connection_info)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/duckdb.py` around lines 57 - 62, The three separate conditionals checking connection_info should be made mutually exclusive to avoid multiple initializers running; change the sequential ifs to an if / elif / elif chain so only one branch executes for S3FileConnectionInfo, MinioFileConnectionInfo, or GcsFileConnectionInfo and call the corresponding initializer (_init_duckdb_s3, _init_duckdb_minio, _init_duckdb_gcs) with self.connection and connection_info.wren/src/wren/connector/base.py (1)
55-61: Potential precision loss with fixed decimal parameters.Hardcoding
precision=38andscale=9may truncate high-precision decimal values from databases that support larger scales (e.g., BigQuery supports scale up to 38). Consider making this configurable or preserving original precision where possible.Consider preserving original precision when feasible:
def _round_decimal_columns( self, result_table: Table, col_name: str, scale: int = 9 ) -> Table: col = result_table[col_name] original_dtype = result_table.schema()[col_name] # Preserve original precision if within PyArrow limits precision = min(original_dtype.precision or 38, 38) decimal_type = Decimal(precision=precision, scale=scale) rounded_col = col.cast(decimal_type).round(scale) return result_table.mutate(**{col_name: rounded_col})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/base.py` around lines 55 - 61, The _round_decimal_columns method currently hardcodes Decimal(precision=38, scale=9) which can truncate higher-precision values; change it to derive precision from the column's original dtype (fetch original_dtype = result_table.schema()[col_name] or equivalent), use that precision when present (cap it to the supported maximum, e.g., 38) or fall back to a configurable default, then construct Decimal(precision=derived_precision, scale=scale) and cast/round using that type; ensure the change is applied in the _round_decimal_columns function and consider adding a configurable default precision on the connector class if needed.wren/src/wren/cli.py (1)
150-175:transpilecommand doesn't close the engine.Unlike
query,dry_run, andvalidate, thetranspilecommand creates an engine but never callsengine.close(). While transpile doesn't use a connector, future changes could leak resources.♻️ Proposed fix using context manager
# For transpile we don't need real connection_info — pass a dummy dict - engine = WrenEngine(manifest_str=manifest_str, data_source=ds, connection_info={}) - try: - result = engine.transpile(sql) - typer.echo(result) - except Exception as e: - typer.echo(f"Error: {e}", err=True) - raise typer.Exit(1) + with WrenEngine(manifest_str=manifest_str, data_source=ds, connection_info={}) as engine: + try: + result = engine.transpile(sql) + typer.echo(result) + except Exception as e: + typer.echo(f"Error: {e}", err=True) + raise typer.Exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/cli.py` around lines 150 - 175, The transpile command creates a WrenEngine instance (WrenEngine in function transpile) but never calls engine.close(), risking resource leaks; update transpile to ensure engine.close() is always invoked (use a with/context manager on WrenEngine if it implements __enter__/__exit__, or wrap engine usage in try/finally and call engine.close() in finally) after creating engine = WrenEngine(manifest_str=manifest_str, data_source=ds, connection_info={}); keep existing manifest loading (_load_manifest) and DataSource handling, and ensure exceptions still result in typer.Exit(1) while guaranteeing engine.close() runs.wren/tests/suite/query.py (1)
135-137:manifest_strfixture defined but not used by subclass engine fixtures.The
manifest_strfixture is defined here, but bothTestDuckDBandTestPostgres(per relevant snippets) rebuildmanifest_strinline within theirenginefixtures rather than requesting it as a parameter. This fixture is effectively dead code.Either remove the unused fixture or update subclass fixtures to depend on it for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/tests/suite/query.py` around lines 135 - 137, The manifest_str pytest fixture is defined but never consumed by subclass fixtures; update the TestDuckDB and TestPostgres engine fixtures to accept manifest_str as a parameter and use that value instead of rebuilding it inline (or remove the unused manifest_str fixture). Specifically, modify the engine fixtures in the TestDuckDB and TestPostgres classes to include manifest_str in their signature and replace any inline base64.b64encode(orjson.dumps(self.manifest)).decode() calls with the provided manifest_str, ensuring the manifest_str fixture is actually used; if you prefer removing duplication, delete the manifest_str fixture instead of changing the subclasses.wren/src/wren/model/data_source.py (2)
11-13: Top-level imports ofboto3andibismay slow startup for non-AWS/non-Ibis use cases.These imports happen at module load time. For CLI usage that only needs transpilation (no DB connection), this adds unnecessary overhead.
Consider lazy imports within the methods that need them, following the pattern used elsewhere in this codebase (e.g.,
from duckdb import ... # noqa: PLC0415in duckdb.py).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/model/data_source.py` around lines 11 - 13, Top-level imports of boto3, ibis, and BaseBackend slow startup; change them to lazy imports inside the functions/classes that actually use AWS or Ibis functionality (replace "import boto3", "import ibis", and "from ibis import BaseBackend" with in-function imports where needed). Locate call sites that depend on boto3 or ibis (search for usages of boto3, ibis, and BaseBackend in this module) and add local imports at the start of those functions/methods (follow the existing project pattern used in duckdb.py, e.g., local import with a noqa comment if necessary) so CLI/transpile paths that don't touch DB/AWS avoid importing those heavy libs at module load.
287-289: Minor typo:credits_jsonshould likely becreds_json.The variable name
credits_json(implying monetary credits) seems like a typo forcreds_json(credentials JSON).✏️ Proposed rename
- credits_json = loads( + creds_json = loads( base64.b64decode(info.credentials.get_secret_value()).decode("utf-8") ) credentials = service_account.Credentials.from_service_account_info( - credits_json + creds_json )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/model/data_source.py` around lines 287 - 289, Rename the mistakenly named variable credits_json to creds_json in the data source parsing code: update the assignment that decodes base64 and loads JSON (the expression using base64.b64decode(info.credentials.get_secret_value()).decode("utf-8") passed to loads) and update all subsequent references in the same function/method to use creds_json instead of credits_json, ensuring there are no name collisions and tests/imports still pass.wren/src/wren/engine.py (1)
166-173: Silent fallback on manifest extraction failure hides potential issues.When
extractor.resolve_used_table_namesorextractor.extract_byfails, the exception is silently caught and the full manifest is used. This could mask configuration or MDL issues.Consider logging the fallback for debugging:
try: # Extract minimal manifest for the query extractor = get_manifest_extractor(self.manifest_str) tables = extractor.resolve_used_table_names(sql) manifest = extractor.extract_by(tables) effective_manifest = to_json_base64(manifest) - except Exception: + except Exception as e: + from loguru import logger + logger.debug(f"Manifest extraction failed, using full manifest: {e}") effective_manifest = self.manifest_str🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/engine.py` around lines 166 - 173, The try/except around manifest extraction (get_manifest_extractor, extractor.resolve_used_table_names, extractor.extract_by, to_json_base64) silently falls back to self.manifest_str; update the except block to log the exception and context before using the full manifest so failures aren’t hidden—capture the exception object, call the module/class logger (or processLogger) to emit a clear message including the exception details and which SQL/manifest caused the fallback, then set effective_manifest = self.manifest_str as before.wren/src/wren/connector/databricks.py (1)
53-55: Dry-run SQL wrapping may fail with CTEs or complex queries.Wrapping SQL as
SELECT * FROM ({sql}) AS sub LIMIT 0will fail ifsqlcontains CTEs (WITHclauses), multiple statements, or certain DDL. This differs from how other connectors handle dry-run.Consider using Databricks'
EXPLAINto validate SQL without execution, similar to how other enterprise databases handle query validation:def dry_run(self, sql: str) -> None: with closing(self.connection.cursor()) as cursor: cursor.execute(f"EXPLAIN {sql}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/databricks.py` around lines 53 - 55, The dry_run implementation in dry_run(self, sql: str) wraps the query in a SELECT subquery which breaks for CTEs, multiple statements or DDL; change it to validate the SQL using Databricks' EXPLAIN instead of wrapping (i.e., execute "EXPLAIN {sql}" via the same cursor in dry_run) so CTEs and complex queries are accepted without executing data-changing statements; keep using closing(self.connection.cursor()) and ensure you only call cursor.execute with the EXPLAIN-prefixed SQL in the dry_run function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/cli.py`:
- Around line 202-206: The current JSON branch contains a dead loop (for batch
in table.to_batches(): for row in batch.to_pydict(): pass); remove the no-op
loops and replace them with actual JSON-lines emission: iterate
table.to_batches(), for each batch call batch.to_pydict() and write each row as
a JSON string (e.g., via json.dumps) followed by newline to the output stream.
Update the code around the output == "json" branch to import/use json and ensure
rows from batch.to_pydict() are serialized and emitted instead of being passed
over.
In `@wren/src/wren/connector/duckdb.py`:
- Around line 67-70: The query method currently fetches the full Arrow table
then slices it; change it to push the limit into DuckDB SQL instead by modifying
query(sql: str, limit: int | None = None) to, when limit is provided, wrap or
append a LIMIT clause to the SQL (e.g., add "LIMIT {limit}" or "SELECT * FROM
({sql}) LIMIT {limit}") before calling
self.connection.execute(...).fetch_arrow_table(), so the database applies the
limit and only the needed rows are returned; ensure you use the existing
connection.execute and fetch_arrow_table calls and handle None limit by leaving
the SQL unchanged.
- Around line 16-23: The SQL string in _init_duckdb_s3 (and similarly
_init_duckdb_minio and _init_duckdb_gcs) interpolates secret values via
f-strings, risking SQL injection or syntax errors if a secret contains a single
quote; fix by not directly injecting raw secrets: retrieve each secret via
get_secret_value() into local variables, safely escape single quotes by doubling
them (or use the DB driver's parameterized execution if supported), then
construct the CREATE SECRET statement using the escaped values (or parameter
placeholders) and call connection.execute with the safe values; apply the same
change to _init_duckdb_minio and _init_duckdb_gcs.
- Around line 80-88: The ATTACH call in the db_files loop
(self.connection.execute) interpolates raw filenames, risking SQL injection;
instead validate and sanitize each path and alias before building the SQL:
ensure the file exists and is an allowed/absolute path (use
pathlib.Path.resolve() and check it is inside your data directory), create a
safe alias from os.path.basename(file) by removing/normalizing any characters
other than letters/digits/underscore (e.g., regexp replace to [A-Za-z0-9_]), and
escape single quotes in the resolved path by doubling them before interpolation;
keep the existing exception handling (WrenError / ErrorCode.ATTACH_DUCKDB_ERROR)
but perform these validations in the same loop before calling
self.connection.execute.
In `@wren/src/wren/connector/factory.py`:
- Around line 49-54: The install hint currently uses data_source.value directly
in the WrenError raised in factory.py (within the except ImportError block),
which can recommend non-existent extras (e.g., 'doris'); add a mapping (e.g.,
EXTRA_MAP keyed by the data source enum or name) that maps aliases to the actual
optional extra (for example map "doris" -> "mysql"), then use that mapped_extra
in the error message (instead of data_source.value) when composing the pip
install suggestion; update the WrenError call (and ErrorCode.NOT_IMPLEMENTED
usage) to reference the mapped name and ensure the mapping covers known aliases.
In `@wren/src/wren/connector/ibis.py`:
- Around line 23-39: The query method in ibis.py currently only handles
trino.exceptions.TrinoQueryError, WrenError, and TimeoutError, leaving other
exceptions unwrapped; add a catch-all except Exception as e at the end of
Connector.query to mirror PostgresConnector behavior and wrap any unexpected
exception into a WrenError (use ErrorCode.INVALID_SQL, include str(e),
phase=ErrorPhase.SQL_EXECUTION, and metadata={DIALECT_SQL: sql}) so all errors
from query(...) (the query method) are consistently wrapped; keep the existing
TrinoQueryError special-case and the pass-through for WrenError and
TimeoutError.
In `@wren/src/wren/connector/mssql.py`:
- Around line 98-105: _wrap the body of _describe_sql_for_error_message in a
try/except so any exception from sge.convert, building/executing describe_sql,
or self.connection.raw_sql is caught and does not propagate; on exception return
a safe fallback string like "Unknown reason" (and optionally log the exception
if a logger is available on self) instead of letting the helper raise and mask
the original dry-run error—ensure you still attempt the original logic
(sge.convert(...).sql("mssql"), building describe_sql, using
closing(self.connection.raw_sql(...))) but guard the whole sequence with
exception handling.
- Around line 77-78: The function _flatten_pagination_limit currently swallows
exceptions by returning a string like f"Error: {e!s}", which then becomes
invalid SQL; change the error handling to propagate the original exception
instead of returning an error string—either re-raise the caught exception
(raise) or raise a new descriptive exception (e.g., RuntimeError) including the
original exception information (use from e) so callers like the SQL execution
path receive an exception rather than malformed SQL; update the except block in
_flatten_pagination_limit accordingly.
In `@wren/src/wren/connector/redshift.py`:
- Around line 46-54: The query method currently fetches all rows then truncates,
which is inefficient; update the query(self, sql: str, limit: int | None = None)
implementation to apply the limit in the SQL sent to the database when limit is
not None (e.g., append a safe "LIMIT {limit}" clause or wrap the original SQL)
before calling cursor.execute so the DB returns only the requested rows; keep
the rest of the logic that builds cols, rows and converts to a pandas DataFrame
and pa.Table (references: query, cursor.execute, cursor.description,
pd.DataFrame, pa.Table.from_pandas).
- Around line 31-37: The port passed to redshift_connector.connect() is a string
(connection_info.port.get_secret_value()) but must be an int; update the call in
the redshift connector where self.connection is assigned (the
redshift_connector.connect(...) invocation) to convert the port to an integer
using int(...) around connection_info.port.get_secret_value(), matching the
pattern used in other connectors.
In `@wren/src/wren/model/__init__.py`:
- Around line 267-289: The ConnectionInfo union (symbol: ConnectionInfo) is
missing DatabricksServicePrincipalConnectionInfo which causes validation to fail
for service-principal Databricks connections; update the ConnectionInfo union to
include DatabricksServicePrincipalConnectionInfo alongside
DatabricksTokenConnectionInfo so the union accepts both connection types (locate
where ConnectionInfo is defined in __init__.py and add
DatabricksServicePrincipalConnectionInfo to the listed alternatives).
- Around line 127-134: DorisConnectionInfo defines port as SecretStr while other
connection classes use SecretPort (which normalizes ints); change the port field
in DorisConnectionInfo to use SecretPort instead of SecretStr so integer ports
are normalized to strings. Update the Field declaration for the port attribute
in the DorisConnectionInfo class to use SecretPort (keeping examples and any
defaults intact) so behavior matches other connection models.
- Around line 253-259: The GcsFileConnectionInfo.credentials field is typed as
SecretStr but given a default of None, causing a type mismatch; update the field
to allow None by changing its type to Optional[SecretStr] (or SecretStr | None)
and import typing.Optional (or use PEP 604 union) so the default=None is valid,
or alternatively remove the default=None if credentials must be required; locate
the credentials attribute in class GcsFileConnectionInfo to apply the change.
In `@wren/src/wren/model/data_source.py`:
- Around line 79-82: The isinstance(data, ConnectionInfo) check is invalid
because ConnectionInfo is a typing.Union, not a class; update the branch in the
method that currently does "if isinstance(data, ConnectionInfo):" (the same
block that calls self._build_connection_info(data) when false) to detect
concrete types instead: either check for dict with "if isinstance(data, dict):
info = self._build_connection_info(data)" or import and use a tuple of the
actual Pydantic model classes used in the ConnectionInfo union (e.g., "if
isinstance(data, (ConcreteModelA, ConcreteModelB)): info = data"); ensure you
reference the concrete model class names you see in the Union and replace the
isinstance(...) call accordingly so the branch no longer uses the Union type.
- Around line 438-444: get_databricks_connection currently only supports
DatabricksTokenConnectionInfo and will crash when given a
DatabricksServicePrincipalConnectionInfo (built by _build_connection_info)
because it accesses access_token; update get_databricks_connection to accept
either DatabricksTokenConnectionInfo or DatabricksServicePrincipalConnectionInfo
and branch on databricks auth type (or detect presence of
client_id/client_secret/tenant_id) to call ibis.databricks.connect with either
access_token or service principal params, mirroring the logic in
DatabricksConnector (see wren/src/wren/connector/databricks.py) or alternatively
refactor DataSourceExtension.get_connection to delegate Databricks connections
to DatabricksConnector so both token and service principal flows are handled
correctly.
- Around line 358-373: Add an explanatory comment above the per-connection
override in get_doris_connection explaining why we set
connection.con.get_autocommit = lambda: True: describe that Doris does not
reflect SERVER_STATUS_AUTOCOMMIT in the MySQL handshake so MySQLdb's
get_autocommit() remains False even after autocommit(True), causing
ibis.raw_sql() to wrap queries in BEGIN/ROLLBACK (which Doris rejects); note
this is an instance-level workaround on the connection returned by
get_doris_connection (not a global patch to MySQLdb) and reference the relevant
symbols (get_doris_connection, connection.con.get_autocommit, ibis.raw_sql,
SERVER_STATUS_AUTOCOMMIT) so future maintainers understand the rationale.
In `@wren/src/wren/model/error.py`:
- Around line 52-65: The constructor accepts a cause but never preserves it;
modify the Error class __init__ to assign the cause to an instance attribute
(e.g., self.cause = cause) and, if cause is not None, set self.__cause__ = cause
to preserve exception chaining; keep the existing super().__init__(message) call
(or call it before setting __cause__) so the original message is preserved while
the underlying exception context is retained.
In `@wren/tests/suite/query.py`:
- Around line 127-129: The hardcoded constants order_count, customer_count (and
their inline comments) are inconsistent with TPCH sf=0.01 and appear off by a
factor of 10; update the values in the class so they match actual TPCH sf=0.01
outputs (e.g., set order_count to 1500 and customer_count to 150 if your tests
use 1500 orders / 150 customers) and fix the inline comment math to reflect the
correct base counts and multiplication (reference the symbols order_count,
customer_count, and order_id_dtype) so the comment shows the correct calculation
for TPCH sf=0.01.
---
Nitpick comments:
In @.github/workflows/wren-ci.yml:
- Around line 64-73: Duplicate step sequence that builds the wren-core-py wheel
and installs deps (the steps named "Build wren-core-py wheel", "Install
dependencies" and any identical blocks in the `test-unit` and `test-postgres`
jobs) should be extracted into a reusable composite action; create a composite
action (e.g., .github/actions/build-wren-core-py/action.yml) that runs the
pipx/poetry/maturin build and the uv sync install, accept any inputs needed
(like working-directory or extra flags), then replace the duplicated step blocks
in both jobs with a single uses: ./github/actions/build-wren-core-py (or
appropriate path) invocation and wire any inputs/outputs through the action so
downstream steps (like "Run unit tests") remain unchanged.
In `@wren/.claude/CLAUDE.md`:
- Around line 7-18: Update the fenced code block in CLAUDE.md to include a
language specifier (e.g., replace the opening ``` with ```text or ```plaintext)
so the directory tree (showing engine.py, cli.py, mdl/, connector/,
model/data_source.py, model/error.py, tests/) gets proper syntax highlighting;
locate the block that starts with ``` and the directory listing and change only
the opening fence to include the specifier.
In `@wren/pyproject.toml`:
- Line 49: The 'all' extras entry currently omits the 'spark' extra; update the
pyproject.toml so the all aggregate either includes "spark" alongside the other
extras or add a brief inline comment next to the 'all' definition explaining why
'spark' is intentionally excluded (e.g., PySpark size/footprint), referencing
the 'all' extras line and the existing 'spark' extra name so reviewers can see
the intended behavior.
In `@wren/src/wren/cli.py`:
- Around line 150-175: The transpile command creates a WrenEngine instance
(WrenEngine in function transpile) but never calls engine.close(), risking
resource leaks; update transpile to ensure engine.close() is always invoked (use
a with/context manager on WrenEngine if it implements __enter__/__exit__, or
wrap engine usage in try/finally and call engine.close() in finally) after
creating engine = WrenEngine(manifest_str=manifest_str, data_source=ds,
connection_info={}); keep existing manifest loading (_load_manifest) and
DataSource handling, and ensure exceptions still result in typer.Exit(1) while
guaranteeing engine.close() runs.
In `@wren/src/wren/connector/base.py`:
- Around line 55-61: The _round_decimal_columns method currently hardcodes
Decimal(precision=38, scale=9) which can truncate higher-precision values;
change it to derive precision from the column's original dtype (fetch
original_dtype = result_table.schema()[col_name] or equivalent), use that
precision when present (cap it to the supported maximum, e.g., 38) or fall back
to a configurable default, then construct Decimal(precision=derived_precision,
scale=scale) and cast/round using that type; ensure the change is applied in the
_round_decimal_columns function and consider adding a configurable default
precision on the connector class if needed.
In `@wren/src/wren/connector/bigquery.py`:
- Around line 16-20: Rename the mistakenly named variable credits_json to
creds_json in the BigQuery connector: update the assignment statement that
decodes and loads the base64 credentials (the variable currently declared as
credits_json) and replace all subsequent references to that variable within the
same scope (e.g., anywhere creds_json is used after the loads(...) call) to use
creds_json so the name matches "credentials" shorthand; ensure no other
variables collide and run tests/imports to verify no unresolved references.
- Around line 24-29: The code currently calls credentials.with_scopes([...])
eagerly during connector initialization (the credentials variable in
bigquery.py); remove that eager scope expansion and keep the original
credentials untouched at init. Instead, perform credentials.with_scopes(...)
only in the retry path that handles empty results for special Drive-backed types
(the method that detects empty query results and retries—e.g., the connector's
retry/query handling function), using the same scopes
["https://www.googleapis.com/auth/drive","https://www.googleapis.com/auth/cloud-platform"]
just before re-creating the BigQuery client for the retry attempt.
In `@wren/src/wren/connector/canner.py`:
- Around line 37-49: The _handle_pyarrow_unsupported_type method in
CannerConnector duplicates decimal and UUID handling already implemented as
IbisConnector._round_decimal_columns and IbisConnector._cast_uuid_columns;
change CannerConnector to reuse those helpers instead of reimplementing: have
CannerConnector inherit from IbisConnector or import the shared mixin, then
replace the loop in _handle_pyarrow_unsupported_type with calls to
IbisConnector._round_decimal_columns(ibis_table, ...) and
IbisConnector._cast_uuid_columns(...) (or the equivalent mixin methods) so
decimals and UUIDs are processed by the existing helpers, returning the final
transformed Table.
In `@wren/src/wren/connector/databricks.py`:
- Around line 53-55: The dry_run implementation in dry_run(self, sql: str) wraps
the query in a SELECT subquery which breaks for CTEs, multiple statements or
DDL; change it to validate the SQL using Databricks' EXPLAIN instead of wrapping
(i.e., execute "EXPLAIN {sql}" via the same cursor in dry_run) so CTEs and
complex queries are accepted without executing data-changing statements; keep
using closing(self.connection.cursor()) and ensure you only call cursor.execute
with the EXPLAIN-prefixed SQL in the dry_run function.
In `@wren/src/wren/connector/duckdb.py`:
- Around line 57-62: The three separate conditionals checking connection_info
should be made mutually exclusive to avoid multiple initializers running; change
the sequential ifs to an if / elif / elif chain so only one branch executes for
S3FileConnectionInfo, MinioFileConnectionInfo, or GcsFileConnectionInfo and call
the corresponding initializer (_init_duckdb_s3, _init_duckdb_minio,
_init_duckdb_gcs) with self.connection and connection_info.
In `@wren/src/wren/connector/ibis.py`:
- Line 29: Replace the awkward "not ... == ..." style checks with "!="
comparisons: change the condition in the exception handling where the code uses
if not e.error_name == "EXCEEDED_TIME_LIMIT" to use if e.error_name !=
"EXCEEDED_TIME_LIMIT" (also update the same pattern found later in the file
around the other exception branch); look for the exception handler referencing
e.error_name in ibis.py to make both replacements.
- Around line 99-103: The create_connector function can be simplified by
removing the explicit IbisConnector constructor and instantiating cls with the
correct arguments: if cls is IbisConnector call cls(data_source,
connection_info) else call cls(connection_info); implement this by building args
= (data_source, connection_info) when cls is IbisConnector and args =
(connection_info,) otherwise, then return cls(*args). Update the function that
uses _DATA_SOURCE_TO_CLASS, IbisConnector, DataSource and connection_info
accordingly.
In `@wren/src/wren/connector/mysql.py`:
- Around line 14-51: MySqlConnector and DorisConnector duplicate the same
_handle_pyarrow_unsupported_type logic; move the shared logic into
IbisConnector._handle_pyarrow_unsupported_type (handling Decimal via
_round_decimal_columns, UUID via _cast_uuid_columns, and dt.JSON -> cast to
string) and have MySqlConnector and DorisConnector call or inherit that base
implementation (override only if connector-specific behavior is needed).
Alternatively, extract the shared code into a small mixin/helper function (e.g.,
handle_pyarrow_unsupported_types) that uses ibis_table.schema(), Decimal, UUID,
dt.JSON and the existing _round_decimal_columns/_cast_uuid_columns helpers, then
have both MySqlConnector and DorisConnector delegate to that single
implementation to remove duplication.
In `@wren/src/wren/connector/spark.py`:
- Around line 40-48: The close method currently swallows all exceptions from
self.connection.stop(), which should be logged; update close(self) in the class
to catch Exception as e (around the self.connection.stop() call), log the
exception with a debug or warning level (including the exception object and
traceback) using the module/class logger (or create one if missing), and still
set self._closed = True in the finally block so cleanup always completes; keep
the existing early-return when self._closed is already True.
In `@wren/src/wren/engine.py`:
- Around line 166-173: The try/except around manifest extraction
(get_manifest_extractor, extractor.resolve_used_table_names,
extractor.extract_by, to_json_base64) silently falls back to self.manifest_str;
update the except block to log the exception and context before using the full
manifest so failures aren’t hidden—capture the exception object, call the
module/class logger (or processLogger) to emit a clear message including the
exception details and which SQL/manifest caused the fallback, then set
effective_manifest = self.manifest_str as before.
In `@wren/src/wren/mdl/__init__.py`:
- Around line 8-17: The get_session_context function is decorated with
functools.cache which is unbounded; replace it with
functools.lru_cache(maxsize=...) (or configurable maxsize) to prevent unbounded
memory growth in long‑running processes: update the decorator on
get_session_context, add/import lru_cache from functools, and choose a
reasonable maxsize (or read from config/env) so SessionContext instances are
memoized but capped.
In `@wren/src/wren/model/data_source.py`:
- Around line 11-13: Top-level imports of boto3, ibis, and BaseBackend slow
startup; change them to lazy imports inside the functions/classes that actually
use AWS or Ibis functionality (replace "import boto3", "import ibis", and "from
ibis import BaseBackend" with in-function imports where needed). Locate call
sites that depend on boto3 or ibis (search for usages of boto3, ibis, and
BaseBackend in this module) and add local imports at the start of those
functions/methods (follow the existing project pattern used in duckdb.py, e.g.,
local import with a noqa comment if necessary) so CLI/transpile paths that don't
touch DB/AWS avoid importing those heavy libs at module load.
- Around line 287-289: Rename the mistakenly named variable credits_json to
creds_json in the data source parsing code: update the assignment that decodes
base64 and loads JSON (the expression using
base64.b64decode(info.credentials.get_secret_value()).decode("utf-8") passed to
loads) and update all subsequent references in the same function/method to use
creds_json instead of credits_json, ensuring there are no name collisions and
tests/imports still pass.
In `@wren/tests/connectors/test_duckdb.py`:
- Around line 35-48: The engine pytest fixture currently declares a
non-generator return type WrenEngine but uses yield; change its annotation to a
generator type such as Generator[WrenEngine, None, None] or Iterator[WrenEngine]
on the def engine(...) signature, and update the in-code comment that describes
TPCH generation in the engine fixture from "1500 orders, 150 customers" to the
correct "15000 orders, 1500 customers" per the WrenQueryTestSuite defaults;
locate the changes in the engine fixture and adjust imports if needed
(typing.Generator or typing.Iterator).
In `@wren/tests/suite/query.py`:
- Around line 135-137: The manifest_str pytest fixture is defined but never
consumed by subclass fixtures; update the TestDuckDB and TestPostgres engine
fixtures to accept manifest_str as a parameter and use that value instead of
rebuilding it inline (or remove the unused manifest_str fixture). Specifically,
modify the engine fixtures in the TestDuckDB and TestPostgres classes to include
manifest_str in their signature and replace any inline
base64.b64encode(orjson.dumps(self.manifest)).decode() calls with the provided
manifest_str, ensuring the manifest_str fixture is actually used; if you prefer
removing duplication, delete the manifest_str fixture instead of changing the
subclasses.
In `@wren/tests/unit/test_engine.py`:
- Around line 121-127: The test currently never initializes
WrenEngine._connector so it only asserts None; update
test_context_manager_closes_connector to trigger connector creation (call a
method that lazily initializes it such as WrenEngine.query(...) or
WrenEngine.dry_run()) inside the with block, assert e._connector is not None
while inside the context, then after exiting the with (which invokes
WrenEngine.__exit__), assert e._connector is None to verify cleanup; reference
WrenEngine, test_context_manager_closes_connector, query/dry_run, __exit__, and
_connector when making the change.
- Line 10: Remove the unused import symbol "math" from the top of the test
module (in test_engine.py) so the file no longer imports a module that isn't
used; if you intended to use math in tests instead, add the actual usage (e.g.,
math.isclose) where appropriate, otherwise delete the import line.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36c5e8e0-aecb-4fb1-80b0-c10741e81c6c
⛔ Files ignored due to path filters (1)
wren/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
.github/workflows/wren-ci.ymlibis-server/docs/development.mdwren/.claude/CLAUDE.mdwren/README.mdwren/justfilewren/pyproject.tomlwren/src/wren/__init__.pywren/src/wren/cli.pywren/src/wren/connector/__init__.pywren/src/wren/connector/base.pywren/src/wren/connector/bigquery.pywren/src/wren/connector/canner.pywren/src/wren/connector/databricks.pywren/src/wren/connector/duckdb.pywren/src/wren/connector/factory.pywren/src/wren/connector/ibis.pywren/src/wren/connector/mssql.pywren/src/wren/connector/mysql.pywren/src/wren/connector/postgres.pywren/src/wren/connector/redshift.pywren/src/wren/connector/spark.pywren/src/wren/engine.pywren/src/wren/mdl/__init__.pywren/src/wren/model/__init__.pywren/src/wren/model/data_source.pywren/src/wren/model/error.pywren/tests/__init__.pywren/tests/conftest.pywren/tests/connectors/__init__.pywren/tests/connectors/test_duckdb.pywren/tests/connectors/test_postgres.pywren/tests/suite/__init__.pywren/tests/suite/manifests.pywren/tests/suite/query.pywren/tests/unit/__init__.pywren/tests/unit/test_engine.py
CI: add uv lock before uv sync so lockfile regenerates for Linux after building wren-core-py wheel (was locked to macOS ARM64 only). Code fixes from review: - Remove dead loop in cli.py _print_result - Escape SQL values in DuckDB CREATE SECRET and ATTACH statements - MSSQL: return original SQL on parse failure, robust error matching, wrap _describe_sql_for_error_message in try/except - DorisConnectionInfo.port: SecretStr → SecretPort - GcsFileConnectionInfo.credentials: add Optional type - Add DatabricksServicePrincipalConnectionInfo to ConnectionInfo union - WrenError: preserve exception cause via __cause__ - Redshift: cast port to int for redshift_connector - Factory: correct pip install hint for shared connector modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed all CodeRabbit review comments in d4986bf: CI fix:
Code fixes:
Not changed (intentional):
|
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
wren/src/wren/connector/mssql.py (1)
98-108:⚠️ Potential issue | 🟡 MinorMissing null check for error_message value.
Line 106 returns
rows[0][0]directly, but if SQL Server returns a row whereerror_messageisNULL, this will returnNoneinstead of the expected"Unknown reason"string.🛡️ Proposed fix
with closing(self.connection.raw_sql(describe_sql)) as cur: rows = cur.fetchall() - if not rows: + if not rows or rows[0][0] is None: return "Unknown reason" return rows[0][0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/mssql.py` around lines 98 - 108, In _describe_sql_for_error_message, after executing describe_sql (built with sge.convert(sql).sql("mssql")) and reading rows, guard against a NULL error_message by checking rows[0][0] for None and returning "Unknown reason" when it is None (in addition to the existing empty-rows and exception handling); update the return logic that currently returns rows[0][0] so it returns rows[0][0] only when it is a non-null string, otherwise return "Unknown reason".wren/src/wren/connector/duckdb.py (1)
71-74:⚠️ Potential issue | 🟠 Major
limitis still enforced client-side.This still fetches the entire Arrow table before trimming it, so large queries don't get the memory/latency protection that
limitimplies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/duckdb.py` around lines 71 - 74, The query method currently fetches the full Arrow table then slices it, so enforce the limit in the SQL execution instead: in query (wren/src/wren/connector/duckdb.py) when limit is not None build a limited SQL such as "SELECT * FROM ({sql}) AS _wren_subquery LIMIT {int(limit)}" (cast limit to int) and pass that to self.connection.execute(...).fetch_arrow_table(), otherwise execute the original sql; this ensures DuckDB applies the limit server-side and avoids client-side slicing.
🧹 Nitpick comments (2)
wren/src/wren/cli.py (2)
59-65: Unused parametersqlin_make_engine.The
sqlparameter is accepted but never used within the function body. This appears to be leftover from refactoring.♻️ Proposed fix to remove unused parameter
def _make_engine( - sql: str, datasource: str, mdl: str, connection_info: str | None, connection_file: str | None, ):Then update all call sites:
- engine = _make_engine(sql, datasource, mdl, connection_info, connection_file) + engine = _make_engine(datasource, mdl, connection_info, connection_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/cli.py` around lines 59 - 65, Remove the unused parameter from the _make_engine function signature: drop the sql parameter and update the definition of _make_engine(sql: str, datasource: str, mdl: str, connection_info: str | None, connection_file: str | None) to only accept the actually used parameters (datasource, mdl, connection_info, connection_file); then update every call site that invokes _make_engine to stop passing the now-removed sql argument (search for _make_engine( or references to _make_engine in the codebase) so the calls match the new signature.
43-56: Consider case-insensitive extension check.Line 51 uses
mdl.endswith(".json")which is case-sensitive. Files namedmanifest.JSONormanifest.Jsonwould be treated as pre-encoded base64 instead of being encoded.♻️ Proposed fix
- if mdl.endswith(".json"): + if mdl.lower().endswith(".json"):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/cli.py` around lines 43 - 56, The _load_manifest function treats files with extension check using mdl.endswith(".json") which is case-sensitive; update the check to be case-insensitive (e.g., compare Path(mdl).suffix.lower() or use mdl.lower().endswith(".json")) so filenames like "manifest.JSON" are correctly detected and base64-encoded; modify the conditional that currently uses mdl.endswith(".json") in _load_manifest to use a lower-cased suffix check while keeping the rest of the logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/cli.py`:
- Around line 167-175: The transpile command creates a WrenEngine
(WrenEngine(manifest_str=..., data_source=..., connection_info={})) and calls
engine.transpile(...) but never closes the engine; update the transpile command
to wrap the transpile call in try/except/finally (matching
query/dry-run/validate) so that engine.close() is always called in the finally
block, ensuring engine.close() runs whether transpile succeeds or raises; keep
the existing error handling (typer.echo and typer.Exit) inside the except and
invoke engine.close() in finally.
- Around line 200-207: The fallback in _print_result currently prints
table.to_pydict() directly which emits a Python dict repr (single quotes)
instead of valid JSON; change the except block to serialize the dict to proper
JSON (e.g., import json and call json.dumps(table.to_pydict(),
ensure_ascii=False) or equivalent) and pass that string to typer.echo so the
"json" output always emits valid JSON; reference _print_result,
table.to_pydict(), and typer.echo when making the change.
In `@wren/src/wren/connector/duckdb.py`:
- Around line 59-69: The constructor opens a transient DuckDB connection with
duckdb.connect() but does not close it if any of _init_duckdb_s3,
_init_duckdb_minio, _init_duckdb_gcs or _attach_database raise, leaking the
connection; wrap the initialization sequence in a try/except (or try/finally)
inside __init__ so that on any exception you call self.connection.close() before
re-raising the error, ensuring the connection is always closed on failure while
preserving successful path behavior.
- Around line 96-105: The _list_duckdb_files function currently always
constructs an OpenDAL Operator with backend "fs", which will fail for remote
connection types; either enforce validators on the connection models
(S3FileConnectionInfo, MinioFileConnectionInfo, GcsFileConnectionInfo) to forbid
format="duckdb" (allow it only on LocalFileConnectionInfo), or update
_list_duckdb_files to inspect connection_info's concrete type or a backend
indicator and construct the correct Operator (e.g., "s3", "minio", "gcs", or
"fs") using the appropriate credentials/URL before listing; update references to
connection_info.url.get_secret_value() accordingly so remote backends use the
right root and do not attempt local fs listing.
- Around line 76-77: The dry_run implementation currently calls
connection.execute(sql) which may run mutations; change dry_run in the DuckDB
connector to avoid executing the raw statement by enforcing read-only validation
(for example, wrap the incoming sql in a read-only wrapper like SELECT * FROM
({sql}) LIMIT 0 and execute that instead, or alternatively use EXPLAIN or
PREPARE to validate without side effects). Update the dry_run method (symbol:
dry_run) to build and execute the safe validation statement rather than calling
connection.execute(sql) directly, so connection.execute is only ever invoked on
the wrapped/EXPLAIN/PREPARE statement.
In `@wren/src/wren/connector/mssql.py`:
- Around line 19-24: The override of query (which flattens pagination via
_flatten_pagination_limit and rounds decimals via _round_decimal_columns)
bypasses the base-class PyArrow compatibility step and thus fails for UUID
columns; update query to ensure _handle_pyarrow_unsupported_type is applied
before returning (e.g., call self._handle_pyarrow_unsupported_type(...) on the
ibis_table or the result of _round_decimal_columns), or add UUID-to-string
casting inside _round_decimal_columns so that any UUID columns are converted to
strings prior to to_pyarrow()/PyArrow conversion.
---
Duplicate comments:
In `@wren/src/wren/connector/duckdb.py`:
- Around line 71-74: The query method currently fetches the full Arrow table
then slices it, so enforce the limit in the SQL execution instead: in query
(wren/src/wren/connector/duckdb.py) when limit is not None build a limited SQL
such as "SELECT * FROM ({sql}) AS _wren_subquery LIMIT {int(limit)}" (cast limit
to int) and pass that to self.connection.execute(...).fetch_arrow_table(),
otherwise execute the original sql; this ensures DuckDB applies the limit
server-side and avoids client-side slicing.
In `@wren/src/wren/connector/mssql.py`:
- Around line 98-108: In _describe_sql_for_error_message, after executing
describe_sql (built with sge.convert(sql).sql("mssql")) and reading rows, guard
against a NULL error_message by checking rows[0][0] for None and returning
"Unknown reason" when it is None (in addition to the existing empty-rows and
exception handling); update the return logic that currently returns rows[0][0]
so it returns rows[0][0] only when it is a non-null string, otherwise return
"Unknown reason".
---
Nitpick comments:
In `@wren/src/wren/cli.py`:
- Around line 59-65: Remove the unused parameter from the _make_engine function
signature: drop the sql parameter and update the definition of _make_engine(sql:
str, datasource: str, mdl: str, connection_info: str | None, connection_file:
str | None) to only accept the actually used parameters (datasource, mdl,
connection_info, connection_file); then update every call site that invokes
_make_engine to stop passing the now-removed sql argument (search for
_make_engine( or references to _make_engine in the codebase) so the calls match
the new signature.
- Around line 43-56: The _load_manifest function treats files with extension
check using mdl.endswith(".json") which is case-sensitive; update the check to
be case-insensitive (e.g., compare Path(mdl).suffix.lower() or use
mdl.lower().endswith(".json")) so filenames like "manifest.JSON" are correctly
detected and base64-encoded; modify the conditional that currently uses
mdl.endswith(".json") in _load_manifest to use a lower-cased suffix check while
keeping the rest of the logic intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3ed940ad-de97-49ae-971d-0fe28ffb622d
📒 Files selected for processing (8)
.github/workflows/wren-ci.ymlwren/src/wren/cli.pywren/src/wren/connector/duckdb.pywren/src/wren/connector/factory.pywren/src/wren/connector/mssql.pywren/src/wren/connector/redshift.pywren/src/wren/model/__init__.pywren/src/wren/model/error.py
✅ Files skipped from review due to trivial changes (3)
- .github/workflows/wren-ci.yml
- wren/src/wren/model/error.py
- wren/src/wren/model/init.py
🚧 Files skipped from review as they are similar to previous changes (2)
- wren/src/wren/connector/factory.py
- wren/src/wren/connector/redshift.py
The committed uv.lock contains macOS ARM64 wheel paths from local dev. Adding --upgrade-package wren-core-py to uv lock forces re-resolution from the freshly built Linux wheel in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix isinstance check against union type (use BaseConnectionInfo) - Close DuckDB connection on init failure to prevent resource leak - Push limit into SQL for DuckDB and Redshift instead of post-fetch slicing - Add missing engine.close() in transpile command - Fix JSON fallback to output valid JSON instead of Python dict repr - Fix TPCH count comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
wren/src/wren/connector/duckdb.py (2)
69-70:⚠️ Potential issue | 🟠 MajorRemote
format="duckdb"still falls back to the local filesystem backend.
__init__()calls_attach_database()for any connection type withformat == "duckdb", but_list_duckdb_files()always constructsopendal.Operator("fs", ...).S3FileConnectionInfo,MinioFileConnectionInfo, andGcsFileConnectionInfotherefore try to discover databases on the local filesystem instead of the configured object store.Also applies to: 100-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/duckdb.py` around lines 69 - 70, The bug is that __init__ calls _attach_database for any connection with format == "duckdb" but _list_duckdb_files always creates an opendal.Operator("fs", ...) so S3/Minio/GCS connections are scanned on the local filesystem; change _list_duckdb_files to build the opendal operator from the passed connection_info (S3FileConnectionInfo, MinioFileConnectionInfo, GcsFileConnectionInfo, or local fs) instead of hardcoding "fs", or alternatively accept an Operator parameter (update calls from __init__ and the callers around the referenced block/lines) so _attach_database/_list_duckdb_files use the correct remote backend and bucket/path when listing DuckDB files.
80-81:⚠️ Potential issue | 🟠 Major
dry_run()still executes the real SQL.This makes
dry-run/validatecapable of applying mutations if a non-SELECT statement makes it through planning. Use a read-only probe here (SELECT * FROM (...) LIMIT 0,EXPLAIN, orPREPARE) instead of executing the statement itself.🛡️ One safe option
def dry_run(self, sql: str) -> None: - self.connection.execute(sql) + self.connection.execute(f"SELECT * FROM ({sql}) AS _q LIMIT 0")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/connector/duckdb.py` around lines 80 - 81, The dry_run method currently calls self.connection.execute(sql) which actually executes mutations; change it to a read-only probe instead (e.g., use EXPLAIN, PREPARE, or wrap the query in a SELECT * FROM (<sql>) LIMIT 0) so the statement is validated/planned but not applied. Update the dry_run implementation to call the connection method that runs the chosen probe (not execute) and ensure it handles errors the same way as planning failures; reference the dry_run method and replace the direct use of self.connection.execute with a safe read-only validation call.wren/src/wren/model/data_source.py (1)
167-173:⚠️ Potential issue | 🟠 MajorService-principal Databricks configs still parse into a type this path can't connect.
_build_connection_info()now returnsDatabricksServicePrincipalConnectionInfo, butget_databricks_connection()still unconditionally readsinfo.access_token. Adatabricks_type="service_principal"config will therefore validate here and then fail withAttributeErrorduring connection creation instead of failing fast with a clear supported/unsupported-auth result.Also applies to: 439-445
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/model/data_source.py` around lines 167 - 173, _build_connection_info() can return DatabricksServicePrincipalConnectionInfo for databricks_type="service_principal", but get_databricks_connection() still assumes info.access_token and will raise AttributeError; update get_databricks_connection() to check the concrete type (use isinstance(info, DatabricksServicePrincipalConnectionInfo) or test for the absence of access_token) and either perform the correct service-principal auth flow or raise a clear UnsupportedAuth/ValueError describing that service-principal auth is unsupported, referencing the symbols _build_connection_info, get_databricks_connection, DatabricksServicePrincipalConnectionInfo, and access_token so the fix targets the correct branch; apply the same change to the other occurrence noted (lines ~439-445).
🧹 Nitpick comments (1)
wren/tests/suite/query.py (1)
135-137: Minor:manifest_strfixture may be unused by current subclasses.The fixture duplicates the inline logic shown in the docstring example (lines 36-37). If subclass engine fixtures consistently create
manifest_strinline rather than using this fixture, consider removing it to avoid dead code. Otherwise, this is fine as a convenience fixture.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/tests/suite/query.py` around lines 135 - 137, The manifest_str pytest fixture in tests/suite/query.py (function manifest_str) appears unused by current subclasses and duplicates inline logic in the docstring example; either remove this fixture to avoid dead code or update subclass engine fixtures/tests to use the shared manifest_str fixture instead of duplicating base64.b64encode(orjson.dumps(self.manifest)).decode() — locate the fixture named manifest_str and either delete it and clean up imports, or replace inline copies in subclass fixtures/tests with a reference to manifest_str so the shared fixture is exercised consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/cli.py`:
- Around line 167-168: The CLI is passing an empty dict into WrenEngine which
triggers eager normalization via data_source.get_connection_info() and
datasource model validation; to fix, change the transpile path to avoid forcing
normalization by passing a sentinel (e.g., None or a dedicated sentinel object)
instead of {} from the CLI (where engine is instantiated) and update WrenEngine
(or its constructor code that calls data_source.get_connection_info()) to skip
calling data_source.get_connection_info() when the provided connection_info is
that sentinel (or falsy) so pure transpile does not invoke datasource-specific
validation in wren/src/wren/model/data_source.py.
In `@wren/src/wren/model/data_source.py`:
- Around line 86-88: The code currently overwrites caller-provided Postgres
connect_timeout because it checks hasattr(info, "connect_timeout") instead of
looking inside info.kwargs; change the guard to check the kwargs dict itself
(e.g., if "connect_timeout" not in kwargs) and only set
kwargs["connect_timeout"] = 120 when that key is absent, using the existing
kwargs = info.kwargs if info.kwargs else {} logic so caller-supplied timeouts
are preserved.
---
Duplicate comments:
In `@wren/src/wren/connector/duckdb.py`:
- Around line 69-70: The bug is that __init__ calls _attach_database for any
connection with format == "duckdb" but _list_duckdb_files always creates an
opendal.Operator("fs", ...) so S3/Minio/GCS connections are scanned on the local
filesystem; change _list_duckdb_files to build the opendal operator from the
passed connection_info (S3FileConnectionInfo, MinioFileConnectionInfo,
GcsFileConnectionInfo, or local fs) instead of hardcoding "fs", or alternatively
accept an Operator parameter (update calls from __init__ and the callers around
the referenced block/lines) so _attach_database/_list_duckdb_files use the
correct remote backend and bucket/path when listing DuckDB files.
- Around line 80-81: The dry_run method currently calls
self.connection.execute(sql) which actually executes mutations; change it to a
read-only probe instead (e.g., use EXPLAIN, PREPARE, or wrap the query in a
SELECT * FROM (<sql>) LIMIT 0) so the statement is validated/planned but not
applied. Update the dry_run implementation to call the connection method that
runs the chosen probe (not execute) and ensure it handles errors the same way as
planning failures; reference the dry_run method and replace the direct use of
self.connection.execute with a safe read-only validation call.
In `@wren/src/wren/model/data_source.py`:
- Around line 167-173: _build_connection_info() can return
DatabricksServicePrincipalConnectionInfo for
databricks_type="service_principal", but get_databricks_connection() still
assumes info.access_token and will raise AttributeError; update
get_databricks_connection() to check the concrete type (use isinstance(info,
DatabricksServicePrincipalConnectionInfo) or test for the absence of
access_token) and either perform the correct service-principal auth flow or
raise a clear UnsupportedAuth/ValueError describing that service-principal auth
is unsupported, referencing the symbols _build_connection_info,
get_databricks_connection, DatabricksServicePrincipalConnectionInfo, and
access_token so the fix targets the correct branch; apply the same change to the
other occurrence noted (lines ~439-445).
---
Nitpick comments:
In `@wren/tests/suite/query.py`:
- Around line 135-137: The manifest_str pytest fixture in tests/suite/query.py
(function manifest_str) appears unused by current subclasses and duplicates
inline logic in the docstring example; either remove this fixture to avoid dead
code or update subclass engine fixtures/tests to use the shared manifest_str
fixture instead of duplicating
base64.b64encode(orjson.dumps(self.manifest)).decode() — locate the fixture
named manifest_str and either delete it and clean up imports, or replace inline
copies in subclass fixtures/tests with a reference to manifest_str so the shared
fixture is exercised consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 50ed8c2f-10b9-432d-8562-6f6cc79f3b33
📒 Files selected for processing (5)
wren/src/wren/cli.pywren/src/wren/connector/duckdb.pywren/src/wren/connector/redshift.pywren/src/wren/model/data_source.pywren/tests/suite/query.py
🚧 Files skipped from review as they are similar to previous changes (1)
- wren/src/wren/connector/redshift.py
- Allow empty dict connection_info for transpile-only usage - Fix Postgres connect_timeout guard to check kwargs dict, not model attrs - Use EXPLAIN for DuckDB dry_run instead of executing SQL directly - Handle UUID columns in MSSQL connector via _handle_pyarrow_unsupported_type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
wren/module — standalone Python SDK and CLI wrapping wren-core-py + Ibis connectors into a single installable packagewren-ci.yml) for lint, unit tests, and postgres connector testsexecutemanyrequires cursor)pyarrow-hotfixdependencyTest plan
just test-unitpasses unit testsjust test-postgrespasses postgres connector tests with testcontainersjust lintpasses ruff format and lint checks🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
Chores