feat(wren): CTE-based SQL planning with per-model expansion#1479
feat(wren): CTE-based SQL planning with per-model expansion#1479douenergy merged 10 commits intoCanner:mainfrom
Conversation
Replace the whole-query `session_context.transform_sql()` approach with a CTE-based rewriter that: 1. Parses user SQL with sqlglot (source dialect) 2. Uses qualify_tables + normalize_identifiers + qualify_columns to resolve all column references (including SELECT *, aliases, and correlated subqueries) 3. Calls wren-core transform_sql once per model with a simple SELECT col1, col2 FROM model 4. Parses each wren-core output with Wren dialect and injects as CTEs 5. Outputs final SQL in the target data source dialect This eliminates the need for a separate _transpile() step since dialect conversion happens naturally via sqlglot AST generation. Also: - Replace CLI `transpile` command with `dry-plan` - Remove WrenEngine.transpile(), unify on dry_plan() - Add fallback parameter to CTERewriter/WrenEngine for test strictness - Fix relationship column filter (handle null values from extracted manifests) - All tests use fallback=False to catch silent CTE path failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughReplaces transpile flow with a dry-plan API, adds CTERewriter to expand MDL model references into injected CTEs using sqlglot, introduces a small Wren sqlglot dialect, updates engine planning to use the rewriter with an optional fallback, updates DuckDB query handling, and adjusts tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User SQL
participant CLI as CLI (dry-plan)
participant Engine as WrenEngine
participant Planner as _plan()
participant Rewriter as CTERewriter
participant Session as SessionContext
participant DB as Connector/DB
User->>CLI: run dry-plan(sql)
CLI->>Engine: engine.dry_plan(sql)
Engine->>Planner: _plan(sql, manifest, data_source)
Planner->>Rewriter: instantiate(manifest, session, data_source, fallback)
Planner->>Rewriter: rewrite(sql)
Rewriter->>Session: transform per-model SELECTs
Session-->>Rewriter: expanded model SQL
Rewriter-->>Planner: rewritten SQL with injected CTEs
Planner->>Engine: planned/executable SQL
Engine->>DB: query/planned SQL (dry-run or execute)
DB-->>Engine: result/rows or dry-run output
Engine-->>CLI: print expanded SQL / results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…CTE rewriter Models referenced in FROM clause but without specific column references (e.g. SELECT COUNT(*) FROM model) now generate CTEs with all columns instead of being treated as "no model references found". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3985317 to
432988c
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/mdl/cte_rewriter.py`:
- Around line 141-144: The code only records user CTE names from the root WITH
via _collect_user_cte_names(), so nested WITHs' CTEs (e.g., a local "WITH
customer AS (...)") are not treated as shadowing and _collect_model_columns()
may incorrectly inject a top-level model CTE; update CTE tracking to be
scope-aware by walking the AST and collecting CTE names per scope (not just
root), then change the lookup in the table collection logic (the loop using
qualified.find_all(exp.Table) and the checks comparing table_name against
self.model_dict and user_cte_names) to consult the nearest-enclosing scope's CTE
set; ensure the same fix is applied to the other occurrence referenced (the
block around the other conditional that currently uses the flat user_cte_names
set).
- Around line 104-109: The current guard uses only used_columns and so treats
table-only model queries (e.g., SELECT COUNT(*) FROM "orders") as having no
model refs; update the check that decides fallback so it also considers
table/model references extracted earlier (e.g., any used_models/used_table_refs
or whatever symbol holds referenced models) and only trigger
self.fallback/session_context.transform_sql when there are truly no model
references at all; ensure queries with table-only references follow the CTE
serialization path (the same path taken by rewrite()/CTE generation) instead of
bypassing it.
- Around line 150-171: The current code collects used columns in a set (used:
dict[str, set[str]]) and then sorts them in _build_model_ctes (sorted(columns)),
which reorders SELECT * column output; change the collection to preserve
manifest/order and stop sorting before building CTEs: replace the per-model
value type from set to a list (or OrderedSet-like append-if-missing) where you
append columns in the order discovered, and in _build_model_ctes use the
incoming columns sequence directly (use columns rather than sorted(columns))
when computing resolved and col_list; update references to used, used_columns,
_build_model_ctes, and resolved to reflect the ordered collection.
In `@wren/tests/unit/test_cte_rewriter.py`:
- Around line 88-93: Change the helper _make_rewriter to default fallback to
False by adding a parameter fallback: bool = False, and forward that flag into
the session creation so the returned CTERewriter uses the new planner by
default; e.g. update the call to get_session_context(...) to include/propagate
the fallback argument (or named parameter) so tests can opt in with
_make_rewriter(..., fallback=True) when they need legacy behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7fdbe27a-6976-434e-9d83-9c81e7a47b57
⛔ Files ignored due to path filters (1)
wren/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
wren/pyproject.tomlwren/src/wren/cli.pywren/src/wren/connector/duckdb.pywren/src/wren/engine.pywren/src/wren/mdl/cte_rewriter.pywren/src/wren/mdl/wren_dialect.pywren/tests/connectors/test_duckdb.pywren/tests/connectors/test_postgres.pywren/tests/suite/query.pywren/tests/unit/test_cte_rewriter.pywren/tests/unit/test_engine.py
The fallback path (session_context.transform_sql) outputs Wren dialect, which needs to be transpiled to the target data source dialect before returning — matching the CTE path's behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
wren/src/wren/mdl/cte_rewriter.py (2)
122-123:⚠️ Potential issue | 🟠 MajorPreserve discovered column order;
set + sorted()can changeSELECT *schema order.Line 153 stores columns in a
set, and Line 174 sorts them. That can reorder exposed columns for star queries, changing visible output schema order.Suggested fix (ordered de-dup instead of set)
- ) -> dict[str, set[str]]: - """Return ``{model_name: {col1, col2, ...}}`` for all referenced models. + ) -> dict[str, list[str]]: + """Return ``{model_name: [col1, col2, ...]}`` for all referenced models. ... - used: dict[str, set[str]] = {m: set() for m in alias_to_model.values()} + used: dict[str, list[str]] = {m: [] for m in alias_to_model.values()} + seen: dict[str, set[str]] = {m: set() for m in alias_to_model.values()} ... if model_name: - used[model_name].add(col.name) + if col.name not in seen[model_name]: + seen[model_name].add(col.name) + used[model_name].append(col.name)- def _build_model_ctes(self, used_columns: dict[str, set[str]]) -> list[exp.CTE]: + def _build_model_ctes(self, used_columns: dict[str, list[str]]) -> list[exp.CTE]: ... - resolved = [orig.get(c, c) for c in sorted(columns)] + resolved = [orig.get(c, c) for c in columns]Also applies to: 153-160, 168-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/mdl/cte_rewriter.py` around lines 122 - 123, The code currently accumulates columns into a set and later sorts them (the docstring shows Return ``{model_name: {col1, col2, ...}}``), which loses and reorders the discovered column order; change the implementation to preserve discovery order by collecting columns in an ordered container (e.g., use a list and de-duplicate with a local seen set or use dict.fromkeys) instead of a set, remove the final sorted() step, and update the return shape to map model_name to an ordered sequence (e.g., dict[str, list[str]]) so SELECT * retains the original schema order; adjust any callers expecting set[str] accordingly.
101-102:⚠️ Potential issue | 🟠 MajorMake CTE shadowing scope-aware; root-only tracking is still incorrect.
Line 223 only reads the root
WITH, but Line 144 uses that flat set to exclude model names globally. A nestedWITH <model_name> AS (...)can still be mistaken for an MDL model and trigger wrong top-level CTE injection.Suggested direction
- user_cte_names = self._collect_user_cte_names(ast) - used_columns = self._collect_model_columns(ast, user_cte_names) + used_columns = self._collect_model_columns(ast) - def _collect_model_columns( - self, ast: exp.Expression, user_cte_names: set[str] - ) -> dict[str, set[str]]: + def _collect_model_columns(self, ast: exp.Expression) -> dict[str, set[str]]: ... for table in qualified.find_all(exp.Table): table_name = table.name - if table_name not in self.model_dict or table_name in user_cte_names: + if table_name not in self.model_dict or self._is_shadowed_by_cte_scope(table, table_name): continue+ `@staticmethod` + def _is_shadowed_by_cte_scope(table: exp.Table, table_name: str) -> bool: + node: exp.Expression | None = table + target = table_name.lower() + while node is not None: + with_clause = node.args.get("with_") + if with_clause: + for cte in with_clause.expressions: + alias = cte.args.get("alias") + if alias and getattr(alias.this, "name", str(alias.this)).lower() == target: + return True + node = node.parent + return FalseAlso applies to: 120-145, 219-233
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/mdl/cte_rewriter.py` around lines 101 - 102, The current logic flattens CTE names by calling self._collect_user_cte_names(ast) and then passing that global set into self._collect_model_columns(ast, user_cte_names), which misses nested WITH shadowing; change the implementation so CTE collection is scope-aware and the correct per-scope CTE set is used when examining a given query block. Concretely: update _collect_user_cte_names to accept an AST node (or add a new helper) that walks INTO each WITH block and returns CTE names visible at that scope, and modify callers (the place that currently calls _collect_user_cte_names(ast) and _collect_model_columns(ast, ...)) to compute and pass the scope-specific CTE set to _collect_model_columns (or call _collect_model_columns per query block with that node-specific set) so nested WITH <name> correctly shadows MDL model names; ensure symbols _collect_user_cte_names and _collect_model_columns are the ones updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@wren/src/wren/mdl/cte_rewriter.py`:
- Around line 122-123: The code currently accumulates columns into a set and
later sorts them (the docstring shows Return ``{model_name: {col1, col2,
...}}``), which loses and reorders the discovered column order; change the
implementation to preserve discovery order by collecting columns in an ordered
container (e.g., use a list and de-duplicate with a local seen set or use
dict.fromkeys) instead of a set, remove the final sorted() step, and update the
return shape to map model_name to an ordered sequence (e.g., dict[str,
list[str]]) so SELECT * retains the original schema order; adjust any callers
expecting set[str] accordingly.
- Around line 101-102: The current logic flattens CTE names by calling
self._collect_user_cte_names(ast) and then passing that global set into
self._collect_model_columns(ast, user_cte_names), which misses nested WITH
shadowing; change the implementation so CTE collection is scope-aware and the
correct per-scope CTE set is used when examining a given query block.
Concretely: update _collect_user_cte_names to accept an AST node (or add a new
helper) that walks INTO each WITH block and returns CTE names visible at that
scope, and modify callers (the place that currently calls
_collect_user_cte_names(ast) and _collect_model_columns(ast, ...)) to compute
and pass the scope-specific CTE set to _collect_model_columns (or call
_collect_model_columns per query block with that node-specific set) so nested
WITH <name> correctly shadows MDL model names; ensure symbols
_collect_user_cte_names and _collect_model_columns are the ones updated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ddcf8c82-6e8f-4d1a-9fac-d934be809133
📒 Files selected for processing (1)
wren/src/wren/mdl/cte_rewriter.py
…traction The input SQL is in the target dialect (e.g. Postgres), not DataFusion. Use sqlglot to parse and extract table names for manifest extraction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er, test strictness - Collect user CTE names from all scopes (find_all(exp.With)), not just root, so nested WITH clauses properly shadow model names - Preserve manifest column order in CTE generation instead of sorting alphabetically — fixes SELECT * column order - Default test helper _make_rewriter to fallback=False so tests exercise the CTE path strictly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that a WITH clause inside a subquery correctly shadows the model name, preventing a model CTE from being generated for it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…servation - Nested CTE in subquery correctly shadows model name - WITH RECURSIVE keyword preserved after model CTE injection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
wren/src/wren/engine.py (1)
163-172:⚠️ Potential issue | 🟠 MajorPreserve manifest/planner failures instead of masking them as SQL errors.
This path now swallows every extractor failure by reverting to the full manifest, and anything raised by
get_session_context()/CTERewriter()is then funneled into the generic planning wrapper. A malformed manifest or extractor regression will surface as bad user SQL instead of keeping its real code/phase.Also applies to: 181-187
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/engine.py` around lines 163 - 172, Current broad except Exception around manifest extraction masks manifest/extractor/planner failures as SQL errors; instead, narrow the try/except so only SQL parsing failures are handled locally and all other exceptions (from get_manifest_extractor, extractor.extract_by, to_json_base64, get_session_context, CTERewriter, etc.) are allowed to surface or are re-raised with context. Specifically, replace the blanket except Exception with targeted except for sql parsing errors from sqlglot (e.g., sqlglot.errors.ParseError) around parse_one/get_sqlglot_dialect, and do not swallow exceptions from get_manifest_extractor, extractor.extract_by, to_json_base64, get_session_context, or CTERewriter — either let them propagate or wrap them in a clear, phase-specific error so they are not reported as generic SQL planning errors; apply the same change to the other occurrence (lines handling effective_manifest later).
♻️ Duplicate comments (1)
wren/src/wren/mdl/cte_rewriter.py (1)
145-146:⚠️ Potential issue | 🟠 MajorCTE shadowing is still resolved globally.
_collect_user_cte_names()now walks everyWITH, but it still flattens them into one set. That means a nestedWITH customer AS (...)can suppress rewriting for an unrelated outerFROM customer, because the check at Lines 145-146 has no notion of the nearest enclosing scope. Track CTE names per scope instead of globally.Also applies to: 221-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/mdl/cte_rewriter.py` around lines 145 - 146, The current global CTE-name set from _collect_user_cte_names() causes nested WITH clauses to shadow unrelated tables; change the algorithm to track CTE names per scope (e.g., maintain a stack/list of sets pushed when entering a WITH node and popped when exiting) and replace the global membership checks (the conditional using user_cte_names in the block that contains "if table_name not in self.model_dict or table_name in user_cte_names: continue") with lookups against the innermost scope set (or test against the stacked scopes for visibility rules you want). Update the same pattern applied around the other block (the logic covering lines 221-234) so both places consult the current scope's CTE set rather than a flattened global set and ensure _collect_user_cte_names() or its caller is adjusted to build/return scoped sets (or the traversal itself manages the scope stack).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@wren/src/wren/engine.py`:
- Around line 163-172: Current broad except Exception around manifest extraction
masks manifest/extractor/planner failures as SQL errors; instead, narrow the
try/except so only SQL parsing failures are handled locally and all other
exceptions (from get_manifest_extractor, extractor.extract_by, to_json_base64,
get_session_context, CTERewriter, etc.) are allowed to surface or are re-raised
with context. Specifically, replace the blanket except Exception with targeted
except for sql parsing errors from sqlglot (e.g., sqlglot.errors.ParseError)
around parse_one/get_sqlglot_dialect, and do not swallow exceptions from
get_manifest_extractor, extractor.extract_by, to_json_base64,
get_session_context, or CTERewriter — either let them propagate or wrap them in
a clear, phase-specific error so they are not reported as generic SQL planning
errors; apply the same change to the other occurrence (lines handling
effective_manifest later).
---
Duplicate comments:
In `@wren/src/wren/mdl/cte_rewriter.py`:
- Around line 145-146: The current global CTE-name set from
_collect_user_cte_names() causes nested WITH clauses to shadow unrelated tables;
change the algorithm to track CTE names per scope (e.g., maintain a stack/list
of sets pushed when entering a WITH node and popped when exiting) and replace
the global membership checks (the conditional using user_cte_names in the block
that contains "if table_name not in self.model_dict or table_name in
user_cte_names: continue") with lookups against the innermost scope set (or test
against the stacked scopes for visibility rules you want). Update the same
pattern applied around the other block (the logic covering lines 221-234) so
both places consult the current scope's CTE set rather than a flattened global
set and ensure _collect_user_cte_names() or its caller is adjusted to
build/return scoped sets (or the traversal itself manages the scope stack).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 647bef17-8a8e-4853-9563-5c1d44aa0489
📒 Files selected for processing (3)
wren/src/wren/engine.pywren/src/wren/mdl/cte_rewriter.pywren/tests/unit/test_cte_rewriter.py
✅ Files skipped from review due to trivial changes (1)
- wren/tests/unit/test_cte_rewriter.py
|
Given the following SQL: WITH Orders AS (SELECT 1 AS o_orderkey, 'OPEN' AS o_orderstatus)
SELECT * FROM OrdersThe current rewrite process generates a duplicate CTE named Orders. We should address this issue in a follow-up PR. rw = _make_rewriter(_SINGLE_MODEL_MANIFEST, fallback=True)
result = rw.rewrite(
"WITH Orders AS (SELECT 1 AS o_orderkey, 'OPEN' AS o_orderstatus) "
"SELECT * FROM Orders"
)
print("\n--- generated SQL ---")
print(result)
ast = sqlglot.parse_one(result, dialect="duckdb")
with_clause = ast.args.get("with_")
if with_clause:
cte_names = [
cte.args["alias"].this.name
for cte in with_clause.expressions
if cte.args.get("alias")
]
print("--- CTE names in output ---")
for name in cte_names:
print(f" {repr(name)}")
orders_count = sum(1 for n in cte_names if n.lower() == "orders")
print(f"--- 'orders' count (case-insensitive): {orders_count} ---")
assert orders_count <= 1, (
f"Duplicate 'orders' CTE (case-insensitive) in output:\n{result}"
) |
…spile→dry-plan - Rename default conn file from conn.json to connection_info.json to match mcp-server's ~/.wren/connection_info.json default - Rename `wren transpile` CLI command to `wren dry-plan` (aligns with PR Canner#1479 which removed WrenEngine.transpile() in favour of dry_plan()) - Update all references in cli.py, README.md, docs/cli.md, docs/connections.md, and .claude/skills/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…spile→dry-plan - Rename default conn file from conn.json to connection_info.json to match mcp-server's ~/.wren/connection_info.json default - Rename `wren transpile` CLI command to `wren dry-plan` (aligns with PR Canner#1479 which removed WrenEngine.transpile() in favour of dry_plan()) - Update all references in cli.py, README.md, docs/cli.md, docs/connections.md, and .claude/skills/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_collect_user_cte_names() collected raw CTE names (e.g. "Orders")
while _collect_model_columns() compared against normalized/lowercased
table names ("orders"). The case-sensitive check missed the shadow,
generating a duplicate CTE. Lowercase names on collection to match
sqlglot's normalize_identifiers behavior.
Closes Canner#1479 (comment)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
session_context.transform_sql()with a CTE-based rewriter that expands each MDL model independently, then injects results as CTEs into the original SQLqualify_tables+normalize_identifiers+qualify_columnspipeline to fully resolve column references (SELECT *, aliases, correlated subqueries)_transpile()steptranspilecommand withdry-plan; removeWrenEngine.transpile(), unify ondry_plan()fallbackparameter toCTERewriter/WrenEngineso tests can disable silent fallback to the legacy path"relationship": nullfrom extracted manifestsTest plan
test_cte_rewriter.py+test_engine.py)fallback=Falseto ensure CTE path is always exercisedjust test-connector duckdb,just test-connector postgres)🤖 Generated with Claude Code
Summary by CodeRabbit
Breaking Changes
New Features
Chores