feat(wren): add dlt DuckDB introspection and project generation#1532
feat(wren): add dlt DuckDB introspection and project generation#1532goldmedal wants to merge 5 commits intoCanner:mainfrom
Conversation
Adds `wren context init --from-dlt <path.duckdb>` which introspects a dlt-produced DuckDB file and auto-generates a complete Wren v2 YAML project — models, relationships, wren_project.yml, and instructions.md. New files: - `src/wren/dlt_introspect.py` — DltIntrospector class using duckdb_tables()/duckdb_columns() for schema discovery; filters dlt internal tables (_dlt_loads, _dlt_pipeline_state, _dlt_version) and columns (_dlt_id, _dlt_parent_id, _dlt_load_id, _dlt_list_idx); detects ONE_TO_MANY parent-child relationships from dlt's __ naming convention and _dlt_parent_id columns. - `tests/unit/test_dlt_introspect.py` — 32 unit + CLI tests covering table discovery, column filtering, type normalization, relationship detection, project file generation, and CLI integration. Modified: - `src/wren/context.py` — adds convert_dlt_to_project() which bridges DltIntrospector output to the ProjectFile list. - `src/wren/context_cli.py` — adds --from-dlt and --profile options to init() with mutual-exclusion guard against --from-mdl. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
📝 WalkthroughWalkthroughAdds DuckDB (dlt) introspection, a conversion function to produce Wren v2 project files, CLI import options, and tests; includes relationship detection, duplicate-table-name validation, and generation of project YAML, per-model metadata, relationships.yml, and instructions.md. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as "wren context init"
participant Converter as "convert_dlt_to_project()"
participant Introspector as "DltIntrospector"
participant DuckDB as "DuckDB File"
participant Writer as "write_project_files()"
User->>CLI: run --from-dlt <path> [--profile <name>]
CLI->>Converter: convert_dlt_to_project(path, project_name?)
Converter->>Introspector: __init__ / introspect()
Introspector->>DuckDB: ATTACH read-only & query tables/columns
DuckDB-->>Introspector: table/column metadata
Introspector->>Introspector: filter internals, normalize types, detect relationships
Introspector-->>Converter: (tables[], relationships[])
Converter->>Converter: build ProjectFile objects (project.yml, models/*/metadata.yml, relationships.yml, instructions.md)
Converter-->>CLI: list[ProjectFile]
CLI->>Writer: write_project_files(files)
Writer-->>CLI: success
CLI-->>User: print summary (models N, relationships M) / profile info
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
wren/src/wren/dlt_introspect.py (1)
108-111: Letparse_type()handle its own fallback.Line 109 already calls a helper that falls back to the raw type on parse errors. The blanket
except Exceptionhere only hides unrelatedsqlglotbugs and makes bad mappings look successful.Proposed simplification
- try: - normalized = parse_type(raw_type, "duckdb") - except Exception: - normalized = raw_type + normalized = parse_type(raw_type, "duckdb")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/dlt_introspect.py` around lines 108 - 111, Remove the broad try/except around parse_type in dlt_introspect: instead of catching all exceptions and defaulting to raw_type, call normalized = parse_type(raw_type, "duckdb") directly and let parse_type perform its own fallback behavior; remove the surrounding exception handling so unrelated sqlglot errors surface rather than being swallowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/context_cli.py`:
- Around line 34-44: The CLI currently allows --profile without --from-dlt;
update the validation in the context init command to fail fast when profile is
provided but from_dlt is None by raising a Typer/CLI error (e.g.,
typer.BadParameter or typer.Exit with an error message). Locate the parameters
from_dlt and profile in the context_cli command and modify the existing
validation block that checks those options so it enforces "if profile and not
from_dlt: raise an error with a clear message" to prevent scaffolding an
unrelated project.
In `@wren/src/wren/context.py`:
- Around line 242-266: The code in the for table in tables loop builds model
output files using only table.name, which causes collisions when the same table
name exists in multiple schemas; update the logic that creates
ProjectFile(relative_path=f"models/{table.name}/metadata.yml") to include the
schema (e.g., use table.schema + table.name or a schema-qualified identifier) or
detect duplicates and raise an error; also ensure any references written to
relationships.yml (where bare names are used) are schema-qualified the same way
so names remain unambiguous; modify the model creation and file naming in this
block and add a pre-check for duplicate (schema,name) vs name-only collisions to
fail fast if you prefer strictness.
In `@wren/src/wren/dlt_introspect.py`:
- Around line 60-62: The ATTACH SQL is built by interpolating self._path and
self._catalog directly into the string; escape these inputs before building the
SQL to avoid syntax errors and injection (escape single quotes in self._path by
doubling them, and escape double quotes in self._catalog by doubling them), then
construct the ATTACH statement using the escaped values passed to
self._con.execute (reference the call where self._con.execute(...) is invoked
and the attributes self._path and self._catalog).
- Around line 79-85: The SQL filter on table_name currently uses LIKE '_dlt_%'
which treats '_' as a single-character wildcard and can incorrectly match names;
update the WHERE clause used in the self._con.execute(...) query that populates
table_rows to exclude DLT tables by using an explicit prefix test such as
starts_with(table_name, '_dlt_') (i.e. WHERE database_name = ? AND NOT
starts_with(table_name, '_dlt_')) or equivalently LEFT(table_name, 5) != '_dlt_'
so only names starting with the literal "_dlt_" are excluded; keep the rest of
the query (ORDER BY schema_name, table_name) intact and continue passing
[self._catalog] as the parameter.
---
Nitpick comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 108-111: Remove the broad try/except around parse_type in
dlt_introspect: instead of catching all exceptions and defaulting to raw_type,
call normalized = parse_type(raw_type, "duckdb") directly and let parse_type
perform its own fallback behavior; remove the surrounding exception handling so
unrelated sqlglot errors surface rather than being swallowed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 505503d0-c298-4848-89c1-b39ed7b5c6d0
📒 Files selected for processing (4)
wren/src/wren/context.pywren/src/wren/context_cli.pywren/src/wren/dlt_introspect.pywren/tests/unit/test_dlt_introspect.py
- dlt_introspect: escape path/catalog in ATTACH SQL to handle single quotes in paths and double quotes in catalog aliases - dlt_introspect: replace LIKE '_dlt_%' with starts_with(table_name, '_dlt_') to avoid underscore wildcard mismatches - dlt_introspect: remove redundant try/except around parse_type() since parse_type() already falls back to the raw string on parse errors - context: add duplicate-table-name guard before generating model files to fail fast when the same table name appears in multiple schemas - context_cli: fail fast when --profile is used without --from-dlt https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
Move all local imports in test_dlt_introspect.py to module level to satisfy PLC0415. Apply ruff format to all changed files. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 55-62: The ATTACH alias uses the raw filename stem (self._catalog)
which can collide with DuckDB reserved names; change the logic that sets
self._catalog to a safe prefixed alias (e.g., "dlt_" + stem) and ensure the
DuckDB connection (self._con = duckdb.connect()) is created before any execute
calls; keep escaping for the path and catalog name and use the prefixed
self._catalog when calling self._con.execute("ATTACH ... AS
\"{escaped_catalog}\" (READ_ONLY)").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8e9cc38a-0d6d-4d71-b62a-dc60bf81770e
📒 Files selected for processing (4)
wren/src/wren/context.pywren/src/wren/context_cli.pywren/src/wren/dlt_introspect.pywren/tests/unit/test_dlt_introspect.py
🚧 Files skipped from review as they are similar to previous changes (1)
- wren/src/wren/context_cli.py
DuckDB reserves 'memory', 'system', and 'temp' as catalog names. Using a filename stem like 'temp.duckdb' or 'memory.duckdb' would cause ATTACH to fail. Prefix the internal alias with 'dlt_' (stored as self._alias) while keeping self._catalog as the bare stem for table_reference output in the generated YAML project. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 36-41: The parent-resolution logic currently matches parents by
bare table name and assumes matched parents have a DLT id column; update the
matching to require schema+name equality (use DltTable.schema and DltTable.name
together) rather than falling back to name-only matches, and if schema is
omitted by the child only allow a name-only match when the candidate is unique
across schemas in the same catalog; also replace any hard assumption that a
matched parent has "_dlt_id" with an explicit check for that column on the
candidate (inspect its DltColumn entries) before setting has_dlt_parent_id or
emitting relationships. Refer to the DltTable, DltColumn and has_dlt_parent_id
usages and update the parent-match loops (the blocks that search tables by name)
to enforce schema-aware matching and explicit column existence checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d0ef9f89-1533-4bf7-862f-c1b30df064f6
📒 Files selected for processing (1)
wren/src/wren/dlt_introspect.py
| class DltTable: | ||
| catalog: str # attached database alias (filename stem) | ||
| schema: str # DuckDB schema (e.g. "main", "hubspot_crm") | ||
| name: str # table name | ||
| columns: list[DltColumn] = field(default_factory=list) | ||
| has_dlt_parent_id: bool = False |
There was a problem hiding this comment.
Don't resolve dlt parents from bare table names alone.
Line 144 throws away schema, so a child like sales.orders__items can bind to crm.orders if that's the only orders table in the file. Line 175 also assumes the matched parent has _dlt_id, which is not true for non-dlt tables that this introspector still imports. Since wren/src/wren/context.py:201-291 only rejects duplicate table names, a unique-but-wrong match can still make it into relationships.yml.
Proposed fix
`@dataclass`
class DltTable:
catalog: str # attached database alias (filename stem)
schema: str # DuckDB schema (e.g. "main", "hubspot_crm")
name: str # table name
columns: list[DltColumn] = field(default_factory=list)
+ has_dlt_id: bool = False
has_dlt_parent_id: bool = False
@@
- has_parent_id = False
+ has_dlt_id = False
+ has_parent_id = False
columns: list[DltColumn] = []
for col_name, raw_type, is_nullable in col_rows:
+ if col_name == "_dlt_id":
+ has_dlt_id = True
if col_name == "_dlt_parent_id":
has_parent_id = True
if col_name in _DLT_INTERNAL_COLUMNS:
continue
@@
DltTable(
catalog=self._catalog,
schema=schema,
name=table_name,
columns=columns,
+ has_dlt_id=has_dlt_id,
has_dlt_parent_id=has_parent_id,
)
)
@@
- table_names = {t.name for t in tables}
+ parent_lookup = {
+ (t.schema, t.name)
+ for t in tables
+ if t.has_dlt_id
+ }
relationships: list[dict] = []
@@
for i in range(len(parts) - 1, 0, -1):
candidate = "__".join(parts[:i])
- if candidate in table_names:
+ if (table.schema, candidate) in parent_lookup:
parent_name = candidate
break
@@
logger.warning(
- "Child table '%s' has _dlt_parent_id but no matching parent "
- "found — skipping relationship",
- table.name,
+ "Child table '%s.%s' has _dlt_parent_id but no matching "
+ "same-schema parent with _dlt_id was found; skipping relationship",
+ table.schema,
+ table.name,
)Also applies to: 108-123, 144-176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@wren/src/wren/dlt_introspect.py` around lines 36 - 41, The parent-resolution
logic currently matches parents by bare table name and assumes matched parents
have a DLT id column; update the matching to require schema+name equality (use
DltTable.schema and DltTable.name together) rather than falling back to
name-only matches, and if schema is omitted by the child only allow a name-only
match when the candidate is unique across schemas in the same catalog; also
replace any hard assumption that a matched parent has "_dlt_id" with an explicit
check for that column on the candidate (inspect its DltColumn entries) before
setting has_dlt_parent_id or emitting relationships. Refer to the DltTable,
DltColumn and has_dlt_parent_id usages and update the parent-match loops (the
blocks that search tables by name) to enforce schema-aware matching and explicit
column existence checks.
Summary
Add support for importing dlt-produced DuckDB files into Wren projects. This enables users to automatically generate Wren v2 project structure (models, relationships, and configuration) from an existing dlt pipeline output.
Key Changes
New
dlt_introspect.pymodule: ImplementsDltIntrospectorclass that:_dlt_loads,_dlt_pipeline_state, etc.) and columns (_dlt_id,_dlt_parent_id,_dlt_load_id,_dlt_list_idx)__naming convention and_dlt_parent_idcolumnNew
convert_dlt_to_project()function incontext.py:wren_project.ymlwith DuckDB data source configurationrelationships.ymlwith detected ONE_TO_MANY relationshipsinstructions.mdwith generation metadataCLI enhancement in
context_cli.py:--from-dltflag tocontext initcommand--profileflag to optionally create a named DuckDB connection profile--from-mdland--from-dltComprehensive test suite (
test_dlt_introspect.py):Implementation Details
duckdb_tables()andduckdb_columns()system functions for schema introspection__-delimited table names_dlt_parent_idbut no matching parent)parse_type()utility with duckdb dialecthttps://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
Summary by CodeRabbit
New Features
Validation / Bug Fixes
Tests