feat(wren): add dlt DuckDB introspection and project generation by goldmedal · Pull Request #1532 · Canner/wren-engine

goldmedal · 2026-04-08T15:32:56Z

Summary

Add support for importing dlt-produced DuckDB files into Wren projects. This enables users to automatically generate Wren v2 project structure (models, relationships, and configuration) from an existing dlt pipeline output.

Key Changes

New dlt_introspect.py module: Implements DltIntrospector class that:
- Connects READ_ONLY to a dlt-produced DuckDB file
- Discovers user tables across all schemas
- Filters out dlt internal tables (_dlt_loads, _dlt_pipeline_state, etc.) and columns (_dlt_id, _dlt_parent_id, _dlt_load_id, _dlt_list_idx)
- Detects parent-child relationships using dlt's __ naming convention and _dlt_parent_id column
- Normalizes column types using sqlglot
New convert_dlt_to_project() function in context.py:
- Generates complete Wren v2 project files from introspected dlt database
- Creates wren_project.yml with DuckDB data source configuration
- Generates model metadata files with proper table references and column definitions
- Produces relationships.yml with detected ONE_TO_MANY relationships
- Includes instructions.md with generation metadata
CLI enhancement in context_cli.py:
- Added --from-dlt flag to context init command
- Added --profile flag to optionally create a named DuckDB connection profile
- Enforces mutual exclusivity between --from-mdl and --from-dlt
- Provides summary output showing number of models and relationships imported
- Includes proper error handling for missing files and invalid inputs
Comprehensive test suite (test_dlt_introspect.py):
- 30+ unit tests covering table discovery, column filtering, type normalization, relationship detection
- Tests for edge cases (empty databases, orphaned child tables, multiple schemas)
- CLI integration tests validating end-to-end workflow
- Tests for mutual exclusivity and force overwrite behavior

Implementation Details

Uses DuckDB's duckdb_tables() and duckdb_columns() system functions for schema introspection
Attaches the DuckDB file READ_ONLY to avoid accidental modifications
Parent-child relationship detection uses longest-prefix matching on __-delimited table names
Logs warnings for orphaned child tables (those with _dlt_parent_id but no matching parent)
Type normalization leverages existing parse_type() utility with duckdb dialect

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

Summary by CodeRabbit

New Features
- Import dlt-produced DuckDB files as Wren v2 projects via a new CLI option, generating project manifest, per-model metadata, relationships, and an instructions file with source path and timestamp
- Optionally create and activate a DuckDB connection profile during import
Validation / Bug Fixes
- Error on ambiguous duplicate table names across schemas to prevent overwrites
- CLI rejects incompatible flags and validates required flag combinations; reports counts of imported models and relationships
Tests
- Added comprehensive unit and CLI tests covering introspection, conversion, and error cases

Adds `wren context init --from-dlt <path.duckdb>` which introspects a dlt-produced DuckDB file and auto-generates a complete Wren v2 YAML project — models, relationships, wren_project.yml, and instructions.md. New files: - `src/wren/dlt_introspect.py` — DltIntrospector class using duckdb_tables()/duckdb_columns() for schema discovery; filters dlt internal tables (_dlt_loads, _dlt_pipeline_state, _dlt_version) and columns (_dlt_id, _dlt_parent_id, _dlt_load_id, _dlt_list_idx); detects ONE_TO_MANY parent-child relationships from dlt's __ naming convention and _dlt_parent_id columns. - `tests/unit/test_dlt_introspect.py` — 32 unit + CLI tests covering table discovery, column filtering, type normalization, relationship detection, project file generation, and CLI integration. Modified: - `src/wren/context.py` — adds convert_dlt_to_project() which bridges DltIntrospector output to the ProjectFile list. - `src/wren/context_cli.py` — adds --from-dlt and --profile options to init() with mutual-exclusion guard against --from-mdl. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

coderabbitai · 2026-04-08T15:33:16Z

📝 Walkthrough

Walkthrough

Adds DuckDB (dlt) introspection, a conversion function to produce Wren v2 project files, CLI import options, and tests; includes relationship detection, duplicate-table-name validation, and generation of project YAML, per-model metadata, relationships.yml, and instructions.md.

Changes

Cohort / File(s)	Summary
DuckDB dlt Introspection `wren/src/wren/dlt_introspect.py`	New module: `DltIntrospector`, `DltTable`, `DltColumn`. Attaches a DuckDB file read-only, enumerates user tables/columns (filters dlt internals), normalizes types, flags `_dlt_parent_id`, and emits parent-child relationships with join conditions.
Project Conversion Function `wren/src/wren/context.py`	Added `convert_dlt_to_project(duckdb_path, *, project_name=None) -> list[ProjectFile]`. Builds explicit `wren_project.yml`, per-model `models/<name>/metadata.yml`, `relationships.yml`, and `instructions.md`; errors on duplicate table names across different schemas.
CLI Integration `wren/src/wren/context_cli.py`	Extended `wren context init` with `--from-dlt` and `--profile` options, validations (mutual exclusion with `--from-mdl`, `--profile` requires `--from-dlt`), invokes conversion, writes files, reports counts, and can create/activate a DuckDB profile.
Tests `wren/tests/unit/test_dlt_introspect.py`	New unit tests covering `DltIntrospector`, `convert_dlt_to_project`, and `context init --from-dlt`: discovery, dlt-column filtering, type normalization, relationship detection/orphans, generated files, CLI error/force behaviors.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "wren context init"
    participant Converter as "convert_dlt_to_project()"
    participant Introspector as "DltIntrospector"
    participant DuckDB as "DuckDB File"
    participant Writer as "write_project_files()"

    User->>CLI: run --from-dlt <path> [--profile <name>]
    CLI->>Converter: convert_dlt_to_project(path, project_name?)
    Converter->>Introspector: __init__ / introspect()
    Introspector->>DuckDB: ATTACH read-only & query tables/columns
    DuckDB-->>Introspector: table/column metadata
    Introspector->>Introspector: filter internals, normalize types, detect relationships
    Introspector-->>Converter: (tables[], relationships[])
    Converter->>Converter: build ProjectFile objects (project.yml, models/*/metadata.yml, relationships.yml, instructions.md)
    Converter-->>CLI: list[ProjectFile]
    CLI->>Writer: write_project_files(files)
    Writer-->>CLI: success
    CLI-->>User: print summary (models N, relationships M) / profile info

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(wren): add context init --from-mdl to import MDL JSON as YAML project #1516: Related project-conversion work and overlapping edits to context.py / context_cli.py (adds other --from-* conversion).
feat(wren): add type_mapping module and wren utils CLI + wren-generate-mdl skill #1513: Related type normalization changes used by the new introspector via wren.type_mapping.parse_type.

Poem

🐰
I nosed a DuckDB beneath moonlit beams,
I traced the tables, stitched them into schemes.
I hopped through columns, linked parent and child,
Wren files sprouted neat — my work made them tiled. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.66% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately reflects the main change: adding dlt DuckDB introspection and project generation capability to Wren.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

wren/src/wren/dlt_introspect.py (1)

108-111: Let parse_type() handle its own fallback.

Line 109 already calls a helper that falls back to the raw type on parse errors. The blanket except Exception here only hides unrelated sqlglot bugs and makes bad mappings look successful.

Proposed simplification

-                try:
-                    normalized = parse_type(raw_type, "duckdb")
-                except Exception:
-                    normalized = raw_type
+                normalized = parse_type(raw_type, "duckdb")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wren/src/wren/dlt_introspect.py` around lines 108 - 111, Remove the broad
try/except around parse_type in dlt_introspect: instead of catching all
exceptions and defaulting to raw_type, call normalized = parse_type(raw_type,
"duckdb") directly and let parse_type perform its own fallback behavior; remove
the surrounding exception handling so unrelated sqlglot errors surface rather
than being swallowed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/context_cli.py`:
- Around line 34-44: The CLI currently allows --profile without --from-dlt;
update the validation in the context init command to fail fast when profile is
provided but from_dlt is None by raising a Typer/CLI error (e.g.,
typer.BadParameter or typer.Exit with an error message). Locate the parameters
from_dlt and profile in the context_cli command and modify the existing
validation block that checks those options so it enforces "if profile and not
from_dlt: raise an error with a clear message" to prevent scaffolding an
unrelated project.

In `@wren/src/wren/context.py`:
- Around line 242-266: The code in the for table in tables loop builds model
output files using only table.name, which causes collisions when the same table
name exists in multiple schemas; update the logic that creates
ProjectFile(relative_path=f"models/{table.name}/metadata.yml") to include the
schema (e.g., use table.schema + table.name or a schema-qualified identifier) or
detect duplicates and raise an error; also ensure any references written to
relationships.yml (where bare names are used) are schema-qualified the same way
so names remain unambiguous; modify the model creation and file naming in this
block and add a pre-check for duplicate (schema,name) vs name-only collisions to
fail fast if you prefer strictness.

In `@wren/src/wren/dlt_introspect.py`:
- Around line 60-62: The ATTACH SQL is built by interpolating self._path and
self._catalog directly into the string; escape these inputs before building the
SQL to avoid syntax errors and injection (escape single quotes in self._path by
doubling them, and escape double quotes in self._catalog by doubling them), then
construct the ATTACH statement using the escaped values passed to
self._con.execute (reference the call where self._con.execute(...) is invoked
and the attributes self._path and self._catalog).
- Around line 79-85: The SQL filter on table_name currently uses LIKE '_dlt_%'
which treats '_' as a single-character wildcard and can incorrectly match names;
update the WHERE clause used in the self._con.execute(...) query that populates
table_rows to exclude DLT tables by using an explicit prefix test such as
starts_with(table_name, '_dlt_') (i.e. WHERE database_name = ? AND NOT
starts_with(table_name, '_dlt_')) or equivalently LEFT(table_name, 5) != '_dlt_'
so only names starting with the literal "_dlt_" are excluded; keep the rest of
the query (ORDER BY schema_name, table_name) intact and continue passing
[self._catalog] as the parameter.

---

Nitpick comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 108-111: Remove the broad try/except around parse_type in
dlt_introspect: instead of catching all exceptions and defaulting to raw_type,
call normalized = parse_type(raw_type, "duckdb") directly and let parse_type
perform its own fallback behavior; remove the surrounding exception handling so
unrelated sqlglot errors surface rather than being swallowed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 505503d0-c298-4848-89c1-b39ed7b5c6d0

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2639f and 9d03760.

📒 Files selected for processing (4)

wren/src/wren/context.py
wren/src/wren/context_cli.py
wren/src/wren/dlt_introspect.py
wren/tests/unit/test_dlt_introspect.py

wren/src/wren/context_cli.py

wren/src/wren/context.py

wren/src/wren/dlt_introspect.py

- dlt_introspect: escape path/catalog in ATTACH SQL to handle single quotes in paths and double quotes in catalog aliases - dlt_introspect: replace LIKE '_dlt_%' with starts_with(table_name, '_dlt_') to avoid underscore wildcard mismatches - dlt_introspect: remove redundant try/except around parse_type() since parse_type() already falls back to the raw string on parse errors - context: add duplicate-table-name guard before generating model files to fail fast when the same table name appears in multiple schemas - context_cli: fail fast when --profile is used without --from-dlt https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

Move all local imports in test_dlt_introspect.py to module level to satisfy PLC0415. Apply ruff format to all changed files. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 55-62: The ATTACH alias uses the raw filename stem (self._catalog)
which can collide with DuckDB reserved names; change the logic that sets
self._catalog to a safe prefixed alias (e.g., "dlt_" + stem) and ensure the
DuckDB connection (self._con = duckdb.connect()) is created before any execute
calls; keep escaping for the path and catalog name and use the prefixed
self._catalog when calling self._con.execute("ATTACH ... AS
\"{escaped_catalog}\" (READ_ONLY)").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8e9cc38a-0d6d-4d71-b62a-dc60bf81770e

📥 Commits

Reviewing files that changed from the base of the PR and between 9d03760 and daa7647.

📒 Files selected for processing (4)

wren/src/wren/context.py
wren/src/wren/context_cli.py
wren/src/wren/dlt_introspect.py
wren/tests/unit/test_dlt_introspect.py

🚧 Files skipped from review as they are similar to previous changes (1)

wren/src/wren/context_cli.py

wren/src/wren/dlt_introspect.py

DuckDB reserves 'memory', 'system', and 'temp' as catalog names. Using a filename stem like 'temp.duckdb' or 'memory.duckdb' would cause ATTACH to fail. Prefix the internal alias with 'dlt_' (stored as self._alias) while keeping self._catalog as the bare stem for table_reference output in the generated YAML project. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 36-41: The parent-resolution logic currently matches parents by
bare table name and assumes matched parents have a DLT id column; update the
matching to require schema+name equality (use DltTable.schema and DltTable.name
together) rather than falling back to name-only matches, and if schema is
omitted by the child only allow a name-only match when the candidate is unique
across schemas in the same catalog; also replace any hard assumption that a
matched parent has "_dlt_id" with an explicit check for that column on the
candidate (inspect its DltColumn entries) before setting has_dlt_parent_id or
emitting relationships. Refer to the DltTable, DltColumn and has_dlt_parent_id
usages and update the parent-match loops (the blocks that search tables by name)
to enforce schema-aware matching and explicit column existence checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0ef9f89-1533-4bf7-862f-c1b30df064f6

📥 Commits

Reviewing files that changed from the base of the PR and between 1ae1027 and 3199637.

📒 Files selected for processing (1)

wren/src/wren/dlt_introspect.py

coderabbitai · 2026-04-08T16:12:21Z

wren/src/wren/dlt_introspect.py

+class DltTable:
+    catalog: str  # attached database alias (filename stem)
+    schema: str  # DuckDB schema (e.g. "main", "hubspot_crm")
+    name: str  # table name
+    columns: list[DltColumn] = field(default_factory=list)
+    has_dlt_parent_id: bool = False


⚠️ Potential issue | 🟠 Major

Don't resolve dlt parents from bare table names alone.

Line 144 throws away schema, so a child like sales.orders__items can bind to crm.orders if that's the only orders table in the file. Line 175 also assumes the matched parent has _dlt_id, which is not true for non-dlt tables that this introspector still imports. Since wren/src/wren/context.py:201-291 only rejects duplicate table names, a unique-but-wrong match can still make it into relationships.yml.

Proposed fix

`@dataclass` class DltTable: catalog: str # attached database alias (filename stem) schema: str # DuckDB schema (e.g. "main", "hubspot_crm") name: str # table name columns: list[DltColumn] = field(default_factory=list) + has_dlt_id: bool = False has_dlt_parent_id: bool = False @@ - has_parent_id = False + has_dlt_id = False + has_parent_id = False columns: list[DltColumn] = [] for col_name, raw_type, is_nullable in col_rows: + if col_name == "_dlt_id": + has_dlt_id = True if col_name == "_dlt_parent_id": has_parent_id = True if col_name in _DLT_INTERNAL_COLUMNS: continue @@ DltTable( catalog=self._catalog, schema=schema, name=table_name, columns=columns, + has_dlt_id=has_dlt_id, has_dlt_parent_id=has_parent_id, ) ) @@ - table_names = {t.name for t in tables} + parent_lookup = { + (t.schema, t.name) + for t in tables + if t.has_dlt_id + } relationships: list[dict] = [] @@ for i in range(len(parts) - 1, 0, -1): candidate = "__".join(parts[:i]) - if candidate in table_names: + if (table.schema, candidate) in parent_lookup: parent_name = candidate break @@ logger.warning( - "Child table '%s' has _dlt_parent_id but no matching parent " - "found — skipping relationship", - table.name, + "Child table '%s.%s' has _dlt_parent_id but no matching " + "same-schema parent with _dlt_id was found; skipping relationship", + table.schema, + table.name, )

Also applies to: 108-123, 144-176

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/dlt_introspect.py` around lines 36 - 41, The parent-resolution logic currently matches parents by bare table name and assumes matched parents have a DLT id column; update the matching to require schema+name equality (use DltTable.schema and DltTable.name together) rather than falling back to name-only matches, and if schema is omitted by the child only allow a name-only match when the candidate is unique across schemas in the same catalog; also replace any hard assumption that a matched parent has "_dlt_id" with an explicit check for that column on the candidate (inspect its DltColumn entries) before setting has_dlt_parent_id or emitting relationships. Refer to the DltTable, DltColumn and has_dlt_parent_id usages and update the parent-match loops (the blocks that search tables by name) to enforce schema-aware matching and explicit column existence checks.

goldmedal changed the title ~~Add dlt DuckDB introspection and project generation~~ feat(wren): add dlt DuckDB introspection and project generation Apr 8, 2026

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

wren/src/wren/context_cli.py Show resolved Hide resolved

wren/src/wren/context.py Outdated Show resolved Hide resolved

wren/src/wren/dlt_introspect.py Outdated Show resolved Hide resolved

wren/src/wren/dlt_introspect.py Outdated Show resolved Hide resolved

claude added 2 commits April 8, 2026 15:46

chore: fix ruff lint (PLC0415 + formatting)

daa7647

Move all local imports in test_dlt_introspect.py to module level to satisfy PLC0415. Apply ruff format to all changed files. https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

wren/src/wren/dlt_introspect.py Outdated Show resolved Hide resolved

claude added 2 commits April 8, 2026 15:59

chore: ruff format dlt_introspect.py

3199637

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

goldmedal marked this pull request as draft April 9, 2026 09:46

goldmedal closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wren): add dlt DuckDB introspection and project generation#1532

feat(wren): add dlt DuckDB introspection and project generation#1532
goldmedal wants to merge 5 commits intoCanner:mainfrom
goldmedal:claude/add-from-dlt-option-KjOeC

goldmedal commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goldmedal commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Implementation Details

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goldmedal commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading