Skip to content

feat(wren): add dlt DuckDB introspection and project generation#1532

Closed
goldmedal wants to merge 5 commits intoCanner:mainfrom
goldmedal:claude/add-from-dlt-option-KjOeC
Closed

feat(wren): add dlt DuckDB introspection and project generation#1532
goldmedal wants to merge 5 commits intoCanner:mainfrom
goldmedal:claude/add-from-dlt-option-KjOeC

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Apr 8, 2026

Summary

Add support for importing dlt-produced DuckDB files into Wren projects. This enables users to automatically generate Wren v2 project structure (models, relationships, and configuration) from an existing dlt pipeline output.

Key Changes

  • New dlt_introspect.py module: Implements DltIntrospector class that:

    • Connects READ_ONLY to a dlt-produced DuckDB file
    • Discovers user tables across all schemas
    • Filters out dlt internal tables (_dlt_loads, _dlt_pipeline_state, etc.) and columns (_dlt_id, _dlt_parent_id, _dlt_load_id, _dlt_list_idx)
    • Detects parent-child relationships using dlt's __ naming convention and _dlt_parent_id column
    • Normalizes column types using sqlglot
  • New convert_dlt_to_project() function in context.py:

    • Generates complete Wren v2 project files from introspected dlt database
    • Creates wren_project.yml with DuckDB data source configuration
    • Generates model metadata files with proper table references and column definitions
    • Produces relationships.yml with detected ONE_TO_MANY relationships
    • Includes instructions.md with generation metadata
  • CLI enhancement in context_cli.py:

    • Added --from-dlt flag to context init command
    • Added --profile flag to optionally create a named DuckDB connection profile
    • Enforces mutual exclusivity between --from-mdl and --from-dlt
    • Provides summary output showing number of models and relationships imported
    • Includes proper error handling for missing files and invalid inputs
  • Comprehensive test suite (test_dlt_introspect.py):

    • 30+ unit tests covering table discovery, column filtering, type normalization, relationship detection
    • Tests for edge cases (empty databases, orphaned child tables, multiple schemas)
    • CLI integration tests validating end-to-end workflow
    • Tests for mutual exclusivity and force overwrite behavior

Implementation Details

  • Uses DuckDB's duckdb_tables() and duckdb_columns() system functions for schema introspection
  • Attaches the DuckDB file READ_ONLY to avoid accidental modifications
  • Parent-child relationship detection uses longest-prefix matching on __-delimited table names
  • Logs warnings for orphaned child tables (those with _dlt_parent_id but no matching parent)
  • Type normalization leverages existing parse_type() utility with duckdb dialect

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o

Summary by CodeRabbit

  • New Features

    • Import dlt-produced DuckDB files as Wren v2 projects via a new CLI option, generating project manifest, per-model metadata, relationships, and an instructions file with source path and timestamp
    • Optionally create and activate a DuckDB connection profile during import
  • Validation / Bug Fixes

    • Error on ambiguous duplicate table names across schemas to prevent overwrites
    • CLI rejects incompatible flags and validates required flag combinations; reports counts of imported models and relationships
  • Tests

    • Added comprehensive unit and CLI tests covering introspection, conversion, and error cases

Adds `wren context init --from-dlt <path.duckdb>` which introspects a
dlt-produced DuckDB file and auto-generates a complete Wren v2 YAML
project — models, relationships, wren_project.yml, and instructions.md.

New files:
- `src/wren/dlt_introspect.py` — DltIntrospector class using
  duckdb_tables()/duckdb_columns() for schema discovery; filters dlt
  internal tables (_dlt_loads, _dlt_pipeline_state, _dlt_version) and
  columns (_dlt_id, _dlt_parent_id, _dlt_load_id, _dlt_list_idx);
  detects ONE_TO_MANY parent-child relationships from dlt's __ naming
  convention and _dlt_parent_id columns.
- `tests/unit/test_dlt_introspect.py` — 32 unit + CLI tests covering
  table discovery, column filtering, type normalization, relationship
  detection, project file generation, and CLI integration.

Modified:
- `src/wren/context.py` — adds convert_dlt_to_project() which bridges
  DltIntrospector output to the ProjectFile list.
- `src/wren/context_cli.py` — adds --from-dlt and --profile options to
  init() with mutual-exclusion guard against --from-mdl.

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Adds DuckDB (dlt) introspection, a conversion function to produce Wren v2 project files, CLI import options, and tests; includes relationship detection, duplicate-table-name validation, and generation of project YAML, per-model metadata, relationships.yml, and instructions.md.

Changes

Cohort / File(s) Summary
DuckDB dlt Introspection
wren/src/wren/dlt_introspect.py
New module: DltIntrospector, DltTable, DltColumn. Attaches a DuckDB file read-only, enumerates user tables/columns (filters dlt internals), normalizes types, flags _dlt_parent_id, and emits parent-child relationships with join conditions.
Project Conversion Function
wren/src/wren/context.py
Added convert_dlt_to_project(duckdb_path, *, project_name=None) -> list[ProjectFile]. Builds explicit wren_project.yml, per-model models/<name>/metadata.yml, relationships.yml, and instructions.md; errors on duplicate table names across different schemas.
CLI Integration
wren/src/wren/context_cli.py
Extended wren context init with --from-dlt and --profile options, validations (mutual exclusion with --from-mdl, --profile requires --from-dlt), invokes conversion, writes files, reports counts, and can create/activate a DuckDB profile.
Tests
wren/tests/unit/test_dlt_introspect.py
New unit tests covering DltIntrospector, convert_dlt_to_project, and context init --from-dlt: discovery, dlt-column filtering, type normalization, relationship detection/orphans, generated files, CLI error/force behaviors.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "wren context init"
    participant Converter as "convert_dlt_to_project()"
    participant Introspector as "DltIntrospector"
    participant DuckDB as "DuckDB File"
    participant Writer as "write_project_files()"

    User->>CLI: run --from-dlt <path> [--profile <name>]
    CLI->>Converter: convert_dlt_to_project(path, project_name?)
    Converter->>Introspector: __init__ / introspect()
    Introspector->>DuckDB: ATTACH read-only & query tables/columns
    DuckDB-->>Introspector: table/column metadata
    Introspector->>Introspector: filter internals, normalize types, detect relationships
    Introspector-->>Converter: (tables[], relationships[])
    Converter->>Converter: build ProjectFile objects (project.yml, models/*/metadata.yml, relationships.yml, instructions.md)
    Converter-->>CLI: list[ProjectFile]
    CLI->>Writer: write_project_files(files)
    Writer-->>CLI: success
    CLI-->>User: print summary (models N, relationships M) / profile info
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I nosed a DuckDB beneath moonlit beams,
I traced the tables, stitched them into schemes.
I hopped through columns, linked parent and child,
Wren files sprouted neat — my work made them tiled. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.66% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately reflects the main change: adding dlt DuckDB introspection and project generation capability to Wren.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@goldmedal goldmedal changed the title Add dlt DuckDB introspection and project generation feat(wren): add dlt DuckDB introspection and project generation Apr 8, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
wren/src/wren/dlt_introspect.py (1)

108-111: Let parse_type() handle its own fallback.

Line 109 already calls a helper that falls back to the raw type on parse errors. The blanket except Exception here only hides unrelated sqlglot bugs and makes bad mappings look successful.

Proposed simplification
-                try:
-                    normalized = parse_type(raw_type, "duckdb")
-                except Exception:
-                    normalized = raw_type
+                normalized = parse_type(raw_type, "duckdb")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wren/src/wren/dlt_introspect.py` around lines 108 - 111, Remove the broad
try/except around parse_type in dlt_introspect: instead of catching all
exceptions and defaulting to raw_type, call normalized = parse_type(raw_type,
"duckdb") directly and let parse_type perform its own fallback behavior; remove
the surrounding exception handling so unrelated sqlglot errors surface rather
than being swallowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/context_cli.py`:
- Around line 34-44: The CLI currently allows --profile without --from-dlt;
update the validation in the context init command to fail fast when profile is
provided but from_dlt is None by raising a Typer/CLI error (e.g.,
typer.BadParameter or typer.Exit with an error message). Locate the parameters
from_dlt and profile in the context_cli command and modify the existing
validation block that checks those options so it enforces "if profile and not
from_dlt: raise an error with a clear message" to prevent scaffolding an
unrelated project.

In `@wren/src/wren/context.py`:
- Around line 242-266: The code in the for table in tables loop builds model
output files using only table.name, which causes collisions when the same table
name exists in multiple schemas; update the logic that creates
ProjectFile(relative_path=f"models/{table.name}/metadata.yml") to include the
schema (e.g., use table.schema + table.name or a schema-qualified identifier) or
detect duplicates and raise an error; also ensure any references written to
relationships.yml (where bare names are used) are schema-qualified the same way
so names remain unambiguous; modify the model creation and file naming in this
block and add a pre-check for duplicate (schema,name) vs name-only collisions to
fail fast if you prefer strictness.

In `@wren/src/wren/dlt_introspect.py`:
- Around line 60-62: The ATTACH SQL is built by interpolating self._path and
self._catalog directly into the string; escape these inputs before building the
SQL to avoid syntax errors and injection (escape single quotes in self._path by
doubling them, and escape double quotes in self._catalog by doubling them), then
construct the ATTACH statement using the escaped values passed to
self._con.execute (reference the call where self._con.execute(...) is invoked
and the attributes self._path and self._catalog).
- Around line 79-85: The SQL filter on table_name currently uses LIKE '_dlt_%'
which treats '_' as a single-character wildcard and can incorrectly match names;
update the WHERE clause used in the self._con.execute(...) query that populates
table_rows to exclude DLT tables by using an explicit prefix test such as
starts_with(table_name, '_dlt_') (i.e. WHERE database_name = ? AND NOT
starts_with(table_name, '_dlt_')) or equivalently LEFT(table_name, 5) != '_dlt_'
so only names starting with the literal "_dlt_" are excluded; keep the rest of
the query (ORDER BY schema_name, table_name) intact and continue passing
[self._catalog] as the parameter.

---

Nitpick comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 108-111: Remove the broad try/except around parse_type in
dlt_introspect: instead of catching all exceptions and defaulting to raw_type,
call normalized = parse_type(raw_type, "duckdb") directly and let parse_type
perform its own fallback behavior; remove the surrounding exception handling so
unrelated sqlglot errors surface rather than being swallowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 505503d0-c298-4848-89c1-b39ed7b5c6d0

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2639f and 9d03760.

📒 Files selected for processing (4)
  • wren/src/wren/context.py
  • wren/src/wren/context_cli.py
  • wren/src/wren/dlt_introspect.py
  • wren/tests/unit/test_dlt_introspect.py

claude added 2 commits April 8, 2026 15:46
- dlt_introspect: escape path/catalog in ATTACH SQL to handle single
  quotes in paths and double quotes in catalog aliases
- dlt_introspect: replace LIKE '_dlt_%' with starts_with(table_name,
  '_dlt_') to avoid underscore wildcard mismatches
- dlt_introspect: remove redundant try/except around parse_type() since
  parse_type() already falls back to the raw string on parse errors
- context: add duplicate-table-name guard before generating model files
  to fail fast when the same table name appears in multiple schemas
- context_cli: fail fast when --profile is used without --from-dlt

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
Move all local imports in test_dlt_introspect.py to module level to
satisfy PLC0415. Apply ruff format to all changed files.

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 55-62: The ATTACH alias uses the raw filename stem (self._catalog)
which can collide with DuckDB reserved names; change the logic that sets
self._catalog to a safe prefixed alias (e.g., "dlt_" + stem) and ensure the
DuckDB connection (self._con = duckdb.connect()) is created before any execute
calls; keep escaping for the path and catalog name and use the prefixed
self._catalog when calling self._con.execute("ATTACH ... AS
\"{escaped_catalog}\" (READ_ONLY)").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8e9cc38a-0d6d-4d71-b62a-dc60bf81770e

📥 Commits

Reviewing files that changed from the base of the PR and between 9d03760 and daa7647.

📒 Files selected for processing (4)
  • wren/src/wren/context.py
  • wren/src/wren/context_cli.py
  • wren/src/wren/dlt_introspect.py
  • wren/tests/unit/test_dlt_introspect.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren/src/wren/context_cli.py

claude added 2 commits April 8, 2026 15:59
DuckDB reserves 'memory', 'system', and 'temp' as catalog names. Using
a filename stem like 'temp.duckdb' or 'memory.duckdb' would cause ATTACH
to fail. Prefix the internal alias with 'dlt_' (stored as self._alias)
while keeping self._catalog as the bare stem for table_reference output
in the generated YAML project.

https://claude.ai/code/session_01X9z6PBFBk59A3oDaxYrt2o
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/dlt_introspect.py`:
- Around line 36-41: The parent-resolution logic currently matches parents by
bare table name and assumes matched parents have a DLT id column; update the
matching to require schema+name equality (use DltTable.schema and DltTable.name
together) rather than falling back to name-only matches, and if schema is
omitted by the child only allow a name-only match when the candidate is unique
across schemas in the same catalog; also replace any hard assumption that a
matched parent has "_dlt_id" with an explicit check for that column on the
candidate (inspect its DltColumn entries) before setting has_dlt_parent_id or
emitting relationships. Refer to the DltTable, DltColumn and has_dlt_parent_id
usages and update the parent-match loops (the blocks that search tables by name)
to enforce schema-aware matching and explicit column existence checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0ef9f89-1533-4bf7-862f-c1b30df064f6

📥 Commits

Reviewing files that changed from the base of the PR and between 1ae1027 and 3199637.

📒 Files selected for processing (1)
  • wren/src/wren/dlt_introspect.py

Comment on lines +36 to +41
class DltTable:
catalog: str # attached database alias (filename stem)
schema: str # DuckDB schema (e.g. "main", "hubspot_crm")
name: str # table name
columns: list[DltColumn] = field(default_factory=list)
has_dlt_parent_id: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't resolve dlt parents from bare table names alone.

Line 144 throws away schema, so a child like sales.orders__items can bind to crm.orders if that's the only orders table in the file. Line 175 also assumes the matched parent has _dlt_id, which is not true for non-dlt tables that this introspector still imports. Since wren/src/wren/context.py:201-291 only rejects duplicate table names, a unique-but-wrong match can still make it into relationships.yml.

Proposed fix
 `@dataclass`
 class DltTable:
     catalog: str  # attached database alias (filename stem)
     schema: str  # DuckDB schema (e.g. "main", "hubspot_crm")
     name: str  # table name
     columns: list[DltColumn] = field(default_factory=list)
+    has_dlt_id: bool = False
     has_dlt_parent_id: bool = False
@@
-            has_parent_id = False
+            has_dlt_id = False
+            has_parent_id = False
             columns: list[DltColumn] = []
             for col_name, raw_type, is_nullable in col_rows:
+                if col_name == "_dlt_id":
+                    has_dlt_id = True
                 if col_name == "_dlt_parent_id":
                     has_parent_id = True
                 if col_name in _DLT_INTERNAL_COLUMNS:
                     continue
@@
                 DltTable(
                     catalog=self._catalog,
                     schema=schema,
                     name=table_name,
                     columns=columns,
+                    has_dlt_id=has_dlt_id,
                     has_dlt_parent_id=has_parent_id,
                 )
             )
@@
-        table_names = {t.name for t in tables}
+        parent_lookup = {
+            (t.schema, t.name)
+            for t in tables
+            if t.has_dlt_id
+        }
         relationships: list[dict] = []
@@
             for i in range(len(parts) - 1, 0, -1):
                 candidate = "__".join(parts[:i])
-                if candidate in table_names:
+                if (table.schema, candidate) in parent_lookup:
                     parent_name = candidate
                     break
@@
                 logger.warning(
-                    "Child table '%s' has _dlt_parent_id but no matching parent "
-                    "found — skipping relationship",
-                    table.name,
+                    "Child table '%s.%s' has _dlt_parent_id but no matching "
+                    "same-schema parent with _dlt_id was found; skipping relationship",
+                    table.schema,
+                    table.name,
                 )

Also applies to: 108-123, 144-176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wren/src/wren/dlt_introspect.py` around lines 36 - 41, The parent-resolution
logic currently matches parents by bare table name and assumes matched parents
have a DLT id column; update the matching to require schema+name equality (use
DltTable.schema and DltTable.name together) rather than falling back to
name-only matches, and if schema is omitted by the child only allow a name-only
match when the candidate is unique across schemas in the same catalog; also
replace any hard assumption that a matched parent has "_dlt_id" with an explicit
check for that column on the candidate (inspect its DltColumn entries) before
setting has_dlt_parent_id or emitting relationships. Refer to the DltTable,
DltColumn and has_dlt_parent_id usages and update the parent-match loops (the
blocks that search tables by name) to enforce schema-aware matching and explicit
column existence checks.

@goldmedal goldmedal marked this pull request as draft April 9, 2026 09:46
@goldmedal goldmedal closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants