Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Once installed, invoke a skill by name in your conversation:
|-------|-------------|
| [wren-usage](wren-usage/SKILL.md) | **Primary skill** — CLI workflow guide: query data via `wren --sql`, gather schema context with `wren memory`, store/recall queries, handle errors |
| [wren-generate-mdl](wren-generate-mdl/SKILL.md) | Generate a Wren MDL project from a live database — schema discovery, type normalization, YAML generation |
| [wren-dlt-connector](wren-dlt-connector/SKILL.md) | Connect SaaS data (HubSpot, Stripe, Salesforce, etc.) via dlt pipelines into DuckDB, then auto-generate a Wren project |

### wren-usage reference files

Expand Down
21 changes: 21 additions & 0 deletions skills/SKILLS.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,27 @@ Generates a Wren MDL project by exploring a live database using whatever tools a

---

## wren-dlt-connector

**File:** [wren-dlt-connector/SKILL.md](wren-dlt-connector/SKILL.md)

Connects SaaS data (HubSpot, Stripe, Salesforce, GitHub, Slack, etc.) to Wren Engine for SQL analysis. Walks through the full flow: install dlt, pick a SaaS source, set up credentials, run the data pipeline into DuckDB, then auto-generate a Wren semantic project from the loaded data.

### When to use

- Connecting SaaS data sources (HubSpot, Stripe, Salesforce, GitHub, Slack, etc.)
- Importing data from an API via dlt pipelines
- Loading SaaS data into DuckDB for SQL analysis
- Creating a Wren project from an existing dlt-produced DuckDB file

### Dependent skills

| Skill | Purpose |
|-------|---------|
| `wren-generate-mdl` | Generate or regenerate MDL from the DuckDB database |

---

## Installing a skill

```bash
Expand Down
21 changes: 21 additions & 0 deletions skills/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,27 @@
"repository": "https://github.com/Canner/wren-engine",
"license": "Apache-2.0",
"skills": [
{
"name": "wren-dlt-connector",
"version": "1.0",
"description": "Connect SaaS data (HubSpot, Stripe, Salesforce, GitHub, Slack, etc.) to Wren Engine for SQL analysis via dlt pipelines into DuckDB, then auto-generate a Wren semantic project.",
"tags": [
"wren",
"dlt",
"saas",
"duckdb",
"pipeline",
"hubspot",
"stripe",
"salesforce",
"github",
"slack"
],
"dependencies": [
"wren-generate-mdl"
],
"repository": "https://github.com/Canner/wren-engine/tree/main/skills/wren-dlt-connector"
},
{
"name": "wren-generate-mdl",
"version": "2.1",
Expand Down
13 changes: 11 additions & 2 deletions skills/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ set -euo pipefail
REPO="Canner/wren-engine"
BRANCH="${WREN_SKILLS_BRANCH:-main}"
DEST="${CLAUDE_SKILLS_DIR:-$HOME/.claude/skills}"
ALL_SKILLS=(wren-generate-mdl wren-usage)
ALL_SKILLS=(wren-dlt-connector wren-generate-mdl wren-usage)

# Parse --force flag and skill list from arguments
FORCE=false
Expand Down Expand Up @@ -51,8 +51,17 @@ fi

# Locate index.json for dependency resolution (local or remote)
INDEX_JSON=""
INDEX_JSON_TMP=""
if [ -n "$SCRIPT_DIR" ] && [ -f "$SCRIPT_DIR/index.json" ]; then
INDEX_JSON="$SCRIPT_DIR/index.json"
elif command -v curl &>/dev/null; then
INDEX_JSON_TMP="$(mktemp)"
if curl -fsSL "https://raw.githubusercontent.com/$REPO/$BRANCH/skills/index.json" -o "$INDEX_JSON_TMP" 2>/dev/null; then
INDEX_JSON="$INDEX_JSON_TMP"
else
rm -f "$INDEX_JSON_TMP"
INDEX_JSON_TMP=""
fi
fi

# Expand SELECTED_SKILLS to include dependencies declared in index.json.
Expand Down Expand Up @@ -153,7 +162,7 @@ else
echo "Destination: $DEST"
echo ""
tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT
trap 'rm -rf "$tmpdir"; [ -n "${INDEX_JSON_TMP:-}" ] && rm -f "$INDEX_JSON_TMP"' EXIT

extract_paths=()
for skill in "${SELECTED_SKILLS[@]}"; do
Expand Down
1 change: 1 addition & 0 deletions skills/versions.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{
"wren-dlt-connector": "1.0",
"wren-generate-mdl": "2.1",
"wren-usage": "2.1"
}
269 changes: 269 additions & 0 deletions skills/wren-dlt-connector/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
---
name: wren-dlt-connector
description: "Connect SaaS data (HubSpot, Stripe, Salesforce, GitHub, Slack, etc.) to Wren Engine for SQL analysis. Guides the user through the full flow: install dlt, pick a SaaS source, set up credentials, run the data pipeline into DuckDB, then auto-generate a Wren semantic project from the loaded data. Use this skill whenever the user mentions: connecting SaaS data, importing data from an API, dlt pipelines, loading HubSpot/Stripe/Salesforce/GitHub/Slack data, querying SaaS data with SQL, or setting up a new data source from a REST API. Also trigger when the user already has a dlt-produced DuckDB file and wants to create a Wren project from it."
license: Apache-2.0
metadata:
author: wren-engine
version: "1.0"
---

# wren-dlt-connector

Connect SaaS data to Wren Engine for SQL analysis — from zero to a verified, queryable project in one conversation.

## Who this is for

Data analysts who know SQL and some Python, but may not have used dlt or Wren before. Explain concepts briefly when they first appear, but don't over-explain things a SQL-literate person would already know.

## Overview

This skill walks through a four-phase workflow:

1. **Extract** — Use dlt (data load tool) to pull data from a SaaS API into a local DuckDB file
2. **Model** — Introspect the DuckDB schema and auto-generate a Wren semantic project (YAML models, relationships, profile)
3. **Build & Verify** — Build the project and run actual SQL queries to confirm everything works end-to-end
4. **Handoff** — Show the user their data and next steps

The user might enter at any phase. Ask which phase they're starting from — they may already have a `.duckdb` file and just need phases 2–4.

**The goal is a project that actually queries successfully, not just files that look correct.** Always run the verification step before declaring success.

## Critical: DuckDB catalog naming

When wren engine connects to a DuckDB file, it ATTACHes it using the filename (without `.duckdb` extension) as the catalog alias:

```
ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)
```
Comment on lines +35 to +37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the SQL code fence.

This fence is a regular snippet (not a blockquote notification fence), so adding sql improves lint compliance and readability.

Proposed fix
-```
+```sql
 ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 35-35: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @skills/wren-dlt-connector/SKILL.md around lines 35 - 37, The SQL code fence
containing the ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)
statement should include a language identifier; update the fenced block to start
with sql instead of so the snippet is marked as SQL for linting and
readability in SKILL.md.


</details>

<!-- fingerprinting:phantom:triton:hawk:ae29aa70-e75f-47d5-b6c7-3bef260b3a7e -->

<!-- This is an auto-generated comment by CodeRabbit -->


This means **every model's `table_reference.catalog` must equal the DuckDB filename stem**. If the file is `hubspot.duckdb`, the catalog is `hubspot`. If it's `my_pipeline.duckdb`, the catalog is `my_pipeline`.

Getting this wrong causes "table not found" errors at query time. The `introspect_dlt.py` script handles this automatically.

## Critical: Type normalization

Column types must be normalized using wren SDK's `type_mapping.parse_type()` function, which uses sqlglot to convert database-specific types (like DuckDB's `HUGEINT`, `TIMESTAMP WITH TIME ZONE`) into canonical SQL types that wren-core understands. Do not hardcode type mappings — always delegate to `parse_type(raw_type, "duckdb")`.

The `introspect_dlt.py` script does this automatically when wren SDK is installed.

## Phase 1: Extract — dlt Pipeline Setup

### Step 1: Pick the SaaS source

Ask the user which SaaS service they want to connect. Read `references/dlt_sources.md` for a list of popular verified sources and their auth requirements. If the source isn't listed, check whether dlt has a verified source for it by searching `dlthub.com/docs/dlt-ecosystem/verified-sources`.

### Step 2: Install dlt

```bash
pip install "dlt[duckdb]" --break-system-packages
```

### Step 3: Write the pipeline script

Create a Python script that:
1. Imports the dlt source function for the chosen SaaS
2. Configures the pipeline with `destination='duckdb'` and a local file path
3. Runs the pipeline with `pipeline.run(source)`

Here's the general pattern — adapt it per source (check `references/dlt_sources.md` for source-specific templates):

```python
import dlt

pipeline = dlt.pipeline(
pipeline_name="<source>_pipeline",
destination="duckdb",
dataset_name="<source>_data",
)

# Source-specific: check references/dlt_sources.md for auth patterns
source = <source_function>(api_key=dlt.secrets.value)

info = pipeline.run(source)
print(info)
```

### Step 4: Set up credentials

dlt reads credentials from environment variables or `.dlt/secrets.toml`. The simplest approach for a one-time run:

```bash
# Set the credential as an environment variable
# The exact variable name depends on the source — check references/dlt_sources.md
export SOURCES__<SOURCE>__API_KEY="the-actual-key"
```

Ask the user for their API key or token. Remind them:
- Never commit credentials to git
- Environment variables are the simplest way for a one-time run
- For repeated use, they can create `.dlt/secrets.toml`

### Step 5: Run the pipeline

```bash
python <pipeline_script>.py
```

After the run, confirm:
1. The pipeline completed without errors
2. A `.duckdb` file was created (usually at `<pipeline_name>.duckdb`)
3. Print discovered tables and their column counts

```python
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
SELECT table_schema, table_name,
(SELECT COUNT(*) FROM information_schema.columns c
WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
FROM information_schema.tables t
WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
AND table_name NOT LIKE '_dlt_%'
ORDER BY table_schema, table_name
""").fetchall():
print(f" {row[0]}.{row[1]} ({row[2]} columns)")
con.close()
```

## Phase 2: Model — Generate Wren Project

Run the introspection script to auto-generate a complete Wren project from the DuckDB file:

```bash
python <path-to-this-skill>/scripts/introspect_dlt.py \
--duckdb-path <path-to-duckdb-file> \
--output-dir <project-directory> \
--project-name <name>
```

This script:
- Connects to the DuckDB file (read-only)
- **Sets `table_reference.catalog` to the DuckDB filename stem** (matching wren engine's ATTACH behavior)
- Discovers all tables and columns via `information_schema`
- Filters out dlt internal tables (`_dlt_loads`, `_dlt_pipeline_state`, etc.)
- Filters out dlt metadata columns (`_dlt_id`, `_dlt_load_id`, `_dlt_list_idx`) from model definitions
- Detects parent-child relationships from `_dlt_parent_id` columns and table naming conventions
- **Normalizes column types using `wren.type_mapping.parse_type()`** (sqlglot-based)
- Generates a complete v2 YAML project (wren_project.yml, models/, relationships.yml, instructions.md)

After running, show the user what was generated:

```bash
# Show project summary
cat <project-directory>/wren_project.yml
echo "---"
ls <project-directory>/models/
echo "---"
cat <project-directory>/relationships.yml
```

### Verify model correctness

Spot-check one generated model to confirm:
1. `table_reference.catalog` matches the DuckDB filename (e.g., `stripe_data` for `stripe_data.duckdb`)
2. `table_reference.schema` matches the DuckDB schema (usually `main`)
3. No `_dlt_*` columns appear in the columns list
4. Column types look reasonable (VARCHAR, BIGINT, BOOLEAN, TIMESTAMP, etc.)

### Set up the connection profile

Create a Wren profile so the user can query without specifying connection details every time. The `url` must point to the **directory containing** the `.duckdb` file (not the file itself):

```python
import yaml
from pathlib import Path

wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"

existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})

profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
"datasource": "duckdb",
"url": str(Path("<duckdb-path>").resolve().parent),
"format": "duckdb",
}
existing["active"] = profile_name

profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))
```

## Phase 3: Build & Verify — The Project Must Actually Work

This phase is not optional. A project that generates YAML but fails at query time is not a success.

### Step 1: Build the MDL

```bash
cd <project-directory>
wren context build
```

This compiles the YAML models into `target/mdl.json`. If this fails, fix the issues before proceeding (see Troubleshooting below).

### Step 2: Validate with a real query

Run at least one query per generated model to confirm the project is functional:

```bash
# For each model, verify it resolves correctly
wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'
```

If any query fails, debug and fix the model before moving on. Common issues:
- Wrong catalog in table_reference → "table not found"
- Type mismatch → fix the column type in metadata.yml
- Missing profile → check `wren profile list`

### Step 3: Run interesting queries

Once basic queries pass, run 2–3 more interesting queries to show the user what their data looks like:

```bash
# Preview data
wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'

# If there's a relationship, verify both models are queryable
wren --sql 'SELECT * FROM "<parent>" LIMIT 5'
wren --sql 'SELECT * FROM "<child>" LIMIT 5'
```

Show the results to the user and explain what they're seeing. This is their first look at the data through Wren — make it count.

### Step 4: Confirm success

Only after queries return real data, tell the user the setup is complete. Summarize:
- How many models were created
- What relationships were detected
- Which profile is active
- Example queries they can try next

## Troubleshooting

If `wren context build` fails:
- Check that `data_source: duckdb` is set in `wren_project.yml`
- Verify the DuckDB file path in the profile is correct
- Run `wren context validate` for detailed error messages

If queries fail with "table not found":
- **Most likely cause:** `table_reference.catalog` doesn't match the DuckDB filename. If the file is `pipeline.duckdb`, the catalog must be `pipeline`, not empty string.
- Check the profile's `url` points to the directory containing the `.duckdb` file
- Table names with double underscores need quoting: `"hubspot__contacts"`

If queries fail with type errors:
- Check column types in the model YAML — they should be canonical SQL types (VARCHAR, BIGINT, etc.)
- Re-run `introspect_dlt.py` with wren SDK installed to get proper type normalization

General:
- Check that the profile is active: `wren profile list`
- The DuckDB file might be locked if a dlt pipeline is running — wait for it to finish

## Important notes

- dlt's `_dlt_parent_id` / `_dlt_id` columns are kept in the actual DuckDB tables but hidden from Wren model definitions. They're only used in relationship conditions.
- DuckDB has a single-writer limitation. Don't run a dlt sync while querying. For concurrent access, dlt should write to a separate file and swap atomically.
- The generated models use `table_reference` (not `ref_sql`) since they map directly to DuckDB tables created by dlt.
- Column types are normalized using wren SDK's `parse_type()` with sqlglot's DuckDB dialect. If a type looks wrong, the user can edit the model's `metadata.yml` directly.
23 changes: 23 additions & 0 deletions skills/wren-dlt-connector/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"skill_name": "wren-dlt-connector",
"evals": [
{
"id": 1,
"prompt": "我想把公司 HubSpot CRM 的資料拉進來用 SQL 分析,可以幫我設定嗎?我有 HubSpot 的 private app token。",
"expected_output": "Should guide through: install dlt, write a HubSpot pipeline script, set credentials, run pipeline, introspect DuckDB, generate wren project, build, and run sample queries on contacts/deals tables.",
"files": []
},
{
"id": 2,
"prompt": "I already have a DuckDB file from a dlt pipeline at ./stripe_data.duckdb. I want to create a wren project so I can query the Stripe data with SQL. Can you set that up?",
"expected_output": "Should skip Phase 1 (dlt setup), go directly to introspecting the DuckDB file, generate wren project YAML, set up profile, build, and run sample queries.",
"files": []
},
{
"id": 3,
"prompt": "我們團隊用 GitHub 管理 open source project,我想定期把 issues 和 PR 資料抓下來做分析。怎麼開始?",
"expected_output": "Should guide through: install dlt, configure GitHub access token, write pipeline for github source (issues + PRs), run into DuckDB, generate wren project with models for issues/pull_requests/comments, detect relationships, build, run sample queries.",
"files": []
}
]
}
Loading
Loading