feat(memory): seed query_history with canonical NL-SQL pairs on index by goldmedal · Pull Request #1510 · Canner/wren-engine

goldmedal · 2026-04-02T06:53:57Z

Summary

Adds wren/src/wren/memory/seed_queries.py — pure, deterministic template engine that generates 2–3 NL→SQL pairs per model (listing, aggregation, grouped aggregation) and one JOIN pair per relationship
Extends MemoryStore.index_schema() to call _upsert_seed_queries() on every index run: old source:seed entries are replaced, user-confirmed pairs are preserved; returns {"schema_items": int, "seed_queries": int}
Adds --no-seed flag to wren memory index CLI; updates output to show seed count
Updates WrenMemory.index_manifest() public API to match the new dict return type

Motivation

wren memory recall returned nothing on first use, making it hard for agents to see its value. Seeding canonical examples at index time solves the cold-start problem without requiring an LLM or network access.

Test plan

just test-unit — 61 passed, 18 skipped (lancedb/sentence-transformers skipped without memory extra)
16 new unit tests in test_seed_queries.py covering all templates, edge cases, and type handling
5 new integration tests in TestMemoryStoreSeedLifecycle covering seed creation, --no-seed flag, idempotent re-index, user-entry preservation, and recall

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Indexing now generates canonical NL↔SQL "seed" example queries by default.
- Added a CLI flag (--no-seed) to skip generating seed examples.
Changes
- Index operation reports separate counts for schema items and seed queries (instead of a single count).
Tests
- Added coverage validating seed generation, edge cases, and lifecycle behavior.

When `wren memory index` runs, automatically generate 2-3 example NL→SQL pairs per model (listing, aggregation, grouped aggregation) and one per relationship (JOIN), inserting them into query_history tagged `source:seed`. This solves the cold-start problem so `wren memory recall` is useful from the first session. Re-indexing replaces old seed entries while preserving user-confirmed pairs. Use `--no-seed` to skip generation. `index_schema()` now returns a dict with `schema_items` and `seed_queries` counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-02T06:54:13Z

📝 Walkthrough

Walkthrough

Adds optional seed NL↔SQL example generation to the memory subsystem: new seed generator module, plumbing to index/store/CLI to enable/disable seeding, and return/report seed counts; tests updated to cover seed behavior.

Changes

Cohort / File(s)	Summary
Core Memory API `wren/src/wren/memory/__init__.py`	`WrenMemory.index_manifest` signature adds `seed_queries: bool = True` and now returns a dict `{"schema_items": int, "seed_queries": int}`.
Store Implementation `wren/src/wren/memory/store.py`	`MemoryStore.index_schema` adds `seed_queries: bool = True`, returns the same dict shape, and a new helper `_upsert_seed_queries(manifest) -> int` inserts seed pairs into `query_history` with tag `source:seed`.
CLI Integration `wren/src/wren/memory/cli.py`	`memory index` adds `--no-seed` option (mapped to `seed_queries`) and prints both schema item and seed query counts.
Seed Generation `wren/src/wren/memory/seed_queries.py`	New pure-utility module exporting `SEED_TAG` and `generate_seed_queries(manifest) -> list[dict]` that emits listing, aggregation, grouped-aggregation, and join seed NL↔SQL pairs based on manifest models/relationships.
Tests `wren/tests/unit/test_memory.py`, `wren/tests/unit/test_seed_queries.py`	Updated tests to expect dict returns and added comprehensive unit tests for seed generation and lifecycle (seeding on/off, replacement behavior, recall).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as Memory CLI
    participant WrenMemory as WrenMemory
    participant MemoryStore as MemoryStore
    participant SeedGen as Seed Generator
    participant QueryDB as Query History DB

    User->>CLI: memory index --manifest=... [--no-seed]
    CLI->>WrenMemory: index_manifest(manifest, seed_queries=bool)
    WrenMemory->>MemoryStore: index_schema(manifest, seed_queries=bool)
    MemoryStore->>MemoryStore: index schema items

    alt seed_queries enabled
        MemoryStore->>SeedGen: generate_seed_queries(manifest)
        SeedGen-->>MemoryStore: list[{"nl":..., "sql":...}]
        MemoryStore->>QueryDB: DELETE WHERE tags='source:seed'
        MemoryStore->>QueryDB: INSERT seed pairs with tags='source:seed'
        MemoryStore->>MemoryStore: seed_count = n
    end

    MemoryStore-->>WrenMemory: {"schema_items": int, "seed_queries": int}
    WrenMemory-->>CLI: {"schema_items": int, "seed_queries": int}
    CLI->>User: print status (schema items + seed queries)

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

feat(wren): add LanceDB-backed memory layer for schema and query retrieval #1494: Touches the same memory indexing and CLI surfaces; likely predecessor or related integration work.

Suggested Reviewers

douenergy

Poem

🐰 I hopped through manifest rows so neat,
Plucked seed queries, both NL and SQL sweet.
"List all" and "Total" I gently sow,
Joins and groups in tidy row—
Memory gardens sprout seeds complete 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(memory): seed query_history with canonical NL-SQL pairs on index' accurately describes the main change—adding seed NL-SQL examples to query_history during indexing, which is the core objective of this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/seed-examples

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/memory/seed_queries.py`:
- Around line 93-97: The code currently treats whitespace-only join conditions
as present; update the handling of condition in seed_queries.py so you strip
whitespace before validation (e.g., set condition = rel.get("condition",
"").strip()) and then use the existing check if len(models) < 2 or not
condition: return None so relationships with only whitespace are treated as
missing.
- Around line 56-63: The selection loop for numeric_col may pick a primary key
(is_pk) as a numeric aggregate; update the logic in the loop that assigns
numeric_col (the branch that checks col_type in _NUMERIC_TYPES and not is_calc)
to also require that the column is not a primary key (is_pk is False). Locate
the for-loop over columns and the variables numeric_col, col_type, is_calc,
is_pk and change the condition so primary keys are skipped when choosing the
aggregate column.

In `@wren/src/wren/memory/store.py`:
- Around line 165-172: The code returns early when
generate_seed_queries(manifest) yields no pairs, preventing removal of existing
seed rows; change the flow so that old seeds are always cleared before
returning: check for _QUERY_TABLE via _table_names(self._db), open the table
with self._db.open_table(_QUERY_TABLE) and call table.delete(f"tags =
'{SEED_TAG}'") unconditionally (or at least before the early return), then if
pairs is empty return 0; keep the user entries intact by only deleting rows
where tags == SEED_TAG; use the existing symbols generate_seed_queries,
manifest, _QUERY_TABLE, _table_names, self._db, table.delete, and SEED_TAG to
locate and update the logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51643891-ee9c-42ac-aaa7-5ce8df2823e7

📥 Commits

Reviewing files that changed from the base of the PR and between 8f15e8c and 7399064.

📒 Files selected for processing (6)

wren/src/wren/memory/__init__.py
wren/src/wren/memory/cli.py
wren/src/wren/memory/seed_queries.py
wren/src/wren/memory/store.py
wren/tests/unit/test_memory.py
wren/tests/unit/test_seed_queries.py

wren/src/wren/memory/seed_queries.py

wren/src/wren/memory/store.py

- Exclude PK columns from numeric aggregation template to avoid nonsensical SUM(primary_key) queries - Strip whitespace from relationship condition before validation so whitespace-only strings are treated as missing - Clear old seed entries before checking if new pairs exist, so re-indexing an empty manifest removes stale seeds rather than leaving them Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

wren/src/wren/memory/store.py (1)

175-180: Batch seed inserts to reduce embedding/DB overhead.

Line 176 currently calls store_query() once per pair, which recomputes embeddings and performs per-row table operations. Consider batching records in _upsert_seed_queries() for fewer round trips.

♻️ Proposed refactor (batched insert)

-        # Insert new seeds via the existing store_query() method
-        for pair in pairs:
-            self.store_query(
-                nl_query=pair["nl"],
-                sql_query=pair["sql"],
-                tags=SEED_TAG,
-            )
+        now = datetime.now(timezone.utc)
+        nl_queries = [pair["nl"] for pair in pairs]
+        vectors = self._embed_fn.compute_source_embeddings(nl_queries)
+        records = [
+            {
+                "text": pair["nl"],
+                "vector": vec,
+                "nl_query": pair["nl"],
+                "sql_query": pair["sql"],
+                "datasource": "",
+                "created_at": now,
+                "tags": SEED_TAG,
+            }
+            for pair, vec in zip(pairs, vectors)
+        ]
+
+        if _QUERY_TABLE in _table_names(self._db):
+            table = self._db.open_table(_QUERY_TABLE)
+            table.add(records)
+        else:
+            self._db.create_table(
+                _QUERY_TABLE,
+                records,
+                schema=self._query_table_schema(),
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wren/src/wren/memory/store.py` around lines 175 - 180, The loop in
_upsert_seed_queries currently calls store_query for every pair causing repeated
embedding computation and per-row DB operations; change _upsert_seed_queries to
batch the input (pairs) and perform a single embedding call and a single bulk
upsert instead: collect the NL texts from pairs, call the embedding function
once to get embeddings for the batch, build the list of records with their SQL,
tags (SEED_TAG) and embeddings, and then call a new or existing bulk
insert/upsert method on the store/DB rather than calling store_query per row;
update or add a batched signature (e.g., store_queries_bulk or extend
store_query to accept a list) and update callers accordingly so embeddings and
DB writes happen in one round trip.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@wren/src/wren/memory/store.py`:
- Around line 175-180: The loop in _upsert_seed_queries currently calls
store_query for every pair causing repeated embedding computation and per-row DB
operations; change _upsert_seed_queries to batch the input (pairs) and perform a
single embedding call and a single bulk upsert instead: collect the NL texts
from pairs, call the embedding function once to get embeddings for the batch,
build the list of records with their SQL, tags (SEED_TAG) and embeddings, and
then call a new or existing bulk insert/upsert method on the store/DB rather
than calling store_query per row; update or add a batched signature (e.g.,
store_queries_bulk or extend store_query to accept a list) and update callers
accordingly so embeddings and DB writes happen in one round trip.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 694842fd-e3a2-4a29-8f79-9913f4df3578

📥 Commits

Reviewing files that changed from the base of the PR and between 7399064 and 202c13f.

📒 Files selected for processing (3)

wren/src/wren/memory/seed_queries.py
wren/src/wren/memory/store.py
wren/tests/unit/test_seed_queries.py

✅ Files skipped from review due to trivial changes (2)

wren/tests/unit/test_seed_queries.py
wren/src/wren/memory/seed_queries.py

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

wren/src/wren/memory/seed_queries.py Show resolved Hide resolved

wren/src/wren/memory/seed_queries.py Show resolved Hide resolved

wren/src/wren/memory/store.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

goldmedal requested a review from douenergy April 2, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): seed query_history with canonical NL-SQL pairs on index#1510

feat(memory): seed query_history with canonical NL-SQL pairs on index#1510
goldmedal wants to merge 2 commits intomainfrom
feat/seed-examples

goldmedal commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

goldmedal commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

goldmedal commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading