Skip to content

feat(memory): seed query_history with canonical NL-SQL pairs on index#1510

Open
goldmedal wants to merge 2 commits intomainfrom
feat/seed-examples
Open

feat(memory): seed query_history with canonical NL-SQL pairs on index#1510
goldmedal wants to merge 2 commits intomainfrom
feat/seed-examples

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Apr 2, 2026

Summary

  • Adds wren/src/wren/memory/seed_queries.py — pure, deterministic template engine that generates 2–3 NL→SQL pairs per model (listing, aggregation, grouped aggregation) and one JOIN pair per relationship
  • Extends MemoryStore.index_schema() to call _upsert_seed_queries() on every index run: old source:seed entries are replaced, user-confirmed pairs are preserved; returns {"schema_items": int, "seed_queries": int}
  • Adds --no-seed flag to wren memory index CLI; updates output to show seed count
  • Updates WrenMemory.index_manifest() public API to match the new dict return type

Motivation

wren memory recall returned nothing on first use, making it hard for agents to see its value. Seeding canonical examples at index time solves the cold-start problem without requiring an LLM or network access.

Test plan

  • just test-unit — 61 passed, 18 skipped (lancedb/sentence-transformers skipped without memory extra)
  • 16 new unit tests in test_seed_queries.py covering all templates, edge cases, and type handling
  • 5 new integration tests in TestMemoryStoreSeedLifecycle covering seed creation, --no-seed flag, idempotent re-index, user-entry preservation, and recall

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Indexing now generates canonical NL↔SQL "seed" example queries by default.
    • Added a CLI flag (--no-seed) to skip generating seed examples.
  • Changes

    • Index operation reports separate counts for schema items and seed queries (instead of a single count).
  • Tests

    • Added coverage validating seed generation, edge cases, and lifecycle behavior.

When `wren memory index` runs, automatically generate 2-3 example NL→SQL
pairs per model (listing, aggregation, grouped aggregation) and one per
relationship (JOIN), inserting them into query_history tagged `source:seed`.
This solves the cold-start problem so `wren memory recall` is useful from
the first session.

Re-indexing replaces old seed entries while preserving user-confirmed pairs.
Use `--no-seed` to skip generation. `index_schema()` now returns a dict
with `schema_items` and `seed_queries` counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Adds optional seed NL↔SQL example generation to the memory subsystem: new seed generator module, plumbing to index/store/CLI to enable/disable seeding, and return/report seed counts; tests updated to cover seed behavior.

Changes

Cohort / File(s) Summary
Core Memory API
wren/src/wren/memory/__init__.py
WrenMemory.index_manifest signature adds seed_queries: bool = True and now returns a dict {"schema_items": int, "seed_queries": int}.
Store Implementation
wren/src/wren/memory/store.py
MemoryStore.index_schema adds seed_queries: bool = True, returns the same dict shape, and a new helper _upsert_seed_queries(manifest) -> int inserts seed pairs into query_history with tag source:seed.
CLI Integration
wren/src/wren/memory/cli.py
memory index adds --no-seed option (mapped to seed_queries) and prints both schema item and seed query counts.
Seed Generation
wren/src/wren/memory/seed_queries.py
New pure-utility module exporting SEED_TAG and generate_seed_queries(manifest) -> list[dict] that emits listing, aggregation, grouped-aggregation, and join seed NL↔SQL pairs based on manifest models/relationships.
Tests
wren/tests/unit/test_memory.py, wren/tests/unit/test_seed_queries.py
Updated tests to expect dict returns and added comprehensive unit tests for seed generation and lifecycle (seeding on/off, replacement behavior, recall).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as Memory CLI
    participant WrenMemory as WrenMemory
    participant MemoryStore as MemoryStore
    participant SeedGen as Seed Generator
    participant QueryDB as Query History DB

    User->>CLI: memory index --manifest=... [--no-seed]
    CLI->>WrenMemory: index_manifest(manifest, seed_queries=bool)
    WrenMemory->>MemoryStore: index_schema(manifest, seed_queries=bool)
    MemoryStore->>MemoryStore: index schema items

    alt seed_queries enabled
        MemoryStore->>SeedGen: generate_seed_queries(manifest)
        SeedGen-->>MemoryStore: list[{"nl":..., "sql":...}]
        MemoryStore->>QueryDB: DELETE WHERE tags='source:seed'
        MemoryStore->>QueryDB: INSERT seed pairs with tags='source:seed'
        MemoryStore->>MemoryStore: seed_count = n
    end

    MemoryStore-->>WrenMemory: {"schema_items": int, "seed_queries": int}
    WrenMemory-->>CLI: {"schema_items": int, "seed_queries": int}
    CLI->>User: print status (schema items + seed queries)
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Suggested Reviewers

  • douenergy

Poem

🐰 I hopped through manifest rows so neat,
Plucked seed queries, both NL and SQL sweet.
"List all" and "Total" I gently sow,
Joins and groups in tidy row—
Memory gardens sprout seeds complete 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(memory): seed query_history with canonical NL-SQL pairs on index' accurately describes the main change—adding seed NL-SQL examples to query_history during indexing, which is the core objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/seed-examples

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wren/src/wren/memory/seed_queries.py`:
- Around line 93-97: The code currently treats whitespace-only join conditions
as present; update the handling of condition in seed_queries.py so you strip
whitespace before validation (e.g., set condition = rel.get("condition",
"").strip()) and then use the existing check if len(models) < 2 or not
condition: return None so relationships with only whitespace are treated as
missing.
- Around line 56-63: The selection loop for numeric_col may pick a primary key
(is_pk) as a numeric aggregate; update the logic in the loop that assigns
numeric_col (the branch that checks col_type in _NUMERIC_TYPES and not is_calc)
to also require that the column is not a primary key (is_pk is False). Locate
the for-loop over columns and the variables numeric_col, col_type, is_calc,
is_pk and change the condition so primary keys are skipped when choosing the
aggregate column.

In `@wren/src/wren/memory/store.py`:
- Around line 165-172: The code returns early when
generate_seed_queries(manifest) yields no pairs, preventing removal of existing
seed rows; change the flow so that old seeds are always cleared before
returning: check for _QUERY_TABLE via _table_names(self._db), open the table
with self._db.open_table(_QUERY_TABLE) and call table.delete(f"tags =
'{SEED_TAG}'") unconditionally (or at least before the early return), then if
pairs is empty return 0; keep the user entries intact by only deleting rows
where tags == SEED_TAG; use the existing symbols generate_seed_queries,
manifest, _QUERY_TABLE, _table_names, self._db, table.delete, and SEED_TAG to
locate and update the logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51643891-ee9c-42ac-aaa7-5ce8df2823e7

📥 Commits

Reviewing files that changed from the base of the PR and between 8f15e8c and 7399064.

📒 Files selected for processing (6)
  • wren/src/wren/memory/__init__.py
  • wren/src/wren/memory/cli.py
  • wren/src/wren/memory/seed_queries.py
  • wren/src/wren/memory/store.py
  • wren/tests/unit/test_memory.py
  • wren/tests/unit/test_seed_queries.py

- Exclude PK columns from numeric aggregation template to avoid nonsensical
  SUM(primary_key) queries
- Strip whitespace from relationship condition before validation so
  whitespace-only strings are treated as missing
- Clear old seed entries before checking if new pairs exist, so re-indexing
  an empty manifest removes stale seeds rather than leaving them

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
wren/src/wren/memory/store.py (1)

175-180: Batch seed inserts to reduce embedding/DB overhead.

Line 176 currently calls store_query() once per pair, which recomputes embeddings and performs per-row table operations. Consider batching records in _upsert_seed_queries() for fewer round trips.

♻️ Proposed refactor (batched insert)
-        # Insert new seeds via the existing store_query() method
-        for pair in pairs:
-            self.store_query(
-                nl_query=pair["nl"],
-                sql_query=pair["sql"],
-                tags=SEED_TAG,
-            )
+        now = datetime.now(timezone.utc)
+        nl_queries = [pair["nl"] for pair in pairs]
+        vectors = self._embed_fn.compute_source_embeddings(nl_queries)
+        records = [
+            {
+                "text": pair["nl"],
+                "vector": vec,
+                "nl_query": pair["nl"],
+                "sql_query": pair["sql"],
+                "datasource": "",
+                "created_at": now,
+                "tags": SEED_TAG,
+            }
+            for pair, vec in zip(pairs, vectors)
+        ]
+
+        if _QUERY_TABLE in _table_names(self._db):
+            table = self._db.open_table(_QUERY_TABLE)
+            table.add(records)
+        else:
+            self._db.create_table(
+                _QUERY_TABLE,
+                records,
+                schema=self._query_table_schema(),
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wren/src/wren/memory/store.py` around lines 175 - 180, The loop in
_upsert_seed_queries currently calls store_query for every pair causing repeated
embedding computation and per-row DB operations; change _upsert_seed_queries to
batch the input (pairs) and perform a single embedding call and a single bulk
upsert instead: collect the NL texts from pairs, call the embedding function
once to get embeddings for the batch, build the list of records with their SQL,
tags (SEED_TAG) and embeddings, and then call a new or existing bulk
insert/upsert method on the store/DB rather than calling store_query per row;
update or add a batched signature (e.g., store_queries_bulk or extend
store_query to accept a list) and update callers accordingly so embeddings and
DB writes happen in one round trip.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@wren/src/wren/memory/store.py`:
- Around line 175-180: The loop in _upsert_seed_queries currently calls
store_query for every pair causing repeated embedding computation and per-row DB
operations; change _upsert_seed_queries to batch the input (pairs) and perform a
single embedding call and a single bulk upsert instead: collect the NL texts
from pairs, call the embedding function once to get embeddings for the batch,
build the list of records with their SQL, tags (SEED_TAG) and embeddings, and
then call a new or existing bulk insert/upsert method on the store/DB rather
than calling store_query per row; update or add a batched signature (e.g.,
store_queries_bulk or extend store_query to accept a list) and update callers
accordingly so embeddings and DB writes happen in one round trip.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 694842fd-e3a2-4a29-8f79-9913f4df3578

📥 Commits

Reviewing files that changed from the base of the PR and between 7399064 and 202c13f.

📒 Files selected for processing (3)
  • wren/src/wren/memory/seed_queries.py
  • wren/src/wren/memory/store.py
  • wren/tests/unit/test_seed_queries.py
✅ Files skipped from review due to trivial changes (2)
  • wren/tests/unit/test_seed_queries.py
  • wren/src/wren/memory/seed_queries.py

@goldmedal goldmedal requested a review from douenergy April 2, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant