feat(memory): seed query_history with canonical NL-SQL pairs on index#1510
feat(memory): seed query_history with canonical NL-SQL pairs on index#1510
Conversation
When `wren memory index` runs, automatically generate 2-3 example NL→SQL pairs per model (listing, aggregation, grouped aggregation) and one per relationship (JOIN), inserting them into query_history tagged `source:seed`. This solves the cold-start problem so `wren memory recall` is useful from the first session. Re-indexing replaces old seed entries while preserving user-confirmed pairs. Use `--no-seed` to skip generation. `index_schema()` now returns a dict with `schema_items` and `seed_queries` counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds optional seed NL↔SQL example generation to the memory subsystem: new seed generator module, plumbing to index/store/CLI to enable/disable seeding, and return/report seed counts; tests updated to cover seed behavior. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as Memory CLI
participant WrenMemory as WrenMemory
participant MemoryStore as MemoryStore
participant SeedGen as Seed Generator
participant QueryDB as Query History DB
User->>CLI: memory index --manifest=... [--no-seed]
CLI->>WrenMemory: index_manifest(manifest, seed_queries=bool)
WrenMemory->>MemoryStore: index_schema(manifest, seed_queries=bool)
MemoryStore->>MemoryStore: index schema items
alt seed_queries enabled
MemoryStore->>SeedGen: generate_seed_queries(manifest)
SeedGen-->>MemoryStore: list[{"nl":..., "sql":...}]
MemoryStore->>QueryDB: DELETE WHERE tags='source:seed'
MemoryStore->>QueryDB: INSERT seed pairs with tags='source:seed'
MemoryStore->>MemoryStore: seed_count = n
end
MemoryStore-->>WrenMemory: {"schema_items": int, "seed_queries": int}
WrenMemory-->>CLI: {"schema_items": int, "seed_queries": int}
CLI->>User: print status (schema items + seed queries)
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wren/src/wren/memory/seed_queries.py`:
- Around line 93-97: The code currently treats whitespace-only join conditions
as present; update the handling of condition in seed_queries.py so you strip
whitespace before validation (e.g., set condition = rel.get("condition",
"").strip()) and then use the existing check if len(models) < 2 or not
condition: return None so relationships with only whitespace are treated as
missing.
- Around line 56-63: The selection loop for numeric_col may pick a primary key
(is_pk) as a numeric aggregate; update the logic in the loop that assigns
numeric_col (the branch that checks col_type in _NUMERIC_TYPES and not is_calc)
to also require that the column is not a primary key (is_pk is False). Locate
the for-loop over columns and the variables numeric_col, col_type, is_calc,
is_pk and change the condition so primary keys are skipped when choosing the
aggregate column.
In `@wren/src/wren/memory/store.py`:
- Around line 165-172: The code returns early when
generate_seed_queries(manifest) yields no pairs, preventing removal of existing
seed rows; change the flow so that old seeds are always cleared before
returning: check for _QUERY_TABLE via _table_names(self._db), open the table
with self._db.open_table(_QUERY_TABLE) and call table.delete(f"tags =
'{SEED_TAG}'") unconditionally (or at least before the early return), then if
pairs is empty return 0; keep the user entries intact by only deleting rows
where tags == SEED_TAG; use the existing symbols generate_seed_queries,
manifest, _QUERY_TABLE, _table_names, self._db, table.delete, and SEED_TAG to
locate and update the logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51643891-ee9c-42ac-aaa7-5ce8df2823e7
📒 Files selected for processing (6)
wren/src/wren/memory/__init__.pywren/src/wren/memory/cli.pywren/src/wren/memory/seed_queries.pywren/src/wren/memory/store.pywren/tests/unit/test_memory.pywren/tests/unit/test_seed_queries.py
- Exclude PK columns from numeric aggregation template to avoid nonsensical SUM(primary_key) queries - Strip whitespace from relationship condition before validation so whitespace-only strings are treated as missing - Clear old seed entries before checking if new pairs exist, so re-indexing an empty manifest removes stale seeds rather than leaving them Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
wren/src/wren/memory/store.py (1)
175-180: Batch seed inserts to reduce embedding/DB overhead.Line 176 currently calls
store_query()once per pair, which recomputes embeddings and performs per-row table operations. Consider batching records in_upsert_seed_queries()for fewer round trips.♻️ Proposed refactor (batched insert)
- # Insert new seeds via the existing store_query() method - for pair in pairs: - self.store_query( - nl_query=pair["nl"], - sql_query=pair["sql"], - tags=SEED_TAG, - ) + now = datetime.now(timezone.utc) + nl_queries = [pair["nl"] for pair in pairs] + vectors = self._embed_fn.compute_source_embeddings(nl_queries) + records = [ + { + "text": pair["nl"], + "vector": vec, + "nl_query": pair["nl"], + "sql_query": pair["sql"], + "datasource": "", + "created_at": now, + "tags": SEED_TAG, + } + for pair, vec in zip(pairs, vectors) + ] + + if _QUERY_TABLE in _table_names(self._db): + table = self._db.open_table(_QUERY_TABLE) + table.add(records) + else: + self._db.create_table( + _QUERY_TABLE, + records, + schema=self._query_table_schema(), + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren/src/wren/memory/store.py` around lines 175 - 180, The loop in _upsert_seed_queries currently calls store_query for every pair causing repeated embedding computation and per-row DB operations; change _upsert_seed_queries to batch the input (pairs) and perform a single embedding call and a single bulk upsert instead: collect the NL texts from pairs, call the embedding function once to get embeddings for the batch, build the list of records with their SQL, tags (SEED_TAG) and embeddings, and then call a new or existing bulk insert/upsert method on the store/DB rather than calling store_query per row; update or add a batched signature (e.g., store_queries_bulk or extend store_query to accept a list) and update callers accordingly so embeddings and DB writes happen in one round trip.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@wren/src/wren/memory/store.py`:
- Around line 175-180: The loop in _upsert_seed_queries currently calls
store_query for every pair causing repeated embedding computation and per-row DB
operations; change _upsert_seed_queries to batch the input (pairs) and perform a
single embedding call and a single bulk upsert instead: collect the NL texts
from pairs, call the embedding function once to get embeddings for the batch,
build the list of records with their SQL, tags (SEED_TAG) and embeddings, and
then call a new or existing bulk insert/upsert method on the store/DB rather
than calling store_query per row; update or add a batched signature (e.g.,
store_queries_bulk or extend store_query to accept a list) and update callers
accordingly so embeddings and DB writes happen in one round trip.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 694842fd-e3a2-4a29-8f79-9913f4df3578
📒 Files selected for processing (3)
wren/src/wren/memory/seed_queries.pywren/src/wren/memory/store.pywren/tests/unit/test_seed_queries.py
✅ Files skipped from review due to trivial changes (2)
- wren/tests/unit/test_seed_queries.py
- wren/src/wren/memory/seed_queries.py
Summary
wren/src/wren/memory/seed_queries.py— pure, deterministic template engine that generates 2–3 NL→SQL pairs per model (listing, aggregation, grouped aggregation) and one JOIN pair per relationshipMemoryStore.index_schema()to call_upsert_seed_queries()on every index run: oldsource:seedentries are replaced, user-confirmed pairs are preserved; returns{"schema_items": int, "seed_queries": int}--no-seedflag towren memory indexCLI; updates output to show seed countWrenMemory.index_manifest()public API to match the new dict return typeMotivation
wren memory recallreturned nothing on first use, making it hard for agents to see its value. Seeding canonical examples at index time solves the cold-start problem without requiring an LLM or network access.Test plan
just test-unit— 61 passed, 18 skipped (lancedb/sentence-transformers skipped without memory extra)test_seed_queries.pycovering all templates, edge cases, and type handlingTestMemoryStoreSeedLifecyclecovering seed creation,--no-seedflag, idempotent re-index, user-entry preservation, and recall🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Changes
Tests