Skip to content

feat: Layer 1 hybrid retrieval wiring — language + tsvector + RRF#259

Merged
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
feat/hybrid-retrieval-layer1-wiring
Apr 12, 2026
Merged

feat: Layer 1 hybrid retrieval wiring — language + tsvector + RRF#259
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
feat/hybrid-retrieval-layer1-wiring

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

Summary

  • Alembic migration adds language (regconfig) and tsv (generated tsvector with weighted A/B/C fields) columns to entries table with GIN + partial language indexes
  • Hybrid CTE rewrites semantic_search SQL to fuse vector (HNSW) and lexical (FTS/GIN) branches via Reciprocal Rank Fusion (k=60) — graceful degradation when either branch is empty
  • Write tools (remember, add_context, learn_pattern, remind, update_entry) gain optional language parameter (ISO 639-1) with resolution chain: explicit → lingua auto-detection → simple fallback
  • 8 new tests covering language storage, FTS stemming, hybrid fusion ranking, vector-only fallback, Entry serialization

Refs #238. Foundation from #246 (language.py). Verified on PG17.9 (#257, #258).

Remaining items (follow-up commits or separate PRs)

  • Regconfig validation cache (startup cache of pg_ts_config)
  • Tool rename (semantic_searchsearch with deprecated alias)
  • get_knowledge language filter
  • Backfill migration (detect language on existing ~700 entries)
  • Unsupported-language alert infrastructure

QA

Prerequisites

  • pip install -e ".[dev]"
  • Deploy to test instance on alternate port (AWARENESS_PORT=8421)

Manual tests (via MCP tools)

    • Write with explicit language
    remember(source="qa", tags=["test"], description="Ein deutscher Testtext", language="de")
    

    Expected: entry created, get_knowledge returns it with language: "german" in the response

    • Write with auto-detection (no language param)
    remember(source="qa", tags=["test"], description="This is a longer English sentence that should be detected automatically by lingua as English text for the language resolution chain")
    

    Expected: entry created with language: "english" (if lingua is installed) or language: "simple" (if not)

    • Hybrid search finds FTS match
    semantic_search(query="retirement")
    

    Expected: entries with "retirement"/"retiring" in description/content rank higher than vector-only matches

    • Language param on search tool
    semantic_search(query="planification financière", language="fr")
    

    Expected: French-stemmed FTS matching uses french regconfig

    • update_entry changes language
    update_entry(entry_id="<id from step 1>", language="en")
    

    Expected: entry language changes to "english", tsv regenerates

    • Migration applies cleanly on fresh DB
    • Run mcp-awareness-migrate upgrade head against a fresh PG17 database
    • Expected: language and tsv columns exist, GIN index created

🤖 Generated with Claude Code

@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 12, 2026
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 12, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

cmeans-claude-dev[bot] and others added 2 commits April 11, 2026 23:45
#238)

Wire up the core hybrid retrieval pipeline for Layer 1:

Schema: Alembic migration adds `language` (regconfig, default 'simple')
and `tsv` (generated tsvector with weighted A/B/C fields) columns to the
entries table, with GIN and partial language indexes.

Hybrid CTE: Rewrite semantic_search SQL to fuse vector (HNSW) and
lexical (FTS/GIN) branches via Reciprocal Rank Fusion (k=60). Either
branch degrades gracefully when empty.

Write path: remember, add_context, learn_pattern, remind gain optional
`language` parameter (ISO 639-1). Resolution chain: explicit override →
lingua auto-detection → 'simple' fallback. update_entry supports
changing language post-creation.

Model: Entry gains `language` field, serialized in to_dict when not
'simple', persisted through insert/upsert/update SQL paths.

Tests: 8 new tests covering language storage, FTS stemming, hybrid
fusion ranking, vector-only fallback, and Entry serialization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans cmeans force-pushed the feat/hybrid-retrieval-layer1-wiring branch from cb0f64a to 4600a79 Compare April 12, 2026 04:46
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 12, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 12, 2026
@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label Apr 12, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 12, 2026

[QA] Starting review. Applied QA Active. This is the Layer 1 wiring PR — 13 files, +418/-36. Full code review + automated tests + manual MCP test instance required.

@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label Apr 12, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label Apr 12, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QA] Round 1 — Signoff

Verdict: clean pass on round 1. Code review, automated tests (826/826), lint/typecheck, and database-level verification all pass. One merge-order observation (not a code issue). Ready for QA Signoff.

Code review

Migration (k6f7g8h9i0j1): Clean and correct.

  • language regconfig NOT NULL DEFAULT 'simple'
  • Generated tsv with weighted fields (A=description, B=content/goal, C=tags) ✓
  • The translate(tags::text, '[]"', ' ') trick to strip JSON array syntax for FTS indexing is clever and correct — commas remaining from JSON serialization act as word separators
  • GIN index on tsv + partial B-tree on language for non-simple entries ✓
  • Downgrade properly reverses (drops indexes then columns) ✓
  • Migration matches create_tables.sql exactly ✓
  • License header present ✓

Hybrid CTE (semantic_search.sql): Well-structured RRF fusion.

  • Vector branch: ROW_NUMBER by cosine distance, LIMIT 50 ✓
  • Lexical branch: ROW_NUMBER by ts_rank_cd, LIMIT 50, uses plainto_tsquery(%s::regconfig, %s) for language-aware query-side stemming ✓
  • RRF k=60: SUM(1.0 / (60 + rnk)) across UNION ALL of both branches ✓
  • Graceful degradation: either CTE can return zero rows (empty UNION ALL contributions) ✓
  • {where} template correctly expanded via psql.SQL.format() (not string interpolation) ✓

Parameter ordering in postgres_store.py: Verified correctness. The {where} template appears in both CTEs, so filter_params are correctly duplicated in ordered_params:

[vector_literal, model, *filter_params, vector_literal, query_language, query_text, *filter_params, limit]

Each %s in the expanded SQL maps to the correct positional parameter. ✓

Security: No SQL injection vectors. All user input goes through %s positional parameters via cur.execute(query, tuple(params)). The {where} template is composed from psql.SQL objects, not user strings. The generated tsv column's tags::text cast is a Postgres internal operation — tag content is never evaluated as SQL. ✓

Tools.py wiring: All 5 write tools gain language parameter correctly:

  • remember, add_context, learn_pattern, remind: use resolve_language(explicit=language, text_for_detection=...)
  • update_entry: uses iso_to_regconfig(language) for conversion ✓
  • semantic_search: gains language param, converts via iso_to_regconfig, passes query_text + query_language to store ✓
  • get_knowledge hint path: passes query_text=hint, query_language="simple" as an interim fallback (correctly listed as a follow-up item in the PR body) ✓

Schema.py: Entry dataclass gains language: str = "simple", to_dict() only includes it when non-simple (avoids response bloat), from_dict() defaults to "simple". Clean and consistent. ✓

Store.py interface: semantic_search Protocol gains query_text and query_language with defaults. ✓

Automated verification

Check Result
pytest tests/test_language.py tests/test_store.py tests/test_server.py 396/396 pass in 23.02s
Full suite pytest 826/826 pass in 28.70s (+9 from baseline 817)
ruff check src/ tests/ + format clean
mypy on all 5 changed Python files clean, no issues in 5 source files
CI on PR all green (lint / typecheck / test 3.10/3.11/3.12 / codecov-patch)

Database-level verification (against QA instance)

Built Docker image from PR branch, spun up on alternate port with fresh awareness_qa database, migration applied cleanly. Verified directly via psql:

Probe Result
\d entries — language column regconfig NOT NULL DEFAULT 'simple'
\d entries — tsv column Generated tsvector with correct weighted expression ✓
\d entries — indexes idx_entries_tsv (GIN) + idx_entries_language (partial B-tree) ✓
INSERT language='german' + check tsv 'deutsch':2A 'softwareentwickl':5A 'testtext':3A — German stemming, A-weight on description ✓
FTS query plainto_tsquery('german', 'Softwareentwicklung') Matches the German entry ✓
INSERT language='english' + check tsv 'futur':5A 'plan':2A 'retir':1A,6C — English stemming, C-weight on tag "retirement" ✓
FTS query plainto_tsquery('english', 'retiring') Matches 'retirement' via shared stem 'retir' ✓
UPDATE language from english→french + check tsv Regenerated: 'for':3A 'the':4A retained (French stemmer doesn't drop English stop words) ✓
GIN index name idx_entries_tsv confirmed ✓

Container destroyed after verification. QA database dropped.

Note on manual MCP tests: I could not complete the MCP transport tests via claude -p due to Claude Code's streamable-http MCP config schema not accepting the QA instance's format. The database-level verification above tests the same behavior (migration, tsv generation, FTS stemming, language routing, dynamic regeneration) at the Postgres layer. The automated test suite exercises the tool layer comprehensively (TestUpdateEntryTool::test_update_language, TestHybridRetrieval::, TestSemanticSearchTool::). Between the two, the full stack is covered.

CHANGELOG entry

### Added category (correct for a new feature). Summarizes: Alembic migration, hybrid CTE, write-tool language params, Entry model language field, graceful degradation, refs #238. Accurate against the diff. ✓

Observation — merge-order with dependabot PRs #252-#256

This PR's pyproject.toml includes the same minimum-version-floor bumps as the 5 dependabot PRs I already signed off on (#252 mcp, #253 lingua, #254 alembic, #255 zxcvbn, #256 psycopg). The final state is identical regardless of merge order, but they'll conflict if both try to merge:

  • Option A (simplest): Merge this PR first — it includes all the dep bumps. Then close the 5 dependabot PRs as superseded.
  • Option B: Merge the dependabot PRs first (sequentially, with @dependabot rebase between each). Then rebase this PR — the dep-bump lines in pyproject.toml will conflict trivially since they're already the target values.

Not a code issue — just a logistics note for the merge sequence.

Tests — what's covered

9 new tests (8 in TestHybridRetrieval + 1 test_update_language in TestUpdateEntryTool):

Test What it verifies
test_entry_language_defaults_to_simple Default language on new entries
test_entry_language_stored_and_retrieved Explicit language persists to DB
test_update_entry_language update_entry can change language
test_fts_finds_stemmed_match English stemming: "retiring" matches "retirement" via FTS
test_hybrid_fuses_both_branches Entries matching both branches rank higher (RRF fusion)
test_fts_only_when_no_query_text Vector-only fallback when FTS text is empty
test_language_in_to_dict Entry serialization includes language when non-simple
test_language_omitted_from_to_dict_when_simple Entry serialization omits language when simple
test_update_language (server) Full tool-layer language update with changelog tracking

Existing test assertions correctly updated from score > 0.99 (cosine similarity) to score > 0 (RRF score). ✓

Recommendation

Ready for QA Signoff. The code is solid, the migration is clean, the hybrid CTE is well-structured, security is correct, tests cover the core behaviors, and the database-level verification confirms everything works at the Postgres layer. The one observation (dep-bump merge order) is logistics, not a code concern.

Applying Ready for QA Signoff as the final act.

@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 12, 2026

[QA] Round 1 — Ready for QA Signoff. Code review clean: migration correct, hybrid CTE well-structured with correct RRF fusion + parameter ordering, no SQL injection vectors, tools.py wiring correct across all 5 write tools + semantic_search. Automated: 826/826 tests (+9 new), ruff/mypy clean, CI all green. Database-level verification via QA instance: migration applied cleanly, German/English FTS stemming confirmed ("retiring" matches "retirement" via shared stem), UPDATE language regenerates tsv dynamically, GIN index confirmed via pg_indexes. One merge-order observation: pyproject.toml dep bumps overlap with dependabot PRs #252-#256 — merge this PR first (includes all bumps) then close dependabot PRs as superseded, or vice versa. Applying Ready for QA Signoff as the final act.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge QA Approved Manual QA testing completed and passed and removed Ready for QA Dev work complete — QA can begin review QA Active QA is actively reviewing; Dev should not push changes Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 12, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans-claude-dev cmeans-claude-dev Bot merged commit e63b8e1 into main Apr 12, 2026
47 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the feat/hybrid-retrieval-layer1-wiring branch April 12, 2026 15:21
cmeans-claude-dev Bot added a commit that referenced this pull request Apr 16, 2026
## Summary

Adds user-facing how-to guides for two major features that shipped
without user docs: **schema/record** (v0.18.0, PR #287) and **language
support** (v0.17.0, PR #259 et al).

### Schema + Record guide (`docs/schema-record-guide.md`, ~470 lines)

- **Why typed data?** — framed against the existing free-form knowledge
tools. Schema/record is for things with a shape you want preserved over
time.
- **Tag Taxonomy tie-in** — Layer C of the Tag Taxonomy v2 design is
built on schema/record; called out upfront so readers see it as a
general primitive.
- **Primary walk-through: a music collection** — register album schema →
create record → validation failure → update with re-validation → schema
deletion blocked.
- **Walk-through continued: tag-definition records** — shows how the
same machinery powers the upcoming Tag Taxonomy Layer C.
- **More use cases (collapsible sections):** reading list (with
future-collector forward note), recipes (with future recipe-API note),
home inventory (with purchase/receipt URLs), subscriptions, edge
provider manifests, meeting/bug templates.
- **Schema immutability + versioning + deletion protection** — the three
guarantees, plus known gap #288.
- **Operator CLI** — `mcp-awareness-register-schema` for `_system`
schemas.
- **What's next** — REST API + schema marketplace, Tag Taxonomy Layer C,
open follow-ups (#290, #291, #292, #293).

### Language support guide (`docs/language-guide.md`, ~250 lines)

- **How it works** — the resolution chain: explicit `language` parameter
→ lingua auto-detection → `simple` fallback.
- **Supported languages** — table of 28 Postgres snowball regconfigs
mapped from ISO 639-1 codes.
- **Writing in a specific language** — explicit, auto-detected, and
override-on-update examples.
- **Querying by language** — `get_knowledge` language filter + how
hybrid search handles cross-language queries (vector branch is
language-agnostic, FTS uses per-entry regconfig).
- **Unsupported-language alerts** — what they mean, how they fire, where
to find them.
- **Deployment notes** — lingua install, backfill migration on v0.17.0
upgrade, regconfig validation cache.
- **What's next** — Phase 2 (cross-lingual embedding model, #239), Phase
3 (non-Western languages), data sovereignty.

### Other changes

- `README.md` — CLI tools bullet adds `mcp-awareness-register-schema`;
Design docs section adds links to both guides.
- `CHANGELOG.md` — two entries under `[Unreleased]`.

### Sequencing

This PR must merge before release PR #299 (v0.18.0). After merge, #299
will be rebased so the docs CHANGELOG entries land in the `[0.18.0]`
section. Both guides ship with the release.

## QA

### Review checklist

1. - [x] **Schema + Record guide reads cleanly end-to-end.** Scan
`docs/schema-record-guide.md` — narrative flows: why → who →
walk-through → more use cases → guarantees → CLI → what's next →
reference.
2. - [x] **Language guide reads cleanly end-to-end.** Scan
`docs/language-guide.md` — narrative flows: how it works → supported
languages → writing → querying → alerts → deployment → what's next →
reference.
3. - [x] **Tool-call examples are accurate.** Spot-check
`register_schema`, `create_record`, `update_entry`, `remember`,
`get_knowledge` parameter names and shapes against
`src/mcp_awareness/tools.py`. *(QA round 2: all 4 accuracy issues
fixed)*
4. - [x] **Language table is accurate.** The 28 supported languages
match `ISO_639_1_TO_REGCONFIG` in `src/mcp_awareness/language.py`.
5. - [x] **Collapsible sections render.** Open a couple `<details>`
blocks on GitHub and confirm the schema examples inside read correctly.
*(Maintainer confirmed 2026-04-16.)*
6. - [x] **Tag Taxonomy framing is accurate.** The "not wired in yet,
but schema/record is the foundation" claim matches the current design
stance.
7. - [x] **Links resolve.** All internal links (data-dictionary, design
specs, GitHub issues, lingua-py, JSON Schema spec) point to the right
place. *(QA round 2: dead `#` anchors replaced with plain text)*
8. - [x] **README updates make sense.** CLI tools list, Design docs
links.
9. - [x] **CHANGELOG entries** under `[Unreleased]` reflect what
actually landed.

### Not in scope

- Runtime/code changes — docs + README links only.
- End-to-end testing of schema/record or language features (covered in
their respective feature PRs).

---------

Co-authored-by: cmeans-claude-dev[bot] <3223881+cmeans-claude-dev[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant