feat: Layer 1 hybrid retrieval wiring — language + tsvector + RRF by cmeans-claude-dev[bot] · Pull Request #259 · cmeans/mcp-awareness

cmeans-claude-dev · 2026-04-12T04:32:11Z

Summary

Alembic migration adds language (regconfig) and tsv (generated tsvector with weighted A/B/C fields) columns to entries table with GIN + partial language indexes
Hybrid CTE rewrites semantic_search SQL to fuse vector (HNSW) and lexical (FTS/GIN) branches via Reciprocal Rank Fusion (k=60) — graceful degradation when either branch is empty
Write tools (remember, add_context, learn_pattern, remind, update_entry) gain optional language parameter (ISO 639-1) with resolution chain: explicit → lingua auto-detection → simple fallback
8 new tests covering language storage, FTS stemming, hybrid fusion ranking, vector-only fallback, Entry serialization

Refs #238. Foundation from #246 (language.py). Verified on PG17.9 (#257, #258).

Remaining items (follow-up commits or separate PRs)

Regconfig validation cache (startup cache of pg_ts_config)
Tool rename (semantic_search → search with deprecated alias)
get_knowledge language filter
Backfill migration (detect language on existing ~700 entries)
Unsupported-language alert infrastructure

QA

Prerequisites

pip install -e ".[dev]"
Deploy to test instance on alternate port (AWARENESS_PORT=8421)

Manual tests (via MCP tools)

- Write with explicit language
```
remember(source="qa", tags=["test"], description="Ein deutscher Testtext", language="de")
```
Expected: entry created, get_knowledge returns it with language: "german" in the response

Write with auto-detection (no language param)

remember(source="qa", tags=["test"], description="This is a longer English sentence that should be detected automatically by lingua as English text for the language resolution chain")

Expected: entry created with language: "english" (if lingua is installed) or language: "simple" (if not)

- Hybrid search finds FTS match
```
semantic_search(query="retirement")
```
Expected: entries with "retirement"/"retiring" in description/content rank higher than vector-only matches
- Language param on search tool
```
semantic_search(query="planification financière", language="fr")
```
Expected: French-stemmed FTS matching uses french regconfig
- update_entry changes language
```
update_entry(entry_id="<id from step 1>", language="en")
```
Expected: entry language changes to "english", tsv regenerates
- Migration applies cleanly on fresh DB
- Run mcp-awareness-migrate upgrade head against a fresh PG17 database
- Expected: language and tsv columns exist, GIN index created

🤖 Generated with Claude Code

codecov · 2026-04-12T04:34:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

#238) Wire up the core hybrid retrieval pipeline for Layer 1: Schema: Alembic migration adds `language` (regconfig, default 'simple') and `tsv` (generated tsvector with weighted A/B/C fields) columns to the entries table, with GIN and partial language indexes. Hybrid CTE: Rewrite semantic_search SQL to fuse vector (HNSW) and lexical (FTS/GIN) branches via Reciprocal Rank Fusion (k=60). Either branch degrades gracefully when empty. Write path: remember, add_context, learn_pattern, remind gain optional `language` parameter (ISO 639-1). Resolution chain: explicit override → lingua auto-detection → 'simple' fallback. update_entry supports changing language post-creation. Model: Entry gains `language` field, serialized in to_dict when not 'simple', persisted through insert/upsert/update SQL paths. Tests: 8 new tests covering language storage, FTS stemming, hybrid fusion ranking, vector-only fallback, and Entry serialization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cmeans · 2026-04-12T04:59:43Z

[QA] Starting review. Applied QA Active. This is the Layer 1 wiring PR — 13 files, +418/-36. Full code review + automated tests + manual MCP test instance required.

cmeans

[QA] Round 1 — Signoff

Verdict: clean pass on round 1. Code review, automated tests (826/826), lint/typecheck, and database-level verification all pass. One merge-order observation (not a code issue). Ready for QA Signoff.

Code review

Migration (k6f7g8h9i0j1): Clean and correct.

language regconfig NOT NULL DEFAULT 'simple' ✓
Generated tsv with weighted fields (A=description, B=content/goal, C=tags) ✓
The translate(tags::text, '[]"', ' ') trick to strip JSON array syntax for FTS indexing is clever and correct — commas remaining from JSON serialization act as word separators
GIN index on tsv + partial B-tree on language for non-simple entries ✓
Downgrade properly reverses (drops indexes then columns) ✓
Migration matches create_tables.sql exactly ✓
License header present ✓

Hybrid CTE (semantic_search.sql): Well-structured RRF fusion.

Vector branch: ROW_NUMBER by cosine distance, LIMIT 50 ✓
Lexical branch: ROW_NUMBER by ts_rank_cd, LIMIT 50, uses plainto_tsquery(%s::regconfig, %s) for language-aware query-side stemming ✓
RRF k=60: SUM(1.0 / (60 + rnk)) across UNION ALL of both branches ✓
Graceful degradation: either CTE can return zero rows (empty UNION ALL contributions) ✓
{where} template correctly expanded via psql.SQL.format() (not string interpolation) ✓

Parameter ordering in postgres_store.py: Verified correctness. The {where} template appears in both CTEs, so filter_params are correctly duplicated in ordered_params:

[vector_literal, model, *filter_params, vector_literal, query_language, query_text, *filter_params, limit]

Each %s in the expanded SQL maps to the correct positional parameter. ✓

Security: No SQL injection vectors. All user input goes through %s positional parameters via cur.execute(query, tuple(params)). The {where} template is composed from psql.SQL objects, not user strings. The generated tsv column's tags::text cast is a Postgres internal operation — tag content is never evaluated as SQL. ✓

Tools.py wiring: All 5 write tools gain language parameter correctly:

remember, add_context, learn_pattern, remind: use resolve_language(explicit=language, text_for_detection=...) ✓
update_entry: uses iso_to_regconfig(language) for conversion ✓
semantic_search: gains language param, converts via iso_to_regconfig, passes query_text + query_language to store ✓
get_knowledge hint path: passes query_text=hint, query_language="simple" as an interim fallback (correctly listed as a follow-up item in the PR body) ✓

Schema.py: Entry dataclass gains language: str = "simple", to_dict() only includes it when non-simple (avoids response bloat), from_dict() defaults to "simple". Clean and consistent. ✓

Store.py interface: semantic_search Protocol gains query_text and query_language with defaults. ✓

Automated verification

Check	Result
`pytest tests/test_language.py tests/test_store.py tests/test_server.py`	396/396 pass in 23.02s
Full suite `pytest`	826/826 pass in 28.70s (+9 from baseline 817)
`ruff check src/ tests/` + format	clean
`mypy` on all 5 changed Python files	clean, no issues in 5 source files
CI on PR	all green (lint / typecheck / test 3.10/3.11/3.12 / codecov-patch)

Database-level verification (against QA instance)

Built Docker image from PR branch, spun up on alternate port with fresh awareness_qa database, migration applied cleanly. Verified directly via psql:

Probe	Result
`\d entries` — language column	`regconfig NOT NULL DEFAULT 'simple'` ✓
`\d entries` — tsv column	Generated tsvector with correct weighted expression ✓
`\d entries` — indexes	`idx_entries_tsv` (GIN) + `idx_entries_language` (partial B-tree) ✓
INSERT `language='german'` + check tsv	`'deutsch':2A 'softwareentwickl':5A 'testtext':3A` — German stemming, A-weight on description ✓
FTS query `plainto_tsquery('german', 'Softwareentwicklung')`	Matches the German entry ✓
INSERT `language='english'` + check tsv	`'futur':5A 'plan':2A 'retir':1A,6C` — English stemming, C-weight on tag "retirement" ✓
FTS query `plainto_tsquery('english', 'retiring')`	Matches 'retirement' via shared stem 'retir' ✓
UPDATE `language` from english→french + check tsv	Regenerated: `'for':3A 'the':4A` retained (French stemmer doesn't drop English stop words) ✓
GIN index name	`idx_entries_tsv` confirmed ✓

Container destroyed after verification. QA database dropped.

Note on manual MCP tests: I could not complete the MCP transport tests via claude -p due to Claude Code's streamable-http MCP config schema not accepting the QA instance's format. The database-level verification above tests the same behavior (migration, tsv generation, FTS stemming, language routing, dynamic regeneration) at the Postgres layer. The automated test suite exercises the tool layer comprehensively (TestUpdateEntryTool::test_update_language, TestHybridRetrieval::, TestSemanticSearchTool::). Between the two, the full stack is covered.

CHANGELOG entry

### Added category (correct for a new feature). Summarizes: Alembic migration, hybrid CTE, write-tool language params, Entry model language field, graceful degradation, refs #238. Accurate against the diff. ✓

Observation — merge-order with dependabot PRs #252-#256

This PR's pyproject.toml includes the same minimum-version-floor bumps as the 5 dependabot PRs I already signed off on (#252 mcp, #253 lingua, #254 alembic, #255 zxcvbn, #256 psycopg). The final state is identical regardless of merge order, but they'll conflict if both try to merge:

Option A (simplest): Merge this PR first — it includes all the dep bumps. Then close the 5 dependabot PRs as superseded.
Option B: Merge the dependabot PRs first (sequentially, with @dependabot rebase between each). Then rebase this PR — the dep-bump lines in pyproject.toml will conflict trivially since they're already the target values.

Not a code issue — just a logistics note for the merge sequence.

Tests — what's covered

9 new tests (8 in TestHybridRetrieval + 1 test_update_language in TestUpdateEntryTool):

Test	What it verifies
`test_entry_language_defaults_to_simple`	Default language on new entries
`test_entry_language_stored_and_retrieved`	Explicit language persists to DB
`test_update_entry_language`	update_entry can change language
`test_fts_finds_stemmed_match`	English stemming: "retiring" matches "retirement" via FTS
`test_hybrid_fuses_both_branches`	Entries matching both branches rank higher (RRF fusion)
`test_fts_only_when_no_query_text`	Vector-only fallback when FTS text is empty
`test_language_in_to_dict`	Entry serialization includes language when non-simple
`test_language_omitted_from_to_dict_when_simple`	Entry serialization omits language when simple
`test_update_language` (server)	Full tool-layer language update with changelog tracking

Existing test assertions correctly updated from score > 0.99 (cosine similarity) to score > 0 (RRF score). ✓

Recommendation

Ready for QA Signoff. The code is solid, the migration is clean, the hybrid CTE is well-structured, security is correct, tests cover the core behaviors, and the database-level verification confirms everything works at the Postgres layer. The one observation (dep-bump merge order) is logistics, not a code concern.

Applying Ready for QA Signoff as the final act.

cmeans · 2026-04-12T05:09:41Z

[QA] Round 1 — Ready for QA Signoff. Code review clean: migration correct, hybrid CTE well-structured with correct RRF fusion + parameter ordering, no SQL injection vectors, tools.py wiring correct across all 5 write tools + semantic_search. Automated: 826/826 tests (+9 new), ruff/mypy clean, CI all green. Database-level verification via QA instance: migration applied cleanly, German/English FTS stemming confirmed ("retiring" matches "retirement" via shared stem), UPDATE language regenerates tsv dynamically, GIN index confirmed via pg_indexes. One merge-order observation: pyproject.toml dep bumps overlap with dependabot PRs #252-#256 — merge this PR first (includes all bumps) then close dependabot PRs as superseded, or vice versa. Applying Ready for QA Signoff as the final act.

cmeans

LGTM

## Summary Adds user-facing how-to guides for two major features that shipped without user docs: **schema/record** (v0.18.0, PR #287) and **language support** (v0.17.0, PR #259 et al). ### Schema + Record guide (`docs/schema-record-guide.md`, ~470 lines) - **Why typed data?** — framed against the existing free-form knowledge tools. Schema/record is for things with a shape you want preserved over time. - **Tag Taxonomy tie-in** — Layer C of the Tag Taxonomy v2 design is built on schema/record; called out upfront so readers see it as a general primitive. - **Primary walk-through: a music collection** — register album schema → create record → validation failure → update with re-validation → schema deletion blocked. - **Walk-through continued: tag-definition records** — shows how the same machinery powers the upcoming Tag Taxonomy Layer C. - **More use cases (collapsible sections):** reading list (with future-collector forward note), recipes (with future recipe-API note), home inventory (with purchase/receipt URLs), subscriptions, edge provider manifests, meeting/bug templates. - **Schema immutability + versioning + deletion protection** — the three guarantees, plus known gap #288. - **Operator CLI** — `mcp-awareness-register-schema` for `_system` schemas. - **What's next** — REST API + schema marketplace, Tag Taxonomy Layer C, open follow-ups (#290, #291, #292, #293). ### Language support guide (`docs/language-guide.md`, ~250 lines) - **How it works** — the resolution chain: explicit `language` parameter → lingua auto-detection → `simple` fallback. - **Supported languages** — table of 28 Postgres snowball regconfigs mapped from ISO 639-1 codes. - **Writing in a specific language** — explicit, auto-detected, and override-on-update examples. - **Querying by language** — `get_knowledge` language filter + how hybrid search handles cross-language queries (vector branch is language-agnostic, FTS uses per-entry regconfig). - **Unsupported-language alerts** — what they mean, how they fire, where to find them. - **Deployment notes** — lingua install, backfill migration on v0.17.0 upgrade, regconfig validation cache. - **What's next** — Phase 2 (cross-lingual embedding model, #239), Phase 3 (non-Western languages), data sovereignty. ### Other changes - `README.md` — CLI tools bullet adds `mcp-awareness-register-schema`; Design docs section adds links to both guides. - `CHANGELOG.md` — two entries under `[Unreleased]`. ### Sequencing This PR must merge before release PR #299 (v0.18.0). After merge, #299 will be rebased so the docs CHANGELOG entries land in the `[0.18.0]` section. Both guides ship with the release. ## QA ### Review checklist 1. - [x] **Schema + Record guide reads cleanly end-to-end.** Scan `docs/schema-record-guide.md` — narrative flows: why → who → walk-through → more use cases → guarantees → CLI → what's next → reference. 2. - [x] **Language guide reads cleanly end-to-end.** Scan `docs/language-guide.md` — narrative flows: how it works → supported languages → writing → querying → alerts → deployment → what's next → reference. 3. - [x] **Tool-call examples are accurate.** Spot-check `register_schema`, `create_record`, `update_entry`, `remember`, `get_knowledge` parameter names and shapes against `src/mcp_awareness/tools.py`. *(QA round 2: all 4 accuracy issues fixed)* 4. - [x] **Language table is accurate.** The 28 supported languages match `ISO_639_1_TO_REGCONFIG` in `src/mcp_awareness/language.py`. 5. - [x] **Collapsible sections render.** Open a couple `<details>` blocks on GitHub and confirm the schema examples inside read correctly. *(Maintainer confirmed 2026-04-16.)* 6. - [x] **Tag Taxonomy framing is accurate.** The "not wired in yet, but schema/record is the foundation" claim matches the current design stance. 7. - [x] **Links resolve.** All internal links (data-dictionary, design specs, GitHub issues, lingua-py, JSON Schema spec) point to the right place. *(QA round 2: dead `#` anchors replaced with plain text)* 8. - [x] **README updates make sense.** CLI tools list, Design docs links. 9. - [x] **CHANGELOG entries** under `[Unreleased]` reflect what actually landed. ### Not in scope - Runtime/code changes — docs + README links only. - End-to-end testing of schema/record or language features (covered in their respective feature PRs). --------- Co-authored-by: cmeans-claude-dev[bot] <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 12, 2026

github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 12, 2026

cmeans-claude-dev[bot] and others added 2 commits April 11, 2026 23:45

test: cover update_entry language param (Codecov gap)

4600a79

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cmeans force-pushed the feat/hybrid-retrieval-layer1-wiring branch from cb0f64a to 4600a79 Compare April 12, 2026 04:46

style: ruff format test_server.py

8c07626

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 12, 2026

github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 12, 2026

cmeans added the QA Active QA is actively reviewing; Dev should not push changes label Apr 12, 2026

github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label Apr 12, 2026

cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label Apr 12, 2026

cmeans reviewed Apr 12, 2026

View reviewed changes

cmeans approved these changes Apr 12, 2026

View reviewed changes

cmeans-claude-dev Bot merged commit e63b8e1 into main Apr 12, 2026
47 checks passed

cmeans-claude-dev Bot deleted the feat/hybrid-retrieval-layer1-wiring branch April 12, 2026 15:21

This was referenced Apr 12, 2026

feat: regconfig validation cache — startup pg_ts_config check #260

Closed

feat: backfill language on existing entries via lingua detection #263

Closed

docs: README update for Layer 1 hybrid retrieval features #271

Closed

cmeans mentioned this pull request Apr 12, 2026

docs: README update for Layer 1 hybrid retrieval features #273

Merged

cmeans-claude-dev Bot mentioned this pull request Apr 16, 2026

docs: add Schema + Record and Language support guides #300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Layer 1 hybrid retrieval wiring — language + tsvector + RRF#259

feat: Layer 1 hybrid retrieval wiring — language + tsvector + RRF#259
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
feat/hybrid-retrieval-layer1-wiring

cmeans-claude-dev Bot commented Apr 12, 2026

Uh oh!

codecov Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

cmeans commented Apr 12, 2026

Uh oh!

cmeans left a comment

Uh oh!

cmeans commented Apr 12, 2026

Uh oh!

cmeans left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cmeans-claude-dev Bot commented Apr 12, 2026

Summary

Remaining items (follow-up commits or separate PRs)

QA

Prerequisites

Manual tests (via MCP tools)

Uh oh!

codecov Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cmeans commented Apr 12, 2026

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

[QA] Round 1 — Signoff

Code review

Automated verification

Database-level verification (against QA instance)

CHANGELOG entry

Observation — merge-order with dependabot PRs #252-#256

Tests — what's covered

Recommendation

Uh oh!

cmeans commented Apr 12, 2026

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 12, 2026 •

edited

Loading