Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`TestWriteResponseShapes`** test class — sentinel-scan parametrized over every write tool plus two registry-completeness tests. Catches regressions when a future write tool echoes caller-supplied payload fields. The `ECHO_EXEMPTIONS` registry doubles as the executable spec for what counts as a primary handle vs a payload echo ([#243](https://github.com/cmeans/mcp-awareness/issues/243)).

### Changed
- **README** — document Layer 1 hybrid retrieval features: hybrid `search` tool (RRF fusion), per-entry language support with auto-detection, `get_knowledge` language filter, regconfig validation, unsupported-language alerts, FTS infrastructure. Refs [#271](https://github.com/cmeans/mcp-awareness/issues/271).
- **Design doc** — `docs/design/hybrid-retrieval-multilingual.md` — close Phase 1.05 (extension selection) with **option 3 — defer non-Western language support from Layer 1**. Decision recorded 2026-04-11 after the empirical PG17.9 verification ([#257](https://github.com/cmeans/mcp-awareness/pull/257)) returned a definitive negative on pgroonga regconfig integration, ruling out the original "install pgroonga, use the 4 entries" path. The trilemma was resolved in favor of pragmatic Layer 1 shipping: at 1 month into mcp-awareness development with no public users and no signal on multilingual demand, Layer 1 ships with the 28 stock snowball regconfigs and `simple` as the fallback for everything else. CJK + Hebrew + Thai + Khmer support becomes a deliberate follow-up release when actual demand surfaces. The decision tree (per-language parser extensions, branched-pgroonga path, or external search index) is preserved in the design doc for the future evaluation, with the empirical verification status of each option documented in the "Verified empirical results for future reference" subsection — zhparser confirmed via context7 during [#246](https://github.com/cmeans/mcp-awareness/pull/246), pgroonga 4.0.6 empirically ruled out by [#257](https://github.com/cmeans/mcp-awareness/pull/257)'s PG17.9 verification, Typesense 29.0 empirically tested in a 20-operation spike on 2026-04-11 (see awareness `typesense-spike-2026-04-11` and `~/.local/state/mcp-awareness-typesense-spike/test-results-2026-04-11.md` for the full test matrix), and Meilisearch documented per its official documentation reviewed via context7 against `/meilisearch/documentation` on 2026-04-11 but not empirically tested. Phase 3 (non-Western language extension install) is reframed as a wiring-PR follow-on contingent on demand. The managed-Postgres compatibility section is reframed as contingent on Phase 3 reactivation. Closes [#249](https://github.com/cmeans/mcp-awareness/issues/249) (gating question answered, mechanism chosen) and [#248](https://github.com/cmeans/mcp-awareness/issues/248) (original premise — measure pgroonga regconfig memory cost — moot since those regconfigs do not exist; surviving stock-snowball measurement scope deferred as below-the-line for current scale).
- **Design doc** — `docs/design/hybrid-retrieval-multilingual.md` — record the empirical PG17.9 verification results for Steps 0 and 1 of the schema verification task ([#249](https://github.com/cmeans/mcp-awareness/issues/249)). **Step 0 (Substantive 3, gating): pgroonga 4.0.6 does not register any regconfigs in `pg_ts_config`** — verified by capturing `SELECT cfgname FROM pg_ts_config` before and after `CREATE EXTENSION pgroonga` against `groonga/pgroonga:latest-alpine-17`; both queries returned the same 29 rows (28 stock snowball + `simple`). `to_tsvector('japanese', '...')` errors with `text search configuration "japanese" does not exist`. The pgroonga extension is functional under its documented integration model (`USING pgroonga` index access method + `&@` operator successfully indexes/queries Japanese and Chinese content); the regconfig absence is by design, not a packaging bug. **Step 1 (Substantive 2, generated-column pattern): works on PG17.9** — `tsv tsvector GENERATED ALWAYS AS (to_tsvector(language, content)) STORED` is accepted at `CREATE TABLE`, populates correctly per row's regconfig, regenerates dynamically when `language` is updated, works with a standard GIN index (`Bitmap Index Scan` confirmed via `EXPLAIN ANALYZE` with `enable_seqscan=off`), and fails at INSERT time when handed a missing regconfig — exactly the case the startup-cache validation is designed to catch. The trigger-based fallback is therefore not needed for the wiring PR (kept in the design doc as documented escape hatch). One Step 1 checkbox remains open: confirming the combined hybrid CTE plan uses both HNSW and GIN indexes (requires a `pgvector` + chosen-non-Western-FTS image, deferred to the wiring PR). Step 2 (#248 memory measurement) and Step 3 (RDS compatibility) remain open and now contingent on Phase 1.05's mechanism choice. **Phase 1.05 (extension selection) is now the load-bearing open decision** — the original "install pgroonga, use the 4 entries" path is empirically ruled out, leaving the three documented options: per-language parser extensions like zhparser, pgroonga with a branched query path, or deferral of non-Western support from Layer 1.
- **Design doc** — `docs/design/hybrid-retrieval-multilingual.md` — record the pgroonga regconfig finding from [#246](https://github.com/cmeans/mcp-awareness/pull/246)'s QA cycle (rounds 3–5): pgroonga's documented integration is its own PostgreSQL index access method, not the standard `regconfig` registry the Layer 1 design assumes. Layer 1's verification task (Substantive 2) is now gated on a new Substantive 3 task (Step 0 of the revised verification): does pgroonga even register the assumed regconfigs in `pg_ts_config`? Tracked as [#249](https://github.com/cmeans/mcp-awareness/issues/249); [#248](https://github.com/cmeans/mcp-awareness/issues/248) (Postgres memory cost) is now blocked on [#249](https://github.com/cmeans/mcp-awareness/issues/249). Adds zhparser as a verified counter-example proving the design pattern (regconfig → tsvector → GIN → standard FTS operators) works for non-Western languages with the right extension, but only for Chinese; Japanese / Korean / Hebrew equivalents are not yet verified. Defers non-Western FTS mechanism selection to the wiring PR (new Phase 1.05) with three explicit options: per-language parser extensions, pgroonga with a branched query path, or deferral from Layer 1. Phase 3 (non-Western language extension install) is reframed to cover all three options. Risk section and managed-Postgres compatibility analysis updated to reflect that extension choice is open.
Expand Down
37 changes: 28 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ The server is running on port 8420. Point any MCP client at `http://localhost:84

| Variable | Default | Description |
|----------|---------|-------------|
| `AWARENESS_EMBEDDING_PROVIDER` | _(none)_ | Set to `ollama` to enable semantic search. Without it, all features work except `semantic_search` and `backfill_embeddings`. |
| `AWARENESS_EMBEDDING_PROVIDER` | _(none)_ | Set to `ollama` to enable the vector branch of hybrid search. Without it, `search` uses FTS only and `backfill_embeddings` is unavailable. |
| `AWARENESS_EMBEDDING_MODEL` | `nomic-embed-text` | Ollama model name for embeddings. Must match the model pulled in the Ollama container. |
| `AWARENESS_OLLAMA_URL` | `http://ollama:11434` | Ollama API endpoint. Default works with Docker Compose; change for external Ollama instances. |
| `AWARENESS_EMBEDDING_DIMENSIONS` | `768` | Vector dimensions. Must match the model output. Only change if using a non-default model. |
Expand Down Expand Up @@ -282,7 +282,7 @@ Results from the initial run (2026-03-27): HNSW query P50 stays under 4ms from 5

## Tools

The server exposes 29 MCP tools. Clients that support MCP resources also get 6 read-only resources, but since not all clients surface resources, every resource has a tool mirror.
The server exposes 30 MCP tools. Clients that support MCP resources also get 6 read-only resources, but since not all clients surface resources, every resource has a tool mirror.

### Read tools

Expand All @@ -301,7 +301,8 @@ The server exposes 29 MCP tools. Clients that support MCP resources also get 6 r
| `get_unread` | Entries with zero reads. Cleanup candidates or missed knowledge. |
| `get_activity` | Combined read + action feed, chronological. |
| `get_related` | Bidirectional entry relationships — entries referenced via `related_ids` and entries that reference the given entry. |
| `semantic_search` | Find entries by meaning using vector similarity (pgvector + Ollama). Combines with tag/source/type/date filters. Requires embedding provider. |
| `search` | Hybrid search — finds entries by meaning (vector) and by exact terms (FTS), fused via Reciprocal Rank Fusion. Combines with tag/source/type/date/language filters. Requires embedding provider for vector branch; FTS works without it. |
| `semantic_search` | Deprecated alias for `search`. Will be removed in a future release. |

### Write tools

Expand Down Expand Up @@ -350,28 +351,34 @@ For single-user deployments, secret path + WAF is sufficient. For multi-user, en
### Getting started
- **One-line demo install** — `curl | bash` sets up Awareness + Postgres + Cloudflare quick tunnel with pre-loaded demo data and a `getting-started` prompt that personalizes your instance
- **Published Docker images** — `ghcr.io/cmeans/mcp-awareness` (GHCR) and Docker Hub, auto-built on release tags
- **Optional semantic search** — add `AWARENESS_EMBEDDING_PROVIDER=ollama` and `docker compose --profile embeddings up -d` for vector similarity search
- **Optional embedding provider** — add `AWARENESS_EMBEDDING_PROVIDER=ollama` and `docker compose --profile embeddings up -d` to enable the vector branch of hybrid search. FTS works without it
- **CLI tools** — `mcp-awareness-user` (user management), `mcp-awareness-token` (JWT generation), `mcp-awareness-secret` (signing secret generation)

### Knowledge store
- `remember`, `learn_pattern`, `add_context`, `set_preference` with filtered retrieval
- Idempotent upserts via `logical_key` — same source + key updates in place with changelog tracking
- In-place updates with changelog tracking (`update_entry` + `include_history`)
- General-purpose notes with optional content payload and MIME type
- Per-entry language support — optional `language` parameter (ISO 639-1) on write tools, auto-detection via lingua-py, `simple` fallback for unsupported languages
- `get_knowledge` language filter — query entries by their detected language
- Unsupported-language alerts — info-level alerts fire when lingua detects a language without a Postgres regconfig, signaling demand for future language support
- Store introspection: `get_stats` for entry counts, `get_tags` for tag discovery
- Soft delete with 30-day trash, dry-run confirmation for bulk operations
- Delete and restore by tags with AND logic
- Pagination (`limit`/`offset`) on all list queries
- Entry relationships via `related_ids` convention + `get_related` bidirectional traversal

### Semantic search
- `semantic_search` tool — find entries by meaning using pgvector cosine similarity
### Hybrid search
- `search` tool — hybrid vector + full-text search fused via Reciprocal Rank Fusion (RRF, k=60). Finds entries by meaning *and* by exact terms — long documents are rescued by lexical matches, rare identifiers are found by FTS, semantic queries still use vector similarity
- `semantic_search` — deprecated alias for `search`, will be removed in a future release
- `backfill_embeddings` tool — embed pre-existing entries and re-embed stale ones
- `hint` parameter on `get_knowledge` — re-rank tag-filtered results by semantic similarity
- `hint` parameter on `get_knowledge` — re-rank tag-filtered results by hybrid similarity
- Per-entry language-aware FTS — generated `tsvector` column with weighted fields (description=A, content/goal=B, tags=C) and language-specific stemming via Postgres regconfigs (28 stock snowball languages)
- Regconfig validation — valid configs cached from `pg_ts_config` at startup, invalid values fall back to `simple` with cache-refresh retry
- Background embedding generation via thread pool (non-blocking writes)
- Stale embedding detection via `text_hash` comparison
- Powered by Ollama (`nomic-embed-text`, 768 dimensions) — optional, self-hosted, zero cost
- Graceful degradation: everything works without an embedding provider
- Graceful degradation: FTS works without embeddings, vector search works without FTS matches, everything works without an embedding provider

### Awareness engine
- Ambient awareness: status reporting, alert detection, suppression, briefing generation
Expand All @@ -387,7 +394,7 @@ For single-user deployments, secret path + WAF is sufficient. For multi-user, en
- Streamable HTTP + stdio transports

### Infrastructure
- PostgreSQL 17 with pgvector, GIN-indexed tag queries, HNSW-indexed embeddings, Debezium CDC-ready
- PostgreSQL 17 with pgvector, GIN-indexed tag queries, HNSW-indexed embeddings, GIN-indexed tsvector for full-text search, Debezium CDC-ready
- Connection pooling (psycopg_pool, min 2 / max 5) with automatic health checks and reconnection
- List mode and since/until/created_after/created_before filters for lightweight queries
- Storage abstraction: `Store` protocol — backends are swappable without changing server or collator logic
Expand All @@ -399,6 +406,18 @@ For single-user deployments, secret path + WAF is sufficient. For multi-user, en
- Request timing instrumentation and `/health` endpoint
- Comprehensive test suite (all against real Postgres + Ollama in CI), strict type checking, CI pipeline with coverage, QA gate

### Upgrading

When upgrading to a release with hybrid retrieval (Layer 1), running `mcp-awareness-migrate upgrade head` applies two migrations:

1. **Schema migration** — adds `language` (regconfig) and `tsv` (generated tsvector) columns to the entries table, plus GIN and partial indexes. Fast (DDL only).
2. **Language backfill** — runs lingua-py detection on all existing entries and updates the `language` column where a known language is detected. This is a one-time data migration that may take longer than usual on the first deploy:
- lingua's first call loads ~300MB of n-gram models (multi-second startup cost)
- Each existing entry is processed for language detection
- If `lingua-language-detector` is not installed, the backfill is skipped and entries remain as `simple` (FTS still works, just without language-specific stemming)

The `semantic_search` tool continues to work as a deprecated alias for the new `search` tool. Update your agent prompts to use `search` — the alias will be removed in a future release.

### Not yet implemented
- Layer 2 (baseline) detection — rolling averages and deviation calculation

Expand Down