From f73aae5f283a215794d54a2d4d549599db6b8035 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Wed, 15 Apr 2026 20:35:14 -0500 Subject: [PATCH 1/6] docs: add Schema + Record user guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New docs/schema-record-guide.md covers why typed data matters (framing against free-form knowledge tools), who the feature is for (personal collections + team/integration use), and a full worked example: registering an album schema, creating a record, what a validation failure looks like, update with re-validation, and why schema deletion is blocked when live records reference it. Extends the walk-through into the Tag Taxonomy tie-in — the Layer C design for user-defined tag vocabularies will consume records validated against a tag-definition schema, so schema/record doubles as the foundation under that upcoming feature. Six additional use cases in collapsible sections: reading list, recipes, home inventory (with purchase/receipt URLs), subscriptions, edge provider manifests, and meeting/bug templates. Each names the kind of future edge provider that would naturally extend the schema (Goodreads for books, recipe APIs for recipes, etc.) without making the doc depend on any specific service being available. Closes the "What's next" section with links to the REST API + schema marketplace roadmap idea (awareness logical_key design-schema-marketplace-import), Tag Taxonomy Layer C, and the open P2/P3 follow-ups on main (#290, #291, #292, #293). README updates: adds mcp-awareness-register-schema to the CLI tools bullet; links the new guide from Design docs. Co-Authored-By: Claude Opus 4.6 (1M context) --- CHANGELOG.md | 1 + README.md | 3 +- docs/schema-record-guide.md | 469 ++++++++++++++++++++++++++++++++++++ 3 files changed, 472 insertions(+), 1 deletion(-) create mode 100644 docs/schema-record-guide.md diff --git a/CHANGELOG.md b/CHANGELOG.md index dcfd636..12c3cbd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **Schema + Record user guide** — new `docs/schema-record-guide.md` walks through why typed data matters, a full music-collection worked example (register schema → create record → validation failure → update → delete blocked), the Tag Taxonomy (Layer C) tie-in, and six collapsible use cases: reading list, recipes, home inventory, subscriptions, edge provider manifests, meeting/bug templates. Linked from README "Design docs". README also updates the CLI tools list to include `mcp-awareness-register-schema`. - **CLA bot** — installed [CLA Assistant](https://cla-assistant.io) to gate pull requests on a signed Contributor License Agreement. New `CLA.md` (v1.0) at the repo root holds the authoritative text (dual-license sublicensing grant, no copyright transfer). New `docs/cla.md` documents the bot, the signature record location (a public Gist owned by the maintainer), and the maintainer/bot whitelist. `CONTRIBUTING.md` "How to sign" section rewritten with the concrete signing flow. Refs [#297](https://github.com/cmeans/mcp-awareness/issues/297) (end-to-end verification against a non-maintainer account remains open). - Two new entry types: `schema` (JSON Schema Draft 2020-12 definition) and `record` (validated payload conforming to a schema). Tools: `register_schema`, `create_record`. Schemas are absolutely immutable after registration; records re-validate on content update. Schema deletion is blocked while live records reference a version. Per-owner storage with a shared `_system` fallback namespace for built-in schemas. - New CLI: `mcp-awareness-register-schema` for operators to seed `_system`-owned schemas at deploy time. diff --git a/README.md b/README.md index 9b5dfd3..7b1066b 100644 --- a/README.md +++ b/README.md @@ -354,7 +354,7 @@ For single-user deployments, secret path + WAF is sufficient. For multi-user, en - **One-line demo install** — `curl | bash` sets up Awareness + Postgres + Cloudflare quick tunnel with pre-loaded demo data and a `getting-started` prompt that personalizes your instance - **Published Docker images** — `ghcr.io/cmeans/mcp-awareness` (GHCR) and Docker Hub, auto-built on release tags - **Optional embedding provider** — add `AWARENESS_EMBEDDING_PROVIDER=ollama` and `docker compose --profile embeddings up -d` to enable the vector branch of hybrid search. FTS works without it -- **CLI tools** — `mcp-awareness-user` (user management), `mcp-awareness-token` (JWT generation), `mcp-awareness-secret` (signing secret generation) +- **CLI tools** — `mcp-awareness-user` (user management), `mcp-awareness-token` (JWT generation), `mcp-awareness-secret` (signing secret generation), `mcp-awareness-register-schema` (seed `_system`-owned schemas at deploy time) ### Knowledge store - `remember`, `learn_pattern`, `add_context`, `set_preference` with filtered retrieval @@ -462,6 +462,7 @@ Read the full vision: **[What Knowledge Becomes When It's Ambient](docs/vision.m - [From Metrics to Mental Models](docs/from-metrics-to-mental-models.md) — core spec: three-layer detection model, API design, data schema - [Collation Layer](docs/collation-layer.md) — briefing resource, token optimization, escalation logic - [Data Dictionary](docs/data-dictionary.md) — database schema, entry types, data field structures, lifecycle rules +- [Schema + Record Guide](docs/schema-record-guide.md) — how to define typed data contracts with JSON Schema, with worked examples (music collection, reading list, recipes, home inventory, subscriptions, edge manifests, tag taxonomy) - [Memory Prompts](docs/memory-prompts.md) — how to configure your AI to use awareness (platform memory, global CLAUDE.md, project CLAUDE.md) - [Changelog](CHANGELOG.md) — version history diff --git a/docs/schema-record-guide.md b/docs/schema-record-guide.md new file mode 100644 index 0000000..7d9f3e8 --- /dev/null +++ b/docs/schema-record-guide.md @@ -0,0 +1,469 @@ + +# Schema + Record guide + +New in v0.18.0. This guide walks through what schemas and records are, +why you'd use them, and how to register one and start storing validated +data. + +- Reference: entry-type schemas for `schema` and `record` live in the + [Data Dictionary](data-dictionary.md#schema--json-schema-definitions). +- Design context: + [2026-04-13 schema/record design](superpowers/specs/2026-04-13-schema-record-entry-types-design.md). + +--- + +## Why typed data? + +`remember`, `add_context`, `learn_pattern`, and `remind` let you store +free-form knowledge. That's the right tool for "the router is in the +basement closet" or "update the CLA bot whitelist weekly." Most things +an agent learns about you are notes — no shape required. + +Some things *do* have a shape, though. A book you're reading has a +title, an author, and a status. A recipe has ingredients and steps. A +home-inventory item has a location and a purchase date. Without a +schema, one entry says `status: "reading"`, another says +`state: "in progress"`, and a third forgets status entirely. Three +months later nothing lines up and your agent can't answer "what am I +partway through?" + +**Schemas fix that.** You define the shape once (as +[JSON Schema Draft 2020-12](https://json-schema.org/draft/2020-12)), +and records that conform to it are validated on write and re-validated +on update. Invalid data is rejected at the boundary, not discovered +later. + +### The Tag Taxonomy tie-in + +The [Tag Taxonomy v2 design](#) (Layer C) is built on top of +schema/record: a user-defined tag vocabulary is just a set of records +validated against a `tag-taxonomy` schema. That layer isn't wired in +yet, but it means the schema/record primitive in this release is doing +double duty — the feature you're reading about now is also the +foundation under something bigger. If you start using schema/record +for your own data today, you'll be using the same machinery that'll +later power shared tag vocabularies, edge provider manifests, and +more. + +--- + +## Who it's for + +**Personal use.** You're curating something that matters to you — your +music collection, reading list, recipe file, home inventory, +subscription tracker. You want the agent to validate what it stores so +fields stay consistent as the collection grows. When a future [edge +provider](superpowers/specs/2026-04-13-schema-record-entry-types-design.md) +syncs with an external service (Goodreads for books, Spotify for +music, a recipe API for your saved recipes), the schema is unchanged +— the data just starts arriving automatically instead of being typed +by hand. + +**Team and integration use.** You're building an edge provider that +writes structured telemetry to an awareness instance, or you need +shared vocabulary across multiple agents. Schema/record gives you a +typed contract that both sides agree on, with immutability and +versioning so the contract can evolve without breaking consumers. + +--- + +## Walk-through: a music collection + +Imagine you want to keep an inventory of albums you own. You care +about the title, artist, year, rating, and a short note. You'd like +the agent to reject `year: "nineteen ninety seven"` and +`rating: 11` before they pollute the store. + +### 1. Register an `album` schema + +``` +register_schema( + family="album", + version="1", + schema={ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "type": "object", + "required": ["title", "artist", "year"], + "properties": { + "title": {"type": "string", "minLength": 1}, + "artist": {"type": "string", "minLength": 1}, + "year": {"type": "integer", "minimum": 1877}, + "rating": {"type": "integer", "minimum": 1, "maximum": 5}, + "notes": {"type": "string"} + }, + "additionalProperties": false + }, + description="A single album in my music collection." +) +``` + +A few things are happening: + +- `family` + `version` become the schema's `logical_key` + (`album:1`). That's how records reference it. +- The schema body is a standard JSON Schema. Anything a JSON Schema + Draft 2020-12 validator understands, mcp-awareness understands. +- **Schemas are immutable after registration.** If you need to change + the shape, register `album:2`. The data dictionary explains why: + immutability is what makes the schema safely referenceable by every + record pinned to it. + +### 2. Create a record + +``` +create_record( + schema_ref="album", + schema_version="1", + content={ + "title": "OK Computer", + "artist": "Radiohead", + "year": 1997, + "rating": 5, + "notes": "Still the reference" + }, + tags=["music", "album", "90s"] +) +``` + +The record is stored with its content validated against `album:1`. +Tags work as usual, so you can retrieve it the same way you retrieve +any other entry. + +### 3. What a validation failure looks like + +``` +create_record( + schema_ref="album", + schema_version="1", + content={"title": "Kid A", "artist": "Radiohead", "year": "2000"} +) +``` + +The write is rejected with a structured error: + +``` +{ + "error": "validation_failed", + "schema_ref": "album:1", + "validation_errors": [ + {"path": "/year", "message": "'2000' is not of type 'integer'"} + ] +} +``` + +All failures are reported at once — you don't fix one and discover the +next on the retry. + +### 4. Update with re-validation + +``` +update_entry( + id="", + content={"title": "OK Computer", "artist": "Radiohead", "year": 1997, "rating": 4, "notes": "Downgraded after fresh listen"} +) +``` + +`update_entry` re-runs schema validation on record content. If the +update would produce an invalid record, the write is rejected and the +stored entry is unchanged. You can't accidentally corrupt a record by +editing one field into an invalid state. + +### 5. Schemas with live records can't be deleted + +If you try to `delete_entry` on `album:1` while any records still +reference it: + +``` +{ + "error": "schema_in_use", + "schema_ref": "album:1", + "referencing_records": 42 +} +``` + +Deletion is blocked because records pin their exact schema version. +To retire a schema, migrate records to a newer version (`album:2`) +first. (A follow-up, [#293](https://github.com/cmeans/mcp-awareness/issues/293), +will add a migration helper; today you do it by creating new records +and deleting the old ones.) + +--- + +## Extending the walk-through: tag taxonomy + +Your music collection probably already uses tags — `["music", "rock", "90s"]` on +one album, `["music", "rock", "alternative"]` on another. What is +"alternative rock" exactly? Is it the same as "alt rock"? Without a +shared definition, different entries drift. + +Register a `tag-definition` schema: + +``` +register_schema( + family="tag-definition", + version="1", + schema={ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "type": "object", + "required": ["path", "description"], + "properties": { + "path": {"type": "string", "pattern": "^[a-z0-9/_-]+$"}, + "description": {"type": "string", "minLength": 1}, + "synonyms": {"type": "array", "items": {"type": "string"}}, + "display": {"type": "string"} + }, + "additionalProperties": false + }, + description="A human-defined tag with description, synonyms, and display name." +) +``` + +Then seed a few records: + +``` +create_record(schema_ref="tag-definition", schema_version="1", + content={"path": "music/genre/rock/alternative", + "description": "Rock that deliberately departs from mainstream rock conventions; 90s onward.", + "synonyms": ["alt rock", "alternative"], + "display": "Alternative Rock"}, + tags=["tag-definition", "music"]) +``` + +Once the [Tag Taxonomy Layer C](#) work lands, the awareness server +will automatically consume these records to disambiguate cross-user +tags, provide display names in shared views, and power prefix-aware +tag searches. Today they serve as self-documenting tag definitions +that future tooling will leverage. + +--- + +## More use cases + +
+Reading list — books you're reading, finished, or gave up on + +``` +register_schema(family="book", version="1", schema={ + "type": "object", + "required": ["title", "author", "status"], + "properties": { + "title": {"type": "string"}, + "author": {"type": "string"}, + "status": {"enum": ["to-read", "reading", "finished", "abandoned"]}, + "rating": {"type": "integer", "minimum": 1, "maximum": 5}, + "notes": {"type": "string"} + } +}) +``` + +A status enum catches typos the way free text can't — you can't end +up with records split across `"in progress"`, `"reading"`, and +`"currently reading"` a year from now. + +**Future collector:** an edge provider that syncs from Goodreads, +Kindle, or another reading service can add fields like `pages_read`, +`last_opened`, or `progress_pct` to an updated schema (`book:2`). +Until that lands, keep the manual schema lean — only fields you'd +actually type by hand. +
+ +
+Recipes — a personal cookbook + +``` +register_schema(family="recipe", version="1", schema={ + "type": "object", + "required": ["title", "ingredients", "steps"], + "properties": { + "title": {"type": "string"}, + "servings": {"type": "integer", "minimum": 1}, + "prep_min": {"type": "integer", "minimum": 0}, + "cook_min": {"type": "integer", "minimum": 0}, + "ingredients": {"type": "array", "items": {"type": "string"}, "minItems": 1}, + "steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}, + "source_url": {"type": "string", "format": "uri"} + } +}) +``` + +Arrays of strings keep it simple; a future version could tighten +`ingredients` into `{name, quantity, unit}` objects if you want +shopping-list generation. `source_url` preserves the link to wherever +you found the recipe. + +**Future collector:** an edge provider could ingest recipes from +third-party services or from `application/ld+json` Recipe microdata +on web pages you bookmark. +
+ +
+Home inventory — what you own and where it is + +``` +register_schema(family="inventory-item", version="1", schema={ + "type": "object", + "required": ["name", "location"], + "properties": { + "name": {"type": "string"}, + "location": {"type": "string"}, + "purchase_date": {"type": "string", "format": "date"}, + "purchase_price": {"type": "number", "minimum": 0}, + "purchase_url": {"type": "string", "format": "uri"}, + "receipt_url": {"type": "string", "format": "uri"}, + "warranty_expires": {"type": "string", "format": "date"}, + "serial_number": {"type": "string"} + } +}) +``` + +The URL fields turn this into a durable audit trail: years later, +`warranty_expires` and `receipt_url` still resolve; `purchase_url` +takes you back to the original listing for replacement or +comparison-shopping. + +**Future collector:** an edge provider could watch an email inbox for +shipping confirmations and pre-populate records; or parse receipts +from a file drop. +
+ +
+Subscriptions — services you pay for + +``` +register_schema(family="subscription", version="1", schema={ + "type": "object", + "required": ["service", "cost", "billing_cycle"], + "properties": { + "service": {"type": "string"}, + "cost": {"type": "number", "minimum": 0}, + "currency": {"type": "string", "default": "USD"}, + "billing_cycle": {"enum": ["monthly", "annual", "quarterly"]}, + "renewal_date": {"type": "string", "format": "date"}, + "auto_renew": {"type": "boolean"}, + "cancel_url": {"type": "string", "format": "uri"} + } +}) +``` + +Enforcing `billing_cycle` as an enum prevents the "once a year vs +yearly vs annual" drift that makes cost totals hard to compute. + +**Future collector:** an edge provider could read bank or card +statements and propose matching records. +
+ +
+Edge provider manifests — the technical motivator + +An edge provider is an external daemon that writes status, alerts, or +knowledge into your awareness instance. Each provider declares its +capabilities via a manifest record validated against an +`edge-manifest` schema. See the +[schema/record design doc](superpowers/specs/2026-04-13-schema-record-entry-types-design.md) +for the manifest shape; it's registered as a `_system` schema at +deploy time. +
+ +
+Meeting notes and bug templates — team artifacts + +If you run standups, retros, or incident reviews, a schema enforces +the fields you always want — `attendees`, `decisions`, `action_items` +for meetings; `title`, `repro_steps`, `environment`, `severity` for +bugs. Drops the "did I remember to capture X?" overhead. +
+ +--- + +## Schema immutability, versioning, and deletion protection + +**Immutable.** Once a schema is registered, its body can't be changed. +Register a new version (`album:2`) instead. This guarantee is what +lets records safely pin an exact version at write time — the agent +validating `album:1` records five years from now is running against +the same rules as the day the first record was written. + +**Versioned.** Records store both `schema_ref` (family) and +`schema_version` (exact version). A single family can have many +versions in flight simultaneously. Migrating records from one version +to another is a deliberate action, not an implicit one. + +**Deletion-protected.** `delete_entry` on a schema blocks if any +records still reference it. This is why immutability and versioning +matter together: you can't accidentally orphan a record by deleting +its schema out from under it. To retire an old version, migrate its +records to a newer version first (or to a different shape), then +delete the schema. + +**Known gap.** As of v0.18.0, bulk `delete_entry` paths (by tags, by +source) don't yet consult the referencing-record check. Single-id +deletion is protected. Tracked at +[#288](https://github.com/cmeans/mcp-awareness/issues/288). + +--- + +## Operator-seeded `_system` schemas (CLI) + +If you're running an awareness instance and want certain schemas +available to every user on the server (e.g., a standard +`edge-manifest` schema that all edge providers can reference), the +`mcp-awareness-register-schema` CLI registers schemas under the +shared `_system` owner namespace. Users looking up a schema fall back +to `_system` if they don't have a per-owner version of the same +family. + +``` +mcp-awareness-register-schema --system \ + --family edge-manifest --version 1 \ + --schema /etc/awareness/schemas/edge-manifest.v1.json \ + --description "Edge provider capability manifest" +``` + +Typical pattern: + +- At deploy time, your Docker image or compose config runs this for + each built-in schema. +- Agents and edge providers reference the schema as if it were + owned by the current user; the store transparently falls back to + `_system`. + +This is how built-in schemas stay versioned and deletion-protected +just like user-registered ones, without being duplicated per tenant. + +--- + +## What's next + +A few threads already in motion that will make schema/record more +powerful: + +- **REST API** (roadmap) — HTTP surface for writing and reading + schemas and records, useful for non-MCP clients and web UIs. +- **Schema marketplace / one-click import** — once the REST API + lands, we're planning a way to import community-contributed schemas + (plus optional starter records) into your instance with a single + click. Imagine "install the Music Collection pack" or "import this + shared tag vocabulary." Tracked in awareness as + `design-schema-marketplace-import`. +- **Tag Taxonomy Layer C** — wires user-defined tag records into the + server's tag resolution. Design at `design-tag-taxonomy-v2`. +- **Cross-schema `$ref`** ([#291](https://github.com/cmeans/mcp-awareness/issues/291)) + — compose schemas from reusable fragments. +- **Validator caching** ([#290](https://github.com/cmeans/mcp-awareness/issues/290)) + — perf win at edge-manifest scale. +- **Generic `create_entry`** ([#292](https://github.com/cmeans/mcp-awareness/issues/292)) + — collapse type-specific write tools behind a single polymorphic + tool. +- **Record-version migration** ([#293](https://github.com/cmeans/mcp-awareness/issues/293)) + — a first-class way to move records from `album:1` to `album:2`. + +--- + +## Reference + +- [Data Dictionary](data-dictionary.md) — entry schemas for `schema` + and `record`, including all fields, indexes, and constraints. +- [Schema/Record design doc](superpowers/specs/2026-04-13-schema-record-entry-types-design.md) + — the design this implementation shipped from. +- [JSON Schema Draft 2020-12](https://json-schema.org/draft/2020-12) + — the external spec schemas conform to. From 0ea42374ed7ff0eb9565b94ad34afb9d80a6c173 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Wed, 15 Apr 2026 20:40:22 -0500 Subject: [PATCH 2/6] docs: add Language support guide New docs/language-guide.md covers how mcp-awareness handles multilingual content: per-entry language detection (explicit ISO 639-1 parameter or auto-detection via lingua-py), the 28 supported Postgres snowball regconfigs, querying by language with get_knowledge, how hybrid search handles cross-language queries (vector branch is language-agnostic, FTS branch uses per-entry regconfig), and unsupported-language alerts as a demand signal for Phase 3 non-Western language support. Includes deployment notes: lingua install, the one-time language backfill migration on v0.17.0 upgrade, and the regconfig validation cache that prevents INSERT failures from invalid language values. README links the new guide from the Design docs section. CHANGELOG entry under [Unreleased]. Co-Authored-By: Claude Opus 4.6 (1M context) --- CHANGELOG.md | 1 + README.md | 1 + docs/language-guide.md | 253 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 255 insertions(+) create mode 100644 docs/language-guide.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 12c3cbd..624e3f1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **Schema + Record user guide** — new `docs/schema-record-guide.md` walks through why typed data matters, a full music-collection worked example (register schema → create record → validation failure → update → delete blocked), the Tag Taxonomy (Layer C) tie-in, and six collapsible use cases: reading list, recipes, home inventory, subscriptions, edge provider manifests, meeting/bug templates. Linked from README "Design docs". README also updates the CLI tools list to include `mcp-awareness-register-schema`. +- **Language support guide** — new `docs/language-guide.md` covers per-entry language detection (explicit parameter or auto-detect via lingua), supported languages (28 Postgres snowball regconfigs), querying by language (`get_knowledge` language filter), how hybrid search handles cross-language queries, unsupported-language alerts as a demand signal, and deployment notes (lingua install, backfill migration, regconfig validation cache). Linked from README "Design docs". - **CLA bot** — installed [CLA Assistant](https://cla-assistant.io) to gate pull requests on a signed Contributor License Agreement. New `CLA.md` (v1.0) at the repo root holds the authoritative text (dual-license sublicensing grant, no copyright transfer). New `docs/cla.md` documents the bot, the signature record location (a public Gist owned by the maintainer), and the maintainer/bot whitelist. `CONTRIBUTING.md` "How to sign" section rewritten with the concrete signing flow. Refs [#297](https://github.com/cmeans/mcp-awareness/issues/297) (end-to-end verification against a non-maintainer account remains open). - Two new entry types: `schema` (JSON Schema Draft 2020-12 definition) and `record` (validated payload conforming to a schema). Tools: `register_schema`, `create_record`. Schemas are absolutely immutable after registration; records re-validate on content update. Schema deletion is blocked while live records reference a version. Per-owner storage with a shared `_system` fallback namespace for built-in schemas. - New CLI: `mcp-awareness-register-schema` for operators to seed `_system`-owned schemas at deploy time. diff --git a/README.md b/README.md index 7b1066b..44f198f 100644 --- a/README.md +++ b/README.md @@ -463,6 +463,7 @@ Read the full vision: **[What Knowledge Becomes When It's Ambient](docs/vision.m - [Collation Layer](docs/collation-layer.md) — briefing resource, token optimization, escalation logic - [Data Dictionary](docs/data-dictionary.md) — database schema, entry types, data field structures, lifecycle rules - [Schema + Record Guide](docs/schema-record-guide.md) — how to define typed data contracts with JSON Schema, with worked examples (music collection, reading list, recipes, home inventory, subscriptions, edge manifests, tag taxonomy) +- [Language Support Guide](docs/language-guide.md) — per-entry language detection, language-specific FTS stemming, supported languages, querying by language, unsupported-language alerts, deployment notes - [Memory Prompts](docs/memory-prompts.md) — how to configure your AI to use awareness (platform memory, global CLAUDE.md, project CLAUDE.md) - [Changelog](CHANGELOG.md) — version history diff --git a/docs/language-guide.md b/docs/language-guide.md new file mode 100644 index 0000000..b032770 --- /dev/null +++ b/docs/language-guide.md @@ -0,0 +1,253 @@ + +# Language support guide + +New in v0.17.0. This guide explains how mcp-awareness handles +multilingual content — per-entry language detection, language-specific +full-text search, and what happens when you write in a language the +server doesn't yet have a Postgres regconfig for. + +- Design context: + [Hybrid Retrieval + Multilingual](design/hybrid-retrieval-multilingual.md). +- Data Dictionary: the `language` column is documented in the + [common envelope](data-dictionary.md). + +--- + +## How it works + +Every entry has a `language` column that stores a Postgres +[regconfig](https://www.postgresql.org/docs/17/textsearch-configuration.html) +name (e.g., `english`, `french`, `german`). This regconfig controls +how full-text search (FTS) tokenizes and stems that entry's text. + +The language is resolved at write time through this chain: + +1. **Explicit parameter.** Write tools (`remember`, `add_context`, + `learn_pattern`, `remind`, `register_schema`, `create_record`) + accept an optional `language` parameter. Pass an ISO 639-1 code + (e.g., `"fr"` for French) to force a specific language. + +2. **Auto-detection.** If no language is provided, + [lingua-py](https://github.com/pemistahl/lingua-py) analyzes the + entry's text (description + content) and returns an ISO code. That + code is mapped to a Postgres regconfig. + +3. **Fallback.** If lingua is not installed, or the text is too short + for reliable detection, or lingua detects a language without a + Postgres regconfig, the entry is stored with `simple` — a + language-agnostic config that tokenizes on whitespace without + stemming. + +This means entries are always searchable via FTS. The question is +whether they get language-specific stemming (better recall for +inflected forms) or the `simple` fallback (exact-token matching only). + +--- + +## Supported languages + +28 languages have a Postgres snowball regconfig mapped in +mcp-awareness: + +| ISO code | Regconfig | | ISO code | Regconfig | +|----------|------------|-|----------|-----------| +| `ar` | arabic | | `it` | italian | +| `ca` | catalan | | `lt` | lithuanian | +| `da` | danish | | `ne` | nepali | +| `de` | german | | `nl` | dutch | +| `el` | greek | | `no` | norwegian | +| `en` | english | | `pt` | portuguese | +| `es` | spanish | | `ro` | romanian | +| `eu` | basque | | `ru` | russian | +| `fi` | finnish | | `sr` | serbian | +| `fr` | french | | `sv` | swedish | +| `ga` | irish | | `ta` | tamil | +| `hi` | hindi | | `tr` | turkish | +| `hu` | hungarian | | `yi` | yiddish | +| `hy` | armenian | | | | +| `id` | indonesian | | | | + +Languages not in this list (e.g., Chinese, Japanese, Korean, Hebrew) +fall back to `simple`. Phase 3 of the hybrid retrieval design covers +non-Western language support via Postgres extensions (`pgroonga`, +`zhparser`, etc.), but that hasn't shipped yet. + +--- + +## Writing in a specific language + +### Explicit language + +``` +remember( + description="Le serveur NAS est dans le placard du sous-sol.", + source="personal", + tags=["infra", "nas"], + language="fr" +) +``` + +The entry is stored with `language = 'french'`. FTS will stem +French inflections correctly — a search for "serveurs" will match +"serveur". + +### Auto-detected language + +``` +remember( + description="Der NAS-Server steht im Kellerschrank.", + source="personal", + tags=["infra", "nas"] +) +``` + +With lingua installed, this auto-detects as German (`de`) → stored +as `german` regconfig. Without lingua, it falls back to `simple`. + +### Override on update + +``` +update_entry( + id="", + language="de" +) +``` + +If auto-detection guessed wrong (or the entry was written before +lingua was installed), you can update the language explicitly. + +--- + +## Querying by language + +### Filter `get_knowledge` to a single language + +``` +get_knowledge(tags=["infra"], language="fr") +``` + +Returns only French-language entries matching the tag filter. The +`language` parameter accepts an ISO 639-1 code (`"fr"`) or the +special value `"simple"` (entries with no detected language). + +### Hybrid search across languages + +``` +search(query="NAS server basement", tags=["infra"]) +``` + +The `search` tool runs two branches: + +- **Vector branch** — if an embedding provider is configured, + compares the query's embedding against entry embeddings. This is + language-agnostic (the embedding model handles cross-lingual + similarity internally). +- **FTS branch** — runs a Postgres `ts_query` against the + `tsv` column, using the *query's* resolved language for + stemming. This means a French query stems French terms, matching + entries stored with `language = 'french'`. + +Results from both branches are fused via Reciprocal Rank Fusion +(RRF, k=60). In practice: + +- Same-language queries get strong matches from both branches. +- Cross-language queries rely more heavily on the vector branch + (embedding similarity crosses language barriers; FTS stemming + doesn't). This is why the embedding provider matters most for + multilingual use — FTS alone only finds same-language matches. + +--- + +## Unsupported-language alerts + +When you write an entry and lingua detects a language that has no +Postgres regconfig (e.g., Chinese, Japanese, Korean), mcp-awareness: + +1. Stores the entry with `language = 'simple'` (FTS still works, + just without stemming). +2. Fires an **info-level structural alert** with the tag + `unsupported-language-{iso}` (e.g., `unsupported-language-zh`). + +These alerts are upserted per language — you'll see at most one +alert per unsupported language, not one per entry. They serve as a +demand signal: if `unsupported-language-ja` fires, the operator +knows users are writing in Japanese and should consider installing +Phase 3 language support when it ships. + +You can see current unsupported-language alerts via: + +``` +get_alerts(tags=["language", "unsupported"]) +``` + +--- + +## Deployment notes + +### Installing lingua + +lingua-py is an optional dependency. Without it, all entries get +`language = 'simple'` (still searchable, just without stemming). + +```bash +pip install lingua-language-detector +``` + +Or, if using the Docker image, lingua is included by default. + +### Language backfill on upgrade + +When upgrading to v0.17.0+, two Alembic migrations run: + +1. **Schema migration** — adds `language` and `tsv` columns (fast, + DDL only). +2. **Language backfill** — runs lingua detection on all existing + entries and updates the `language` column. This is a one-time data + migration: + - lingua's first call loads ~300 MB of n-gram models (multi-second + startup cost) + - Each existing entry is processed for language detection + - If lingua is not installed, the backfill is skipped and entries + remain as `simple` + +After backfill, existing entries participate in language-specific FTS +immediately — no re-indexing needed (the `tsv` column is a generated +column that updates automatically when `language` changes). + +### Regconfig validation + +At startup, `PostgresStore` caches valid Postgres regconfig names from +`pg_ts_config`. If a write provides a regconfig that doesn't exist in +the server's Postgres (e.g., a third-party config was uninstalled), +the entry falls back to `simple` with a one-time cache refresh. This +prevents INSERT failures from invalid `language` values reaching the +generated `tsv` column. + +--- + +## What's next + +- **Phase 2: Cross-lingual vector model** — swap the embedding model + to one with strong cross-lingual properties (e.g., multilingual-e5 + or similar). Tracked at + [#239](https://github.com/cmeans/mcp-awareness/issues/239). +- **Phase 3: Non-Western language support** — install Postgres + extensions for CJK, Hebrew, and other languages that need + non-snowball tokenizers. Driven by unsupported-language alerts. +- **Data sovereignty framework** — governs where content is sent for + inference, required before cloud embedding providers ship as + defaults. + +--- + +## Reference + +- [Hybrid Retrieval + Multilingual design](design/hybrid-retrieval-multilingual.md) + — full design doc covering Layers 1–3, data sovereignty, and the + dilution-bug root cause. +- [Data Dictionary](data-dictionary.md) — `language` and `tsv` + column definitions. +- [Postgres text search configs](https://www.postgresql.org/docs/17/textsearch-configuration.html) + — how regconfigs work. +- [lingua-py](https://github.com/pemistahl/lingua-py) — the + language detection library. From 34a89f4f9ae5ecfbc81b5b8f9d46d5a257b5f320 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Wed, 15 Apr 2026 21:31:25 -0500 Subject: [PATCH 3/6] fix: address QA findings + add Mermaid diagrams to guides MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Schema + Record guide: - Fix all register_schema examples: add required source, tags, description params to primary walk-through examples - Fix all create_record examples: add required source, tags, description, logical_key params - Fix update_entry: id= → entry_id= (correct param name) - Fix dead # anchor links for Tag Taxonomy references - Add brevity note before collapsible sections explaining that abbreviated examples omit required admin params shown above - Add lifecycle Mermaid diagram (register → create → validate → update → delete-blocked) Language guide: - Fix get_alerts(tags=...) → search(query=..., entry_type="alert") since get_alerts has no tags parameter - Add resolution-chain Mermaid diagram (explicit → lingua → simple) - Add hybrid-search two-lane Mermaid diagram (vector + FTS → RRF) Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/language-guide.md | 31 ++++++++++++++-- docs/schema-record-guide.md | 72 +++++++++++++++++++++++++++++-------- 2 files changed, 86 insertions(+), 17 deletions(-) diff --git a/docs/language-guide.md b/docs/language-guide.md index b032770..3364fbe 100644 --- a/docs/language-guide.md +++ b/docs/language-guide.md @@ -15,6 +15,21 @@ server doesn't yet have a Postgres regconfig for. ## How it works +```mermaid +flowchart TD + A["Entry written"] --> B{language\nparam?} + B -- "Yes (e.g. 'fr')" --> C["Map ISO → regconfig\n(fr → french)"] + B -- No --> D{lingua\ninstalled?} + D -- Yes --> E["Auto-detect language"] + E --> F{Regconfig\nexists?} + F -- Yes --> C + F -- No --> G["Use 'simple'\n+ fire unsupported-language alert"] + D -- No --> H["Use 'simple'"] + C --> I["Store entry with\nlanguage-specific FTS"] + G --> I + H --> I +``` + Every entry has a `language` column that stores a Postgres [regconfig](https://www.postgresql.org/docs/17/textsearch-configuration.html) name (e.g., `english`, `french`, `german`). This regconfig controls @@ -132,6 +147,15 @@ special value `"simple"` (entries with no detected language). ### Hybrid search across languages +```mermaid +flowchart LR + Q["search(query)"] --> V["Vector branch\n(language-agnostic)"] + Q --> F["FTS branch\n(per-entry regconfig)"] + V --> RRF["Reciprocal Rank\nFusion (k=60)"] + F --> RRF + RRF --> R["Merged results"] +``` + ``` search(query="NAS server basement", tags=["infra"]) ``` @@ -174,12 +198,15 @@ demand signal: if `unsupported-language-ja` fires, the operator knows users are writing in Japanese and should consider installing Phase 3 language support when it ships. -You can see current unsupported-language alerts via: +You can find current unsupported-language alerts via: ``` -get_alerts(tags=["language", "unsupported"]) +search(query="unsupported language", entry_type="alert") ``` +Or browse all active alerts with `get_alerts()` and look for alert IDs +starting with `unsupported-language-`. + --- ## Deployment notes diff --git a/docs/schema-record-guide.md b/docs/schema-record-guide.md index 7d9f3e8..475c32b 100644 --- a/docs/schema-record-guide.md +++ b/docs/schema-record-guide.md @@ -35,7 +35,7 @@ later. ### The Tag Taxonomy tie-in -The [Tag Taxonomy v2 design](#) (Layer C) is built on top of +The Tag Taxonomy v2 design (Layer C) is built on top of schema/record: a user-defined tag vocabulary is just a set of records validated against a `tag-taxonomy` schema. That layer isn't wired in yet, but it means the schema/record primitive in this release is doing @@ -67,6 +67,23 @@ versioning so the contract can evolve without breaking consumers. --- +## The lifecycle at a glance + +```mermaid +flowchart LR + A["Register Schema\n(immutable)"] --> B["Create Record\n(validated)"] + B --> C{Valid?} + C -- Yes --> D["Stored ✓"] + C -- No --> E["Rejected ✗\n(all errors returned)"] + D --> F["Update Record"] + F --> C + G["Delete Schema"] --> H{Records\nexist?} + H -- Yes --> I["Blocked ✗"] + H -- No --> J["Deleted ✓"] +``` + +--- + ## Walk-through: a music collection Imagine you want to keep an inventory of albums you own. You care @@ -78,6 +95,9 @@ the agent to reject `year: "nineteen ninety seven"` and ``` register_schema( + source="personal", + tags=["music", "schema"], + description="A single album in my music collection.", family="album", version="1", schema={ @@ -92,8 +112,7 @@ register_schema( "notes": {"type": "string"} }, "additionalProperties": false - }, - description="A single album in my music collection." + } ) ``` @@ -112,6 +131,10 @@ A few things are happening: ``` create_record( + source="personal", + tags=["music", "album", "90s"], + description="OK Computer by Radiohead (1997)", + logical_key="album-ok-computer", schema_ref="album", schema_version="1", content={ @@ -120,8 +143,7 @@ create_record( "year": 1997, "rating": 5, "notes": "Still the reference" - }, - tags=["music", "album", "90s"] + } ) ``` @@ -133,6 +155,10 @@ any other entry. ``` create_record( + source="personal", + tags=["music", "album"], + description="Kid A by Radiohead (2000)", + logical_key="album-kid-a", schema_ref="album", schema_version="1", content={"title": "Kid A", "artist": "Radiohead", "year": "2000"} @@ -158,7 +184,7 @@ next on the retry. ``` update_entry( - id="", + entry_id="", content={"title": "OK Computer", "artist": "Radiohead", "year": 1997, "rating": 4, "notes": "Downgraded after fresh listen"} ) ``` @@ -200,6 +226,9 @@ Register a `tag-definition` schema: ``` register_schema( + source="personal", + tags=["tag-definition", "schema"], + description="A human-defined tag with description, synonyms, and display name.", family="tag-definition", version="1", schema={ @@ -213,23 +242,30 @@ register_schema( "display": {"type": "string"} }, "additionalProperties": false - }, - description="A human-defined tag with description, synonyms, and display name." + } ) ``` Then seed a few records: ``` -create_record(schema_ref="tag-definition", schema_version="1", - content={"path": "music/genre/rock/alternative", - "description": "Rock that deliberately departs from mainstream rock conventions; 90s onward.", - "synonyms": ["alt rock", "alternative"], - "display": "Alternative Rock"}, - tags=["tag-definition", "music"]) +create_record( + source="personal", + tags=["tag-definition", "music"], + description="Tag definition: music/genre/rock/alternative", + logical_key="tag-music-genre-rock-alternative", + schema_ref="tag-definition", + schema_version="1", + content={ + "path": "music/genre/rock/alternative", + "description": "Rock that deliberately departs from mainstream rock conventions; 90s onward.", + "synonyms": ["alt rock", "alternative"], + "display": "Alternative Rock" + } +) ``` -Once the [Tag Taxonomy Layer C](#) work lands, the awareness server +Once the Tag Taxonomy Layer C work lands, the awareness server will automatically consume these records to disambiguate cross-user tags, provide display names in shared views, and power prefix-aware tag searches. Today they serve as self-documenting tag definitions @@ -239,6 +275,12 @@ that future tooling will leverage. ## More use cases +> **Note:** The examples below show only the schema body for brevity. +> In practice, every `register_schema` call also requires `source`, +> `tags`, and `description` (as shown in the primary walk-through +> above), and every `create_record` call also requires `source`, +> `tags`, `description`, and `logical_key`. +
Reading list — books you're reading, finished, or gave up on From 745772f2ef9314e31b1e3919079578b75881d115 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Wed, 15 Apr 2026 22:27:27 -0500 Subject: [PATCH 4/6] =?UTF-8?q?fix:=20language=20guide=20=E2=80=94=20TOC,?= =?UTF-8?q?=20embedding=20languages,=20detailed=20search=20diagram?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses QA round 3 inline review findings: - Add table of contents organized by write vs search vs operations - Add "Embedding model languages" section: granite-embedding:278m is trained on 12 languages (Arabic, Chinese, Czech, English, French, German, Italian, Japanese, Korean, Dutch, Portuguese, Spanish) and covers ~100 via XLM-RoBERTa base. Includes a coverage matrix showing which languages get FTS stemming, vector search, or both — key for understanding why the embedding provider matters for CJK etc. - Replace simple hybrid-search Mermaid with a detailed diagram showing both branches: vector (embed → cosine → top-N) and FTS (ts_query → match → ts_rank_cd) merging into RRF. Color-coded branches. - Add scenario table (same-language, cross-language, rare terms, long docs, no embeddings) showing which branch contributes. - Fix update_entry(id=...) → update_entry(entry_id=...) — same bug as finding 3 round 1, missed in the language guide. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/language-guide.md | 206 ++++++++++++++++++++++++++++------------- 1 file changed, 144 insertions(+), 62 deletions(-) diff --git a/docs/language-guide.md b/docs/language-guide.md index 3364fbe..b4cf29d 100644 --- a/docs/language-guide.md +++ b/docs/language-guide.md @@ -3,13 +3,25 @@ New in v0.17.0. This guide explains how mcp-awareness handles multilingual content — per-entry language detection, language-specific -full-text search, and what happens when you write in a language the -server doesn't yet have a Postgres regconfig for. +full-text search, and vector search across languages. -- Design context: - [Hybrid Retrieval + Multilingual](design/hybrid-retrieval-multilingual.md). -- Data Dictionary: the `language` column is documented in the - [common envelope](data-dictionary.md). +## Contents + +### Adding and maintaining data +- [How it works](#how-it-works) — the language resolution chain +- [Supported FTS languages](#supported-fts-languages) — 28 Postgres regconfigs +- [Embedding model languages](#embedding-model-languages) — 12 natively trained + ~100 via XLM-RoBERTa +- [Writing in a specific language](#writing-in-a-specific-language) — explicit, auto-detected, override +- [Unsupported-language alerts](#unsupported-language-alerts) — demand signals for new languages + +### Searching and retrieving data +- [Querying by language](#querying-by-language) — filter by language +- [Hybrid search across languages](#hybrid-search-across-languages) — how vector + FTS work together + +### Operations +- [Deployment notes](#deployment-notes) — lingua install, backfill, regconfig cache +- [What's next](#whats-next) — Phases 2–3, data sovereignty +- [Reference](#reference) --- @@ -59,10 +71,11 @@ inflected forms) or the `simple` fallback (exact-token matching only). --- -## Supported languages +## Supported FTS languages 28 languages have a Postgres snowball regconfig mapped in -mcp-awareness: +mcp-awareness. These get **language-specific stemming** in full-text +search — inflected forms like "serveurs" match "serveur": | ISO code | Regconfig | | ISO code | Regconfig | |----------|------------|-|----------|-----------| @@ -83,12 +96,48 @@ mcp-awareness: | `id` | indonesian | | | | Languages not in this list (e.g., Chinese, Japanese, Korean, Hebrew) -fall back to `simple`. Phase 3 of the hybrid retrieval design covers -non-Western language support via Postgres extensions (`pgroonga`, +fall back to `simple` for FTS. Phase 3 of the hybrid retrieval design +covers non-Western FTS support via Postgres extensions (`pgroonga`, `zhparser`, etc.), but that hasn't shipped yet. --- +## Embedding model languages + +The default embedding model +([granite-embedding:278m](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual)) +is a multilingual model trained on **12 languages**: + +| Language | Language | Language | +|----------|----------|----------| +| Arabic | English | Japanese | +| Chinese | French | Korean | +| Czech | German | Dutch | +| Italian | Portuguese | Spanish | + +The model is built on XLM-RoBERTa, which covers approximately **100 +languages** in its vocabulary. Languages outside the 12 training +languages still produce usable embeddings — cross-lingual retrieval +will work, just with lower recall than for the trained set. + +**Why this matters even without FTS stemming.** Languages like +Japanese, Korean, and Chinese have no Postgres regconfig in our +mapping — FTS falls back to `simple` (whitespace tokenization, no +stemming). But the embedding model *was* trained on these languages, +so the **vector branch of hybrid search still works well for them**. +A Japanese query will find Japanese entries via vector similarity +even though FTS can't stem the text. This is why enabling the +embedding provider is especially valuable for multilingual use. + +| Language | FTS stemming | Vector search | +|----------|:---:|:---:| +| English, French, German, ... (28 FTS languages) | ✓ | ✓ | +| Japanese, Korean, Chinese, Czech (in embedding model, no regconfig) | ✗ (simple fallback) | ✓ | +| Other XLM-RoBERTa languages (not in embedding training set) | ✗ (simple fallback) | partial | +| Languages outside XLM-RoBERTa vocabulary | ✗ (simple fallback) | ✗ | + +--- + ## Writing in a specific language ### Explicit language @@ -123,7 +172,7 @@ as `german` regconfig. Without lingua, it falls back to `simple`. ``` update_entry( - id="", + entry_id="", language="de" ) ``` @@ -133,62 +182,14 @@ lingua was installed), you can update the language explicitly. --- -## Querying by language - -### Filter `get_knowledge` to a single language - -``` -get_knowledge(tags=["infra"], language="fr") -``` - -Returns only French-language entries matching the tag filter. The -`language` parameter accepts an ISO 639-1 code (`"fr"`) or the -special value `"simple"` (entries with no detected language). - -### Hybrid search across languages - -```mermaid -flowchart LR - Q["search(query)"] --> V["Vector branch\n(language-agnostic)"] - Q --> F["FTS branch\n(per-entry regconfig)"] - V --> RRF["Reciprocal Rank\nFusion (k=60)"] - F --> RRF - RRF --> R["Merged results"] -``` - -``` -search(query="NAS server basement", tags=["infra"]) -``` - -The `search` tool runs two branches: - -- **Vector branch** — if an embedding provider is configured, - compares the query's embedding against entry embeddings. This is - language-agnostic (the embedding model handles cross-lingual - similarity internally). -- **FTS branch** — runs a Postgres `ts_query` against the - `tsv` column, using the *query's* resolved language for - stemming. This means a French query stems French terms, matching - entries stored with `language = 'french'`. - -Results from both branches are fused via Reciprocal Rank Fusion -(RRF, k=60). In practice: - -- Same-language queries get strong matches from both branches. -- Cross-language queries rely more heavily on the vector branch - (embedding similarity crosses language barriers; FTS stemming - doesn't). This is why the embedding provider matters most for - multilingual use — FTS alone only finds same-language matches. - ---- - ## Unsupported-language alerts When you write an entry and lingua detects a language that has no Postgres regconfig (e.g., Chinese, Japanese, Korean), mcp-awareness: 1. Stores the entry with `language = 'simple'` (FTS still works, - just without stemming). + just without stemming; vector search still works if the language + is in the embedding model's training set). 2. Fires an **info-level structural alert** with the tag `unsupported-language-{iso}` (e.g., `unsupported-language-zh`). @@ -209,6 +210,85 @@ starting with `unsupported-language-`. --- +## Querying by language + +### Filter `get_knowledge` to a single language + +``` +get_knowledge(tags=["infra"], language="fr") +``` + +Returns only French-language entries matching the tag filter. The +`language` parameter accepts an ISO 639-1 code (`"fr"`) or the +special value `"simple"` (entries with no detected language). + +--- + +## Hybrid search across languages + +```mermaid +flowchart TD + Q["search(query, ...)"] --> P["Resolve query language"] + P --> V["Vector branch\n(HNSW index)"] + P --> F["FTS branch\n(GIN index)"] + + V --> V1["Embed query text\n(granite-embedding)"] + V1 --> V2["Cosine similarity\nagainst entry embeddings"] + V2 --> V3["Top N by vector score"] + + F --> F1["Parse query as ts_query\n(using query's regconfig)"] + F1 --> F2["Match against entry tsvector\n(per-entry language stemming)"] + F2 --> F3["Rank by ts_rank_cd"] + + V3 --> RRF["Reciprocal Rank Fusion\n(k=60)"] + F3 --> RRF + RRF --> R["Merged results\n(best of both branches)"] + + style V fill:#e8f4e8 + style F fill:#e8e8f4 + style RRF fill:#f4e8e8 +``` + +``` +search(query="NAS server basement", tags=["infra"]) +``` + +The `search` tool runs two branches in parallel: + +- **Vector branch** (green above) — if an embedding provider is + configured, embeds the query and compares it against stored entry + embeddings via cosine similarity. This is **language-agnostic** — + the multilingual embedding model handles cross-lingual matching + internally. A French query can find English entries and vice versa. +- **FTS branch** (blue above) — parses the query as a Postgres + `ts_query` using the query's resolved language for stemming, then + matches against entries' `tsv` column (which uses each entry's own + language for stemming). This is **language-specific** — French + stemming matches French entries, English matches English. + +Results from both branches are fused via **Reciprocal Rank Fusion** +(red above, k=60). RRF doesn't care about absolute scores — it +combines rankings, so an entry that ranks highly in *either* branch +surfaces in the final results. + +### What this means in practice + +| Scenario | Vector branch | FTS branch | Result | +|----------|:---:|:---:|--------| +| Same-language query (e.g., English → English) | ✓ strong | ✓ strong | Best recall — both branches contribute | +| Cross-language query (e.g., French → English) | ✓ strong | ✗ miss | Vector carries the match; still works | +| Rare identifier or exact term | ✗ weak | ✓ strong | FTS rescues the match | +| Long document (>500 chars) | ✗ partial (first 500 chars embedded) | ✓ full text indexed | FTS rescues buried terms | +| No embedding provider configured | ✗ skipped | ✓ only branch | FTS-only mode, still functional | + +**Graceful degradation:** if no embedding provider is configured, +search runs FTS only. If an entry has no embedding (new entry, +backfill not yet run), it still participates in FTS. If the query +text is too short for meaningful FTS (stop words only), the vector +branch carries. Each branch compensates for the other's gaps. + +--- + ## Deployment notes ### Installing lingua @@ -255,7 +335,7 @@ generated `tsv` column. ## What's next - **Phase 2: Cross-lingual vector model** — swap the embedding model - to one with strong cross-lingual properties (e.g., multilingual-e5 + to one with stronger cross-lingual properties (e.g., multilingual-e5 or similar). Tracked at [#239](https://github.com/cmeans/mcp-awareness/issues/239). - **Phase 3: Non-Western language support** — install Postgres @@ -278,3 +358,5 @@ generated `tsv` column. — how regconfigs work. - [lingua-py](https://github.com/pemistahl/lingua-py) — the language detection library. +- [granite-embedding:278m](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual) + — the default embedding model (IBM, multilingual, 768 dimensions). From 0b738f0e7e50ebab3d7a7197da6d171842011b44 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Wed, 15 Apr 2026 23:25:40 -0500 Subject: [PATCH 5/6] =?UTF-8?q?fix:=20language=20guide=20=E2=80=94=20repla?= =?UTF-8?q?ce=203-column=20table,=20add=20XLM-R=20list,=20credit=20upstrea?= =?UTF-8?q?m?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses round 4 QA inline findings on docs/language-guide.md: Line 116 finding ("terrible table, 3 columns with the same heading"): Replaced the 3-column "Language | Language | Language" grid with a proper alphabetical bullet list of the 12 Granite training languages. Cleaner, actually scannable. Line 118 finding ("link to outside providers, libraries we rely on"): Rewrote the Reference section as "Reference and credits", explicitly crediting and linking: - IBM Granite (model, paper, HuggingFace) - Meta AI FAIR team (XLM-RoBERTa, paper, CC-100) - Ollama (local model serving) - lingua-py (Peter M. Stahl, auto-detection) - Hugging Face (model hosting) - PostgreSQL (FTS infrastructure) - pgvector (Andrew Kane, vector index) - Snowball (Martin Porter et al., stemmers) Line 121 finding ("what are those other languages? No listing"): Added a collapsible
with the full list of ~100 XLM-RoBERTa languages (sourced from fairseq XLM-R docs). Kept collapsed by default so the section stays scannable but the full list is there for anyone curious about coverage for a specific language. Also reframed the embedding-languages section to explain *why* we chose Granite (open-weight, enterprise-licensed, 768-dim, runs on modest hardware) and clarified the relationship between Granite's 12 fine-tuned languages and the ~100 XLM-RoBERTa inherited. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/language-guide.md | 163 +++++++++++++++++++++++++++++++---------- 1 file changed, 125 insertions(+), 38 deletions(-) diff --git a/docs/language-guide.md b/docs/language-guide.md index b4cf29d..9dceb99 100644 --- a/docs/language-guide.md +++ b/docs/language-guide.md @@ -21,7 +21,7 @@ full-text search, and vector search across languages. ### Operations - [Deployment notes](#deployment-notes) — lingua install, backfill, regconfig cache - [What's next](#whats-next) — Phases 2–3, data sovereignty -- [Reference](#reference) +- [Reference and credits](#reference-and-credits) — upstream projects we depend on --- @@ -104,36 +104,88 @@ covers non-Western FTS support via Postgres extensions (`pgroonga`, ## Embedding model languages -The default embedding model -([granite-embedding:278m](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual)) -is a multilingual model trained on **12 languages**: - -| Language | Language | Language | -|----------|----------|----------| -| Arabic | English | Japanese | -| Chinese | French | Korean | -| Czech | German | Dutch | -| Italian | Portuguese | Spanish | - -The model is built on XLM-RoBERTa, which covers approximately **100 -languages** in its vocabulary. Languages outside the 12 training -languages still produce usable embeddings — cross-lingual retrieval -will work, just with lower recall than for the trained set. - -**Why this matters even without FTS stemming.** Languages like -Japanese, Korean, and Chinese have no Postgres regconfig in our -mapping — FTS falls back to `simple` (whitespace tokenization, no -stemming). But the embedding model *was* trained on these languages, -so the **vector branch of hybrid search still works well for them**. -A Japanese query will find Japanese entries via vector similarity -even though FTS can't stem the text. This is why enabling the -embedding provider is especially valuable for multilingual use. - -| Language | FTS stemming | Vector search | -|----------|:---:|:---:| -| English, French, German, ... (28 FTS languages) | ✓ | ✓ | -| Japanese, Korean, Chinese, Czech (in embedding model, no regconfig) | ✗ (simple fallback) | ✓ | -| Other XLM-RoBERTa languages (not in embedding training set) | ✗ (simple fallback) | partial | +The default embedding model is +[granite-embedding:278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual), +from the [IBM Granite](https://www.ibm.com/granite) family +([paper](https://arxiv.org/html/2502.20204v1)), served locally via +[Ollama](https://ollama.com/library/granite-embedding). We chose it +because it is open-weight, enterprise-licensed, produces 768-dim +vectors that fit our HNSW index, and runs well on modest hardware. +If you're using the default configuration, you're using this model; +see the [embedding provider docs](deployment-guide.md#embedding-provider) +for alternatives. + +### Natively trained (12 languages) + +The Granite team fine-tuned the model on retrieval pairs in 12 +languages. These get the strongest embedding quality: + +- Arabic +- Chinese +- Czech +- Dutch +- English +- French +- German +- Italian +- Japanese +- Korean +- Portuguese +- Spanish + +### Inherited from XLM-RoBERTa (~100 languages) + +Granite's embedding model is built on top of +[XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) +([paper](https://arxiv.org/pdf/1911.02116), from Meta AI's +[FAIR team](https://ai.meta.com/research/)), which was pre-trained +on 2.5 TB of Common Crawl data covering ~100 languages. Languages +outside the 12 fine-tuned set still produce usable embeddings — recall +is lower than for trained languages but usable for cross-lingual +retrieval. + +
+Full list of ~100 XLM-RoBERTa languages + +Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, +Basque, Belarusian, Bengali, Bengali (Romanized), Bosnian, Breton, +Bulgarian, Burmese, Burmese (zawgyi font), Catalan, Chinese +(Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, +English, Esperanto, Estonian, Filipino, Finnish, French, Galician, +Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi +(Romanized), Hungarian, Icelandic, Indonesian, Irish, Italian, +Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish +(Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, +Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, +Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, +Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, +Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil +(Romanized), Telugu, Telugu (Romanized), Thai, Turkish, Ukrainian, +Urdu, Urdu (Romanized), Uyghur, Uzbek, Vietnamese, Welsh, Western +Frisian, Xhosa, Yiddish. + +Source: [fairseq XLM-R docs](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr). + +
+ +### Why this matters even without FTS stemming + +Languages like Japanese, Korean, and Chinese have no Postgres +regconfig in our mapping — FTS falls back to `simple` (whitespace +tokenization, no stemming). But the embedding model *was* trained on +these languages, so the **vector branch of hybrid search still works +well for them**. A Japanese query will find Japanese entries via +vector similarity even though FTS can't stem the text. This is why +enabling the embedding provider is especially valuable for +multilingual use. + +### Coverage summary + +| Language category | FTS stemming | Vector search | +|------------------|:---:|:---:| +| 28 mapped FTS languages (e.g., English, French, German) | ✓ | ✓ | +| In Granite's 12 fine-tuned languages, no FTS regconfig (Chinese, Japanese, Korean, Czech) | ✗ (simple fallback) | ✓ (strong) | +| Other XLM-RoBERTa languages (e.g., Swahili, Thai, Vietnamese) | ✗ (simple fallback) | ✓ (partial) | | Languages outside XLM-RoBERTa vocabulary | ✗ (simple fallback) | ✗ | --- @@ -347,16 +399,51 @@ generated `tsv` column. --- -## Reference +## Reference and credits + +Multilingual support in mcp-awareness stands on the shoulders of a +number of open-source and open-weight projects. Credit where credit +is due: + +### Projects we depend on + +- **[IBM Granite](https://www.ibm.com/granite)** — the Granite + Embedding team trained the 278m multilingual model we use by + default. Released under an Apache-2.0 license on + [Hugging Face](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual), + with a + [detailed model paper](https://arxiv.org/html/2502.20204v1). +- **[Meta AI (FAIR team)](https://ai.meta.com/research/)** — authors + of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) + (Conneau et al., + [*Unsupervised Cross-lingual Representation Learning at Scale*](https://arxiv.org/pdf/1911.02116)), + the multilingual transformer backbone that makes Granite's ~100- + language coverage possible. Trained on the CC-100 corpus they also + curated. +- **[Ollama](https://ollama.com)** — makes it trivial to run embedding + models locally. We pull `granite-embedding:278m` from + [ollama.com/library/granite-embedding](https://ollama.com/library/granite-embedding). +- **[lingua-py](https://github.com/pemistahl/lingua-py)** (Peter + M. Stahl) — the language detection library that powers + auto-detection. Fast, accurate, works offline, supports 75+ + languages. +- **[Hugging Face](https://huggingface.co)** — hosts the model + weights, model cards, and community around both Granite and + XLM-RoBERTa. +- **[PostgreSQL](https://www.postgresql.org)** — the + [text search infrastructure](https://www.postgresql.org/docs/17/textsearch-configuration.html) + (regconfigs, tsvector, ts_rank_cd) we lean on for FTS. +- **[pgvector](https://github.com/pgvector/pgvector)** (Andrew + Kane) — the Postgres extension that gives us HNSW-indexed vector + search alongside everything else in the same database. +- **[Snowball](https://snowballstem.org)** (Martin Porter et al.) — + the stemmers behind the 28 FTS regconfigs listed above, most of + which ship with Postgres by default. + +### Internal references - [Hybrid Retrieval + Multilingual design](design/hybrid-retrieval-multilingual.md) — full design doc covering Layers 1–3, data sovereignty, and the dilution-bug root cause. - [Data Dictionary](data-dictionary.md) — `language` and `tsv` column definitions. -- [Postgres text search configs](https://www.postgresql.org/docs/17/textsearch-configuration.html) - — how regconfigs work. -- [lingua-py](https://github.com/pemistahl/lingua-py) — the - language detection library. -- [granite-embedding:278m](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual) - — the default embedding model (IBM, multilingual, 768 dimensions). From 6bae9e0d04a681256fbfba4e39c6587aef1f1584 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Thu, 16 Apr 2026 14:50:25 -0500 Subject: [PATCH 6/6] fix: address round 5 QA findings on docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four findings from QA: 1. Missing ecosystem copyright footer on both guides. Added the standard "Part of the Awareness ecosystem" footer (matching the pattern used in data-dictionary.md, case-studies.md, vision.md, deployment-guide.md). 2. French "remember" example showed localized description but kept source/tags in English — feels half-committed. Reworked the section to show BOTH realistic scenarios explicitly: (a) primarily-English user writing a single French entry (English source/tags, French description), (b) primarily-French user writing in French natively (French source/tags/description). Also added a symmetric note that a French user could write in English when convenient, and a heads-up that MCP tool names and parameters are themselves English-only today. 3. Validation-error example in schema-record guide overstated how helpful the jsonschema-sourced message is. Added a "Heads-up" callout acknowledging: - The quoted-vs-unquoted distinction is subtle for non-experts - The message doesn't suggest a fix - Tracked as #301 for richer typed envelopes - Agents can pre-validate client-side by fetching the schema via get_knowledge(tags=["schema"]) to avoid the round-trip Also added #301 to the "What's next" section. 4. Filed new issue #301 (feat: structured, actionable validation error envelopes for create_record / update_entry) with the design sketch, scope, acceptance criteria, and references the QA comment that surfaced it. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/language-guide.md | 35 ++++++++++++++++++++++++++++++++--- docs/schema-record-guide.md | 22 ++++++++++++++++++++-- 2 files changed, 52 insertions(+), 5 deletions(-) diff --git a/docs/language-guide.md b/docs/language-guide.md index 9dceb99..eef6103 100644 --- a/docs/language-guide.md +++ b/docs/language-guide.md @@ -194,6 +194,10 @@ multilingual use. ### Explicit language +Two reasonable ways this gets used in practice: + +**A primarily English-speaking user who also writes in another language** — for example, you own a property in France and keep notes about it in French: + ``` remember( description="Le serveur NAS est dans le placard du sous-sol.", @@ -203,12 +207,33 @@ remember( ) ``` -The entry is stored with `language = 'french'`. FTS will stem -French inflections correctly — a search for "serveurs" will match -"serveur". +**A primarily French-speaking user keeping their own notes in French** — in this case it's natural to also use French-language values for `source` and `tags`, since they're just labels you'll search by later: + +``` +remember( + description="Le serveur NAS est dans le placard du sous-sol.", + source="personnel", + tags=["infrastructure", "nas", "maison"], + language="fr" +) +``` + +In both cases the entry is stored with `language = 'french'`. FTS +will stem French inflections correctly — a search for "serveurs" will +match "serveur". Symmetrically, a French-speaking user can keep +entries in English (or any other supported language) whenever that's +more convenient — `language="en"` would store the same content with +English stemming. + +> Note: the MCP tool names and parameters themselves (`remember`, +> `description`, etc.) are currently English-only. A future +> localization pass for tool metadata is out of scope — the values +> you write are free to be in any language the model supports. ### Auto-detected language +If you don't pass `language`, lingua-py is used to auto-detect: + ``` remember( description="Der NAS-Server steht im Kellerschrank.", @@ -447,3 +472,7 @@ is due: dilution-bug root cause. - [Data Dictionary](data-dictionary.md) — `language` and `tsv` column definitions. + +--- + +Part of the [Awareness logo — a stylized eye with radiating signal lines Awareness](https://github.com/cmeans/mcp-awareness) ecosystem. © 2026 Chris Means diff --git a/docs/schema-record-guide.md b/docs/schema-record-guide.md index 475c32b..9bcc624 100644 --- a/docs/schema-record-guide.md +++ b/docs/schema-record-guide.md @@ -177,8 +177,19 @@ The write is rejected with a structured error: } ``` -All failures are reported at once — you don't fix one and discover the -next on the retry. +All failures are reported at once — you don't fix one and discover +the next on the retry. + +> **Heads-up:** the message text comes straight from the underlying +> JSON Schema library. It's accurate but not always as actionable as +> it could be — the quote-vs-no-quote distinction between `"2000"` (a +> string) and `2000` (an integer) is subtle, and no fix is suggested. +> Richer, typed error envelopes (with `expected_type`, `actual_type`, +> and fix suggestions where mechanical) are tracked in +> [#301](https://github.com/cmeans/mcp-awareness/issues/301). If +> you're an agent, pre-validating client-side by fetching the schema +> first (`get_knowledge(tags=["schema"])`) avoids the round-trip +> entirely. ### 4. Update with re-validation @@ -498,6 +509,9 @@ powerful: tool. - **Record-version migration** ([#293](https://github.com/cmeans/mcp-awareness/issues/293)) — a first-class way to move records from `album:1` to `album:2`. +- **Structured, actionable validation errors** ([#301](https://github.com/cmeans/mcp-awareness/issues/301)) + — typed error envelopes with `expected_type`, `actual_type`, and + fix suggestions for the common keyword failures. --- @@ -509,3 +523,7 @@ powerful: — the design this implementation shipped from. - [JSON Schema Draft 2020-12](https://json-schema.org/draft/2020-12) — the external spec schemas conform to. + +--- + +Part of the [Awareness logo — a stylized eye with radiating signal lines Awareness](https://github.com/cmeans/mcp-awareness) ecosystem. © 2026 Chris Means