Skip to content

fix(embeddings): version-stamp embeddings and batch the sync embed path#1069

Merged
buremba merged 6 commits into
mainfrom
feat/fix-embeddings-integrity
May 26, 2026
Merged

fix(embeddings): version-stamp embeddings and batch the sync embed path#1069
buremba merged 6 commits into
mainfrom
feat/fix-embeddings-integrity

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 26, 2026

Fixes two confirmed embeddings-integrity bugs.

Finding #3 (HIGH) — embeddings never version-stamped -> silent vector-space mixing

The embeddings service returns {model, dimensions, embeddings}, but the connector-worker destructured only {embeddings, dimensions}model was discarded and only dimensionality was validated. event_embeddings had no model column. Swapping EMBEDDINGS_MODEL to a different same-dimension model therefore mixed incompatible vector spaces with no detection.

Scope chosen: loud mismatch guard + persisted stamp column (not full query-segregation). Full segregation of every similarity query across stamps is broad and risky; the recall/search SQL is a multi-branch candidate UNION. A persisted embedding_model column plus a fail-loud guard prevents new silent mixing immediately and leaves the data tagged so a targeted backfill/filter can follow later. This matches the task's "loud mismatch guard + stamped column is acceptable" guidance.

Changes:

  • Migration db/migrations/20260526120000_event_embeddings_model_stamp.sql adds embedding_model text (nullable; NULL = legacy/pre-stamp) + a column comment. Idempotent ADD COLUMN IF NOT EXISTS / DROP COLUMN IF EXISTS, matching the existing forward-delta format. No schema.sql regeneration (none is checked in; baseline-only), so the schema_migrations varchar(128) drift note does not apply.
  • New embeddings-model.ts resolveServiceModel()fails loud when the service-reported model differs from the worker's expected model (EMBEDDINGS_MODEL/default), else returns the stamp. Kept in its own module so it is unit-testable (the executor tests module-mock embeddings.ts and bun's mock.module is process-global).
  • batchGenerateEmbeddings/generateEmbedding return { embeddings, model }; the stamp threads through ContentItem.embedding_model, CompleteEmbeddingsRequest, and InsertEventParams.embeddingModel. Both server INSERT paths (worker-api completeEmbeddings + insert-event upsertEmbedding) persist it.

Red -> green

  • packages/connector-worker/src/__tests__/embeddings-model-stamp.test.tsresolveServiceModel did not exist on origin/main and no model check was performed (mismatch passed silently). Now: rejects a same-dimension mismatch, resolves the stamp on match, falls back when omitted. 3 pass.
  • packages/server/src/__tests__/integration/events/embedding-model-stamp.test.ts — RED on origin/main: PostgresError: column "embedding_model" does not exist. GREEN with the migration + threading: stamp round-trips through insertEvent; NULL when unsupplied. 2 pass (embedded PG, Node 22).

Finding #12 (MED, perf) — sync embedding path was per-event

onEventChunk looped await processEvent() -> generateEmbedding(singleText) per event = one HTTP round-trip / ONNX pass each. Now processEventChunk accumulates the chunk's embeddable texts, calls batchGenerateEmbeddings once, and maps vectors back by index. Empty-text events get no vector; a batch failure fails open (items stream without embeddings), preserving the prior per-event error behaviour. The embed_backfill lane already batched and is unchanged except for stamping.

Red -> green

  • packages/connector-worker/src/__tests__/executor-batch-embed.test.ts — RED against the original executor: batchGenerateEmbeddings called 0 times (per-event path), test expects 1. GREEN: one 3-event chunk -> exactly one batch call with all embeddable texts, vectors + embedding_model mapped back to the right events, empty event left unembedded. 1 pass.

Validation

  • typecheck: connector-worker, embeddings, server — all clean (per-package tsc).
  • bun test connector-worker: 56 pass.
  • Server integration reproducer: 2 pass on embedded PG (Node 22).
  • Pre-commit (biome + tsc) passed.

Migration callout

Adds db/migrations/20260526120000_event_embeddings_model_stamp.sql (new event_embeddings.embedding_model column).

Summary by CodeRabbit

  • Improvements

    • Embeddings are now stamped with their generating model so searches only compare vectors from the same model.
    • Semantic search, recall, and ranking ignore mismatched or unstamped embeddings to prevent mixing vector spaces.
    • Backfill and ingest detect stale/mismatched embeddings and safely restamp them; same-model re-submits are no-ops.
  • New Features

    • Sync embedding generation batches whole event chunks once for efficiency and correct model tracking.
    • Embedding submission endpoints now accept and persist per-embedding model stamps.
  • Tests

    • Added unit and integration tests validating model-stamp guards, batching, and backfill behavior.

Review Change Stack

Finding #3 (HIGH): embeddings were never version-stamped, so swapping
EMBEDDINGS_MODEL to a different same-dimension model silently mixed
incompatible vector spaces with no detection. The connector-worker
discarded the service-reported `model`; event_embeddings had no model
column.

- Add `embedding_model text` to event_embeddings (migration
  20260526120000) plus a column comment.
- Thread the model stamp through the pipeline: capture the service
  `model`, FAIL LOUD via resolveServiceModel() when it differs from the
  worker's expected model (equal dimensionality is not enough), and
  persist it via ContentItem.embedding_model and
  CompleteEmbeddingsRequest. Both server INSERT paths (worker-api
  completeEmbeddings + insert-event upsertEmbedding) write the stamp;
  legacy/omitted stamps store NULL.

Finding #12 (MED, perf): the sync embedding path generated one embedding
per event (one HTTP round-trip / ONNX pass each). Accumulate a chunk's
texts and call batchGenerateEmbeddings once, mapping vectors back to each
event by index; empty-text events get no vector and a batch failure
fails open (items stream without embeddings), matching the prior
behaviour.

Reproducers:
- embeddings-model-stamp.test.ts: resolveServiceModel rejects a
  same-dimension mismatch and resolves the stamp otherwise.
- executor-batch-embed.test.ts: one chunk -> exactly one batch call with
  vectors + stamp mapped back per event.
- events/embedding-model-stamp.test.ts (integration): embedding_model
  round-trips through insertEvent; NULL when unsupplied.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 52bea2a6-80c5-4948-b4f3-459c6cc34895

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc71e7 and fd78551.

📒 Files selected for processing (1)
  • db/migrations/20260526120000_event_embeddings_model_stamp.sql

📝 Walkthrough

Walkthrough

Adds embedding_model stamps end-to-end: DB migration, worker batch embedding with model stamps and validation, server model-config helpers and guards, model-scoped search/backfill, upsert persistence of model stamps, and extensive unit/integration tests.

Changes

Embedding Model Stamping & Vector Space Safety

Layer / File(s) Summary
Database schema and view updates
db/migrations/20260526120000_event_embeddings_model_stamp.sql
Migration adds nullable embedding_model text column to event_embeddings and updates current_event_records view to export the model field alongside existing embedding data.
Connector-worker embedding API and model validation
packages/connector-worker/src/embeddings-model.ts, packages/connector-worker/src/embeddings.ts, packages/connector-worker/src/daemon/client.ts
New resolveServiceModel guard function; embedding backends now return EmbeddingResult { embeddings, model }; worker exposes expected-model helper; client types extended to carry embedding_model.
Connector-worker batch embedding execution
packages/connector-worker/src/daemon/executor.ts, packages/connector-worker/src/__tests__/executor-batch-embed.test.ts
Executor batches embedding generation per event chunk using processEventChunk/toContentItem, maps embeddings+model back to streamed items, and persists embedding_model for backfill. Tests verify single-batch behavior and mapping.
Connector-worker embedding tests
packages/connector-worker/src/__tests__/embeddings-model-stamp.test.ts, packages/connector-worker/src/__tests__/executor-heartbeat.test.ts
Unit tests for resolveServiceModel and updated mocks to the new { embeddings, model } return shape.
Server embedding model configuration and guards
packages/server/src/utils/embeddings.ts
Adds getConfiguredEmbeddingModel() and configuredEmbeddingModelSqlLiteral(); generateEmbeddings enforces model-mismatch guard and rejects embeddings from a different model.
Server event insertion with embedding model persistence
packages/server/src/utils/insert-event.ts
InsertEventParams extended with optional embeddingModel; upsertEmbedding now upserts embedding+embedding_model and updates only when stored model differs from incoming.
Worker API integration for embedding model forwarding
packages/server/src/worker-api.ts
streamContent accepts optional embedding_model and forwards to insertion; fetchEventsForEmbedding selects missing or model-mismatched rows; completeEmbeddings accepts embedding_model per item and upserts with conditional update.
Vector search model scoping in content search
packages/server/src/utils/content-search.ts
Cosine-distance/ranking restricted to rows with matching embedding_model; similarity expressions return NULL for mismatches, falling back to text scoring; ivfflat recall candidates filtered by model.
Embed backfill with model-mismatch detection
packages/server/src/scheduled/trigger-embed-backfill.ts
needsEmbeddingPredicate() treats NULL/different embedding_model as needing re-embedding and is used in batching and run-creation queries.
Benchmark adapter embedding model stamping
packages/server/src/benchmarks/memory/adapters/lobu-inprocess.ts
Benchmark adapter now stamps saved embeddings with configured model.
Test fixture support for embedding model stamping
packages/server/src/__tests__/setup/test-fixtures.ts
createTestEvent fixture accepts optional embedding_model and defaults to configured model when undefined.
Server embedding model guard unit tests
packages/server/src/__tests__/unit/embeddings-model-guard.test.ts
Unit tests validate server-side model guard behavior and SQL-literal validation.
Server embedding model integration tests
packages/server/src/__tests__/integration/events/embedding-model-stamp.test.ts, packages/server/src/__tests__/integration/events/embedding-model-swap-e2e.test.ts
Integration and E2E tests verify stamped vs legacy-unstamped persistence, model-scoped search exclusion, backfill detection, in-place upserts replacing stale-model rows, and completeEmbeddings handler behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

  • lobu-ai/lobu#892: Changes to trigger-embed-backfill.ts update the "needs embedding" predicate to include embedding model mismatch, touching the scheduler query logic discussed in this issue.

Possibly related PRs

  • lobu-ai/lobu#931: Overlaps in executor executeSyncRun / onEventChunk batching and streaming pipeline modifications.

Suggested labels

skip-size-check

Poem

🐰 I stamp each vector with the model's name,
so search and backfill won't play a game.
Batch by chunk, map back with care,
legacy rows stay known and fair.
Hop, test, and stamp — the vectors stay tame!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(embeddings): version-stamp embeddings and batch the sync embed path' accurately summarizes the two main changes: adding version stamps to embeddings and batching the sync embedding path.
Description check ✅ Passed The description comprehensively covers both findings with context, scope, changes, validation, and migration callouts. It includes detailed explanations aligned with the template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-embeddings-integrity

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 26, 2026

bug_free 82, simplicity 82, slop 0, bugs 1, 0 blockers

Script suites reported exit 0. Also ran targeted server embedding model integration tests in current worktree: 2 files, 9 tests passed. Found one rollback-only migration defect; up path and model-scoped search/backfill behavior look covered.

Suggested fixes

File Line Change
db/migrations/20260526120000_event_embeddings_model_stamp.sql 63 Do not use CREATE OR REPLACE VIEW to remove embedding_model in the down migration; PostgreSQL cannot drop columns from an existing view that way. Drop/recreate current_event_records for the down path before ALTER TABLE ... DROP COLUMN.
Full verdict JSON
{
  "bug_free_confidence": 82,
  "bugs": 1,
  "slop": 0,
  "simplicity": 82,
  "blockers": [],
  "change_type": "fix",
  "behavior_change_risk": "medium",
  "tests_adequate": true,
  "suggested_fixes": [
    {
      "file": "db/migrations/20260526120000_event_embeddings_model_stamp.sql",
      "line": 63,
      "change": "Do not use CREATE OR REPLACE VIEW to remove embedding_model in the down migration; PostgreSQL cannot drop columns from an existing view that way. Drop/recreate current_event_records for the down path before ALTER TABLE ... DROP COLUMN."
    }
  ],
  "notes": "Script suites reported exit 0. Also ran targeted server embedding model integration tests in current worktree: 2 files, 9 tests passed. Found one rollback-only migration defect; up path and model-scoped search/backfill behavior look covered.",
  "categories": {
    "src": 304,
    "tests": 703,
    "docs": 0,
    "config": 0,
    "deps": 0,
    "migrations": 106,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

buremba added 3 commits May 26, 2026 05:10
Close the loop on the version stamp: persisting embedding_model is not
enough — vector search and the backfill trigger must also scope by it,
or a same-dimension model swap still compares the query against stale
vectors from another model and never re-embeds them.

- current_event_records now exposes emb.embedding_model (migration
  recreates the view, column appended at the end; down-migration
  restores the prior shape).
- content-search scopes every <=> comparison to the configured model
  (NULL = legacy, assumed current): matchCondition, similarity /
  combined-score exprs, and the candidate (recall) vector branch. The
  filtered_ids CTEs carry embedding_model so fi.* references resolve.
  Configured model is inlined as a validated SQL literal
  (configuredEmbeddingModelSqlLiteral), avoiding param-index surgery in
  the hot query builder.
- trigger-embed-backfill treats rows whose stamp differs from the
  configured model as needing backfill (not only missing rows);
  fetchEventsForEmbedding returns them too.
- completeEmbeddings + insert-event upsertEmbedding REPLACE a stale-model
  row on conflict (DO UPDATE ... WHERE model IS DISTINCT FROM), idempotent
  for same-model re-submits.

E2E reproducer (embedding-model-swap-e2e.test.ts): ingest under model A,
switch configured model to same-dimension model B; under B the row is
excluded from both the main and candidate search paths and flagged stale
by the backfill query, while under A it is still returned and not stale.
RED against the unscoped query (row leaked under B); GREEN after scoping.
Address the review blockers: a NULL embedding_model (legacy row written
before stamping) has an UNKNOWN true model, so it must not be assumed to
match the configured model.

- content-search modelScopeFor now requires an EXACT match
  (embedding_model = configured); NULL rows are excluded from vector
  comparison until restamped. They remain reachable by text search.
- trigger-embed-backfill + worker-api fetchEventsForEmbedding now treat
  NULL as stale via `IS DISTINCT FROM`, so the backfill restamps legacy
  rows (self-healing; no permanent vector-search blackout).
- server-side generateEmbeddings (used to embed the search query) now
  FAILS LOUD when the embeddings service reports a model different from
  the configured one, instead of only logging — a wrong-model query
  vector must never be compared against model-scoped rows.
- configuredEmbeddingModelSqlLiteral validates the model against the
  service's name allowlist before inlining (defense-in-depth).

Tests:
- createTestEvent now stamps the configured model by default (mirrors
  real ingestion; pass embedding_model: null to simulate a legacy row),
  so existing vector-search fixtures stay searchable under exact scoping.
- embedding-model-swap-e2e adds a NULL-stamp case: a legacy row is
  excluded from vector search and flagged stale by the backfill.
- new unit embeddings-model-guard.test.ts: generateEmbeddings rejects a
  service model mismatch; configuredEmbeddingModelSqlLiteral rejects an
  unsafe identifier.
…odel

The memory-benchmark adapter inserted event_embeddings without a model
stamp, so under exact model-scoped vector search its rows would be
NULL-stamped and invisible to recall. Stamp them with the configured
model, consistent with real ingestion.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 26, 2026

Final status — code complete; fresh verdict REQUIRED before ready

Fixes for #3 (embedding version-stamp) and #12 (batch sync embedding) committed (head ae4a463f), including a migration adding event_embeddings.embedding_model.

Review caught that the first attempt was incomplete (stamp written but search/backfill did not scope by it). Follow-up fixes applied: search excludes NULL/mismatched stamps, backfill treats NULL-or-mismatched as stale (both paths), restamp-on-conflict, and a server-side throw on embeddings-service model mismatch.

Deterministic suites passed on the prior head, but the latest pi verdict on file (48, 2 blockers) predates these fixes and a fresh verdict could not be obtained — the Codex review quota is exhausted (429, resets ~4.2h). Left DRAFT intentionally until a fresh pi review confirms the blockers are cleared (I will re-run centrally after the quota resets). Diff: 18 files, +984/-59. Not merged.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 26, 2026

Marked ready for review

Readying on deterministic evidence (the fresh pi verdict is externally blocked, not the code):

  • The 2 blockers pi flagged in the stale verdict 48 — search/backfill not scoping by embedding_model — are fixed and verified by targeted tests: model-swap E2E (8 integration tests, Node 22 + embedded PG) + model-guard unit tests; both backfill paths restamp NULL/mismatched rows (self-healing); server-side throw on service model mismatch.
  • All deterministic suites green; migration 20260526120000_event_embeddings_model_stamp.sql adds the column + view exposure.
  • The posted 48 verdict predates these fixes and is stale.

Fresh pi verdict still quota-blocked (~3h); I will run it as confirmation post-reset and post the score (re-drafting only on a genuine new blocker).

Operational note before deploy: under exact model-scoping, pre-existing prod rows (NULL stamp) drop out of vector search until the backfill restamps them — self-healing, not instant on a large corpus; text search unaffected. Not merged.

@buremba buremba marked this pull request as ready for review May 26, 2026 05:49
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/connector-worker/src/__tests__/executor-batch-embed.test.ts`:
- Around line 49-52: The SubprocessExecutor test stub defines a constructor with
an unused parameter named _opts; remove the unused parameter entirely from the
SubprocessExecutor constructor declaration so the stub becomes constructor() {}
(i.e., edit the SubprocessExecutor class's constructor to drop "_opts" to
satisfy the repo TS rule against unused parameters).

In `@packages/server/src/utils/insert-event.ts`:
- Around line 176-186: The ON CONFLICT update currently allows a NULL incoming
embeddingModel to overwrite a stamped row; change the conflict WHERE to only
apply when the incoming stamp is non-null by adding a condition on
EXCLUDED.embedding_model. In the SQL block that INSERTs into event_embeddings
(using values from eventId, vectorLiteral, embeddingModel), update the WHERE
clause on the DO UPDATE to: keep the existing check
event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model AND
require EXCLUDED.embedding_model IS NOT NULL so null/unstamped writes do not
erase a current stamp.

In `@packages/server/src/worker-api.ts`:
- Around line 1711-1718: The UPSERT in completeEmbeddings currently allows an
incoming NULL embedding_model to overwrite an existing non-null value; change
the ON CONFLICT DO UPDATE WHERE clause so updates only occur when the incoming
embedding_model is different AND the incoming value is NOT NULL. Concretely,
modify the WHERE clause that currently uses "event_embeddings.embedding_model IS
DISTINCT FROM EXCLUDED.embedding_model" to add "AND EXCLUDED.embedding_model IS
NOT NULL" so the existing embedding_model will not be replaced by a NULL from
older workers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d3a623c0-135e-407d-9095-7fc2f3252355

📥 Commits

Reviewing files that changed from the base of the PR and between 56b6cff and ae4a463.

📒 Files selected for processing (18)
  • db/migrations/20260526120000_event_embeddings_model_stamp.sql
  • packages/connector-worker/src/__tests__/embeddings-model-stamp.test.ts
  • packages/connector-worker/src/__tests__/executor-batch-embed.test.ts
  • packages/connector-worker/src/__tests__/executor-heartbeat.test.ts
  • packages/connector-worker/src/daemon/client.ts
  • packages/connector-worker/src/daemon/executor.ts
  • packages/connector-worker/src/embeddings-model.ts
  • packages/connector-worker/src/embeddings.ts
  • packages/server/src/__tests__/integration/events/embedding-model-stamp.test.ts
  • packages/server/src/__tests__/integration/events/embedding-model-swap-e2e.test.ts
  • packages/server/src/__tests__/setup/test-fixtures.ts
  • packages/server/src/__tests__/unit/embeddings-model-guard.test.ts
  • packages/server/src/benchmarks/memory/adapters/lobu-inprocess.ts
  • packages/server/src/scheduled/trigger-embed-backfill.ts
  • packages/server/src/utils/content-search.ts
  • packages/server/src/utils/embeddings.ts
  • packages/server/src/utils/insert-event.ts
  • packages/server/src/worker-api.ts

Comment on lines +49 to +52
SubprocessExecutor: class {
// biome-ignore lint/suspicious/noExplicitAny: test stub
constructor(_opts: any) {}
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the unused constructor parameter instead of underscore-prefixing it.

_opts is unused and violates the repo TS rule; drop it from the stub constructor.

Suggested fix
 mock.module('../executor/subprocess.js', () => ({
   SubprocessExecutor: class {
     // biome-ignore lint/suspicious/noExplicitAny: test stub
-    constructor(_opts: any) {}
+    constructor() {}
   },
 }));

As per coding guidelines, **/*.{ts,tsx}: Fix unused function parameters by deleting them instead of prefixing with underscore.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SubprocessExecutor: class {
// biome-ignore lint/suspicious/noExplicitAny: test stub
constructor(_opts: any) {}
},
SubprocessExecutor: class {
// biome-ignore lint/suspicious/noExplicitAny: test stub
constructor() {}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/connector-worker/src/__tests__/executor-batch-embed.test.ts` around
lines 49 - 52, The SubprocessExecutor test stub defines a constructor with an
unused parameter named _opts; remove the unused parameter entirely from the
SubprocessExecutor constructor declaration so the stub becomes constructor() {}
(i.e., edit the SubprocessExecutor class's constructor to drop "_opts" to
satisfy the repo TS rule against unused parameters).

Comment on lines +176 to +186
// On conflict, REPLACE a stale-model row with the freshly-embedded vector +
// stamp; the WHERE makes a same-model re-submit a no-op (idempotent), so a
// re-ingest of unchanged content under the same model never churns the row.
await sql`
INSERT INTO event_embeddings (event_id, embedding)
VALUES (${eventId}, ${vectorLiteral}::vector)
ON CONFLICT (event_id) DO NOTHING
INSERT INTO event_embeddings (event_id, embedding, embedding_model)
VALUES (${eventId}, ${vectorLiteral}::vector, ${embeddingModel ?? null})
ON CONFLICT (event_id) DO UPDATE
SET embedding = EXCLUDED.embedding,
embedding_model = EXCLUDED.embedding_model,
created_at = now()
WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let unstamped writes erase a current stamp.

Because embeddingModel is optional here, an older caller that still omits it will satisfy IS DISTINCT FROM and overwrite an existing stamped row with NULL. That makes the embedding immediately invisible to model-scoped search and creates backfill churn during mixed-version rollouts. Only update on conflict when the incoming stamp is non-null.

Suggested fix
     ON CONFLICT (event_id) DO UPDATE
       SET embedding = EXCLUDED.embedding,
           embedding_model = EXCLUDED.embedding_model,
           created_at = now()
-      WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
+      WHERE EXCLUDED.embedding_model IS NOT NULL
+        AND event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/utils/insert-event.ts` around lines 176 - 186, The ON
CONFLICT update currently allows a NULL incoming embeddingModel to overwrite a
stamped row; change the conflict WHERE to only apply when the incoming stamp is
non-null by adding a condition on EXCLUDED.embedding_model. In the SQL block
that INSERTs into event_embeddings (using values from eventId, vectorLiteral,
embeddingModel), update the WHERE clause on the DO UPDATE to: keep the existing
check event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
AND require EXCLUDED.embedding_model IS NOT NULL so null/unstamped writes do not
erase a current stamp.

Comment on lines +1711 to +1718
`INSERT INTO event_embeddings (event_id, embedding, embedding_model)
VALUES ($1, $2::vector, $3)
ON CONFLICT (event_id) DO UPDATE
SET embedding = EXCLUDED.embedding,
embedding_model = EXCLUDED.embedding_model,
created_at = now()
WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`,
[item.event_id, vectorStr, item.embedding_model ?? null]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard completeEmbeddings against NULL-stamp downgrades.

This UPSERT will also replace an already-stamped row with NULL when an older worker posts embedding_model as missing. That effectively unpublishes searchable embeddings until backfill repairs them. Apply the same non-null guard on conflict updates here.

Suggested fix
         `INSERT INTO event_embeddings (event_id, embedding, embedding_model)
          VALUES ($1, $2::vector, $3)
          ON CONFLICT (event_id) DO UPDATE
            SET embedding = EXCLUDED.embedding,
                embedding_model = EXCLUDED.embedding_model,
                created_at = now()
-            WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`,
+            WHERE EXCLUDED.embedding_model IS NOT NULL
+              AND event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/worker-api.ts` around lines 1711 - 1718, The UPSERT in
completeEmbeddings currently allows an incoming NULL embedding_model to
overwrite an existing non-null value; change the ON CONFLICT DO UPDATE WHERE
clause so updates only occur when the incoming embedding_model is different AND
the incoming value is NOT NULL. Concretely, modify the WHERE clause that
currently uses "event_embeddings.embedding_model IS DISTINCT FROM
EXCLUDED.embedding_model" to add "AND EXCLUDED.embedding_model IS NOT NULL" so
the existing embedding_model will not be replaced by a NULL from older workers.

buremba added 2 commits May 26, 2026 13:43
…eplace + idempotency

Closes the coverage gap flagged in pre-merge review: the prior test hand-rolled
the upsert SQL instead of calling the handler, so a regression of completeEmbeddings
to ON CONFLICT DO NOTHING would have stayed green. New test invokes the real
handler via a minimal Context and asserts updated=1 on a stale-model replace and
updated=0 on a same-model re-submit (idempotent).
CREATE OR REPLACE VIEW cannot remove a column from an existing view in
Postgres, so the prior down path would fail on rollback. Drop and recreate
current_event_records without embedding_model, then DROP COLUMN. (pi review
bugs:1 finding on the rollback-only path; up path was unaffected.)
@buremba buremba merged commit e8c354b into main May 26, 2026
20 checks passed
@buremba buremba deleted the feat/fix-embeddings-integrity branch May 26, 2026 14:36
buremba added a commit that referenced this pull request May 28, 2026
…t guard (#1138)

Stability-audit follow-up. Adds CI guards so two recently-changed schema
invariants can't silently regress in a future migration:

- migration-invariants.test.ts: asserts the per-user pending oauth_account
  unique index from #1121 exists (and the old org-wide index is gone), plus a
  functional contract test — a user's second parallel pending OAuth flow
  collides while a distinct user's is allowed. Also asserts event_embeddings
  carries the embedding_model stamp column (#1069/#1080).
- embedding-model-literal.test.ts: the legacy-stamp backfill migration
  hard-codes the model literal; this fails if it ever drifts from
  DEFAULT_EMBEDDING_MODEL, which would silently re-open the full-corpus recall
  regression. Exports DEFAULT_EMBEDDING_MODEL for the assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants