fix(embeddings): version-stamp embeddings and batch the sync embed path#1069
Conversation
Finding #3 (HIGH): embeddings were never version-stamped, so swapping EMBEDDINGS_MODEL to a different same-dimension model silently mixed incompatible vector spaces with no detection. The connector-worker discarded the service-reported `model`; event_embeddings had no model column. - Add `embedding_model text` to event_embeddings (migration 20260526120000) plus a column comment. - Thread the model stamp through the pipeline: capture the service `model`, FAIL LOUD via resolveServiceModel() when it differs from the worker's expected model (equal dimensionality is not enough), and persist it via ContentItem.embedding_model and CompleteEmbeddingsRequest. Both server INSERT paths (worker-api completeEmbeddings + insert-event upsertEmbedding) write the stamp; legacy/omitted stamps store NULL. Finding #12 (MED, perf): the sync embedding path generated one embedding per event (one HTTP round-trip / ONNX pass each). Accumulate a chunk's texts and call batchGenerateEmbeddings once, mapping vectors back to each event by index; empty-text events get no vector and a batch failure fails open (items stream without embeddings), matching the prior behaviour. Reproducers: - embeddings-model-stamp.test.ts: resolveServiceModel rejects a same-dimension mismatch and resolves the stamp otherwise. - executor-batch-embed.test.ts: one chunk -> exactly one batch call with vectors + stamp mapped back per event. - events/embedding-model-stamp.test.ts (integration): embedding_model round-trips through insertEvent; NULL when unsupplied.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds embedding_model stamps end-to-end: DB migration, worker batch embedding with model stamps and validation, server model-config helpers and guards, model-scoped search/backfill, upsert persistence of model stamps, and extensive unit/integration tests. ChangesEmbedding Model Stamping & Vector Space Safety
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
bug_free 82, simplicity 82, slop 0, bugs 1, 0 blockers Script suites reported exit 0. Also ran targeted server embedding model integration tests in current worktree: 2 files, 9 tests passed. Found one rollback-only migration defect; up path and model-scoped search/backfill behavior look covered. Suggested fixes
Full verdict JSON{
"bug_free_confidence": 82,
"bugs": 1,
"slop": 0,
"simplicity": 82,
"blockers": [],
"change_type": "fix",
"behavior_change_risk": "medium",
"tests_adequate": true,
"suggested_fixes": [
{
"file": "db/migrations/20260526120000_event_embeddings_model_stamp.sql",
"line": 63,
"change": "Do not use CREATE OR REPLACE VIEW to remove embedding_model in the down migration; PostgreSQL cannot drop columns from an existing view that way. Drop/recreate current_event_records for the down path before ALTER TABLE ... DROP COLUMN."
}
],
"notes": "Script suites reported exit 0. Also ran targeted server embedding model integration tests in current worktree: 2 files, 9 tests passed. Found one rollback-only migration defect; up path and model-scoped search/backfill behavior look covered.",
"categories": {
"src": 304,
"tests": 703,
"docs": 0,
"config": 0,
"deps": 0,
"migrations": 106,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
Close the loop on the version stamp: persisting embedding_model is not enough — vector search and the backfill trigger must also scope by it, or a same-dimension model swap still compares the query against stale vectors from another model and never re-embeds them. - current_event_records now exposes emb.embedding_model (migration recreates the view, column appended at the end; down-migration restores the prior shape). - content-search scopes every <=> comparison to the configured model (NULL = legacy, assumed current): matchCondition, similarity / combined-score exprs, and the candidate (recall) vector branch. The filtered_ids CTEs carry embedding_model so fi.* references resolve. Configured model is inlined as a validated SQL literal (configuredEmbeddingModelSqlLiteral), avoiding param-index surgery in the hot query builder. - trigger-embed-backfill treats rows whose stamp differs from the configured model as needing backfill (not only missing rows); fetchEventsForEmbedding returns them too. - completeEmbeddings + insert-event upsertEmbedding REPLACE a stale-model row on conflict (DO UPDATE ... WHERE model IS DISTINCT FROM), idempotent for same-model re-submits. E2E reproducer (embedding-model-swap-e2e.test.ts): ingest under model A, switch configured model to same-dimension model B; under B the row is excluded from both the main and candidate search paths and flagged stale by the backfill query, while under A it is still returned and not stale. RED against the unscoped query (row leaked under B); GREEN after scoping.
Address the review blockers: a NULL embedding_model (legacy row written before stamping) has an UNKNOWN true model, so it must not be assumed to match the configured model. - content-search modelScopeFor now requires an EXACT match (embedding_model = configured); NULL rows are excluded from vector comparison until restamped. They remain reachable by text search. - trigger-embed-backfill + worker-api fetchEventsForEmbedding now treat NULL as stale via `IS DISTINCT FROM`, so the backfill restamps legacy rows (self-healing; no permanent vector-search blackout). - server-side generateEmbeddings (used to embed the search query) now FAILS LOUD when the embeddings service reports a model different from the configured one, instead of only logging — a wrong-model query vector must never be compared against model-scoped rows. - configuredEmbeddingModelSqlLiteral validates the model against the service's name allowlist before inlining (defense-in-depth). Tests: - createTestEvent now stamps the configured model by default (mirrors real ingestion; pass embedding_model: null to simulate a legacy row), so existing vector-search fixtures stay searchable under exact scoping. - embedding-model-swap-e2e adds a NULL-stamp case: a legacy row is excluded from vector search and flagged stale by the backfill. - new unit embeddings-model-guard.test.ts: generateEmbeddings rejects a service model mismatch; configuredEmbeddingModelSqlLiteral rejects an unsafe identifier.
…odel The memory-benchmark adapter inserted event_embeddings without a model stamp, so under exact model-scoped vector search its rows would be NULL-stamped and invisible to recall. Stamp them with the configured model, consistent with real ingestion.
Final status — code complete; fresh verdict REQUIRED before readyFixes for #3 (embedding version-stamp) and #12 (batch sync embedding) committed (head Review caught that the first attempt was incomplete (stamp written but search/backfill did not scope by it). Follow-up fixes applied: search excludes NULL/mismatched stamps, backfill treats NULL-or-mismatched as stale (both paths), restamp-on-conflict, and a server-side throw on embeddings-service model mismatch. Deterministic suites passed on the prior head, but the latest pi verdict on file (48, 2 blockers) predates these fixes and a fresh verdict could not be obtained — the Codex review quota is exhausted (429, resets ~4.2h). Left DRAFT intentionally until a fresh pi review confirms the blockers are cleared (I will re-run centrally after the quota resets). Diff: 18 files, +984/-59. Not merged. |
Marked ready for reviewReadying on deterministic evidence (the fresh pi verdict is externally blocked, not the code):
Fresh pi verdict still quota-blocked (~3h); I will run it as confirmation post-reset and post the score (re-drafting only on a genuine new blocker). Operational note before deploy: under exact model-scoping, pre-existing prod rows (NULL stamp) drop out of vector search until the backfill restamps them — self-healing, not instant on a large corpus; text search unaffected. Not merged. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/connector-worker/src/__tests__/executor-batch-embed.test.ts`:
- Around line 49-52: The SubprocessExecutor test stub defines a constructor with
an unused parameter named _opts; remove the unused parameter entirely from the
SubprocessExecutor constructor declaration so the stub becomes constructor() {}
(i.e., edit the SubprocessExecutor class's constructor to drop "_opts" to
satisfy the repo TS rule against unused parameters).
In `@packages/server/src/utils/insert-event.ts`:
- Around line 176-186: The ON CONFLICT update currently allows a NULL incoming
embeddingModel to overwrite a stamped row; change the conflict WHERE to only
apply when the incoming stamp is non-null by adding a condition on
EXCLUDED.embedding_model. In the SQL block that INSERTs into event_embeddings
(using values from eventId, vectorLiteral, embeddingModel), update the WHERE
clause on the DO UPDATE to: keep the existing check
event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model AND
require EXCLUDED.embedding_model IS NOT NULL so null/unstamped writes do not
erase a current stamp.
In `@packages/server/src/worker-api.ts`:
- Around line 1711-1718: The UPSERT in completeEmbeddings currently allows an
incoming NULL embedding_model to overwrite an existing non-null value; change
the ON CONFLICT DO UPDATE WHERE clause so updates only occur when the incoming
embedding_model is different AND the incoming value is NOT NULL. Concretely,
modify the WHERE clause that currently uses "event_embeddings.embedding_model IS
DISTINCT FROM EXCLUDED.embedding_model" to add "AND EXCLUDED.embedding_model IS
NOT NULL" so the existing embedding_model will not be replaced by a NULL from
older workers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: d3a623c0-135e-407d-9095-7fc2f3252355
📒 Files selected for processing (18)
db/migrations/20260526120000_event_embeddings_model_stamp.sqlpackages/connector-worker/src/__tests__/embeddings-model-stamp.test.tspackages/connector-worker/src/__tests__/executor-batch-embed.test.tspackages/connector-worker/src/__tests__/executor-heartbeat.test.tspackages/connector-worker/src/daemon/client.tspackages/connector-worker/src/daemon/executor.tspackages/connector-worker/src/embeddings-model.tspackages/connector-worker/src/embeddings.tspackages/server/src/__tests__/integration/events/embedding-model-stamp.test.tspackages/server/src/__tests__/integration/events/embedding-model-swap-e2e.test.tspackages/server/src/__tests__/setup/test-fixtures.tspackages/server/src/__tests__/unit/embeddings-model-guard.test.tspackages/server/src/benchmarks/memory/adapters/lobu-inprocess.tspackages/server/src/scheduled/trigger-embed-backfill.tspackages/server/src/utils/content-search.tspackages/server/src/utils/embeddings.tspackages/server/src/utils/insert-event.tspackages/server/src/worker-api.ts
| SubprocessExecutor: class { | ||
| // biome-ignore lint/suspicious/noExplicitAny: test stub | ||
| constructor(_opts: any) {} | ||
| }, |
There was a problem hiding this comment.
Remove the unused constructor parameter instead of underscore-prefixing it.
_opts is unused and violates the repo TS rule; drop it from the stub constructor.
Suggested fix
mock.module('../executor/subprocess.js', () => ({
SubprocessExecutor: class {
// biome-ignore lint/suspicious/noExplicitAny: test stub
- constructor(_opts: any) {}
+ constructor() {}
},
}));As per coding guidelines, **/*.{ts,tsx}: Fix unused function parameters by deleting them instead of prefixing with underscore.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SubprocessExecutor: class { | |
| // biome-ignore lint/suspicious/noExplicitAny: test stub | |
| constructor(_opts: any) {} | |
| }, | |
| SubprocessExecutor: class { | |
| // biome-ignore lint/suspicious/noExplicitAny: test stub | |
| constructor() {} | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/connector-worker/src/__tests__/executor-batch-embed.test.ts` around
lines 49 - 52, The SubprocessExecutor test stub defines a constructor with an
unused parameter named _opts; remove the unused parameter entirely from the
SubprocessExecutor constructor declaration so the stub becomes constructor() {}
(i.e., edit the SubprocessExecutor class's constructor to drop "_opts" to
satisfy the repo TS rule against unused parameters).
| // On conflict, REPLACE a stale-model row with the freshly-embedded vector + | ||
| // stamp; the WHERE makes a same-model re-submit a no-op (idempotent), so a | ||
| // re-ingest of unchanged content under the same model never churns the row. | ||
| await sql` | ||
| INSERT INTO event_embeddings (event_id, embedding) | ||
| VALUES (${eventId}, ${vectorLiteral}::vector) | ||
| ON CONFLICT (event_id) DO NOTHING | ||
| INSERT INTO event_embeddings (event_id, embedding, embedding_model) | ||
| VALUES (${eventId}, ${vectorLiteral}::vector, ${embeddingModel ?? null}) | ||
| ON CONFLICT (event_id) DO UPDATE | ||
| SET embedding = EXCLUDED.embedding, | ||
| embedding_model = EXCLUDED.embedding_model, | ||
| created_at = now() | ||
| WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model |
There was a problem hiding this comment.
Don't let unstamped writes erase a current stamp.
Because embeddingModel is optional here, an older caller that still omits it will satisfy IS DISTINCT FROM and overwrite an existing stamped row with NULL. That makes the embedding immediately invisible to model-scoped search and creates backfill churn during mixed-version rollouts. Only update on conflict when the incoming stamp is non-null.
Suggested fix
ON CONFLICT (event_id) DO UPDATE
SET embedding = EXCLUDED.embedding,
embedding_model = EXCLUDED.embedding_model,
created_at = now()
- WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
+ WHERE EXCLUDED.embedding_model IS NOT NULL
+ AND event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/utils/insert-event.ts` around lines 176 - 186, The ON
CONFLICT update currently allows a NULL incoming embeddingModel to overwrite a
stamped row; change the conflict WHERE to only apply when the incoming stamp is
non-null by adding a condition on EXCLUDED.embedding_model. In the SQL block
that INSERTs into event_embeddings (using values from eventId, vectorLiteral,
embeddingModel), update the WHERE clause on the DO UPDATE to: keep the existing
check event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model
AND require EXCLUDED.embedding_model IS NOT NULL so null/unstamped writes do not
erase a current stamp.
| `INSERT INTO event_embeddings (event_id, embedding, embedding_model) | ||
| VALUES ($1, $2::vector, $3) | ||
| ON CONFLICT (event_id) DO UPDATE | ||
| SET embedding = EXCLUDED.embedding, | ||
| embedding_model = EXCLUDED.embedding_model, | ||
| created_at = now() | ||
| WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`, | ||
| [item.event_id, vectorStr, item.embedding_model ?? null] |
There was a problem hiding this comment.
Guard completeEmbeddings against NULL-stamp downgrades.
This UPSERT will also replace an already-stamped row with NULL when an older worker posts embedding_model as missing. That effectively unpublishes searchable embeddings until backfill repairs them. Apply the same non-null guard on conflict updates here.
Suggested fix
`INSERT INTO event_embeddings (event_id, embedding, embedding_model)
VALUES ($1, $2::vector, $3)
ON CONFLICT (event_id) DO UPDATE
SET embedding = EXCLUDED.embedding,
embedding_model = EXCLUDED.embedding_model,
created_at = now()
- WHERE event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`,
+ WHERE EXCLUDED.embedding_model IS NOT NULL
+ AND event_embeddings.embedding_model IS DISTINCT FROM EXCLUDED.embedding_model`,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/worker-api.ts` around lines 1711 - 1718, The UPSERT in
completeEmbeddings currently allows an incoming NULL embedding_model to
overwrite an existing non-null value; change the ON CONFLICT DO UPDATE WHERE
clause so updates only occur when the incoming embedding_model is different AND
the incoming value is NOT NULL. Concretely, modify the WHERE clause that
currently uses "event_embeddings.embedding_model IS DISTINCT FROM
EXCLUDED.embedding_model" to add "AND EXCLUDED.embedding_model IS NOT NULL" so
the existing embedding_model will not be replaced by a NULL from older workers.
…eplace + idempotency Closes the coverage gap flagged in pre-merge review: the prior test hand-rolled the upsert SQL instead of calling the handler, so a regression of completeEmbeddings to ON CONFLICT DO NOTHING would have stayed green. New test invokes the real handler via a minimal Context and asserts updated=1 on a stale-model replace and updated=0 on a same-model re-submit (idempotent).
CREATE OR REPLACE VIEW cannot remove a column from an existing view in Postgres, so the prior down path would fail on rollback. Drop and recreate current_event_records without embedding_model, then DROP COLUMN. (pi review bugs:1 finding on the rollback-only path; up path was unaffected.)
…t guard (#1138) Stability-audit follow-up. Adds CI guards so two recently-changed schema invariants can't silently regress in a future migration: - migration-invariants.test.ts: asserts the per-user pending oauth_account unique index from #1121 exists (and the old org-wide index is gone), plus a functional contract test — a user's second parallel pending OAuth flow collides while a distinct user's is allowed. Also asserts event_embeddings carries the embedding_model stamp column (#1069/#1080). - embedding-model-literal.test.ts: the legacy-stamp backfill migration hard-codes the model literal; this fails if it ever drifts from DEFAULT_EMBEDDING_MODEL, which would silently re-open the full-corpus recall regression. Exports DEFAULT_EMBEDDING_MODEL for the assertion.
Fixes two confirmed embeddings-integrity bugs.
Finding #3 (HIGH) — embeddings never version-stamped -> silent vector-space mixing
The embeddings service returns
{model, dimensions, embeddings}, but the connector-worker destructured only{embeddings, dimensions}—modelwas discarded and only dimensionality was validated.event_embeddingshad no model column. SwappingEMBEDDINGS_MODELto a different same-dimension model therefore mixed incompatible vector spaces with no detection.Scope chosen: loud mismatch guard + persisted stamp column (not full query-segregation). Full segregation of every similarity query across stamps is broad and risky; the recall/search SQL is a multi-branch candidate UNION. A persisted
embedding_modelcolumn plus a fail-loud guard prevents new silent mixing immediately and leaves the data tagged so a targeted backfill/filter can follow later. This matches the task's "loud mismatch guard + stamped column is acceptable" guidance.Changes:
db/migrations/20260526120000_event_embeddings_model_stamp.sqladdsembedding_model text(nullable; NULL = legacy/pre-stamp) + a column comment. IdempotentADD COLUMN IF NOT EXISTS/DROP COLUMN IF EXISTS, matching the existing forward-delta format. Noschema.sqlregeneration (none is checked in; baseline-only), so theschema_migrations varchar(128)drift note does not apply.embeddings-model.tsresolveServiceModel()— fails loud when the service-reported model differs from the worker's expected model (EMBEDDINGS_MODEL/default), else returns the stamp. Kept in its own module so it is unit-testable (the executor tests module-mockembeddings.tsand bun'smock.moduleis process-global).batchGenerateEmbeddings/generateEmbeddingreturn{ embeddings, model }; the stamp threads throughContentItem.embedding_model,CompleteEmbeddingsRequest, andInsertEventParams.embeddingModel. Both server INSERT paths (worker-apicompleteEmbeddings +insert-eventupsertEmbedding) persist it.Red -> green
packages/connector-worker/src/__tests__/embeddings-model-stamp.test.ts—resolveServiceModeldid not exist onorigin/mainand no model check was performed (mismatch passed silently). Now: rejects a same-dimension mismatch, resolves the stamp on match, falls back when omitted. 3 pass.packages/server/src/__tests__/integration/events/embedding-model-stamp.test.ts— RED onorigin/main:PostgresError: column "embedding_model" does not exist. GREEN with the migration + threading: stamp round-trips throughinsertEvent; NULL when unsupplied. 2 pass (embedded PG, Node 22).Finding #12 (MED, perf) — sync embedding path was per-event
onEventChunkloopedawait processEvent()->generateEmbedding(singleText)per event = one HTTP round-trip / ONNX pass each. NowprocessEventChunkaccumulates the chunk's embeddable texts, callsbatchGenerateEmbeddingsonce, and maps vectors back by index. Empty-text events get no vector; a batch failure fails open (items stream without embeddings), preserving the prior per-event error behaviour. Theembed_backfilllane already batched and is unchanged except for stamping.Red -> green
packages/connector-worker/src/__tests__/executor-batch-embed.test.ts— RED against the original executor:batchGenerateEmbeddingscalled 0 times (per-event path), test expects 1. GREEN: one 3-event chunk -> exactly one batch call with all embeddable texts, vectors +embedding_modelmapped back to the right events, empty event left unembedded. 1 pass.Validation
bun testconnector-worker: 56 pass.Migration callout
Adds
db/migrations/20260526120000_event_embeddings_model_stamp.sql(newevent_embeddings.embedding_modelcolumn).Summary by CodeRabbit
Improvements
New Features
Tests