fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service#1080
Conversation
20260526120000 added event_embeddings.embedding_model but left existing rows NULL; the model-scoped vector search excludes NULL stamps, so the entire legacy embedding corpus silently dropped out of vector search on deploy (prod: 1.23M rows, full semantic-recall regression — hotfixed manually). Stamp legacy rows with the default model (the label is accurate — no re-embedding needed) so the fix is reproducible on a fresh env / PITR restore. Idempotent; no-op once stamped.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR improves embeddings service reliability through coordinated database and Kubernetes configuration changes. A SQL migration backfills missing embedding model identifiers, while Helm chart updates increase CPU allocation and introduce embeddings-specific health check timeouts to prevent premature service termination during heavy batch processing. ChangesEmbeddings Service Reliability
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Root cause of the embed-backfill failure (0/100, restart loop): the embeddings service runs the model inline on a single-threaded event loop, so a heavy batch blocks /health. It shared the stateless app's probe (3s timeout x3), so k8s killed a busy-but-healthy service mid-request (exit 137, 'failed liveness probe') -> in-flight embedding requests died with 'fetch failed' -> restart loop. Not OOM, not data, not a code bug. Give embeddings its own tolerant probe (liveness detects a DEAD process, not a BUSY one: 15s timeout x8) and bump CPU 500m->1000m so batches finish faster. Chart version 9.4.1->9.4.2 so Flux ChartVersion reconcile ships it.
…t guard (#1138) Stability-audit follow-up. Adds CI guards so two recently-changed schema invariants can't silently regress in a future migration: - migration-invariants.test.ts: asserts the per-user pending oauth_account unique index from #1121 exists (and the old org-wide index is gone), plus a functional contract test — a user's second parallel pending OAuth flow collides while a distinct user's is allowed. Also asserts event_embeddings carries the embedding_model stamp column (#1069/#1080). - embedding-model-literal.test.ts: the legacy-stamp backfill migration hard-codes the model literal; this fails if it ever drifts from DEFAULT_EMBEDDING_MODEL, which would silently re-open the full-corpus recall regression. Exports DEFAULT_EMBEDDING_MODEL for the assertion.
Two fixes for the embeddings-backfill incident found during the #1066-#1070 prod rollout verification.
1. Migration: stamp legacy NULL
embedding_modelrows (recall regression)20260526120000(#1069) addedevent_embeddings.embedding_modelbut left existing rows NULL. The model-scoped vector search excludes NULL stamps, so on deploy the entire legacy corpus (prod: 1,231,945 rows) dropped out of vector search — full semantic-recall regression (text search unaffected). Hotfixed in prod with the same UPDATE; this migration makes it reproducible (fresh env / PITR restore). The stamp is accurate (legacy rows were all the default model) — no re-embedding. Idempotent; no-op once stamped.2. Chart: tolerant liveness probe + CPU for the embeddings service (backfill 0/100)
Root-caused the failing backfill (
0/100, restart loop): the embeddings service runs the model inline on a single-threaded event loop, so a heavy batch blocks/health. It shared the stateless app's probe (timeout 3s x3), so k8s killed a busy-but-healthy service mid-request (exit 137, "failed liveness probe",/health context deadline exceeded) → in-flight embedding requests died withfetch failed→ restart loop. Not OOM, not data, not a code bug — an infra misconfig.Fix (no code workaround): the embeddings service gets its own tolerant probe (liveness must detect a dead process, not a busy one:
timeout 15s x8) andcpu 500m→1000mso batches finish faster.Chart.yaml 9.4.1→9.4.2so Flux ChartVersion reconcile ships it (release-please doesn't manage the chart).Validation
migrationsjob validates apply. No committedschema.sql.helm lintclean;helm templaterenders the new probe (liveness timeout 15 / failureThreshold 8) +cpu 1000m.Notes
Summary by CodeRabbit
Bug Fixes
Chores