Skip to content

fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service#1080

Merged
buremba merged 2 commits into
mainfrom
feat/fix-embedding-backfill-followup
May 26, 2026
Merged

fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service#1080
buremba merged 2 commits into
mainfrom
feat/fix-embedding-backfill-followup

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 26, 2026

Two fixes for the embeddings-backfill incident found during the #1066-#1070 prod rollout verification.

1. Migration: stamp legacy NULL embedding_model rows (recall regression)

20260526120000 (#1069) added event_embeddings.embedding_model but left existing rows NULL. The model-scoped vector search excludes NULL stamps, so on deploy the entire legacy corpus (prod: 1,231,945 rows) dropped out of vector search — full semantic-recall regression (text search unaffected). Hotfixed in prod with the same UPDATE; this migration makes it reproducible (fresh env / PITR restore). The stamp is accurate (legacy rows were all the default model) — no re-embedding. Idempotent; no-op once stamped.

2. Chart: tolerant liveness probe + CPU for the embeddings service (backfill 0/100)

Root-caused the failing backfill (0/100, restart loop): the embeddings service runs the model inline on a single-threaded event loop, so a heavy batch blocks /health. It shared the stateless app's probe (timeout 3s x3), so k8s killed a busy-but-healthy service mid-request (exit 137, "failed liveness probe", /health context deadline exceeded) → in-flight embedding requests died with fetch failed → restart loop. Not OOM, not data, not a code bug — an infra misconfig.

Fix (no code workaround): the embeddings service gets its own tolerant probe (liveness must detect a dead process, not a busy one: timeout 15s x8) and cpu 500m→1000m so batches finish faster. Chart.yaml 9.4.1→9.4.2 so Flux ChartVersion reconcile ships it (release-please doesn't manage the chart).

Validation

  • Migration: idempotent data UPDATE; CI migrations job validates apply. No committed schema.sql.
  • Chart: helm lint clean; helm template renders the new probe (liveness timeout 15 / failureThreshold 8) + cpu 1000m.

Notes

  • Prod recall is already restored (manual stamp). This PR makes that reproducible AND permanently fixes the embeddings restart loop.
  • Until the chart ships, the backfill still intermittently restarts the embeddings pod (~0.3% of events lack embeddings); merging + shipping the chart stops the churn.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced embeddings service stability with refined health check configuration, improving reliability during high-load operations
    • Increased CPU resource allocation for embeddings service to optimize performance during batch processing
    • Applied database migration to ensure consistent embedding model version tracking
  • Chores

    • Updated Helm chart version to 9.4.2

Review Change Stack

20260526120000 added event_embeddings.embedding_model but left existing rows
NULL; the model-scoped vector search excludes NULL stamps, so the entire legacy
embedding corpus silently dropped out of vector search on deploy (prod: 1.23M
rows, full semantic-recall regression — hotfixed manually). Stamp legacy rows
with the default model (the label is accurate — no re-embedding needed) so the
fix is reproducible on a fresh env / PITR restore. Idempotent; no-op once stamped.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f3aebee7-88c0-42b3-b25f-d52ecd94c7d7

📥 Commits

Reviewing files that changed from the base of the PR and between 6eb19f4 and 870d620.

📒 Files selected for processing (4)
  • charts/lobu/Chart.yaml
  • charts/lobu/templates/embeddings-deployment.yaml
  • charts/lobu/values.yaml
  • db/migrations/20260526170000_backfill_legacy_embedding_model_stamp.sql

📝 Walkthrough

Walkthrough

This PR improves embeddings service reliability through coordinated database and Kubernetes configuration changes. A SQL migration backfills missing embedding model identifiers, while Helm chart updates increase CPU allocation and introduce embeddings-specific health check timeouts to prevent premature service termination during heavy batch processing.

Changes

Embeddings Service Reliability

Layer / File(s) Summary
Database migration for embedding model backfill
db/migrations/20260526170000_backfill_legacy_embedding_model_stamp.sql
Migration sets embedding_model to 'Xenova/bge-base-en-v1.5' for all event_embeddings rows where the column is NULL; down migration is a no-op to avoid reintroducing incorrect behavior.
Embeddings service Helm configuration and deployment
charts/lobu/values.yaml, charts/lobu/templates/embeddings-deployment.yaml, charts/lobu/Chart.yaml
Helm values.yaml increases embeddings CPU request to 1000m and adds embeddings-scoped healthCheck with more tolerant readinessProbe/livenessProbe timings; deployment template references embeddings-specific health check values; chart version bumped to 9.4.2.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A migration to stamp what was lost,
More CPU to power embeddings through cost,
Health checks now patient, no false alarms—
The service breathes easy in embedding farms! 🌾

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-embedding-backfill-followup

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Root cause of the embed-backfill failure (0/100, restart loop): the embeddings
service runs the model inline on a single-threaded event loop, so a heavy batch
blocks /health. It shared the stateless app's probe (3s timeout x3), so k8s
killed a busy-but-healthy service mid-request (exit 137, 'failed liveness probe')
-> in-flight embedding requests died with 'fetch failed' -> restart loop. Not OOM,
not data, not a code bug. Give embeddings its own tolerant probe (liveness detects
a DEAD process, not a BUSY one: 15s timeout x8) and bump CPU 500m->1000m so batches
finish faster. Chart version 9.4.1->9.4.2 so Flux ChartVersion reconcile ships it.
@buremba buremba changed the title fix(db): backfill legacy NULL embedding_model stamps (recall regression follow-up) fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service May 26, 2026
@buremba buremba marked this pull request as ready for review May 26, 2026 16:02
@buremba buremba merged commit b9ba6c9 into main May 26, 2026
21 of 23 checks passed
@buremba buremba deleted the feat/fix-embedding-backfill-followup branch May 26, 2026 16:02
buremba added a commit that referenced this pull request May 28, 2026
…t guard (#1138)

Stability-audit follow-up. Adds CI guards so two recently-changed schema
invariants can't silently regress in a future migration:

- migration-invariants.test.ts: asserts the per-user pending oauth_account
  unique index from #1121 exists (and the old org-wide index is gone), plus a
  functional contract test — a user's second parallel pending OAuth flow
  collides while a distinct user's is allowed. Also asserts event_embeddings
  carries the embedding_model stamp column (#1069/#1080).
- embedding-model-literal.test.ts: the legacy-stamp backfill migration
  hard-codes the model literal; this fails if it ever drifts from
  DEFAULT_EMBEDDING_MODEL, which would silently re-open the full-corpus recall
  regression. Exports DEFAULT_EMBEDDING_MODEL for the assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants