fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service by buremba · Pull Request #1080 · lobu-ai/lobu

buremba · 2026-05-26T15:42:17Z

Two fixes for the embeddings-backfill incident found during the #1066-#1070 prod rollout verification.

1. Migration: stamp legacy NULL `embedding_model` rows (recall regression)

20260526120000 (#1069) added event_embeddings.embedding_model but left existing rows NULL. The model-scoped vector search excludes NULL stamps, so on deploy the entire legacy corpus (prod: 1,231,945 rows) dropped out of vector search — full semantic-recall regression (text search unaffected). Hotfixed in prod with the same UPDATE; this migration makes it reproducible (fresh env / PITR restore). The stamp is accurate (legacy rows were all the default model) — no re-embedding. Idempotent; no-op once stamped.

2. Chart: tolerant liveness probe + CPU for the embeddings service (backfill 0/100)

Root-caused the failing backfill (0/100, restart loop): the embeddings service runs the model inline on a single-threaded event loop, so a heavy batch blocks /health. It shared the stateless app's probe (timeout 3s x3), so k8s killed a busy-but-healthy service mid-request (exit 137, "failed liveness probe", /health context deadline exceeded) → in-flight embedding requests died with fetch failed → restart loop. Not OOM, not data, not a code bug — an infra misconfig.

Fix (no code workaround): the embeddings service gets its own tolerant probe (liveness must detect a dead process, not a busy one: timeout 15s x8) and cpu 500m→1000m so batches finish faster. Chart.yaml 9.4.1→9.4.2 so Flux ChartVersion reconcile ships it (release-please doesn't manage the chart).

Validation

Migration: idempotent data UPDATE; CI migrations job validates apply. No committed schema.sql.
Chart: helm lint clean; helm template renders the new probe (liveness timeout 15 / failureThreshold 8) + cpu 1000m.

Notes

Prod recall is already restored (manual stamp). This PR makes that reproducible AND permanently fixes the embeddings restart loop.
Until the chart ships, the backfill still intermittently restarts the embeddings pod (~0.3% of events lack embeddings); merging + shipping the chart stops the churn.

Summary by CodeRabbit

Bug Fixes
- Enhanced embeddings service stability with refined health check configuration, improving reliability during high-load operations
- Increased CPU resource allocation for embeddings service to optimize performance during batch processing
- Applied database migration to ensure consistent embedding model version tracking
Chores
- Updated Helm chart version to 9.4.2

20260526120000 added event_embeddings.embedding_model but left existing rows NULL; the model-scoped vector search excludes NULL stamps, so the entire legacy embedding corpus silently dropped out of vector search on deploy (prod: 1.23M rows, full semantic-recall regression — hotfixed manually). Stamp legacy rows with the default model (the label is accurate — no re-embedding needed) so the fix is reproducible on a fresh env / PITR restore. Idempotent; no-op once stamped.

coderabbitai · 2026-05-26T15:42:26Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f3aebee7-88c0-42b3-b25f-d52ecd94c7d7

📥 Commits

Reviewing files that changed from the base of the PR and between 6eb19f4 and 870d620.

📒 Files selected for processing (4)

charts/lobu/Chart.yaml
charts/lobu/templates/embeddings-deployment.yaml
charts/lobu/values.yaml
db/migrations/20260526170000_backfill_legacy_embedding_model_stamp.sql

📝 Walkthrough

Walkthrough

This PR improves embeddings service reliability through coordinated database and Kubernetes configuration changes. A SQL migration backfills missing embedding model identifiers, while Helm chart updates increase CPU allocation and introduce embeddings-specific health check timeouts to prevent premature service termination during heavy batch processing.

Changes

Embeddings Service Reliability

Layer / File(s)	Summary
Database migration for embedding model backfill `db/migrations/20260526170000_backfill_legacy_embedding_model_stamp.sql`	Migration sets `embedding_model` to `'Xenova/bge-base-en-v1.5'` for all `event_embeddings` rows where the column is `NULL`; down migration is a no-op to avoid reintroducing incorrect behavior.
Embeddings service Helm configuration and deployment `charts/lobu/values.yaml`, `charts/lobu/templates/embeddings-deployment.yaml`, `charts/lobu/Chart.yaml`	Helm `values.yaml` increases embeddings CPU request to 1000m and adds embeddings-scoped `healthCheck` with more tolerant `readinessProbe`/`livenessProbe` timings; deployment template references embeddings-specific health check values; chart version bumped to 9.4.2.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A migration to stamp what was lost,
More CPU to power embeddings through cost,
Health checks now patient, no false alarms—
The service breathes easy in embedding farms! 🌾

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/fix-embedding-backfill-followup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-26T15:44:10Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Root cause of the embed-backfill failure (0/100, restart loop): the embeddings service runs the model inline on a single-threaded event loop, so a heavy batch blocks /health. It shared the stateless app's probe (3s timeout x3), so k8s killed a busy-but-healthy service mid-request (exit 137, 'failed liveness probe') -> in-flight embedding requests died with 'fetch failed' -> restart loop. Not OOM, not data, not a code bug. Give embeddings its own tolerant probe (liveness detects a DEAD process, not a BUSY one: 15s timeout x8) and bump CPU 500m->1000m so batches finish faster. Chart version 9.4.1->9.4.2 so Flux ChartVersion reconcile ships it.

…t guard (#1138) Stability-audit follow-up. Adds CI guards so two recently-changed schema invariants can't silently regress in a future migration: - migration-invariants.test.ts: asserts the per-user pending oauth_account unique index from #1121 exists (and the old org-wide index is gone), plus a functional contract test — a user's second parallel pending OAuth flow collides while a distinct user's is allowed. Also asserts event_embeddings carries the embedding_model stamp column (#1069/#1080). - embedding-model-literal.test.ts: the legacy-stamp backfill migration hard-codes the model literal; this fails if it ever drifts from DEFAULT_EMBEDDING_MODEL, which would silently re-open the full-corpus recall regression. Exports DEFAULT_EMBEDDING_MODEL for the assertion.

buremba changed the title ~~fix(db): backfill legacy NULL embedding_model stamps (recall regression follow-up)~~ fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service May 26, 2026

buremba marked this pull request as ready for review May 26, 2026 16:02

buremba merged commit b9ba6c9 into main May 26, 2026
21 of 23 checks passed

buremba deleted the feat/fix-embedding-backfill-followup branch May 26, 2026 16:02

This was referenced May 26, 2026

chore(main): release lobu 10.0.0 #1057

Merged

test(server): pin migration invariants + embedding-model literal drift guard #1138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service#1080

fix(embeddings): stamp legacy embedding_model + stop liveness probe killing the embeddings service#1080
buremba merged 2 commits into
mainfrom
feat/fix-embedding-backfill-followup

buremba commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov-commenter commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Migration: stamp legacy NULL embedding_model rows (recall regression)

2. Chart: tolerant liveness probe + CPU for the embeddings service (backfill 0/100)

Validation

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov-commenter commented May 26, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 26, 2026 •

edited by coderabbitai Bot

Loading

1. Migration: stamp legacy NULL `embedding_model` rows (recall regression)

coderabbitai Bot commented May 26, 2026 •

edited

Loading