Skip to content

fix(reliability): gate boot on schema, surface err, split readiness#767

Merged
buremba merged 2 commits into
mainfrom
fix/post-incident-reliability
May 16, 2026
Merged

fix(reliability): gate boot on schema, surface err, split readiness#767
buremba merged 2 commits into
mainfrom
fix/post-incident-reliability

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Postmortem for the 2026-05-16 prod outage. The chain that day:

  1. Pre-upgrade migration Job for 20260516200000_events_search_tsv.sql timed out at the DB's 60s statement_timeout (1.15M-row events rewrite under ACCESS EXCLUSIVE; the migration's own operational note flagged this).
  2. The Job exited non-zero but the app Deployment rolled forward to lobu-app:20260515-181119, which queried current_event_records expecting e.search_tsv to exist.
  3. Every CheckDueFeeds tick failed; pino's default err serializer doesn't fire on the { error } key, so logs showed only error: {} and hid the real cause (column "events.search_tsv" does not exist).
  4. Pod OOM-killed at 1Gi every ~70 min. Mac Bridge users saw Cloudflare 502s during each restart window.

This PR adds three guards so the same shape can't recur:

What

  • Boot-time schema-version assertion (utils/schema-version-check.ts). On startup, compare the max version under db/migrations/ with SELECT max(version) FROM public.schema_migrations. If the DB is behind, log the missing version and exit non-zero — the pod fails readiness and Kubernetes refuses to route traffic. Skippable via SKIP_SCHEMA_VERSION_CHECK=1 for emergency rollbacks to an older image whose migrations dir is a strict prefix of what's already applied.
  • pino error-key serializer (utils/logger.ts). Pino's default serializer only fires on the err key. Register pino.stdSerializers.err on the error key too so existing logger.error({ error }, '...') call sites surface real stacks/messages — no repo-wide rename required.
  • Split liveness/readiness. New GET /health/ready does SELECT 1; wired up as the readinessProbe in the chart. Liveness stays on /health (process-up only) so a transient pooler blip drops the pod from the Service endpoint set without restarting it.

Out of scope (explicit)

  • App pod memory limit. Chart defaults (requests: 512Mi, limits: 2Gi) are already sane. Prod is running with a tighter override (256Mi/1Gi) sourced from a values file outside this repo. Bumping that needs a separate ops-repo PR.
  • Migration job runtime. The actual migration is already applied to prod (manually, with statement_timeout=0). Long-rewrite migration patterns (nullable column + batched backfill + trigger + CREATE INDEX CONCURRENTLY) are a separate hardening task.

Test plan

  • make typecheck clean
  • make build-packages succeeds
  • vitest packages/server/src/utils/__tests__/schema-version-check.test.ts — 8/8 pass
  • Verify in a follow-up rollout that a new image with the schema check refuses to start against a behind DB (induced by holding a migration back locally).

Summary by CodeRabbit

  • New Features

    • Added a readiness endpoint (/health/ready) that verifies DB connectivity.
    • Added a boot-time database schema validation that can block startup on migration mismatches.
  • Bug Fixes

    • Improved error logging so thrown errors include real stack/messages.
  • Tests

    • Added tests covering schema-version checks and related behaviors.
  • Chores

    • Adjusted health probe configuration to separate readiness vs liveness.
    • Updated web subproject pointer to a new commit.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 86adbc42-47b9-4c84-9458-76270ce3cc5a

📥 Commits

Reviewing files that changed from the base of the PR and between 6e75bb5 and 8c6cf00.

📒 Files selected for processing (7)
  • charts/lobu/templates/deployment.yaml
  • packages/server/src/index.ts
  • packages/server/src/server.ts
  • packages/server/src/utils/__tests__/schema-version-check.test.ts
  • packages/server/src/utils/logger.ts
  • packages/server/src/utils/schema-version-check.ts
  • packages/web

📝 Walkthrough

Walkthrough

This PR splits liveness/readiness probes, adds a boot-time DB schema version check with tests and server integration, and configures the logger to serialize error objects. It also updates the packages/web submodule pointer.

Changes

Health Probes, Schema Validation, and Logging

Layer / File(s) Summary
Health probe split: liveness vs readiness
charts/lobu/templates/deployment.yaml, packages/server/src/index.ts
Separates /health (lightweight process liveness) from new /health/ready (performs SELECT 1 for DB readiness). Helm readiness probe updated to /health/ready; liveness stays on /health.
Schema version checking: utility, boot-time enforcement, and tests
packages/server/src/utils/schema-version-check.ts, packages/server/src/server.ts, packages/server/src/utils/__tests__/schema-version-check.test.ts
Adds functions to read expected migration version from filesystem and applied version from public.schema_migrations, compares them, and enforces at startup via assertSchemaUpToDate(getDb(), { migrationsDir }) unless SKIP_SCHEMA_VERSION_CHECK=1. Includes Vitest tests covering parsing, comparison, and environment-dependent behavior.
Logger error serialization fix
packages/server/src/utils/logger.ts
Applies Pino's standard error serializer to both err and error fields so logged errors include stacks/messages.
Web submodule pointer update
packages/web
Updates the git submodule commit pointer for packages/web to a newer hash.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A hop, a check, the health divides,
Ready peeks where DB hides.
Startup guards the schema's gate,
Logs that show the error's state.
Submodule nudged — the rabbit's pleased.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(reliability): gate boot on schema, surface err, split readiness' accurately summarizes the three main changes: schema boot check, error serializer fix, and readiness/liveness probe split.
Description check ✅ Passed The PR description is comprehensive, covering why the changes are needed, what was changed, out-of-scope items, and a detailed test plan with checkmarks showing what was run.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/post-incident-reliability

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 669591dcbf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// utils/schema-version-check.ts for the 2026-05-16 incident this guards.
if (process.env.SKIP_SCHEMA_VERSION_CHECK !== '1') {
const migrationsDir =
process.env.LOBU_MIGRATIONS_DIR?.trim() || path.join(PACKAGE_REPO_ROOT, 'db', 'migrations');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve bundled CLI migrations before gating schema

When lobu run uses the published CLI with an external DATABASE_URL, this default derives from the bundle location as if it were the repo root; for @lobu/cli/dist/server.bundle.mjs it points outside the package, while the CLI build copies migrations to dist/db/migrations (packages/cli/scripts/build.cjs) and start-local.ts already checks that bundle-relative path. Because readExpectedSchemaVersion() treats an unreadable directory as null, the new boot gate silently skips in that self-contained CLI deployment and won't catch a stale external database.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/index.ts`:
- Around line 413-416: The readiness route's catch block currently returns
internal DB error details via c.json using errorMessage(error); instead, log the
full error server-side (e.g., with your existing logger or console.error) and
change the response to a generic readiness payload (e.g., { status: 'unready',
service: 'lobu-api' } with 503) so no internal errors are exposed; update the
catch in the readiness handler to call logger.error(error) and remove
errorMessage(error) from the c.json response.

In `@packages/server/src/utils/schema-version-check.ts`:
- Around line 36-42: The current try/catch around reading migrationsDir (the
block that logs with logger.warn and returns null) causes unreadable migrations
to be treated as "ok" later; instead, update the schema-version check so that
failure to read the migrations directory surfaces as an error that fails the
boot gate (do not return null). Locate the function that reads migrationsDir and
returns the schema version (referenced by migrationsDir, logger.warn and the
caller that currently treats null as ok) and change the catch to either throw a
descriptive Error or return an explicit failure result (e.g., throw new Error or
return { ok: false, reason: ... }) so callers (the assert/check function) will
abort startup rather than logging "schema version up to date."
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8c9c4ccf-08c2-45b2-b379-005f5e5d4231

📥 Commits

Reviewing files that changed from the base of the PR and between ca15df7 and 669591d.

📒 Files selected for processing (6)
  • charts/lobu/templates/deployment.yaml
  • packages/server/src/index.ts
  • packages/server/src/server.ts
  • packages/server/src/utils/__tests__/schema-version-check.test.ts
  • packages/server/src/utils/logger.ts
  • packages/server/src/utils/schema-version-check.ts

Comment on lines +413 to +416
} catch (error) {
return c.json(
{ status: 'unready', service: 'lobu-api', error: errorMessage(error) },
503
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t return raw DB errors from readiness responses.

Line 415 exposes internal database error details from a public unauthenticated route. Return a generic readiness payload and log the error server-side instead.

Proposed fix
 app.get('/health/ready', async (c) => {
   try {
     const sql = getDb();
     await sql`SELECT 1`;
     return c.json({ status: 'ok', service: 'lobu-api' });
   } catch (error) {
+    logger.warn({ err: error }, '[health/ready] database check failed');
     return c.json(
-      { status: 'unready', service: 'lobu-api', error: errorMessage(error) },
+      { status: 'unready', service: 'lobu-api' },
       503
     );
   }
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/index.ts` around lines 413 - 416, The readiness route's
catch block currently returns internal DB error details via c.json using
errorMessage(error); instead, log the full error server-side (e.g., with your
existing logger or console.error) and change the response to a generic readiness
payload (e.g., { status: 'unready', service: 'lobu-api' } with 503) so no
internal errors are exposed; update the catch in the readiness handler to call
logger.error(error) and remove errorMessage(error) from the c.json response.

Comment thread packages/server/src/utils/schema-version-check.ts
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

pi review — addressed

Ran `pi -p` on the PR. Three findings, two fixed in this push, one acknowledged as out-of-scope.

1. Prod chart in submodule wasn't touched — fixed.
The chart at `packages/web/deploy/charts/lobu/` is what Flux deploys (per `deploy/k8s/apps/lobu/base/helmrelease.yaml`); the parent-repo `charts/lobu/` is the public-install chart. Mirrored the readiness-probe split in owletto-web#135 and bumped the submodule pointer (6e75bb5). Merge order: owletto-web#135 first, then this PR. Once owletto-web lands, the parent pointer is already at the new submodule SHA so no further bump is needed.

2. Schema gate failed open if migrations dir was missing — fixed.
`assertSchemaUpToDate` now throws when `NODE_ENV=production` and the migrations directory is empty/unreadable (e.g. Dockerfile copy regression, wrong volume mount). Dev still degrades to a warning so worktrees without `db/` checked out boot. New tests cover both branches.

3. `strategy: Recreate` means a failing schema gate still results in no-service rather than "old version keeps serving" — acknowledged, follow-up.
`charts/lobu/templates/deployment.yaml:14-16` (and the prod chart mirror) sets `strategy: Recreate` whenever `app.workspaces.enabled` is true, because the workspaces PVC is RWO. So with this PR: new pod refuses to boot on schema mismatch ⇒ no Ready endpoints, but the old pod has already been deleted. Strictly better than the pre-PR state (silent queue throws + OOM + cryptic logs) but doesn't yet deliver the "keep serving the old version" promise. Proper fix requires either an RWX storage class for workspaces or splitting the workspaces volume out of the main Deployment (StatefulSet per replica or sidecar mount). Out of scope here; opening as a separate task.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/utils/__tests__/schema-version-check.test.ts`:
- Around line 101-104: The afterEach cleanup currently restores NODE_ENV by
assignment which turns undefined into the string "undefined"; update the
afterEach block that references afterEach, rmSync, dir, process.env.NODE_ENV and
originalNodeEnv so that if originalNodeEnv is undefined you use the delete
operator to remove process.env.NODE_ENV, otherwise assign process.env.NODE_ENV =
originalNodeEnv; keep the existing rmSync(dir, { recursive: true, force: true })
call and ensure no other changes to the test teardown.

In `@packages/web`:
- Line 1: The packages/web submodule is pinned to a SHA that is not reachable
from owletto-web/main; update the submodule pointer so it references a commit
present on owletto-web/main. To fix: fetch or view owletto-web/main to identify
a reachable commit (preferably the latest merged one), update the packages/web
submodule reference in the repo to that commit, stage and commit the submodule
change, and push the branch so the pipeline/FluxCD drift gate can validate
against owletto-web/main.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a07d2bf4-12c0-42d1-a5f7-7a5de11a2f7d

📥 Commits

Reviewing files that changed from the base of the PR and between 669591d and 6e75bb5.

📒 Files selected for processing (3)
  • packages/server/src/utils/__tests__/schema-version-check.test.ts
  • packages/server/src/utils/schema-version-check.ts
  • packages/web

Comment on lines +101 to +104
afterEach(() => {
rmSync(dir, { recursive: true, force: true });
process.env.NODE_ENV = originalNodeEnv;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Node.js, does assigning process.env.NODE_ENV = undefined delete the env var, or does it persist as a string value?

💡 Result:

In Node.js, assigning undefined to a property on process.env does not delete the environment variable; instead, it converts the value to the string "undefined" [1][2][3]. Because process.env implicitly converts assigned values to strings, the expression process.env.NODE_ENV = undefined results in the environment variable being set to the literal string "undefined" [1][4][5]. To correctly remove an environment variable from the process.env object, you must use the delete operator [1][4][5]: delete process.env.NODE_ENV; Using delete removes the property from the process.env object entirely, which is the intended way to unset an environment variable within a Node.js process [2][3][5].

Citations:


Restore NODE_ENV using delete when originally unset.

Line 103 leaks test state: assigning undefined to process.env.NODE_ENV converts it to the string "undefined" instead of deleting it. Use the delete operator when the original value is undefined.

Proposed fix
   afterEach(() => {
     rmSync(dir, { recursive: true, force: true });
-    process.env.NODE_ENV = originalNodeEnv;
+    if (originalNodeEnv === undefined) {
+      delete process.env.NODE_ENV;
+    } else {
+      process.env.NODE_ENV = originalNodeEnv;
+    }
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
afterEach(() => {
rmSync(dir, { recursive: true, force: true });
process.env.NODE_ENV = originalNodeEnv;
});
afterEach(() => {
rmSync(dir, { recursive: true, force: true });
if (originalNodeEnv === undefined) {
delete process.env.NODE_ENV;
} else {
process.env.NODE_ENV = originalNodeEnv;
}
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/utils/__tests__/schema-version-check.test.ts` around
lines 101 - 104, The afterEach cleanup currently restores NODE_ENV by assignment
which turns undefined into the string "undefined"; update the afterEach block
that references afterEach, rmSync, dir, process.env.NODE_ENV and originalNodeEnv
so that if originalNodeEnv is undefined you use the delete operator to remove
process.env.NODE_ENV, otherwise assign process.env.NODE_ENV = originalNodeEnv;
keep the existing rmSync(dir, { recursive: true, force: true }) call and ensure
no other changes to the test teardown.

Comment thread packages/web
buremba added 2 commits May 16, 2026 18:35
Three guards added after the 2026-05-16 prod outage. The chain that day:
the pre-upgrade migration Job for `20260516200000_events_search_tsv.sql`
timed out at the DB's 60s statement_timeout (1.15M-row events rewrite
under ACCESS EXCLUSIVE); the Job exited non-zero but the Deployment
rolled forward anyway. The new image queried `current_event_records`
expecting `e.search_tsv` to exist; every queue tick threw; pino logged
`error: {}` so the real cause stayed invisible; the pod OOM'd at 1Gi and
CrashLoopBackOff'd until the migration was applied manually. Mac Bridge
users saw Cloudflare 502s during each restart window.

- Boot-time schema-version assertion (utils/schema-version-check.ts):
  on startup, compare the max version under `db/migrations/` with
  `SELECT max(version) FROM schema_migrations`. If the DB is behind,
  log the missing version and exit non-zero so the pod fails readiness
  and Kubernetes refuses to route traffic. Skippable for emergency
  rollbacks via `SKIP_SCHEMA_VERSION_CHECK=1`. Unit tests cover the
  pure parse + compare paths.

- pino `error`-key serializer (utils/logger.ts): pino's default `err`
  serializer only fires on the `err` key, so `logger.error({ error },
  '...')` silently logged `error: {}`. Register the same serializer
  on `error` so existing call sites surface the real stack/message
  without a repo-wide rename.

- Split liveness/readiness: new GET `/health/ready` does `SELECT 1`
  and is wired up as the readinessProbe in the Helm chart. Liveness
  stays on `/health` (process-up only), so a transient DB blip drops
  the pod from the Service endpoint set without restarting it.

Out of scope: app pod memory limit. Chart defaults (`requests: 512Mi,
limits: 2Gi`) are already sane; prod is running with a tighter override
(`256Mi`/`1Gi`) sourced from a values file outside this repo. Bumping
that needs an ops-repo PR.
…chart

- schema-version-check: fail closed when migrations dir is missing in
  NODE_ENV=production. The original implementation degraded a missing
  /app/db/migrations to a warning, which would silently disable the
  postmortem guard if the Dockerfile copy regressed. Dev still degrades
  gracefully so worktrees without db/ checked out can boot.
- Bump packages/web submodule to pull in owletto-web#135, which mirrors
  the readiness-probe split in the prod chart (Flux source). Parent-repo
  chart edit alone wouldn't have moved prod off process-only readiness.
- Tests cover both fail-closed branches and the existing pure paths.
@buremba buremba force-pushed the fix/post-incident-reliability branch from 6e75bb5 to 8c6cf00 Compare May 16, 2026 17:35
@buremba buremba merged commit ca4ba0e into main May 16, 2026
16 of 19 checks passed
@buremba buremba deleted the fix/post-incident-reliability branch May 16, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants