fix(reliability): gate boot on schema, surface err, split readiness#767
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR splits liveness/readiness probes, adds a boot-time DB schema version check with tests and server integration, and configures the logger to serialize error objects. It also updates the ChangesHealth Probes, Schema Validation, and Logging
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 669591dcbf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // utils/schema-version-check.ts for the 2026-05-16 incident this guards. | ||
| if (process.env.SKIP_SCHEMA_VERSION_CHECK !== '1') { | ||
| const migrationsDir = | ||
| process.env.LOBU_MIGRATIONS_DIR?.trim() || path.join(PACKAGE_REPO_ROOT, 'db', 'migrations'); |
There was a problem hiding this comment.
Resolve bundled CLI migrations before gating schema
When lobu run uses the published CLI with an external DATABASE_URL, this default derives from the bundle location as if it were the repo root; for @lobu/cli/dist/server.bundle.mjs it points outside the package, while the CLI build copies migrations to dist/db/migrations (packages/cli/scripts/build.cjs) and start-local.ts already checks that bundle-relative path. Because readExpectedSchemaVersion() treats an unreadable directory as null, the new boot gate silently skips in that self-contained CLI deployment and won't catch a stale external database.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/index.ts`:
- Around line 413-416: The readiness route's catch block currently returns
internal DB error details via c.json using errorMessage(error); instead, log the
full error server-side (e.g., with your existing logger or console.error) and
change the response to a generic readiness payload (e.g., { status: 'unready',
service: 'lobu-api' } with 503) so no internal errors are exposed; update the
catch in the readiness handler to call logger.error(error) and remove
errorMessage(error) from the c.json response.
In `@packages/server/src/utils/schema-version-check.ts`:
- Around line 36-42: The current try/catch around reading migrationsDir (the
block that logs with logger.warn and returns null) causes unreadable migrations
to be treated as "ok" later; instead, update the schema-version check so that
failure to read the migrations directory surfaces as an error that fails the
boot gate (do not return null). Locate the function that reads migrationsDir and
returns the schema version (referenced by migrationsDir, logger.warn and the
caller that currently treats null as ok) and change the catch to either throw a
descriptive Error or return an explicit failure result (e.g., throw new Error or
return { ok: false, reason: ... }) so callers (the assert/check function) will
abort startup rather than logging "schema version up to date."
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8c9c4ccf-08c2-45b2-b379-005f5e5d4231
📒 Files selected for processing (6)
charts/lobu/templates/deployment.yamlpackages/server/src/index.tspackages/server/src/server.tspackages/server/src/utils/__tests__/schema-version-check.test.tspackages/server/src/utils/logger.tspackages/server/src/utils/schema-version-check.ts
| } catch (error) { | ||
| return c.json( | ||
| { status: 'unready', service: 'lobu-api', error: errorMessage(error) }, | ||
| 503 |
There was a problem hiding this comment.
Don’t return raw DB errors from readiness responses.
Line 415 exposes internal database error details from a public unauthenticated route. Return a generic readiness payload and log the error server-side instead.
Proposed fix
app.get('/health/ready', async (c) => {
try {
const sql = getDb();
await sql`SELECT 1`;
return c.json({ status: 'ok', service: 'lobu-api' });
} catch (error) {
+ logger.warn({ err: error }, '[health/ready] database check failed');
return c.json(
- { status: 'unready', service: 'lobu-api', error: errorMessage(error) },
+ { status: 'unready', service: 'lobu-api' },
503
);
}
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/index.ts` around lines 413 - 416, The readiness route's
catch block currently returns internal DB error details via c.json using
errorMessage(error); instead, log the full error server-side (e.g., with your
existing logger or console.error) and change the response to a generic readiness
payload (e.g., { status: 'unready', service: 'lobu-api' } with 503) so no
internal errors are exposed; update the catch in the readiness handler to call
logger.error(error) and remove errorMessage(error) from the c.json response.
pi review — addressedRan `pi -p` on the PR. Three findings, two fixed in this push, one acknowledged as out-of-scope. 1. Prod chart in submodule wasn't touched — fixed. 2. Schema gate failed open if migrations dir was missing — fixed. 3. `strategy: Recreate` means a failing schema gate still results in no-service rather than "old version keeps serving" — acknowledged, follow-up. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/utils/__tests__/schema-version-check.test.ts`:
- Around line 101-104: The afterEach cleanup currently restores NODE_ENV by
assignment which turns undefined into the string "undefined"; update the
afterEach block that references afterEach, rmSync, dir, process.env.NODE_ENV and
originalNodeEnv so that if originalNodeEnv is undefined you use the delete
operator to remove process.env.NODE_ENV, otherwise assign process.env.NODE_ENV =
originalNodeEnv; keep the existing rmSync(dir, { recursive: true, force: true })
call and ensure no other changes to the test teardown.
In `@packages/web`:
- Line 1: The packages/web submodule is pinned to a SHA that is not reachable
from owletto-web/main; update the submodule pointer so it references a commit
present on owletto-web/main. To fix: fetch or view owletto-web/main to identify
a reachable commit (preferably the latest merged one), update the packages/web
submodule reference in the repo to that commit, stage and commit the submodule
change, and push the branch so the pipeline/FluxCD drift gate can validate
against owletto-web/main.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a07d2bf4-12c0-42d1-a5f7-7a5de11a2f7d
📒 Files selected for processing (3)
packages/server/src/utils/__tests__/schema-version-check.test.tspackages/server/src/utils/schema-version-check.tspackages/web
| afterEach(() => { | ||
| rmSync(dir, { recursive: true, force: true }); | ||
| process.env.NODE_ENV = originalNodeEnv; | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Node.js, does assigning process.env.NODE_ENV = undefined delete the env var, or does it persist as a string value?
💡 Result:
In Node.js, assigning undefined to a property on process.env does not delete the environment variable; instead, it converts the value to the string "undefined" [1][2][3]. Because process.env implicitly converts assigned values to strings, the expression process.env.NODE_ENV = undefined results in the environment variable being set to the literal string "undefined" [1][4][5]. To correctly remove an environment variable from the process.env object, you must use the delete operator [1][4][5]: delete process.env.NODE_ENV; Using delete removes the property from the process.env object entirely, which is the intended way to unset an environment variable within a Node.js process [2][3][5].
Citations:
- 1: Setting
undefinedtoprocess.env.*sets a"undefined"string value nodejs/node#48875 - 2: https://stackoverflow.com/questions/43314209/node-js-unset-environment-variable
- 3: https://stackoverflow.com/questions/43314209/node-js-unset-environment-variable/60147167
- 4: https://nodejs.org/api/process.md
- 5: https://nodejs.org/api/process.html
Restore NODE_ENV using delete when originally unset.
Line 103 leaks test state: assigning undefined to process.env.NODE_ENV converts it to the string "undefined" instead of deleting it. Use the delete operator when the original value is undefined.
Proposed fix
afterEach(() => {
rmSync(dir, { recursive: true, force: true });
- process.env.NODE_ENV = originalNodeEnv;
+ if (originalNodeEnv === undefined) {
+ delete process.env.NODE_ENV;
+ } else {
+ process.env.NODE_ENV = originalNodeEnv;
+ }
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| afterEach(() => { | |
| rmSync(dir, { recursive: true, force: true }); | |
| process.env.NODE_ENV = originalNodeEnv; | |
| }); | |
| afterEach(() => { | |
| rmSync(dir, { recursive: true, force: true }); | |
| if (originalNodeEnv === undefined) { | |
| delete process.env.NODE_ENV; | |
| } else { | |
| process.env.NODE_ENV = originalNodeEnv; | |
| } | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/utils/__tests__/schema-version-check.test.ts` around
lines 101 - 104, The afterEach cleanup currently restores NODE_ENV by assignment
which turns undefined into the string "undefined"; update the afterEach block
that references afterEach, rmSync, dir, process.env.NODE_ENV and originalNodeEnv
so that if originalNodeEnv is undefined you use the delete operator to remove
process.env.NODE_ENV, otherwise assign process.env.NODE_ENV = originalNodeEnv;
keep the existing rmSync(dir, { recursive: true, force: true }) call and ensure
no other changes to the test teardown.
Three guards added after the 2026-05-16 prod outage. The chain that day:
the pre-upgrade migration Job for `20260516200000_events_search_tsv.sql`
timed out at the DB's 60s statement_timeout (1.15M-row events rewrite
under ACCESS EXCLUSIVE); the Job exited non-zero but the Deployment
rolled forward anyway. The new image queried `current_event_records`
expecting `e.search_tsv` to exist; every queue tick threw; pino logged
`error: {}` so the real cause stayed invisible; the pod OOM'd at 1Gi and
CrashLoopBackOff'd until the migration was applied manually. Mac Bridge
users saw Cloudflare 502s during each restart window.
- Boot-time schema-version assertion (utils/schema-version-check.ts):
on startup, compare the max version under `db/migrations/` with
`SELECT max(version) FROM schema_migrations`. If the DB is behind,
log the missing version and exit non-zero so the pod fails readiness
and Kubernetes refuses to route traffic. Skippable for emergency
rollbacks via `SKIP_SCHEMA_VERSION_CHECK=1`. Unit tests cover the
pure parse + compare paths.
- pino `error`-key serializer (utils/logger.ts): pino's default `err`
serializer only fires on the `err` key, so `logger.error({ error },
'...')` silently logged `error: {}`. Register the same serializer
on `error` so existing call sites surface the real stack/message
without a repo-wide rename.
- Split liveness/readiness: new GET `/health/ready` does `SELECT 1`
and is wired up as the readinessProbe in the Helm chart. Liveness
stays on `/health` (process-up only), so a transient DB blip drops
the pod from the Service endpoint set without restarting it.
Out of scope: app pod memory limit. Chart defaults (`requests: 512Mi,
limits: 2Gi`) are already sane; prod is running with a tighter override
(`256Mi`/`1Gi`) sourced from a values file outside this repo. Bumping
that needs an ops-repo PR.
…chart - schema-version-check: fail closed when migrations dir is missing in NODE_ENV=production. The original implementation degraded a missing /app/db/migrations to a warning, which would silently disable the postmortem guard if the Dockerfile copy regressed. Dev still degrades gracefully so worktrees without db/ checked out can boot. - Bump packages/web submodule to pull in owletto-web#135, which mirrors the readiness-probe split in the prod chart (Flux source). Parent-repo chart edit alone wouldn't have moved prod off process-only readiness. - Tests cover both fail-closed branches and the existing pure paths.
6e75bb5 to
8c6cf00
Compare
Why
Postmortem for the 2026-05-16 prod outage. The chain that day:
20260516200000_events_search_tsv.sqltimed out at the DB's 60sstatement_timeout(1.15M-roweventsrewrite under ACCESS EXCLUSIVE; the migration's own operational note flagged this).lobu-app:20260515-181119, which queriedcurrent_event_recordsexpectinge.search_tsvto exist.CheckDueFeedstick failed; pino's defaulterrserializer doesn't fire on the{ error }key, so logs showed onlyerror: {}and hid the real cause (column "events.search_tsv" does not exist).This PR adds three guards so the same shape can't recur:
What
utils/schema-version-check.ts). On startup, compare the max version underdb/migrations/withSELECT max(version) FROM public.schema_migrations. If the DB is behind, log the missing version and exit non-zero — the pod fails readiness and Kubernetes refuses to route traffic. Skippable viaSKIP_SCHEMA_VERSION_CHECK=1for emergency rollbacks to an older image whose migrations dir is a strict prefix of what's already applied.error-key serializer (utils/logger.ts). Pino's default serializer only fires on theerrkey. Registerpino.stdSerializers.erron theerrorkey too so existinglogger.error({ error }, '...')call sites surface real stacks/messages — no repo-wide rename required.GET /health/readydoesSELECT 1; wired up as the readinessProbe in the chart. Liveness stays on/health(process-up only) so a transient pooler blip drops the pod from the Service endpoint set without restarting it.Out of scope (explicit)
requests: 512Mi,limits: 2Gi) are already sane. Prod is running with a tighter override (256Mi/1Gi) sourced from a values file outside this repo. Bumping that needs a separate ops-repo PR.statement_timeout=0). Long-rewrite migration patterns (nullable column + batched backfill + trigger +CREATE INDEX CONCURRENTLY) are a separate hardening task.Test plan
make typecheckcleanmake build-packagessucceedsvitest packages/server/src/utils/__tests__/schema-version-check.test.ts— 8/8 passSummary by CodeRabbit
New Features
Bug Fixes
Tests
Chores