docs: safe-migration patterns by buremba · Pull Request #769 · lobu-ai/lobu

buremba · 2026-05-16T17:06:34Z

Why

Codifies what we wish #765 had followed before it timed out under the 60s `statement_timeout` and triggered the 2026-05-16 outage. Pairs with the boot-time schema-version assertion in #767 — the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does.

What's in it

The four DDL shapes that bite on a hot table:
- `GENERATED STORED` columns — table rewrite under ACCESS EXCLUSIVE. The exact shape perf(events): stored fulltext column + lifecycle partial index #765 used.
- `CREATE INDEX` — non-CONCURRENTLY takes ACCESS EXCLUSIVE.
- `SET NOT NULL` — full-table scan unless you stage via NOT VALID CHECK.
- `ON DELETE SET NULL / CASCADE` on small parent → wide dependent — `events_connection_id_fkey` is doing exactly this; pg_stat_statements shows 13.4s per call.
For each: the unsafe shape, the safe shape, and a working SQL example.
Operational checklist for "is this migration safe to ship?"
"When dbmate fails in prod" runbook — the recipe we ran manually on 2026-05-16.

Out of scope (deferred to follow-ups)

While measuring for this PR I dug into the connections-list query (565ms mean × 319 calls in pg_stat_statements). The cost is entirely in the per-row `event_count` correlated sub-select going through `current_event_records`: each event_count touches `idx_events_connection_id` for ~53k rows + anti-joins against `idx_events_superseded_by` for ~690k probes. 48% of `events` rows are tombstones (552k / 1.15M with `supersedes_event_id IS NOT NULL`), which is what makes the anti-join expensive.

A naive CTE rewrite that computes counts once for the page is actually slower (1.57s vs 1.30s) — the planner does the same anti-join either way. The real fix is a denormalized `is_superseded` boolean on `events` maintained by trigger, with a partial index. That's the same shape of change as the embed-backfill state column in PR 3, and will be folded in there or a sibling PR.

Test plan

`make typecheck` clean (no code changed)
`make build-packages` builds

Summary by CodeRabbit

Documentation
- Added migration guidance covering rules for writing migrations, safety cautions for PostgreSQL operations that can rewrite tables or rebuild indexes, an operational checklist for large migrations (timing and timeout considerations), and troubleshooting/recovery procedures to diagnose and manually recover failed migrations, plus references to related incident learnings and follow-ups.

Codifies what we wish PR #765 had followed before it timed out under the 60s statement_timeout and triggered the 2026-05-16 outage. Covers the four DDL shapes that bite on a hot table: * GENERATED STORED columns — table rewrite under ACCESS EXCLUSIVE. Safe pattern: nullable column + trigger + batched backfill outside the Helm hook. * CREATE INDEX — switch to CONCURRENTLY (transaction:false migration). * SET NOT NULL — add a NOT VALID CHECK first, VALIDATE separately, then SET NOT NULL becomes O(1). * ON DELETE SET NULL / CASCADE on small parent → wide dependent — cascade UPDATE blocks under exclusive lock (events.connection_id_fkey takes 13s per call in pg_stat_statements). Plus an operational checklist for sizing the migration locally before opening the PR, and a "when dbmate fails in prod" runbook that's the recipe we ran on 2026-05-16. References the boot-time schema-version assertion from lobu#767 — the two complement each other: the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does.

coderabbitai · 2026-05-16T17:06:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0ee52a1f-508e-4572-9a1d-e16ef654988b

📥 Commits

Reviewing files that changed from the base of the PR and between d25a4b4 and 5d4819e.

📒 Files selected for processing (1)

docs/MIGRATIONS.md

✅ Files skipped from review due to trivial changes (1)

docs/MIGRATIONS.md

📝 Walkthrough

Walkthrough

Adds docs/MIGRATIONS.md, documenting dbmate migration rules, PostgreSQL risk patterns with safe practices, a pre-deploy checklist for large migrations, and step‑by‑step production failure diagnosis and manual recovery procedures.

Changes

Migration Operational Guidance

Layer / File(s)	Summary
Migration rules and framework `docs/MIGRATIONS.md`	Introduces the migrations doc introduction and three core "Always" guidelines: read actual Postgres DDL behavior, document measured operational cost, and keep up-side operations idempotent where practical.
PostgreSQL risk patterns and safe practices `docs/MIGRATIONS.md`	Details PostgreSQL operations that can rewrite rows or rebuild indexes (generated stored columns, large CREATE INDEX, SET NOT NULL, cascading operations, DROP INDEX) and provides safer patterns (CONCURRENTLY, NOT VALID/VALIDATE, batching, indexing).
Operational checklist and production recovery `docs/MIGRATIONS.md`	Adds a checklist for big migrations (timing against prod-sized data, statement_timeout comparison, recovery plan, boot-time schema checks) and stepwise recovery instructions for dbmate failures, including SQL steps for normal and CONCURRENT/transaction:false cases and deployment restart guidance.

🐰 I hopped through code and schema plains,
I penned the rules to dodge migration pains.
Check timings, build indexes slow and steady,
Keep recovery steps clear — stay calm, be ready.
Now deploys can dance while I munch a carrot. 🥕

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'docs: safe-migration patterns' is concise and clearly describes the main change—adding documentation for safe database migration patterns—which aligns with the PR's primary objective.
Description check	✅ Passed	The PR description provides comprehensive context (Why, What's in it, Out of scope, Test plan), but deviates from the repository template by omitting the Summary section structure and test checkbox format specified in the template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/sql-hot-paths

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d25a4b4d6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T17:07:40Z

+
+```sh
+psql "$DB" -v ON_ERROR_STOP=1 <<'SQL'
+BEGIN;


Split recovery for transaction:false migrations

When the failed migration is one of the CREATE INDEX CONCURRENTLY/DROP INDEX CONCURRENTLY migrations recommended above, pasting its migrate:up section inside this BEGIN block will fail because PostgreSQL rejects concurrent index operations inside a transaction block. That makes the prod recovery recipe unusable for exactly the large-index migrations this document tells authors to write; call out that transaction:false migrations must be replayed without BEGIN/COMMIT and then record the schema version separately.

Useful? React with 👍 / 👎.

codecov-commenter · 2026-05-16T17:08:03Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/MIGRATIONS.md`:
- Around line 126-136: Update the recovery documentation to show two explicit
recipes: for transactional migrations use the existing BEGIN/COMMIT block (keep
SET LOCAL statement_timeout = 0; SET LOCAL lock_timeout = 0; run the migrate:up
SQL, then INSERT INTO public.schema_migrations(version) VALUES
('<UTC-yyyymmddHHMMSS>'); COMMIT) and for non-transactional (transaction:false)
migrations provide a separate example that omits BEGIN/COMMIT and uses
session-level SET (e.g. SET statement_timeout = 0; SET lock_timeout = 0;) before
running the migrate:up SQL and then INSERT the schema_migrations row afterward;
reference the same migrate:up section and the schema_migrations insert in both
examples so readers know where to paste their SQL and how to record the applied
version.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 56532a67-2179-4dca-abdb-9cfc69b92ed8

📥 Commits

Reviewing files that changed from the base of the PR and between 71e9b0f and d25a4b4.

📒 Files selected for processing (1)

docs/MIGRATIONS.md

Four corrections from pi on #769: * CREATE INDEX without CONCURRENTLY takes a SHARE lock (blocks writes but not reads), not ACCESS EXCLUSIVE. Fixed the lock description. * Added the CONCURRENTLY + IF NOT EXISTS trap: a failed concurrent build leaves an invalid index with the same name, and subsequent IF NOT EXISTS skips the rebuild — silent half-built index. Doc now tells engineers to check pg_index.indisvalid and DROP CONCURRENTLY before retrying. * Recovery runbook now has two paths: standard migration (one txn with timeouts lifted) and transaction:false migration (session-level SET, statements outside BEGIN, separate tiny txn to insert the schema_migrations row). The previous single-path recipe would have errored on any CONCURRENTLY statement. * Cascade guidance now explicitly calls out indexing the child FK column (otherwise the cascade falls back to a seq scan) and sequencing the parent delete after dependents are already nulled.

buremba · 2026-05-16T17:09:35Z

pi review — addressed

Four corrections in 5d4819e:

CREATE INDEX lock description was wrong. Plain CREATE INDEX takes a SHARE lock (blocks writes, not reads), not ACCESS EXCLUSIVE. Fixed.
CONCURRENTLY + IF NOT EXISTS trap added. Failed concurrent build leaves an invalid index with the same name → next IF NOT EXISTS skips the rebuild → silent half-built index the planner refuses to read. Doc now tells engineers to check `pg_index.indisvalid` and `DROP INDEX CONCURRENTLY` before retrying.
Recovery runbook split into two paths. Standard migration: one txn with timeouts lifted. `transaction:false` migration (any CONCURRENTLY DDL): session-level SET, statements outside BEGIN, separate tiny txn to insert the `schema_migrations` row. The previous single-path recipe would have errored on the first CONCURRENTLY line.
Cascade guidance expanded. Indexing the child FK column comes first (otherwise the cascade falls back to a seq scan and the UPDATE goes from 13s to minutes); batched nulling before the parent delete is what keeps the API responsive even with the index in place.

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread docs/MIGRATIONS.md Outdated

buremba mentioned this pull request May 16, 2026

perf(events): has_embedding column + maintenance triggers #770

Closed

buremba merged commit 24603de into main May 16, 2026
21 of 22 checks passed

buremba deleted the perf/sql-hot-paths branch May 16, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: safe-migration patterns#769

docs: safe-migration patterns#769
buremba merged 2 commits into
mainfrom
perf/sql-hot-paths

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

codecov-commenter commented May 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

buremba commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What's in it

Out of scope (deferred to follow-ups)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 16, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

buremba commented May 16, 2026

pi review — addressed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading