docs: safe-migration patterns#769
Conversation
Codifies what we wish PR #765 had followed before it timed out under the 60s statement_timeout and triggered the 2026-05-16 outage. Covers the four DDL shapes that bite on a hot table: * GENERATED STORED columns — table rewrite under ACCESS EXCLUSIVE. Safe pattern: nullable column + trigger + batched backfill outside the Helm hook. * CREATE INDEX — switch to CONCURRENTLY (transaction:false migration). * SET NOT NULL — add a NOT VALID CHECK first, VALIDATE separately, then SET NOT NULL becomes O(1). * ON DELETE SET NULL / CASCADE on small parent → wide dependent — cascade UPDATE blocks under exclusive lock (events.connection_id_fkey takes 13s per call in pg_stat_statements). Plus an operational checklist for sizing the migration locally before opening the PR, and a "when dbmate fails in prod" runbook that's the recipe we ran on 2026-05-16. References the boot-time schema-version assertion from lobu#767 — the two complement each other: the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds ChangesMigration Operational Guidance
🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d25a4b4d6d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ```sh | ||
| psql "$DB" -v ON_ERROR_STOP=1 <<'SQL' | ||
| BEGIN; |
There was a problem hiding this comment.
Split recovery for transaction:false migrations
When the failed migration is one of the CREATE INDEX CONCURRENTLY/DROP INDEX CONCURRENTLY migrations recommended above, pasting its migrate:up section inside this BEGIN block will fail because PostgreSQL rejects concurrent index operations inside a transaction block. That makes the prod recovery recipe unusable for exactly the large-index migrations this document tells authors to write; call out that transaction:false migrations must be replayed without BEGIN/COMMIT and then record the schema version separately.
Useful? React with 👍 / 👎.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/MIGRATIONS.md`:
- Around line 126-136: Update the recovery documentation to show two explicit
recipes: for transactional migrations use the existing BEGIN/COMMIT block (keep
SET LOCAL statement_timeout = 0; SET LOCAL lock_timeout = 0; run the migrate:up
SQL, then INSERT INTO public.schema_migrations(version) VALUES
('<UTC-yyyymmddHHMMSS>'); COMMIT) and for non-transactional (transaction:false)
migrations provide a separate example that omits BEGIN/COMMIT and uses
session-level SET (e.g. SET statement_timeout = 0; SET lock_timeout = 0;) before
running the migrate:up SQL and then INSERT the schema_migrations row afterward;
reference the same migrate:up section and the schema_migrations insert in both
examples so readers know where to paste their SQL and how to record the applied
version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Four corrections from pi on #769: * CREATE INDEX without CONCURRENTLY takes a SHARE lock (blocks writes but not reads), not ACCESS EXCLUSIVE. Fixed the lock description. * Added the CONCURRENTLY + IF NOT EXISTS trap: a failed concurrent build leaves an invalid index with the same name, and subsequent IF NOT EXISTS skips the rebuild — silent half-built index. Doc now tells engineers to check pg_index.indisvalid and DROP CONCURRENTLY before retrying. * Recovery runbook now has two paths: standard migration (one txn with timeouts lifted) and transaction:false migration (session-level SET, statements outside BEGIN, separate tiny txn to insert the schema_migrations row). The previous single-path recipe would have errored on any CONCURRENTLY statement. * Cascade guidance now explicitly calls out indexing the child FK column (otherwise the cascade falls back to a seq scan) and sequencing the parent delete after dependents are already nulled.
pi review — addressedFour corrections in 5d4819e:
|
Why
Codifies what we wish #765 had followed before it timed out under the 60s `statement_timeout` and triggered the 2026-05-16 outage. Pairs with the boot-time schema-version assertion in #767 — the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does.
What's in it
Out of scope (deferred to follow-ups)
While measuring for this PR I dug into the connections-list query (565ms mean × 319 calls in pg_stat_statements). The cost is entirely in the per-row `event_count` correlated sub-select going through `current_event_records`: each event_count touches `idx_events_connection_id` for ~53k rows + anti-joins against `idx_events_superseded_by` for ~690k probes. 48% of `events` rows are tombstones (552k / 1.15M with `supersedes_event_id IS NOT NULL`), which is what makes the anti-join expensive.
A naive CTE rewrite that computes counts once for the page is actually slower (1.57s vs 1.30s) — the planner does the same anti-join either way. The real fix is a denormalized `is_superseded` boolean on `events` maintained by trigger, with a partial index. That's the same shape of change as the embed-backfill state column in PR 3, and will be folded in there or a sibling PR.
Test plan
Summary by CodeRabbit