Skip to content

docs: safe-migration patterns#769

Merged
buremba merged 2 commits into
mainfrom
perf/sql-hot-paths
May 16, 2026
Merged

docs: safe-migration patterns#769
buremba merged 2 commits into
mainfrom
perf/sql-hot-paths

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Codifies what we wish #765 had followed before it timed out under the 60s `statement_timeout` and triggered the 2026-05-16 outage. Pairs with the boot-time schema-version assertion in #767 — the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does.

What's in it

  • The four DDL shapes that bite on a hot table:
    • `GENERATED STORED` columns — table rewrite under ACCESS EXCLUSIVE. The exact shape perf(events): stored fulltext column + lifecycle partial index #765 used.
    • `CREATE INDEX` — non-CONCURRENTLY takes ACCESS EXCLUSIVE.
    • `SET NOT NULL` — full-table scan unless you stage via NOT VALID CHECK.
    • `ON DELETE SET NULL / CASCADE` on small parent → wide dependent — `events_connection_id_fkey` is doing exactly this; pg_stat_statements shows 13.4s per call.
  • For each: the unsafe shape, the safe shape, and a working SQL example.
  • Operational checklist for "is this migration safe to ship?"
  • "When dbmate fails in prod" runbook — the recipe we ran manually on 2026-05-16.

Out of scope (deferred to follow-ups)

While measuring for this PR I dug into the connections-list query (565ms mean × 319 calls in pg_stat_statements). The cost is entirely in the per-row `event_count` correlated sub-select going through `current_event_records`: each event_count touches `idx_events_connection_id` for ~53k rows + anti-joins against `idx_events_superseded_by` for ~690k probes. 48% of `events` rows are tombstones (552k / 1.15M with `supersedes_event_id IS NOT NULL`), which is what makes the anti-join expensive.

A naive CTE rewrite that computes counts once for the page is actually slower (1.57s vs 1.30s) — the planner does the same anti-join either way. The real fix is a denormalized `is_superseded` boolean on `events` maintained by trigger, with a partial index. That's the same shape of change as the embed-backfill state column in PR 3, and will be folded in there or a sibling PR.

Test plan

  • `make typecheck` clean (no code changed)
  • `make build-packages` builds

Summary by CodeRabbit

  • Documentation
    • Added migration guidance covering rules for writing migrations, safety cautions for PostgreSQL operations that can rewrite tables or rebuild indexes, an operational checklist for large migrations (timing and timeout considerations), and troubleshooting/recovery procedures to diagnose and manually recover failed migrations, plus references to related incident learnings and follow-ups.

Review Change Stack

Codifies what we wish PR #765 had followed before it timed out under
the 60s statement_timeout and triggered the 2026-05-16 outage. Covers
the four DDL shapes that bite on a hot table:

* GENERATED STORED columns — table rewrite under ACCESS EXCLUSIVE.
  Safe pattern: nullable column + trigger + batched backfill outside
  the Helm hook.
* CREATE INDEX — switch to CONCURRENTLY (transaction:false migration).
* SET NOT NULL — add a NOT VALID CHECK first, VALIDATE separately,
  then SET NOT NULL becomes O(1).
* ON DELETE SET NULL / CASCADE on small parent → wide dependent —
  cascade UPDATE blocks under exclusive lock (events.connection_id_fkey
  takes 13s per call in pg_stat_statements).

Plus an operational checklist for sizing the migration locally before
opening the PR, and a "when dbmate fails in prod" runbook that's the
recipe we ran on 2026-05-16.

References the boot-time schema-version assertion from lobu#767 — the
two complement each other: the doc keeps migrations from timing out,
the gate keeps the app from rolling forward when one does.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0ee52a1f-508e-4572-9a1d-e16ef654988b

📥 Commits

Reviewing files that changed from the base of the PR and between d25a4b4 and 5d4819e.

📒 Files selected for processing (1)
  • docs/MIGRATIONS.md
✅ Files skipped from review due to trivial changes (1)
  • docs/MIGRATIONS.md

📝 Walkthrough

Walkthrough

Adds docs/MIGRATIONS.md, documenting dbmate migration rules, PostgreSQL risk patterns with safe practices, a pre-deploy checklist for large migrations, and step‑by‑step production failure diagnosis and manual recovery procedures.

Changes

Migration Operational Guidance

Layer / File(s) Summary
Migration rules and framework
docs/MIGRATIONS.md
Introduces the migrations doc introduction and three core "Always" guidelines: read actual Postgres DDL behavior, document measured operational cost, and keep up-side operations idempotent where practical.
PostgreSQL risk patterns and safe practices
docs/MIGRATIONS.md
Details PostgreSQL operations that can rewrite rows or rebuild indexes (generated stored columns, large CREATE INDEX, SET NOT NULL, cascading operations, DROP INDEX) and provides safer patterns (CONCURRENTLY, NOT VALID/VALIDATE, batching, indexing).
Operational checklist and production recovery
docs/MIGRATIONS.md
Adds a checklist for big migrations (timing against prod-sized data, statement_timeout comparison, recovery plan, boot-time schema checks) and stepwise recovery instructions for dbmate failures, including SQL steps for normal and CONCURRENT/transaction:false cases and deployment restart guidance.

🐰 I hopped through code and schema plains,
I penned the rules to dodge migration pains.
Check timings, build indexes slow and steady,
Keep recovery steps clear — stay calm, be ready.
Now deploys can dance while I munch a carrot. 🥕

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs: safe-migration patterns' is concise and clearly describes the main change—adding documentation for safe database migration patterns—which aligns with the PR's primary objective.
Description check ✅ Passed The PR description provides comprehensive context (Why, What's in it, Out of scope, Test plan), but deviates from the repository template by omitting the Summary section structure and test checkbox format specified in the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/sql-hot-paths

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d25a4b4d6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/MIGRATIONS.md

```sh
psql "$DB" -v ON_ERROR_STOP=1 <<'SQL'
BEGIN;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split recovery for transaction:false migrations

When the failed migration is one of the CREATE INDEX CONCURRENTLY/DROP INDEX CONCURRENTLY migrations recommended above, pasting its migrate:up section inside this BEGIN block will fail because PostgreSQL rejects concurrent index operations inside a transaction block. That makes the prod recovery recipe unusable for exactly the large-index migrations this document tells authors to write; call out that transaction:false migrations must be replayed without BEGIN/COMMIT and then record the schema version separately.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/MIGRATIONS.md`:
- Around line 126-136: Update the recovery documentation to show two explicit
recipes: for transactional migrations use the existing BEGIN/COMMIT block (keep
SET LOCAL statement_timeout = 0; SET LOCAL lock_timeout = 0; run the migrate:up
SQL, then INSERT INTO public.schema_migrations(version) VALUES
('<UTC-yyyymmddHHMMSS>'); COMMIT) and for non-transactional (transaction:false)
migrations provide a separate example that omits BEGIN/COMMIT and uses
session-level SET (e.g. SET statement_timeout = 0; SET lock_timeout = 0;) before
running the migrate:up SQL and then INSERT the schema_migrations row afterward;
reference the same migrate:up section and the schema_migrations insert in both
examples so readers know where to paste their SQL and how to record the applied
version.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 56532a67-2179-4dca-abdb-9cfc69b92ed8

📥 Commits

Reviewing files that changed from the base of the PR and between 71e9b0f and d25a4b4.

📒 Files selected for processing (1)
  • docs/MIGRATIONS.md

Comment thread docs/MIGRATIONS.md Outdated
Four corrections from pi on #769:

* CREATE INDEX without CONCURRENTLY takes a SHARE lock (blocks writes
  but not reads), not ACCESS EXCLUSIVE. Fixed the lock description.
* Added the CONCURRENTLY + IF NOT EXISTS trap: a failed concurrent
  build leaves an invalid index with the same name, and subsequent
  IF NOT EXISTS skips the rebuild — silent half-built index. Doc now
  tells engineers to check pg_index.indisvalid and DROP CONCURRENTLY
  before retrying.
* Recovery runbook now has two paths: standard migration (one txn with
  timeouts lifted) and transaction:false migration (session-level SET,
  statements outside BEGIN, separate tiny txn to insert the
  schema_migrations row). The previous single-path recipe would have
  errored on any CONCURRENTLY statement.
* Cascade guidance now explicitly calls out indexing the child FK
  column (otherwise the cascade falls back to a seq scan) and
  sequencing the parent delete after dependents are already nulled.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

pi review — addressed

Four corrections in 5d4819e:

  1. CREATE INDEX lock description was wrong. Plain CREATE INDEX takes a SHARE lock (blocks writes, not reads), not ACCESS EXCLUSIVE. Fixed.
  2. CONCURRENTLY + IF NOT EXISTS trap added. Failed concurrent build leaves an invalid index with the same name → next IF NOT EXISTS skips the rebuild → silent half-built index the planner refuses to read. Doc now tells engineers to check `pg_index.indisvalid` and `DROP INDEX CONCURRENTLY` before retrying.
  3. Recovery runbook split into two paths. Standard migration: one txn with timeouts lifted. `transaction:false` migration (any CONCURRENTLY DDL): session-level SET, statements outside BEGIN, separate tiny txn to insert the `schema_migrations` row. The previous single-path recipe would have errored on the first CONCURRENTLY line.
  4. Cascade guidance expanded. Indexing the child FK column comes first (otherwise the cascade falls back to a seq scan and the UPDATE goes from 13s to minutes); batched nulling before the parent delete is what keeps the API responsive even with the index in place.

@buremba buremba merged commit 24603de into main May 16, 2026
21 of 22 checks passed
@buremba buremba deleted the perf/sql-hot-paths branch May 16, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants