From d25a4b4d6d5b61ff1dd619b082efc08caff1dad5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Burak=20Emre=20Kabakc=C4=B1?= Date: Sat, 16 May 2026 18:05:48 +0100 Subject: [PATCH 1/2] docs: safe-migration patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codifies what we wish PR #765 had followed before it timed out under the 60s statement_timeout and triggered the 2026-05-16 outage. Covers the four DDL shapes that bite on a hot table: * GENERATED STORED columns — table rewrite under ACCESS EXCLUSIVE. Safe pattern: nullable column + trigger + batched backfill outside the Helm hook. * CREATE INDEX — switch to CONCURRENTLY (transaction:false migration). * SET NOT NULL — add a NOT VALID CHECK first, VALIDATE separately, then SET NOT NULL becomes O(1). * ON DELETE SET NULL / CASCADE on small parent → wide dependent — cascade UPDATE blocks under exclusive lock (events.connection_id_fkey takes 13s per call in pg_stat_statements). Plus an operational checklist for sizing the migration locally before opening the PR, and a "when dbmate fails in prod" runbook that's the recipe we ran on 2026-05-16. References the boot-time schema-version assertion from lobu#767 — the two complement each other: the doc keeps migrations from timing out, the gate keeps the app from rolling forward when one does. --- docs/MIGRATIONS.md | 148 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 docs/MIGRATIONS.md diff --git a/docs/MIGRATIONS.md b/docs/MIGRATIONS.md new file mode 100644 index 000000000..1b464dd8a --- /dev/null +++ b/docs/MIGRATIONS.md @@ -0,0 +1,148 @@ +# Migrations + +dbmate-managed SQL under `db/migrations/`. Names are `_.sql` with a `-- migrate:up` / `-- migrate:down` split. Pre-upgrade Helm hook runs `dbmate up` against the prod database before the app rolls. + +Most migrations are routine: an additive column, a small index, a view bump. The risk lives in the ones that **rewrite a hot table** or **rebuild a large index**. Get those wrong and prod stalls under `ACCESS EXCLUSIVE` until the database's `statement_timeout` fires, after which dbmate exits non-zero, the migration leaves no row in `schema_migrations`, and the app deploys forward into a schema it expects but doesn't have. That's the 2026-05-16 outage in one sentence ([PR #767](https://github.com/lobu-ai/lobu/pull/767)). + +The rules below are what we wish #765 had followed. + +--- + +## Always + +- **Read what your DDL actually does.** Postgres has three speeds: `O(1)` metadata flip, `O(rows)` table rewrite under `ACCESS EXCLUSIVE`, and `O(rows)` write with regular row locks. Know which one you wrote. +- **State the operational cost in the migration body** as a comment, in seconds for the real table size, not "should be fast." Cite the row count you tested on. If you didn't test on the real size, say so. +- **Keep migrations idempotent on the up-side** when practical (`CREATE INDEX IF NOT EXISTS`, `ALTER TABLE … ADD COLUMN IF NOT EXISTS`). dbmate doesn't retry, but a half-applied transaction recovered manually is easier when the SQL is re-runnable. + +--- + +## Watch out for + +### `ADD COLUMN ... GENERATED ALWAYS AS (...) STORED` + +This rewrites every row to materialize the generated value. On the 1.15M-row `events` table the rewrite took >60s and tripped the DB's `statement_timeout`. Do not use `GENERATED STORED` on any table over ~100k rows during a Helm hook. + +**Safe pattern:** + +```sql +-- migrate:up + +-- Step 1: nullable column, no backfill, no rewrite. O(1). +ALTER TABLE events ADD COLUMN IF NOT EXISTS search_tsv tsvector; + +-- Step 2: trigger so new rows get the value (fast inserts; trigger overhead +-- is the cost of doing this incrementally). +CREATE OR REPLACE FUNCTION events_set_search_tsv() RETURNS trigger AS $$ +BEGIN + NEW.search_tsv := + setweight(to_tsvector('english', COALESCE(NEW.title, '')), 'A') || + setweight(to_tsvector('english', COALESCE(NEW.payload_text, '')), 'B'); + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + +DROP TRIGGER IF EXISTS trg_events_set_search_tsv ON events; +CREATE TRIGGER trg_events_set_search_tsv + BEFORE INSERT OR UPDATE OF title, payload_text ON events + FOR EACH ROW EXECUTE FUNCTION events_set_search_tsv(); +``` + +Then **backfill in a separate migration / script** that runs in batches outside the Helm hook (e.g. a `BATCH_BACKFILL=1`-gated boot path or a `scripts/backfill-*.sh` runbook): + +```sql +UPDATE events SET search_tsv = + setweight(to_tsvector('english', COALESCE(title, '')), 'A') || + setweight(to_tsvector('english', COALESCE(payload_text, '')), 'B') +WHERE id BETWEEN $START AND $END AND search_tsv IS NULL; +``` + +Pace ~10k rows per batch with a sleep between batches; each batch is its own transaction so VACUUM can keep up. + +### `CREATE INDEX` on a large table + +Plain `CREATE INDEX` takes `ACCESS EXCLUSIVE` on the table for the duration of the build. On `events` that's measured in seconds, not subseconds. + +**Safe pattern:** `CREATE INDEX CONCURRENTLY`. It does multiple passes but only takes `SHARE UPDATE EXCLUSIVE` so reads + writes continue. + +```sql +-- CONCURRENTLY does not run inside a transaction. dbmate runs each .sql +-- file in a single implicit transaction unless you opt out, so: +-- +-- migrate:up transaction:false +CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_events_search_tsv + ON events USING gin (search_tsv); +``` + +If the migration *must* be one file with the column-add too, the column-add stays in a normal transaction-wrapped migration and the index is a follow-up `transaction:false` migration. + +### `ALTER TABLE ... ALTER COLUMN ... SET NOT NULL` + +Postgres has to verify every row, scanning the whole table under `ACCESS EXCLUSIVE`. On a hot table this stalls. + +**Safe pattern (PG 12+):** add a `CHECK (col IS NOT NULL) NOT VALID` first, run `ALTER TABLE … VALIDATE CONSTRAINT` (only takes `SHARE UPDATE EXCLUSIVE`), then `ALTER COLUMN SET NOT NULL` becomes `O(1)` because the constraint already proves the property. + +```sql +ALTER TABLE events ADD CONSTRAINT events_title_not_null + CHECK (title IS NOT NULL) NOT VALID; +ALTER TABLE events VALIDATE CONSTRAINT events_title_not_null; +ALTER TABLE events ALTER COLUMN title SET NOT NULL; +ALTER TABLE events DROP CONSTRAINT events_title_not_null; +``` + +### Cascading `ON DELETE SET NULL` / `ON DELETE CASCADE` on hot tables + +A single `DELETE FROM connections WHERE id IN (...)` triggers an internal `UPDATE events SET connection_id = NULL WHERE connection_id IN (...)`. For a connection with 92k events that's a 13s blocking UPDATE — visible to ops as "the admin action hung the API." + +**Safe pattern:** keep cascades for *small* parent tables only. For parents that fan out to events, watcher_window_events, or anything else 100k+ rows wide, manage the dependent nulling in application code with batched UPDATEs (e.g. 1k rows per loop, sleep between). + +### Bare `DROP INDEX` + +Takes `ACCESS EXCLUSIVE`. Use `DROP INDEX CONCURRENTLY` (also `transaction:false`). + +--- + +## Operational checklist for any "big" migration + +If the migration touches a table over ~100k rows or rebuilds a GIN/ivfflat index, before you open the PR: + +1. **Time it locally against a copy of prod-sized data**, not a 1k-row dev DB. We have `psql "$DATABASE_URL" -c '\COPY ...'` paths to clone a single table for this purpose. +2. **Write the actual observed timing in the migration's header comment**, not a guess. +3. **Compare against the DB's `statement_timeout`** for the migration role. Prod is 60s. If your migration is close, restructure — don't ship hoping it'll fit. +4. **State the recovery plan** in the PR body: if the migration times out, what's the manual `SET statement_timeout=0; \i path/to/file` recipe? +5. **Verify the new app image still boots against an un-applied schema** by running it locally with the migration held back. With the boot-time schema-version assertion from [#767](https://github.com/lobu-ai/lobu/pull/767), it should refuse to start — that's good. The point is to confirm the refusal is loud and clean. + +--- + +## When dbmate fails in prod + +Symptoms: `summaries-app-lobu-migrate-*` pod in `Error` state, app pod in `CrashLoopBackOff` or stuck in a tight error loop. Confirm the schema state: + +```sh +DB="$(... pull DATABASE_URL from secret ...)" +psql "$DB" -tAc "SELECT max(version) FROM schema_migrations" +psql "$DB" -tAc "SELECT 1 FROM information_schema.columns WHERE table_name='' AND column_name=''" +``` + +If a migration left no `schema_migrations` row, the safe recovery is to apply it manually with the cap lifted: + +```sh +psql "$DB" -v ON_ERROR_STOP=1 <<'SQL' +BEGIN; +SET LOCAL statement_timeout = 0; +SET LOCAL lock_timeout = 0; +-- paste the migrate:up section here +INSERT INTO public.schema_migrations(version) VALUES (''); +COMMIT; +SQL +``` + +Then `kubectl rollout restart deploy/-lobu-app` to wipe cached prepared statements. + +If the migration partially applied (column created, index missing, view missing), figure out which subset succeeded with `\d
` and run only the missing pieces. + +--- + +## See also + +- [PR #767](https://github.com/lobu-ai/lobu/pull/767) — the boot-time schema-version assertion that catches behind-DB images at startup. +- The post-incident commentary on the original migration ([#765](https://github.com/lobu-ai/lobu/pull/765)) for what "operational note" looks like when a migration's risks are flagged but not addressed. From 5d4819e0e33805b1778a430d1f45c6473f12117e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Burak=20Emre=20Kabakc=C4=B1?= Date: Sat, 16 May 2026 18:09:22 +0100 Subject: [PATCH 2/2] docs: address pi review on MIGRATIONS.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four corrections from pi on #769: * CREATE INDEX without CONCURRENTLY takes a SHARE lock (blocks writes but not reads), not ACCESS EXCLUSIVE. Fixed the lock description. * Added the CONCURRENTLY + IF NOT EXISTS trap: a failed concurrent build leaves an invalid index with the same name, and subsequent IF NOT EXISTS skips the rebuild — silent half-built index. Doc now tells engineers to check pg_index.indisvalid and DROP CONCURRENTLY before retrying. * Recovery runbook now has two paths: standard migration (one txn with timeouts lifted) and transaction:false migration (session-level SET, statements outside BEGIN, separate tiny txn to insert the schema_migrations row). The previous single-path recipe would have errored on any CONCURRENTLY statement. * Cascade guidance now explicitly calls out indexing the child FK column (otherwise the cascade falls back to a seq scan) and sequencing the parent delete after dependents are already nulled. --- docs/MIGRATIONS.md | 47 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 43 insertions(+), 4 deletions(-) diff --git a/docs/MIGRATIONS.md b/docs/MIGRATIONS.md index 1b464dd8a..92cee1127 100644 --- a/docs/MIGRATIONS.md +++ b/docs/MIGRATIONS.md @@ -60,9 +60,9 @@ Pace ~10k rows per batch with a sleep between batches; each batch is its own tra ### `CREATE INDEX` on a large table -Plain `CREATE INDEX` takes `ACCESS EXCLUSIVE` on the table for the duration of the build. On `events` that's measured in seconds, not subseconds. +Plain `CREATE INDEX` takes a `SHARE` lock on the table — reads still work, but **writes block** for the duration of the build. On `events` that's seconds of stalled INSERTs. -**Safe pattern:** `CREATE INDEX CONCURRENTLY`. It does multiple passes but only takes `SHARE UPDATE EXCLUSIVE` so reads + writes continue. +**Safe pattern:** `CREATE INDEX CONCURRENTLY`. Multiple passes, but only takes `SHARE UPDATE EXCLUSIVE` so reads + writes continue. ```sql -- CONCURRENTLY does not run inside a transaction. dbmate runs each .sql @@ -75,6 +75,17 @@ CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_events_search_tsv If the migration *must* be one file with the column-add too, the column-add stays in a normal transaction-wrapped migration and the index is a follow-up `transaction:false` migration. +**`CONCURRENTLY` + `IF NOT EXISTS` trap:** if a previous `CREATE INDEX CONCURRENTLY` failed mid-build (timeout, deadlock, connection dropped), Postgres leaves an **invalid** index behind. The index *name* exists, so the next `CREATE INDEX CONCURRENTLY IF NOT EXISTS` skips the rebuild — and now your "successful" deploy is using a half-built index that the planner refuses to read from. Before retrying, check: + +```sql +SELECT i.relname, x.indisvalid + FROM pg_class i + JOIN pg_index x ON x.indexrelid = i.oid + WHERE i.relname = 'idx_events_search_tsv'; +``` + +If `indisvalid = false`, drop it with `DROP INDEX CONCURRENTLY idx_events_search_tsv;` (also `transaction:false`) and retry the create. + ### `ALTER TABLE ... ALTER COLUMN ... SET NOT NULL` Postgres has to verify every row, scanning the whole table under `ACCESS EXCLUSIVE`. On a hot table this stalls. @@ -93,7 +104,13 @@ ALTER TABLE events DROP CONSTRAINT events_title_not_null; A single `DELETE FROM connections WHERE id IN (...)` triggers an internal `UPDATE events SET connection_id = NULL WHERE connection_id IN (...)`. For a connection with 92k events that's a 13s blocking UPDATE — visible to ops as "the admin action hung the API." -**Safe pattern:** keep cascades for *small* parent tables only. For parents that fan out to events, watcher_window_events, or anything else 100k+ rows wide, manage the dependent nulling in application code with batched UPDATEs (e.g. 1k rows per loop, sleep between). +**Safe pattern:** keep cascades for *small* parent tables only. For parents that fan out to events, watcher_window_events, or anything else 100k+ rows wide: + +1. Make sure the **child FK column is indexed** (e.g. `idx_events_connection_id`) — without it, the cascade falls back to a seq scan and the UPDATE goes from 13s to minutes. +2. Manage the dependent nulling in application code with batched UPDATEs (e.g. 1k rows per loop, sleep between). +3. Run the parent `DELETE` only **after** the child rows are already nulled/orphaned — the cascade then has nothing to do and the delete is fast. + +Indexing alone helps the cascade, but it doesn't eliminate the per-row WAL write; batching before the delete is what keeps the API responsive. ### Bare `DROP INDEX` @@ -123,7 +140,9 @@ psql "$DB" -tAc "SELECT max(version) FROM schema_migrations" psql "$DB" -tAc "SELECT 1 FROM information_schema.columns WHERE table_name='
' AND column_name=''" ``` -If a migration left no `schema_migrations` row, the safe recovery is to apply it manually with the cap lifted: +If a migration left no `schema_migrations` row, the safe recovery is to apply it manually with the cap lifted. Two paths depending on whether the migration uses `CONCURRENTLY`: + +**Standard migration (no CONCURRENTLY):** one transaction with the timeouts lifted. ```sh psql "$DB" -v ON_ERROR_STOP=1 <<'SQL' @@ -136,6 +155,26 @@ COMMIT; SQL ``` +**`transaction:false` migration (any `CREATE/DROP INDEX CONCURRENTLY` or `REINDEX CONCURRENTLY`):** these can't run inside `BEGIN`. Use session-level `SET` and run statements one at a time. Verify success between each, then record the migration: + +```sh +psql "$DB" <<'SQL' +SET statement_timeout = 0; +SET lock_timeout = 0; + +-- one statement at a time, no BEGIN +CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_foo ON bar (baz); +-- verify the index is valid before continuing: +SELECT indisvalid FROM pg_index WHERE indexrelid = 'idx_foo'::regclass; +-- if false: DROP INDEX CONCURRENTLY idx_foo; and rerun the create above. + +-- once everything is healthy, record the migration in its own tiny txn: +BEGIN; +INSERT INTO public.schema_migrations(version) VALUES (''); +COMMIT; +SQL +``` + Then `kubectl rollout restart deploy/-lobu-app` to wipe cached prepared statements. If the migration partially applied (column created, index missing, view missing), figure out which subset succeeded with `\d
` and run only the missing pieces.