Skip to content

chore(db): squash 82 migrations into baseline + retire schema.sql + embedded patches#908

Merged
buremba merged 6 commits into
mainfrom
chore/db-squash-baseline
May 19, 2026
Merged

chore(db): squash 82 migrations into baseline + retire schema.sql + embedded patches#908
buremba merged 6 commits into
mainfrom
chore/db-squash-baseline

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 19, 2026

Summary

Clean-cut consolidation of the DB schema management story. Net: -13,024 / +1,683 lines, 89 files.

What changed

Change Why
`db/migrations/`: 82 files → 1 baseline Single source of truth; cold-start drops from 82 sequential applies to one CREATE-everything
`db/schema.sql` deleted Baseline IS the schema; no more dual source, no more drift gate
`scripts/normalize-schema.sh` deleted Was only needed to scrub pg_dump output for the drift diff
`Makefile`: `db-schema` target removed No schema.sql to regenerate
`.github/workflows/ci.yml` Removed: drift gate, normalize step, `--schema-file` flag on `dbmate up`. Kept: dbmate-up validation + status check + immutability check (honors `[squash-baseline]` sentinel)
`packages/server/src/db/embedded-schema-patches.ts` deleted Embedded path now runs migrations the same way prod does — no second mirror to maintain
`packages/server/src/start-local.ts` Replaced the "skip migrations if `organization` exists" branch + patches loop with a single `schema_migrations`-aware applier that mirrors dbmate's behavior

Audit findings (4 parallel Explore agents)

Category Findings
Dead tables (zero TS readers/writers) `mcp_proxy_sessions`, `organization_lobu_links`, 4 `migration_*` temp artifacts
Dead columns `agents.skill_auto_granted_domains` (0 hits), `runs.retry_delay_seconds` (comments only)
Dead views / functions / triggers none (past migrations already cleaned up)
Deprecation markers none actionable (already done in past migrations)

Kept after spot-check: `agents.{soulMd, nixConfig, networkConfig, pluginsConfig}` — heavily referenced via camelCase in the owletto web admin agent editor + CLI apply diff/desired-state. Audit's snake_case grep missed them.

Top-15 tables get `COMMENT ON TABLE` descriptions

`events`, `runs`, `agents`, `connections`, `entities`, `auth_profiles`, `organization`, `user`, `member`, `watchers`, `feeds`, `personal_access_tokens`, `oauth_tokens`, `entity_types`, `event_classifications`. Self-documenting schema.


Data safety

Nothing is dropped from prod. All "removed" tables get renamed; all "removed" columns get snapshotted into side tables. Data stays inside the live DB, queryable any time, restorable in one SQL statement.

Two layered recovery paths:

Path A — CNPG point-in-time recovery (preferred, full restore)

This codebase already has CNPG WAL archiving wired:

Config Value Source
Backup target Cloudflare R2 (`s3://summaries-db-backup`) `packages/owletto/deploy/k8s/apps/lobu/base/helmrelease.yaml`
`retentionPolicy` 30 days same
`archive_timeout` 900s (15-min force-archive) same
`ScheduledBackup` daily 02:00 UTC same
Recovery template `db-recovery.yaml` (used 2026-03-15 after Reddit re-sync incident) proof-point in same dir

If anything breaks post-deploy, apply a fresh `Cluster` CR with `bootstrap.recovery.recoveryTarget.targetTime` set to the pre-surgery timestamp recorded in STEP 0 — CNPG restores transaction-level state from the latest base backup + WAL replay. Flip the app's `DATABASE_URL` to point at the recovered cluster.

Path B — In-DB safety net (lightweight, just-the-squash rollback)

Surgery leaves behind:

  • 6 renamed tables (suffix `_d20260519`)
  • 2 column-snapshot tables (one per dropped column, keyed off the parent's PK)
  • The pre-surgery `schema_migrations` ledger CSV at `/tmp/schema-migrations-pre-squash.csv`

To undo just the squash without a full PITR:

```
-- 1. Restore the ledger:
psql "$PROD_DATABASE_URL" -c "DELETE FROM public.schema_migrations"
psql "$PROD_DATABASE_URL"
-c "\\copy public.schema_migrations FROM '/tmp/schema-migrations-pre-squash.csv' CSV HEADER"
-- 2. Rename tables back:
ALTER TABLE public.mcp_proxy_sessions_d20260519 RENAME TO mcp_proxy_sessions;
-- ... (one per renamed table)
-- 3. ADD COLUMN + UPDATE from snapshot tables.
```

Full rollback details in the baseline file's header.


Rollout procedure

  1. STEP 0 — record pre-surgery timestamp:
    ```
    psql "$PROD_DATABASE_URL" -c "SELECT now()" | tee /tmp/pre-surgery-ts.txt
    ```
  2. STEP 1 — full pg_dump as belt + suspenders (CNPG PITR is the real safety net).
  3. STEP 2 — CSV the schema_migrations ledger (for Path B rollback).
  4. STEP 3 — sanity-check row counts on the 6 droppee tables. Abort if non-zero.
  5. STEP 4 — apply the rename + snapshot + ledger-reset surgery in a single BEGIN/COMMIT (full script in the baseline header).
  6. STEP 5 — deploy the new image. `dbmate up` skips the baseline (already-applied per the surgery's ledger insert).

Fresh DBs (local dev, PGlite, CI): no surgery; `dbmate up` applies the baseline from scratch. For local PGlite: `rm -rf /data` next to your `lobu run` to take advantage of the squash.


Verification

Manual end-to-end test, two Docker containers:

Container Setup Outcome
A — "prod simulation" `pgvector/pgvector:pg16` + 82 old migrations + surgery Canonical schema aligns to baseline; 8 backup tables retain rows
B — "fresh DB" `pgvector/pgvector:pg16` + new baseline only 67 user tables + dbmate-managed schema_migrations

Diff between A and B's canonical schemas (excluding backup tables): empty at the column level, across all 67 tables.

  • `bun run typecheck` clean
  • Container B applies the baseline in 2.6s without errors
  • Container A surgery preserves all rows; canonical schema matches B
  • `agents.skill_auto_granted_domains` absent in both; data preserved in `agents_d20260519_skill_auto_granted_domains` on A
  • `runs.retry_delay_seconds` absent in both; data preserved in `runs_d20260519_retry_delay_seconds` on A
  • `schema_migrations` has only `'00000000000000'` post-surgery on A; fresh-applied on B
  • CI: `migrations` job + `dbmate up` validation (running now)
  • Code review on the baseline file's accuracy
  • Pre-merge: confirm the recovery procedure works on a CNPG staging clone

CI

The `[squash-baseline]` sentinel in commit messages signals to CI's immutability check that this PR is a one-time squash. Future schema PRs without that marker still get strict "applied migrations are immutable" enforcement.

Summary by CodeRabbit

Release Notes

  • Chores
    • Consolidated numerous database schema migrations into a baseline schema to streamline database upgrades and initialization processes.
    • Removed legacy migration tooling, embedded schema patching infrastructure, and related build utilities.
    • Simplified CI pipeline configuration and removed schema normalization scripts.
    • Updated local development bootstrapping to use streamlined migration ledger tracking.

Review Change Stack

…a.sql + embedded patches

Clean-cut consolidation of the DB schema management story. Authorized
by the user with "nobody is using our app yet, we can patch prod once."
Net: -13,024 / +1,518 lines.

What changed
------------

db/migrations/
  - Replaced 82 existing files (the stale 00000000000000_baseline.sql
    + 81 forward deltas) with one regenerated baseline that captures
    the current schema verbatim.
  - Baseline generated by: `dbmate up` all 82 against a fresh
    pgvector/pgvector:pg16 container (same image CI uses) → drop dead
    schema (audit-confirmed) → annotate top-15 tables with COMMENT ON
    → pg_dump --schema-only → strip dump noise.

db/schema.sql — DELETED.
  The baseline IS the schema. No more dual source of truth; no more
  drift gate; no more "did I forget to regenerate?" gotcha.

scripts/normalize-schema.sh — DELETED.
  Was only used to scrub pg_dump output before the drift diff.
  No diff, no script.

Makefile
  - Removed `db-schema` target (no schema.sql to regenerate).
  - Help line dropped.

.github/workflows/ci.yml
  - Removed: normalize step, drift-gate step, `--schema-file` flag on
    `dbmate up`.
  - Kept: immutability check, rebased to exclude
    `00000000000000_baseline.sql` so future re-squashes can ship.
  - Kept: `dbmate up` + status check (validates baseline applies
    cleanly to a fresh DB).

packages/server/src/db/embedded-schema-patches.ts — DELETED.
  Embedded path now runs the migrations directory the same way prod
  does — no second mirror to maintain.

packages/server/src/start-local.ts
  - Replaced the "skip migrations if `organization` exists" branch +
    `applyEmbeddedSchemaPatches` loop with a single `schema_migrations`-
    aware applier that mirrors dbmate's behavior: ensure ledger table,
    read applied versions, apply only the unseen ones, record each on
    success.
  - Idempotent against any starting state (fresh or pre-initialized),
    so legacy embedded DBs catch up to the baseline on next boot
    without a separate code path.

Dead schema dropped in the baseline (audit-flagged)
----------------------------------------------------

Tables (6):
- mcp_proxy_sessions (no readers/writers in TS)
- organization_lobu_links (no readers/writers in TS)
- migration_20260315300000_entity_type_org_backfill (one-off temp)
- migration_20260316100000_created_entity_types (one-off temp)
- migration_20260316100000_deleted_default_entity_types (one-off temp)
- migration_20260316100000_events_kind_backup (one-off temp)

Columns (2):
- agents.skill_auto_granted_domains (jsonb, 0 hits)
- runs.retry_delay_seconds (in comments only, never assigned)

Kept after verification: agents.{soulMd,nixConfig,networkConfig,
pluginsConfig} — heavy camelCase usage in owletto web admin agent
editor + lobu CLI apply diff/desired-state.

Documentation
-------------

Added COMMENT ON TABLE for the 15 load-bearing tables: events, runs,
agents, connections, entities, auth_profiles, organization, user,
member, watchers, feeds, personal_access_tokens, oauth_tokens,
entity_types, event_classifications. Self-documenting schema.

Prod rollout — REQUIRED before deploying this image
-----------------------------------------------------

Run on each prod DB once, BEFORE rolling out the new code:

  BEGIN;
  DROP TABLE IF EXISTS public.mcp_proxy_sessions CASCADE;
  DROP TABLE IF EXISTS public.organization_lobu_links CASCADE;
  DROP TABLE IF EXISTS public.migration_20260315300000_entity_type_org_backfill CASCADE;
  DROP TABLE IF EXISTS public.migration_20260316100000_created_entity_types CASCADE;
  DROP TABLE IF EXISTS public.migration_20260316100000_deleted_default_entity_types CASCADE;
  DROP TABLE IF EXISTS public.migration_20260316100000_events_kind_backup CASCADE;
  ALTER TABLE public.agents DROP COLUMN IF EXISTS skill_auto_granted_domains;
  ALTER TABLE public.runs DROP COLUMN IF EXISTS retry_delay_seconds;
  DELETE FROM public.schema_migrations;
  INSERT INTO public.schema_migrations (version) VALUES ('00000000000000');
  COMMIT;

Why this order: new pods boot with `dbmate up` which now expects
schema_migrations to list `'00000000000000'`. The DELETE+INSERT makes
that the only applied row, so dbmate skips the baseline (whose
contents already match prod's schema after the DROPs).

PGlite / local dev — wipe and rebuild
--------------------------------------

The simplest path: `rm -rf <workspace>/data` next to your `lobu run`
invocations. Next boot recreates the schema from the baseline.
Equivalent: `dbmate drop && dbmate up` against your dev DB.

Why not let dbmate self-heal on boot
-------------------------------------

Without the surgery, prod's `schema_migrations` table has 82 ghost
rows (the old applied versions). dbmate would see baseline as
unapplied, try to apply it, but its strict `CREATE TABLE` (no IF NOT
EXISTS) would error against prod's already-existing tables. Hence the
manual reset.

Pre-flight verification
-----------------------

`bun run typecheck` clean.
`dbmate --migrations-dir db/migrations up` against a fresh
pgvector/pgvector:pg16 container applies the baseline successfully
and produces the same schema state as the pre-squash 82-migration
chain.

Audit trail
-----------

Three Explore agents ran in parallel before the squash to find stale
schema everywhere. Findings (zero false positives uncovered when
spot-checking):

- Audit 1 (dead tables): 6 dead, all dropped.
- Audit 2 (dead views/functions/triggers/sequences): 0 dead. Past
  migrations had already cleaned up event_thread_tree,
  normalize_event_created_by, three notify_* functions.
- Audit 3 (dead columns): 1 confirmed (agents.skill_auto_granted_domains),
  1 confirmed dead by comment-only references (runs.retry_delay_seconds).
  4 false-positive suspects (agents.soulMd et al) — verified alive in
  owletto admin UI + CLI apply.
- Audit 4 (deprecation markers in source): no actionable items not
  already cleaned up.

Codex pushback was applied earlier in the conversation: don't drop
`embedded-schema-patches.ts` while keeping the two-execution-model
embedded boot (which would break). This commit also rewrites that
boot path to remove the dual-model, so the file's deletion is now
safe.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f95ab1c-9445-4254-b4e5-898bd579e550

📥 Commits

Reviewing files that changed from the base of the PR and between 54de2e0 and 283c2ee.

📒 Files selected for processing (92)
  • .github/workflows/ci.yml
  • Makefile
  • db/migrations/00000000000000_baseline.sql
  • db/migrations/20260405193000_add_mcp_sessions.sql
  • db/migrations/20260408120000_remove_system_connectors.sql
  • db/migrations/20260408120001_optional_compiled_code.sql
  • db/migrations/20260409110000_add_active_watcher_run_index.sql
  • db/migrations/20260409130000_connector_default_config.sql
  • db/migrations/20260410120000_add_agent_secrets.sql
  • db/migrations/20260413170000_add_watcher_group_id.sql
  • db/migrations/20260416120000_add_entity_wa_jid_index.sql
  • db/migrations/20260417100000_add_entity_identities.sql
  • db/migrations/20260418100000_add_auth_runs.sql
  • db/migrations/20260418110000_add_runs_created_by_user.sql
  • db/migrations/20260419120000_add_event_identity_indexes.sql
  • db/migrations/20260420120000_extend_reserved_org_slugs.sql
  • db/migrations/20260424030000_add_watcher_run_correlation.sql
  • db/migrations/20260424130000_relax_events_client_id_fk.sql
  • db/migrations/20260425100000_normalize_watcher_feedback.sql
  • db/migrations/20260425120000_add_run_diagnostics.sql
  • db/migrations/20260425130000_add_repair_agent_plumbing.sql
  • db/migrations/20260426120000_entities_entity_type_fk.sql
  • db/migrations/20260426130000_db_integrity_cleanup.sql
  • db/migrations/20260426130001_db_integrity_cleanup_concurrent.sql
  • db/migrations/20260427133000_events_created_by_nullable.sql
  • db/migrations/20260427140000_identity_engine_indexes.sql
  • db/migrations/20260427150000_drop_events_source_id.sql
  • db/migrations/20260427160000_drop_dead_schema.sql
  • db/migrations/20260427170000_market_founder_to_member.sql
  • db/migrations/20260428040000_cascade_events_watchers_org_fk.sql
  • db/migrations/20260428050000_add_runs_approved_input.sql
  • db/migrations/20260429010000_auth_profile_tenant_scoped_fk.sql
  • db/migrations/20260429060000_extend_runs_for_lobu_queue.sql
  • db/migrations/20260429120000_agent_changed_notify.sql
  • db/migrations/20260429120100_user_auth_profiles_and_model_prefs.sql
  • db/migrations/20260429120200_fix_notify_old_keys.sql
  • db/migrations/20260429130000_oauth_states_cli_sessions_rate_limits.sql
  • db/migrations/20260429140000_phase8_grants_chat_connections_mcp_sessions.sql
  • db/migrations/20260429140100_runs_priority_expires_at_retry_delay.sql
  • db/migrations/20260429180000_drop_invalidatable_cache_triggers.sql
  • db/migrations/20260430005614_agents_apply_fields.sql
  • db/migrations/20260430022231_fix_connection_config_encryption.sql
  • db/migrations/20260430151215_add_task_run_type.sql
  • db/migrations/20260501000000_drop_cli_sessions.sql
  • db/migrations/20260501133000_lobu_memory_mcp_id.sql
  • db/migrations/20260502000000_drop_chat_connections.sql
  • db/migrations/20260503000000_agent_secrets_org_scope.sql
  • db/migrations/20260504000000_flatten_agents_drop_sandbox_model.sql
  • db/migrations/20260510220000_connector_required_capability.sql
  • db/migrations/20260512000000_device_worker_connection_binding.sql
  • db/migrations/20260512131703_connections_slug.sql
  • db/migrations/20260513000000_chat_user_identities.sql
  • db/migrations/20260513120000_auth_profiles_device_binding.sql
  • db/migrations/20260513150000_auth_profiles_cdp_url.sql
  • db/migrations/20260513200000_notifications_as_events.sql
  • db/migrations/20260514000000_scheduled_jobs.sql
  • db/migrations/20260514120000_auth_profiles_connector_key_nullable.sql
  • db/migrations/20260514130000_connection_action_modes.sql
  • db/migrations/20260514160000_auth_profiles_mirror_mode.sql
  • db/migrations/20260515120000_agents_per_org_pk.sql
  • db/migrations/20260515150000_geo_enrichment.sql
  • db/migrations/20260515160000_drop_agents_org_id_unique.sql
  • db/migrations/20260515170000_auth_profiles_default_for_connector.sql
  • db/migrations/20260516120000_agents_per_org_pk_swap.sql
  • db/migrations/20260516200000_events_search_tsv.sql
  • db/migrations/20260516200100_events_lifecycle_changes_index.sql
  • db/migrations/20260517010000_drop_unused_indexes.sql
  • db/migrations/20260517020000_softdelete_orphan_feeds.sql
  • db/migrations/20260517030000_pat_worker_id_binding.sql
  • db/migrations/20260517040000_archive_orphan_watchers.sql
  • db/migrations/20260517050000_watcher_agent_id_not_null.sql
  • db/migrations/20260517060000_watcher_schema_additions.sql
  • db/migrations/20260517150000_goals_primitive.sql
  • db/migrations/20260517160000_drop_goals_primitive.sql
  • db/migrations/20260518000000_pending_interactions.sql
  • db/migrations/20260518010000_runs_heartbeat_reaper_index.sql
  • db/migrations/20260518020000_runs_heartbeat_inflight_narrow.sql
  • db/migrations/20260518040000_agent_transcript_snapshot.sql
  • db/migrations/20260518050000_runs_denormalize_agent_conversation.sql
  • db/migrations/20260518060000_revert_runs_denormalize.sql
  • db/migrations/20260518070000_runs_heartbeat_inflight_widen.sql
  • db/migrations/20260519000000_passkey_table.sql
  • db/migrations/20260519020000_chat_state_tables.sql
  • db/migrations/20260519020001_revoked_tokens.sql
  • db/schema.sql
  • packages/server/src/__tests__/integration/embedded-schema-patches.test.ts
  • packages/server/src/__tests__/integration/identity/founder-to-member-migration.test.ts
  • packages/server/src/db/embedded-schema-patches.ts
  • packages/server/src/gateway/auth/revoked-token-store.ts
  • packages/server/src/gateway/connections/state-adapter.ts
  • packages/server/src/start-local.ts
  • scripts/normalize-schema.sh

📝 Walkthrough

Walkthrough

This PR implements a large-scale database migration squashing and boot refactoring. Dozens of migration files (dating February–May 2026) are removed to consolidate into a baseline migration. The CI workflow gains a [squash-baseline] commit-message bypass for immutability checks, removes schema normalization steps, and simplifies to basic migration application. The local server boot switches from embedded schema patches to a migration ledger approach using the schema_migrations table.

Changes

Squashed Baseline Migration & Boot Refactor

Layer / File(s) Summary
CI immutability and migration workflow simplification
.github/workflows/ci.yml
Immutability check gains [squash-baseline] bypass; dbmate up is invoked without --schema-file for drift checking; schema normalization logic and post-migration snapshot diffing are removed.
Build target removal
Makefile
The db-schema make target (which ran dbmate + ./scripts/normalize-schema.sh) is deleted; .PHONY and help output are updated.
Local server migration ledger and boot refactor
packages/server/src/start-local.ts
Embedded schema patches import and fallback are removed. runMigrations now ensures schema_migrations table exists, loads applied versions into memory, filters and applies unapplied migrations from db/migrations/, and records versions via INSERT ... ON CONFLICT IGNORE.
Documentation clarifications on schema source
packages/server/src/gateway/auth/revoked-token-store.ts, packages/server/src/gateway/connections/state-adapter.ts
Module docs clarify that Postgres schema is now sourced from the squashed baseline migration (db/migrations/00000000000000_baseline.sql) applied by all boot paths, replacing the prior embedded patches approach.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lobu-ai/lobu#834: Adds public.pending_interactions table that this PR removes via deleted migration 20260518000000_pending_interactions.sql.
  • lobu-ai/lobu#901: Adds scripts/normalize-schema.sh and db-schema target that this PR removes.
  • lobu-ai/lobu#893: Modifies packages/server/src/db/embedded-schema-patches.ts that this PR deletes entirely.

Suggested labels

skip-size-check

Poem

🐰 Migrations once many, now one baseline true,
Schema patches gone, ledger-tracked through and through,
CI bypasses squash-commits with grace,
Boot from db/migrations finds its rightful place!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/db-squash-baseline

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

buremba added 4 commits May 19, 2026 04:50
…teps to baseline header

Two follow-ups to the schema squash (#908):

1. CI's migration-immutability check now skips when ANY commit in the PR
   contains the sentinel `[squash-baseline]` in its message. This is
   the one-time-per-squash escape hatch. Code review is the human gate;
   future random deletions won't ship without the sentinel + reviewer
   sign-off.

2. The baseline file's header now spells out the full prod-rollout
   procedure with explicit backup commands (pg_dump full snapshot, CSV
   dumps of the schema_migrations ledger and the named droppee tables)
   plus a rollback procedure for both kinds of restoration. Audit said
   the dropped tables had zero TS readers; the dumps are paranoia.

[squash-baseline]
… rollback

Three fixes after end-to-end verification against two containers
(Container A: 82 old migrations + surgery; Container B: fresh DB +
new baseline). Canonical schemas now match byte-for-byte across both.

1. Strip schema_migrations table CREATE from the baseline body. dbmate
   creates that table itself on first use, and the previous baseline
   colliding `CREATE TABLE public.schema_migrations` would error a
   fresh `dbmate up` with `relation "schema_migrations" already exists`.
   Verified Container B now applies in 2.6s with 67 user tables + the
   dbmate-managed schema_migrations.

2. Switch the surgery from DROP TABLE to RENAME for the 6 dead tables
   so no row-level data can be lost even if the audit was wrong about
   one of them. Column drops get the same treatment: snapshot the
   value into a side table BEFORE the ALTER TABLE DROP COLUMN, so
   restore is a single UPDATE if needed. Shortened suffix to
   `_d20260519` (10 chars) to fit Postgres's 63-char identifier limit
   for the long `migration_*` artifact table names. All renames
   verified non-truncating.

3. Anchor the rollback story on CNPG point-in-time recovery, which
   this codebase already has wired up:

   - `Cluster.spec.backup.barmanObjectStore` streams WAL to Cloudflare
     R2 (`s3://summaries-db-backup`)
   - `retentionPolicy: 30d` (30-day PITR window)
   - `archive_timeout: 900` (15-min WAL force-archive)
   - `ScheduledBackup` daily at 02:00 UTC
   - Recovery proof-point exists at
     packages/owletto/deploy/k8s/apps/lobu/base/db-recovery.yaml
     (used on 2026-03-15 after the Reddit re-sync incident)

   The baseline header now documents two layered recovery paths:
   - Path A (preferred, full): apply a recovery Cluster CR with
     `recoveryTarget.targetTime` set to the pre-surgery timestamp
     captured in STEP 0. CNPG fetches the latest base backup + replays
     WAL up to that time. Flip the app's DATABASE_URL to point at the
     recovered cluster.
   - Path B (lightweight, just the squash): restore the ledger CSV +
     rename the backup tables back + ADD COLUMN + UPDATE from
     snapshot tables.

End-to-end test (manual; not automated since requires Docker):
- Boot pgvector/pgvector:pg16, apply old 82 migrations → snapshot A.
- Run surgery on A → snapshot A' (post-surgery).
- Boot fresh pgvector, apply new baseline → snapshot B.
- Diff canonical (non-d20260519, non-dropped_*) schemas between A' and B:
    * Identical column-by-column for all 67+ canonical tables
    * agents.skill_auto_granted_domains absent in both
    * runs.retry_delay_seconds absent in both
    * schema_migrations contains only '00000000000000' in both
- A' additionally has 8 backup tables (renamed originals + 2 column
  snapshots). B doesn't have them by design.

[squash-baseline]
CI integration job failed on two tests that hard-coded paths to files
the baseline squash removed:

1. `packages/server/src/__tests__/integration/embedded-schema-patches.test.ts`
   - Imported `EMBEDDED_SCHEMA_PATCHES` from the now-deleted
     `db/embedded-schema-patches.ts`. Test was a unit-shape verifier
     for that file; with the file gone (embedded path now runs
     migrations the same way prod does), the test has nothing to
     verify.

2. `packages/server/src/__tests__/integration/identity/founder-to-member-migration.test.ts`
   - Read `db/migrations/20260427170000_market_founder_to_member.sql`
     directly (extracted -- migrate:up section, re-ran it manually to
     check idempotency). The migration is now collapsed into the
     baseline; idempotency of one historical migration is no longer a
     meaningful unit boundary.

Both tests deleted.

Also updated two stale doc-comments in
`packages/server/src/gateway/{auth/revoked-token-store,connections/state-adapter}.ts`
that pointed at the deleted forward-delta migrations + the deleted
`embedded-schema-patches.ts`. Both now point at the baseline.

[squash-baseline]
Codex review of #908 flagged three real issues. All three fixed:

1. (HIGH) PITR recovery doc was wrong about the ledger state. The
   recovered DB has the OLD 82 ledger rows but not '00000000000000'.
   When the new image's dbmate-up runs against it, it sees the
   baseline as pending and tries CREATE TABLE against existing
   tables, which errors. Rewrote recovery path A to spell out the
   two reconciliation choices:
     (a) revert to old image then repoint (safest — old image
         expects 82 ledger rows; new recovered DB has them);
     (b) keep new image and manually
           DELETE FROM schema_migrations;
           INSERT … VALUES ('00000000000000');
         on the recovered DB before any new-image migration job runs.
   Without (b), the new image fails on first boot against the
   recovered DB.

2. (HIGH) agents column snapshot missed the composite primary key.
   agents_pkey is (organization_id, id); the snapshot stored only
   `id AS agent_id`, and the rollback UPDATE matched only
   `a.id = b.agent_id`. When the same id is reused across orgs (the
   PK contract permits it), restoration would target a wrong row
   nondeterministically. Snapshot now includes organization_id; the
   rollback UPDATE joins on both. runs.PK is just (id) so the runs
   snapshot is unchanged.

3. (MEDIUM) pg_dump emitted
       SELECT pg_catalog.set_config('search_path', '', false)
   which is session-scoped. Subsequent forward migrations using
   unqualified names would fail under CI's `dbmate up` (which doesn't
   reset between files). Changed `false` → `true` (transaction-scoped)
   with a comment explaining why.

Re-verified Container B (fresh DB + baseline) applies in 1.7s, 68
tables. Surgery logic in the header was edited in-place (it's a
comment block, not an executable statement), so the change is doc-only
on the apply side; the script you'd paste at surgery time is now the
correct one.

[squash-baseline]
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 19, 2026

Codex review (high-effort) — applied (commit e3dad80)

Three findings, all fixed before merge.

# Severity Finding Fix
1 HIGH PITR recovery doc was wrong about the ledger. Recovered DB has the OLD 82 ledger rows but not '00000000000000'. New image's dbmate up would treat the baseline as pending → CREATE TABLE against existing tables → error. Rewrote recovery path A. Two reconciliation choices now spelled out: (a) revert to old image then repoint (safest), (b) keep new image and insert baseline ledger row before any new-image migration job.
2 HIGH agents snapshot missed the composite primary key. agents_pkey is (organization_id, id); snapshot stored only id. Restore would clobber the wrong org's agent if id is reused across orgs. Snapshot now keeps organization_id, id, value; rollback UPDATE joins on both. (runs.PK = (id) so the runs snapshot is unchanged.)
3 MEDIUM pg_dump emitted SELECT pg_catalog.set_config('search_path', '', false) — session-scoped, leaks past baseline. Later forward migrations using unqualified names would fail under CI's dbmate up. Changed falsetrue (transaction-scoped).

Re-verification: Container B (fresh DB + baseline) still applies cleanly in 1.7s with 68 tables.

Direct codex confirmations:

  • runMigrations() in start-local.ts correctly handles the post-surgery prod DB (with only '00000000000000' in schema_migrations).
  • Fresh PGlite bootstrap path works.
  • Surgery sequence is data-preserving after the agents composite-PK fix.

…earing

CI integration job caught what the audit missed: `runs.retry_delay_seconds`
is heavily used by RunsQueue, not "comments only" as the audit reported.

Concrete references in packages/server/src/gateway/infrastructure/queue/runs-queue.ts:
- L301: `const retryDelaySeconds = options?.retryDelay ?? null;`
- L368: `retry_delay_seconds,` (INSERT column list)
- L386: `retryDelaySeconds,` (INSERT value binding)
- L575: type signature exposes `retryDelaySeconds: number | null;`
- L584: SQL projection types `retry_delay_seconds: number | string | null`
- L603: `RETURNING r.id, r.action_input, r.attempts, r.max_attempts, r.retry_delay_seconds`
- L617-620: post-claim value extraction

Audit's snake-case grep returned 7 hits "in comments/type hints" — the
file genuinely uses the snake_case column in SQL strings AND the
camelCase JS binding for the same data. The miss was a methodology
blind spot the audit also had on agents.soulMd / nixConfig (caught
earlier by spot-check).

The CI failure was 6 RunsQueue integration tests all failing with:
  PostgresError: column "retry_delay_seconds" of relation "runs" does not exist

Fix:
- Re-add `retry_delay_seconds integer,` to the runs CREATE TABLE block
  in the baseline (between `expires_at` and the constraints, matching
  origin/main's schema.sql).
- Remove the runs.retry_delay_seconds drop from the surgery script
  (it stays in prod; nothing to surgery).
- Remove the runs_d20260519_retry_delay_seconds snapshot table from
  the rollback section.
- Update the docstring drop list.

agents.skill_auto_granted_domains stays dropped — audit was correct
about that one (verified with spot-check earlier).

Re-verified Container B (fresh DB + baseline) applies in 1.9s. The
`runs` table now has the column; SELECT confirms it.

[squash-baseline]
@buremba buremba marked this pull request as ready for review May 19, 2026 04:33
@buremba buremba merged commit 54207a8 into main May 19, 2026
19 of 20 checks passed
@buremba buremba deleted the chore/db-squash-baseline branch May 19, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants