perf: drop 8 unused indexes (5.16 GB) + event_count from list#771
Conversation
Three changes bundled around the post-incident perf brainstorm, each verified against prod: * db/migrations: drops 8 indexes that pg_stat_user_indexes reports idx_scan=0 after 28h of prod uptime — `idx_events_embedding` (4.3 GB ivfflat ANN), `idx_events_raw_content_trgm` (554 MB trigram GIN), `idx_events_search_tsv` (228 MB fulltext GIN), and 5 smaller ones. pg_stat_statements over the same window confirms zero queries touch the shapes they serve (<->, <=>, payload_text ILIKE, @@ to_tsquery). Search code paths in content-search.ts exist but aren't called in prod today; rebuild CONCURRENTLY if/when needed. Migration uses plain DROP INDEX rather than CONCURRENTLY because dbmate's transaction:false directive doesn't actually exit the transaction block against the pq driver (see comment in 20260426130001_db_integrity_cleanup_concurrent.sql). The operator runbook in docs/MIGRATIONS.md "When dbmate fails in prod" covers applying CONCURRENTLY out-of-band first, then recording the schema_migrations row. * manage_connections.ts handleList: removes the per-row event_count subselect. The supersedes anti-join through current_event_records was the entire cost of that query — verified by EXPLAIN ANALYZE against prod, 1303ms → 2.3ms (566x). handleGet (single connection detail) still computes it; that path is one row and costs ~1.2ms. Submodule bump in packages/web pulls in the matching frontend changes (owletto-web#136). * docs/MIGRATIONS.md: appends a Lobu-specific policy paragraph to the cascade section — connections are soft-deleted only in prod. The cascade UPDATE on events.connection_id is ~13s per call at current scale and is the issue at rank #8 in pg_stat_statements. The connection-creation rollback path that hard-deletes never-activated rows (no events yet, no cascade) is the only acceptable use.
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughMigration drops four event-related indexes; schema.sql is updated to remove several event indexes and record the migration. Docs add a connections soft-delete ops policy. Connection list omits per-row event_count. packages/web submodule pointer advanced. ChangesDatabase Schema and Index Consolidation
Connection Query Optimization and Operations Policy
Web Submodule Update
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90dbc38393
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DROP INDEX IF EXISTS public.idx_events_embedding; | ||
| DROP INDEX IF EXISTS public.idx_events_raw_content_trgm; | ||
| DROP INDEX IF EXISTS public.idx_events_search_tsv; |
There was a problem hiding this comment.
Keep the search candidate indexes available
In orgs with embedded content, search_memory still opts into the bounded candidate path (packages/server/src/tools/search.ts:257-268), and that path explicitly relies on these three indexes for the vector, full-text, and trigram branches (packages/server/src/utils/content-search.ts:1707-1731). After dropping them, those branches become table scans over event_embeddings/events; the candidate path has a 6s statement timeout and catches failures by returning an empty content list, so memory recall can silently lose content rather than just run a little slower. Either keep these indexes or disable/replace the candidate path before removing them.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@db/migrations/20260517010000_drop_unused_indexes.sql`:
- Around line 15-23: The migration file 20260517010000_drop_unused_indexes.sql
performs plain DROP INDEX operations that can block prod; add a fail-fast guard
at the top of this SQL that prevents execution unless an explicit opt-in session
flag is set (e.g. check current_setting('app.allow_blocking_index_drop', true)
or similar) and RAISE EXCEPTION with a clear message if the flag is absent or
false, so accidental runs of the DROP INDEX statements abort immediately;
reference the file name and the DROP INDEX statements when making the change.
In `@packages/web`:
- Line 1: The packages/web submodule is pinned to a commit not reachable from
owletto-web/main; update the submodule reference in packages/web to point at a
commit on owletto-web/main (if this PR depends on owletto-web#136, first merge
that PR into owletto-web/main), then update the submodule pointer to the
resulting main commit (e.g., fetch/checkout owletto-web/main, get the target
commit SHA, and update packages/web to that SHA) so the Submodule Drift job no
longer rejects it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 7bcfe2ca-547b-4167-be60-eaed66eedac4
📒 Files selected for processing (5)
db/migrations/20260517010000_drop_unused_indexes.sqldb/schema.sqldocs/MIGRATIONS.mdpackages/server/src/tools/admin/manage_connections.tspackages/web
| -- The migration uses plain `DROP INDEX` (not CONCURRENTLY) because | ||
| -- dbmate's `transaction:false` directive doesn't actually exit the | ||
| -- transaction block when running against the `pq` driver — see the | ||
| -- comment in 20260426130001_db_integrity_cleanup_concurrent.sql. | ||
| -- Operator runbook: for prod application, run `DROP INDEX CONCURRENTLY` | ||
| -- manually first (see docs/MIGRATIONS.md "When dbmate fails in prod" | ||
| -- → transaction:false recipe), then run dbmate to record this row in | ||
| -- schema_migrations. On fresh installs / CI / dev the events table is | ||
| -- empty so the brief ACCESS EXCLUSIVE is irrelevant. |
There was a problem hiding this comment.
Add a fail-fast guard to prevent accidental blocking DROP INDEX in prod.
This migration depends on an out-of-band manual step, but SQL does not enforce that precondition. If someone runs dbmate directly, the plain DROP INDEX path can still hit hot tables.
🛡️ Suggested guard before the DROP statements
+DO $$
+BEGIN
+ IF EXISTS (
+ SELECT 1
+ FROM pg_class i
+ JOIN pg_namespace n ON n.oid = i.relnamespace
+ WHERE n.nspname = 'public'
+ AND i.relname = ANY (ARRAY[
+ 'idx_events_embedding',
+ 'idx_events_raw_content_trgm',
+ 'idx_events_search_tsv',
+ 'idx_events_entity_ids_occurred_at',
+ 'idx_events_origin_parent_id',
+ 'idx_events_thread_lookup',
+ 'idx_events_run_id',
+ 'idx_events_type'
+ ])
+ )
+ AND EXISTS (SELECT 1 FROM public.events LIMIT 1)
+ THEN
+ RAISE EXCEPTION
+ 'Precondition failed: run DROP INDEX CONCURRENTLY out-of-band first, then rerun dbmate to record migration 20260517010000.';
+ END IF;
+END
+$$;
+
DROP INDEX IF EXISTS public.idx_events_embedding;
DROP INDEX IF EXISTS public.idx_events_raw_content_trgm;
DROP INDEX IF EXISTS public.idx_events_search_tsv;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@db/migrations/20260517010000_drop_unused_indexes.sql` around lines 15 - 23,
The migration file 20260517010000_drop_unused_indexes.sql performs plain DROP
INDEX operations that can block prod; add a fail-fast guard at the top of this
SQL that prevents execution unless an explicit opt-in session flag is set (e.g.
check current_setting('app.allow_blocking_index_drop', true) or similar) and
RAISE EXCEPTION with a clear message if the flag is absent or false, so
accidental runs of the DROP INDEX statements abort immediately; reference the
file name and the DROP INDEX statements when making the change.
| @@ -1 +1 @@ | |||
| Subproject commit c390103ed009f00f91fb5547a811235c914dd3d8 | |||
| Subproject commit 00b249cafd7c239b633193cd9b5edcd2fad02166 | |||
There was a problem hiding this comment.
Re-pin this submodule to a commit reachable from owletto-web/main.
Line 1 currently points at a SHA that the Submodule Drift job rejects as off-main, and the failure text says FluxCD deploy would break. If this PR is meant to consume owletto-web#136, merge that first and then update packages/web to the resulting main commit before landing this change.
🧰 Tools
🪛 GitHub Actions: Submodule Drift / 0_check-drift.txt
[error] 1-1: Pinned SHA $PINNED is not reachable from owletto-web/main. FluxCD deploy would break because parent pinned commit is off-main.
🪛 GitHub Actions: Submodule Drift / check-drift
[error] 1-1: Pinned SHA $PINNED is not reachable from owletto-web/main. Error emitted: 'Pinned SHA $PINNED is not reachable from owletto-web/main.'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/web` at line 1, The packages/web submodule is pinned to a commit not
reachable from owletto-web/main; update the submodule reference in packages/web
to point at a commit on owletto-web/main (if this PR depends on owletto-web#136,
first merge that PR into owletto-web/main), then update the submodule pointer to
the resulting main commit (e.g., fetch/checkout owletto-web/main, get the target
commit SHA, and update packages/web to that SHA) so the Submodule Drift job no
longer rejects it.
CI's pg_dump produces plain NOT NULL form (vs PG18's named CONSTRAINTs). The earlier commit's schema.sql had the CONSTRAINT form leaked in from local regen, failing the schema-drift check. Restored from origin/main + re-applied only the intended 4-index removals. Web submodule bump: the earlier pin (perf/connections-detail-event-count- fallback branch tip) wasn't reachable from web/main per the strict reachability check. Re-point to current web/main HEAD.
The previous fixup commit updated db/schema.sql to match the reduced 4-index scope but missed committing the migration file itself, so HEAD still had the original 8-index DROP statements. CI's dbmate ran the 8-drop migration but schema.sql only removed 4 → drift fail. This commits the matching migration content.
* fix: close monitoring + deploy gaps from post-incident audit Three gaps identified by the audit after #771/#772 landed (1914 errors/5min in stdout, zero Sentry issues, single-replica Recreate deploy strategy = ~30s "no available server" window per rollout): * **Sentry forwarding from pino** (utils/logger.ts). The existing Sentry capture middleware only fires on HTTP 500 responses; `logger.error()` from background jobs (CheckDueFeeds, runs queue, scheduled tasks) was invisible to monitoring. Now a pino destination inspects each log line and forwards `level >= error` to `Sentry.captureException` (with stack reconstruction) or `captureMessage`. In-process dedupe by (msg, err.type, top stack frame) with a 60s window prevents the 380/min repeating-error pattern from saturating Sentry — Sentry has its own grouping but every captureException still incurs an HTTP call. * **Orphan-feed cleanup + graceful skip** (queue-helpers.ts + 20260517020000_softdelete_orphan_feeds.sql). The 'website' connector was archived/uninstalled in some orgs but their feeds weren't soft-deleted. `createSyncRunWithClient` threw on every CheckDueFeeds tick when no active connector_definition existed for the (key, org) pair — produced the 380/min error stream that masked real issues. Code change: warn + skip instead of throw, so future orphans don't spam logs. Migration: one-shot soft-delete of feeds matching the orphan criteria (no pinned_version, active connection, no active connector_definition for the org). * **PreStop drain + grace period** (charts/lobu/templates/ deployment.yaml, values.yaml). Single-replica gateway with `strategy: Recreate` (workspaces PVC is RWO) means every deploy has a ~30s window where the old pod is gone but the new one isn't ready — Cloudflare returns "no available server" to live traffic during this window. The 2026-05-16 incident felt extreme because the pod was actually crash-looping; the same shape happens briefly on every healthy deploy. PreStop hook (15s sleep, configurable via app.preStopDelaySeconds) gives Service endpoint deregistration time to propagate before SIGTERM, so requests in-flight during the drain still hit a live process. terminationGracePeriodSeconds bumped to 45 to cover preStop + graceful shutdown (~5s). Full elimination needs RWX storage for workspaces so we can roll instead of recreate — documented in the deployment.yaml strategy comment. * fix(audit): address pi review on #775 Six findings, all addressed: * **#1 preStop on Recreate makes the gap WORSE.** Pi caught the reversal — under `strategy: Recreate` the new pod doesn't start until the old one fully terminates, so adding a preStop sleep EXTENDS the no-available-server window rather than shrinking it. Default `preStopDelaySeconds` to 0 and only emit the lifecycle hook when > 0. Document that ops repos should set it only with RollingUpdate (which needs RWX storage on workspaces — out of scope here). * **#2 migration too broad.** Tightened WHERE to require `feeds.status='active'` and `connections.status='active'` so we match exactly the set CheckDueFeeds would process. Paused / pending_auth / error feeds are left alone — they don't contribute to the error spam and may be intentionally recoverable. * **#3 Sentry double-report.** `server.ts onError` already calls `Sentry.captureException` then `logger.error`; the new logger forwarder would have sent the same event again. Added `sentryReported: true` marker on that log line; logger transport skips forwarding when it sees the marker. * **#4 dedupe fingerprint too coarse.** Added `err.message` to the fingerprint so distinct messages from the same catch site (same Error type, same top stack frame) get distinct fingerprints within the 60s window. * **#5 existing string-error call sites are pre-existing tech debt** — left as-is for now. Documented in the PR body as a follow-up; the new forwarder handles `err` / `error` Error-object payloads well, and call sites with stringified errors are a separate cleanup pass. * **#6 warn-skip leaves future orphans being polled forever.** Now `createSyncRunWithClient` soft-deletes the feed in-place when no active connector_definition is found, so CheckDueFeeds stops selecting it. Operators recover by clearing `deleted_at` after reinstalling the definition. * chore: bump web submodule to current main for drift check
Why
Post-incident perf brainstorm with the user picked the verified-savings items. Three bundled here:
Index drops
Drops these 8 indexes (all confirmed 0 scans in 28h):
`pg_stat_statements` over the same 28h confirms no queries hit the shapes these indexes serve: no `<->` / `<=>` / `<#>` vector ops, no `payload_text ILIKE`, no `@@ to_tsquery`. The code paths in `packages/server/src/utils/content-search.ts` exist but aren't being called in prod today. If/when they start, queries fall back to seq-scan + filter until the index is rebuilt CONCURRENTLY.
The migration uses plain `DROP INDEX` (not CONCURRENTLY) because dbmate's `transaction:false` directive doesn't actually exit the transaction block against the `pq` driver. The operator runbook in `docs/MIGRATIONS.md` covers applying CONCURRENTLY out-of-band first; on fresh installs / CI / dev the events table is empty so the brief ACCESS EXCLUSIVE is irrelevant.
event_count
`manage_connections.handleList` drops the per-row `event_count` subselect. EXPLAIN ANALYZE on prod shows the cost was entirely in the supersedes anti-join through `current_event_records` (48% of `events` rows are tombstones; the anti-join probes 690k rows per page). Without it: 1303ms → 2.3ms (566×).
`handleGet` (single connection detail) still computes `event_count` — that path is one row and costs ~1.2ms.
Web side (owletto-web#136) makes `ConnectionItem.event_count` optional and switches the events landing's `totalEvents` stat to sum the 14-day per-day sparkline series instead of summing per-connection counts. Same data the chart already shows; semantics shift from "all-time live total" to "last 14 days."
Soft-delete policy doc
Appends to `docs/MIGRATIONS.md` cascade section: connections are never hard-deleted in prod. The `events_connection_id_fkey ... ON DELETE SET NULL` cascade is rank #8 in pg_stat_statements (5 calls × 13.4s each). Soft-deleted connections cost ~50 bytes apiece; the occasional accumulation isn't worth the recurring stall. The connection-creation rollback path that hard-deletes never-activated rows (no events yet, no cascade) is the only acceptable use.
Merge order
Operator runbook for the index drops in prod
```sh
DB="$(... pull DATABASE_URL ...)"
Time the biggest one first to validate headroom
psql "$DB" <<'SQL'
SET statement_timeout = 0;
SET lock_timeout = 0;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_embedding;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_raw_content_trgm;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_search_tsv;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_entity_ids_occurred_at;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_origin_parent_id;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_thread_lookup;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_run_id;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_type;
BEGIN;
INSERT INTO public.schema_migrations(version) VALUES ('20260517010000');
COMMIT;
SQL
```
Test plan
Summary by CodeRabbit
Performance Improvements
Chores
Documentation