Skip to content

perf: drop 8 unused indexes (5.16 GB) + event_count from list#771

Merged
buremba merged 3 commits into
mainfrom
perf/drop-unused-indexes-and-event-count
May 16, 2026
Merged

perf: drop 8 unused indexes (5.16 GB) + event_count from list#771
buremba merged 3 commits into
mainfrom
perf/drop-unused-indexes-and-event-count

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Post-incident perf brainstorm with the user picked the verified-savings items. Three bundled here:

Fix Today After Source
Drop 8 unused indexes 5.16 GB disk + RAM + write amp 0 `pg_stat_user_indexes.idx_scan = 0` for 28h
Drop event_count from list 1303ms mean 2.3ms EXPLAIN ANALYZE on prod
Soft-delete-only policy doc Recurring 13s API stall n/a rank #8 in pg_stat_statements

Index drops

Drops these 8 indexes (all confirmed 0 scans in 28h):

Index Size Type
`idx_events_embedding` 4.3 GB ivfflat ANN on `event_embeddings`
`idx_events_raw_content_trgm` 554 MB trigram GIN on `events.payload_text`
`idx_events_search_tsv` 228 MB fulltext GIN (added today; never used)
`idx_events_entity_ids_occurred_at` 37 MB partial btree
`idx_events_origin_parent_id` 11 MB plain btree
`idx_events_thread_lookup` 10 MB partial btree
`idx_events_run_id` 8 MB plain btree
`idx_events_type` 8 MB partial btree

`pg_stat_statements` over the same 28h confirms no queries hit the shapes these indexes serve: no `<->` / `<=>` / `<#>` vector ops, no `payload_text ILIKE`, no `@@ to_tsquery`. The code paths in `packages/server/src/utils/content-search.ts` exist but aren't being called in prod today. If/when they start, queries fall back to seq-scan + filter until the index is rebuilt CONCURRENTLY.

The migration uses plain `DROP INDEX` (not CONCURRENTLY) because dbmate's `transaction:false` directive doesn't actually exit the transaction block against the `pq` driver. The operator runbook in `docs/MIGRATIONS.md` covers applying CONCURRENTLY out-of-band first; on fresh installs / CI / dev the events table is empty so the brief ACCESS EXCLUSIVE is irrelevant.

event_count

`manage_connections.handleList` drops the per-row `event_count` subselect. EXPLAIN ANALYZE on prod shows the cost was entirely in the supersedes anti-join through `current_event_records` (48% of `events` rows are tombstones; the anti-join probes 690k rows per page). Without it: 1303ms → 2.3ms (566×).

`handleGet` (single connection detail) still computes `event_count` — that path is one row and costs ~1.2ms.

Web side (owletto-web#136) makes `ConnectionItem.event_count` optional and switches the events landing's `totalEvents` stat to sum the 14-day per-day sparkline series instead of summing per-connection counts. Same data the chart already shows; semantics shift from "all-time live total" to "last 14 days."

Soft-delete policy doc

Appends to `docs/MIGRATIONS.md` cascade section: connections are never hard-deleted in prod. The `events_connection_id_fkey ... ON DELETE SET NULL` cascade is rank #8 in pg_stat_statements (5 calls × 13.4s each). Soft-deleted connections cost ~50 bytes apiece; the occasional accumulation isn't worth the recurring stall. The connection-creation rollback path that hard-deletes never-activated rows (no events yet, no cascade) is the only acceptable use.

Merge order

  1. owletto-web#136 first (frontend type change + events-page sum)
  2. This PR (parent already bumped the submodule pointer; will be reachable from web/main once feat(worker): token usage and cost tracking #136 lands)

Operator runbook for the index drops in prod

```sh
DB="$(... pull DATABASE_URL ...)"

Time the biggest one first to validate headroom

psql "$DB" <<'SQL'
SET statement_timeout = 0;
SET lock_timeout = 0;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_embedding;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_raw_content_trgm;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_search_tsv;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_entity_ids_occurred_at;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_origin_parent_id;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_thread_lookup;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_run_id;
DROP INDEX CONCURRENTLY IF EXISTS public.idx_events_type;
BEGIN;
INSERT INTO public.schema_migrations(version) VALUES ('20260517010000');
COMMIT;
SQL
```

Test plan

  • `make typecheck` clean
  • `make build-packages` builds
  • EXPLAIN ANALYZE on prod confirms 1303ms → 2.3ms for the connections-list query
  • `pg_stat_user_indexes` confirms 0 scans on all 8 indexes over 28h
  • Migration applies cleanly against a fresh PG (verified via local docker)

Summary by CodeRabbit

  • Performance Improvements

    • Admin connections list loads faster by removing costly per-row event count calculations (detail view still shows counts).
  • Chores

    • Database migration that drops several unused indexes and updates migration tracking; dropped indexes can be restored on rollback.
    • Subproject reference updated to a newer commit.
  • Documentation

    • Operations guidance added clarifying connection-deletion and rollback behavior in migrations.

Review Change Stack

Three changes bundled around the post-incident perf brainstorm, each
verified against prod:

* db/migrations: drops 8 indexes that pg_stat_user_indexes reports
  idx_scan=0 after 28h of prod uptime — `idx_events_embedding` (4.3 GB
  ivfflat ANN), `idx_events_raw_content_trgm` (554 MB trigram GIN),
  `idx_events_search_tsv` (228 MB fulltext GIN), and 5 smaller ones.
  pg_stat_statements over the same window confirms zero queries touch
  the shapes they serve (<->, <=>, payload_text ILIKE, @@ to_tsquery).
  Search code paths in content-search.ts exist but aren't called in
  prod today; rebuild CONCURRENTLY if/when needed.

  Migration uses plain DROP INDEX rather than CONCURRENTLY because
  dbmate's transaction:false directive doesn't actually exit the
  transaction block against the pq driver (see comment in
  20260426130001_db_integrity_cleanup_concurrent.sql). The operator
  runbook in docs/MIGRATIONS.md "When dbmate fails in prod" covers
  applying CONCURRENTLY out-of-band first, then recording the
  schema_migrations row.

* manage_connections.ts handleList: removes the per-row event_count
  subselect. The supersedes anti-join through current_event_records
  was the entire cost of that query — verified by EXPLAIN ANALYZE
  against prod, 1303ms → 2.3ms (566x). handleGet (single connection
  detail) still computes it; that path is one row and costs ~1.2ms.
  Submodule bump in packages/web pulls in the matching frontend
  changes (owletto-web#136).

* docs/MIGRATIONS.md: appends a Lobu-specific policy paragraph to the
  cascade section — connections are soft-deleted only in prod. The
  cascade UPDATE on events.connection_id is ~13s per call at current
  scale and is the issue at rank #8 in pg_stat_statements. The
  connection-creation rollback path that hard-deletes never-activated
  rows (no events yet, no cascade) is the only acceptable use.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: eb830c57-5a9c-407c-9926-48a677cc5070

📥 Commits

Reviewing files that changed from the base of the PR and between fdf1961 and 8a6f3b1.

📒 Files selected for processing (1)
  • db/migrations/20260517010000_drop_unused_indexes.sql

📝 Walkthrough

Walkthrough

Migration drops four event-related indexes; schema.sql is updated to remove several event indexes and record the migration. Docs add a connections soft-delete ops policy. Connection list omits per-row event_count. packages/web submodule pointer advanced.

Changes

Database Schema and Index Consolidation

Layer / File(s) Summary
Index cleanup migration
db/migrations/20260517010000_drop_unused_indexes.sql
Migration drops four event-related indexes in migrate:up and recreates them in migrate:down.
Schema output updates and migration record
db/schema.sql
db/schema.sql updated to remove the composite entity_ids+occurred_at index and several other public.events indexes; public.schema_migrations seed adds 20260517010000.

Connection Query Optimization and Operations Policy

Layer / File(s) Summary
Connection list query and ops docs
packages/server/src/tools/admin/manage_connections.ts, docs/MIGRATIONS.md
Connections list no longer selects per-row event_count (detail still computes it); docs add a policy that connections are soft-deleted in prod and only never-activated rollbacks may be hard-deleted.

Web Submodule Update

Layer / File(s) Summary
Web submodule pointer
packages/web
Advance packages/web subproject Git commit reference to a new revision.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lobu-ai/lobu#765: This PR drops event indexes including idx_events_search_tsv which were added/used in that related PR.

Poem

🐰 I nudge old indexes out of sight,
I tidy schemas late at night,
Lists run lighter, docs explain,
A tiny web pointer moves its train—
Hop, patch, and everything's more spry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: dropping 8 unused indexes (5.16 GB) and removing event_count from the connections list query.
Description check ✅ Passed The description is comprehensive and well-structured, covering Why, Index drops, event_count changes, soft-delete policy doc, merge order, operator runbook, and test plan with checkboxes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/drop-unused-indexes-and-event-count

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90dbc38393

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +25 to +27
DROP INDEX IF EXISTS public.idx_events_embedding;
DROP INDEX IF EXISTS public.idx_events_raw_content_trgm;
DROP INDEX IF EXISTS public.idx_events_search_tsv;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the search candidate indexes available

In orgs with embedded content, search_memory still opts into the bounded candidate path (packages/server/src/tools/search.ts:257-268), and that path explicitly relies on these three indexes for the vector, full-text, and trigram branches (packages/server/src/utils/content-search.ts:1707-1731). After dropping them, those branches become table scans over event_embeddings/events; the candidate path has a 6s statement timeout and catches failures by returning an empty content list, so memory recall can silently lose content rather than just run a little slower. Either keep these indexes or disable/replace the candidate path before removing them.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/migrations/20260517010000_drop_unused_indexes.sql`:
- Around line 15-23: The migration file 20260517010000_drop_unused_indexes.sql
performs plain DROP INDEX operations that can block prod; add a fail-fast guard
at the top of this SQL that prevents execution unless an explicit opt-in session
flag is set (e.g. check current_setting('app.allow_blocking_index_drop', true)
or similar) and RAISE EXCEPTION with a clear message if the flag is absent or
false, so accidental runs of the DROP INDEX statements abort immediately;
reference the file name and the DROP INDEX statements when making the change.

In `@packages/web`:
- Line 1: The packages/web submodule is pinned to a commit not reachable from
owletto-web/main; update the submodule reference in packages/web to point at a
commit on owletto-web/main (if this PR depends on owletto-web#136, first merge
that PR into owletto-web/main), then update the submodule pointer to the
resulting main commit (e.g., fetch/checkout owletto-web/main, get the target
commit SHA, and update packages/web to that SHA) so the Submodule Drift job no
longer rejects it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7bcfe2ca-547b-4167-be60-eaed66eedac4

📥 Commits

Reviewing files that changed from the base of the PR and between e5c93a3 and 90dbc38.

📒 Files selected for processing (5)
  • db/migrations/20260517010000_drop_unused_indexes.sql
  • db/schema.sql
  • docs/MIGRATIONS.md
  • packages/server/src/tools/admin/manage_connections.ts
  • packages/web

Comment on lines +15 to +23
-- The migration uses plain `DROP INDEX` (not CONCURRENTLY) because
-- dbmate's `transaction:false` directive doesn't actually exit the
-- transaction block when running against the `pq` driver — see the
-- comment in 20260426130001_db_integrity_cleanup_concurrent.sql.
-- Operator runbook: for prod application, run `DROP INDEX CONCURRENTLY`
-- manually first (see docs/MIGRATIONS.md "When dbmate fails in prod"
-- → transaction:false recipe), then run dbmate to record this row in
-- schema_migrations. On fresh installs / CI / dev the events table is
-- empty so the brief ACCESS EXCLUSIVE is irrelevant.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a fail-fast guard to prevent accidental blocking DROP INDEX in prod.

This migration depends on an out-of-band manual step, but SQL does not enforce that precondition. If someone runs dbmate directly, the plain DROP INDEX path can still hit hot tables.

🛡️ Suggested guard before the DROP statements
+DO $$
+BEGIN
+  IF EXISTS (
+    SELECT 1
+    FROM pg_class i
+    JOIN pg_namespace n ON n.oid = i.relnamespace
+    WHERE n.nspname = 'public'
+      AND i.relname = ANY (ARRAY[
+        'idx_events_embedding',
+        'idx_events_raw_content_trgm',
+        'idx_events_search_tsv',
+        'idx_events_entity_ids_occurred_at',
+        'idx_events_origin_parent_id',
+        'idx_events_thread_lookup',
+        'idx_events_run_id',
+        'idx_events_type'
+      ])
+  )
+  AND EXISTS (SELECT 1 FROM public.events LIMIT 1)
+  THEN
+    RAISE EXCEPTION
+      'Precondition failed: run DROP INDEX CONCURRENTLY out-of-band first, then rerun dbmate to record migration 20260517010000.';
+  END IF;
+END
+$$;
+
 DROP INDEX IF EXISTS public.idx_events_embedding;
 DROP INDEX IF EXISTS public.idx_events_raw_content_trgm;
 DROP INDEX IF EXISTS public.idx_events_search_tsv;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@db/migrations/20260517010000_drop_unused_indexes.sql` around lines 15 - 23,
The migration file 20260517010000_drop_unused_indexes.sql performs plain DROP
INDEX operations that can block prod; add a fail-fast guard at the top of this
SQL that prevents execution unless an explicit opt-in session flag is set (e.g.
check current_setting('app.allow_blocking_index_drop', true) or similar) and
RAISE EXCEPTION with a clear message if the flag is absent or false, so
accidental runs of the DROP INDEX statements abort immediately; reference the
file name and the DROP INDEX statements when making the change.

Comment thread packages/web Outdated
@@ -1 +1 @@
Subproject commit c390103ed009f00f91fb5547a811235c914dd3d8
Subproject commit 00b249cafd7c239b633193cd9b5edcd2fad02166
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Re-pin this submodule to a commit reachable from owletto-web/main.

Line 1 currently points at a SHA that the Submodule Drift job rejects as off-main, and the failure text says FluxCD deploy would break. If this PR is meant to consume owletto-web#136, merge that first and then update packages/web to the resulting main commit before landing this change.

🧰 Tools
🪛 GitHub Actions: Submodule Drift / 0_check-drift.txt

[error] 1-1: Pinned SHA $PINNED is not reachable from owletto-web/main. FluxCD deploy would break because parent pinned commit is off-main.

🪛 GitHub Actions: Submodule Drift / check-drift

[error] 1-1: Pinned SHA $PINNED is not reachable from owletto-web/main. Error emitted: 'Pinned SHA $PINNED is not reachable from owletto-web/main.'

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/web` at line 1, The packages/web submodule is pinned to a commit not
reachable from owletto-web/main; update the submodule reference in packages/web
to point at a commit on owletto-web/main (if this PR depends on owletto-web#136,
first merge that PR into owletto-web/main), then update the submodule pointer to
the resulting main commit (e.g., fetch/checkout owletto-web/main, get the target
commit SHA, and update packages/web to that SHA) so the Submodule Drift job no
longer rejects it.

buremba added 2 commits May 16, 2026 21:47
CI's pg_dump produces plain NOT NULL form (vs PG18's named CONSTRAINTs).
The earlier commit's schema.sql had the CONSTRAINT form leaked in from
local regen, failing the schema-drift check. Restored from origin/main +
re-applied only the intended 4-index removals.

Web submodule bump: the earlier pin (perf/connections-detail-event-count-
fallback branch tip) wasn't reachable from web/main per the strict
reachability check. Re-point to current web/main HEAD.
The previous fixup commit updated db/schema.sql to match the reduced
4-index scope but missed committing the migration file itself, so HEAD
still had the original 8-index DROP statements. CI's dbmate ran the
8-drop migration but schema.sql only removed 4 → drift fail.

This commits the matching migration content.
@buremba buremba merged commit 653566f into main May 16, 2026
20 of 22 checks passed
@buremba buremba deleted the perf/drop-unused-indexes-and-event-count branch May 16, 2026 21:05
buremba added a commit that referenced this pull request May 16, 2026
* fix: close monitoring + deploy gaps from post-incident audit

Three gaps identified by the audit after #771/#772 landed (1914
errors/5min in stdout, zero Sentry issues, single-replica Recreate
deploy strategy = ~30s "no available server" window per rollout):

* **Sentry forwarding from pino** (utils/logger.ts). The existing
  Sentry capture middleware only fires on HTTP 500 responses;
  `logger.error()` from background jobs (CheckDueFeeds, runs queue,
  scheduled tasks) was invisible to monitoring. Now a pino destination
  inspects each log line and forwards `level >= error` to
  `Sentry.captureException` (with stack reconstruction) or
  `captureMessage`. In-process dedupe by (msg, err.type, top stack
  frame) with a 60s window prevents the 380/min repeating-error
  pattern from saturating Sentry — Sentry has its own grouping but
  every captureException still incurs an HTTP call.

* **Orphan-feed cleanup + graceful skip** (queue-helpers.ts +
  20260517020000_softdelete_orphan_feeds.sql). The 'website'
  connector was archived/uninstalled in some orgs but their feeds
  weren't soft-deleted. `createSyncRunWithClient` threw on every
  CheckDueFeeds tick when no active connector_definition existed for
  the (key, org) pair — produced the 380/min error stream that
  masked real issues. Code change: warn + skip instead of throw, so
  future orphans don't spam logs. Migration: one-shot soft-delete of
  feeds matching the orphan criteria (no pinned_version, active
  connection, no active connector_definition for the org).

* **PreStop drain + grace period** (charts/lobu/templates/
  deployment.yaml, values.yaml). Single-replica gateway with
  `strategy: Recreate` (workspaces PVC is RWO) means every deploy
  has a ~30s window where the old pod is gone but the new one isn't
  ready — Cloudflare returns "no available server" to live traffic
  during this window. The 2026-05-16 incident felt extreme because
  the pod was actually crash-looping; the same shape happens briefly
  on every healthy deploy. PreStop hook (15s sleep, configurable via
  app.preStopDelaySeconds) gives Service endpoint deregistration time
  to propagate before SIGTERM, so requests in-flight during the drain
  still hit a live process. terminationGracePeriodSeconds bumped to
  45 to cover preStop + graceful shutdown (~5s). Full elimination
  needs RWX storage for workspaces so we can roll instead of recreate
  — documented in the deployment.yaml strategy comment.

* fix(audit): address pi review on #775

Six findings, all addressed:

* **#1 preStop on Recreate makes the gap WORSE.** Pi caught the
  reversal — under `strategy: Recreate` the new pod doesn't start
  until the old one fully terminates, so adding a preStop sleep
  EXTENDS the no-available-server window rather than shrinking it.
  Default `preStopDelaySeconds` to 0 and only emit the lifecycle
  hook when > 0. Document that ops repos should set it only with
  RollingUpdate (which needs RWX storage on workspaces — out of
  scope here).
* **#2 migration too broad.** Tightened WHERE to require
  `feeds.status='active'` and `connections.status='active'` so we
  match exactly the set CheckDueFeeds would process. Paused /
  pending_auth / error feeds are left alone — they don't
  contribute to the error spam and may be intentionally recoverable.
* **#3 Sentry double-report.** `server.ts onError` already calls
  `Sentry.captureException` then `logger.error`; the new logger
  forwarder would have sent the same event again. Added
  `sentryReported: true` marker on that log line; logger transport
  skips forwarding when it sees the marker.
* **#4 dedupe fingerprint too coarse.** Added `err.message` to the
  fingerprint so distinct messages from the same catch site (same
  Error type, same top stack frame) get distinct fingerprints
  within the 60s window.
* **#5 existing string-error call sites are pre-existing tech debt**
  — left as-is for now. Documented in the PR body as a follow-up;
  the new forwarder handles `err` / `error` Error-object payloads
  well, and call sites with stringified errors are a separate
  cleanup pass.
* **#6 warn-skip leaves future orphans being polled forever.** Now
  `createSyncRunWithClient` soft-deletes the feed in-place when no
  active connector_definition is found, so CheckDueFeeds stops
  selecting it. Operators recover by clearing `deleted_at` after
  reinstalling the definition.

* chore: bump web submodule to current main for drift check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants