perf(events): has_embedding column + maintenance triggers by buremba · Pull Request #770 · lobu-ai/lobu

buremba · 2026-05-16T17:14:38Z

Why

Postmortem measurement (pg_stat_statements, see #767): the embed-backfill scheduler is the #1 burner — 564s total / 1274ms mean × 443 calls. Each tick does a parallel seq scan over 1.15M `events` rows + hash anti-join against 1.08M `event_embeddings` rows, just to find the ~25k events without embeddings.

```
Parallel Seq Scan on events e (rows=1029034, t=152ms)
Filter: payload_text IS NOT NULL AND payload_text <> ''
Parallel Hash Anti Join (events e ⋈ event_embeddings emb)
rows=8355 of 1.03M scanned per worker × 3 workers
```

The fix: denormalize "does this event have an embedding?" as a boolean on `events`, maintained by trigger. The scheduler then becomes a tiny partial-index range scan over the ~25k pending rows instead of a 1.15M-row table scan.

What's in this PR (Phase 1 of 3)

Migration `20260517000000_events_has_embedding.sql`:
- `ADD COLUMN has_embedding boolean` — nullable, no default, no rewrite (O(1) metadata flip per PG 11+). Specifically not `GENERATED STORED` — that's the trap that broke #765. Follows the pattern documented in #769.
- Triggers on `event_embeddings`: AFTER INSERT flips `events.has_embedding = true`; AFTER DELETE flips back to `false`.
- No partial index yet — column is mostly NULL until backfill runs, so an index would be empty.
Backfill script `scripts/backfill-events-has-embedding.sh`:
- Batched UPDATE (10k rows per txn, configurable). Each batch bounded so VACUUM keeps up.
- Idempotent: only touches `has_embedding IS NULL` rows; safe to resume after Ctrl-C.
- Runs outside the Helm hook (it's a one-time prod op, not an automated migration step).

No app code changes. The scheduler still uses the existing seq-scan path until the column is fully backfilled and verified in prod.

Phases 2 + 3 (follow-up PRs)

PR 4: run the backfill script against prod, then `CREATE INDEX CONCURRENTLY idx_events_pending_embedding ON events (organization_id, id) WHERE NOT has_embedding AND payload_text IS NOT NULL AND payload_text <> ''`.
PR 5: rewrite `triggerEmbedBackfill` to query the partial index. Drop the seq-scan path. Verify EXPLAIN ANALYZE before/after.

Sequencing matters: the index can't be added until the column has values, the code can't switch over until the index exists, and each step is independently verifiable. Bundling them would mean a single rollout window with three things to revert.

Validation

Tested the migration in a `BEGIN; ... ROLLBACK` against the prod schema:

Column adds (column_name=has_embedding, data_type=boolean, nullable).
Both triggers register (`trg_event_embeddings_after_insert` AFTER INSERT, `trg_event_embeddings_after_delete` AFTER DELETE).
Insert a fresh embedding for an event with no existing embedding → `has_embedding` flips from `NULL` to `TRUE`.
Delete that embedding → `has_embedding` flips to `FALSE`.
ROLLBACK: prod state unchanged.

`make build-packages` + `make typecheck` clean.

Risks + mitigations

Trigger overhead on event_embeddings inserts. event_embeddings INSERT is pg_stat_statements rank feat: Add comprehensive architecture documentation #11 (38s / 135k calls / 0.3ms mean). The trigger adds one UPDATE per insert. Estimate: +1-2ms per insert. On steady throughput (~5/sec) that's negligible; on a backfill burst (1k inserts at once) it adds ~1-2s. Acceptable.
Backfill script bottlenecking other writes. Each batch is a single 10k-row UPDATE that takes a SHARE ROW EXCLUSIVE lock per row. `FOR UPDATE SKIP LOCKED` in the picker means concurrent triggers/inserts don't block. Worst case: a single batch takes a few seconds and other writes wait briefly.
NULL semantic during backfill. Until backfill completes, old rows have `has_embedding = NULL`. No code reads the column yet, so no behavior change. Phase 2's partial index uses `WHERE NOT has_embedding` which excludes NULLs, but that matches the pre-fix semantic ("rows we haven't proven have an embedding"); PR 5 just adds a guard that the partial-index path is only used after backfill finished.

Summary by CodeRabbit

Chores
- Enhanced event embedding tracking with automatic status recording. Database-level mechanisms now maintain consistency between events and their associated embeddings.

Phase 1 of the embed-backfill speedup. Adds a `events.has_embedding` boolean column and the AFTER INSERT / AFTER DELETE triggers on `event_embeddings` that keep it in sync. No code change yet — the scheduler still does the seq-scan + hash anti-join until the column is fully backfilled and a follow-up swaps it for a partial-index lookup. The embed-backfill scheduler is the #1 burner in pg_stat_statements: 1.27s mean × 443 calls = 564s of DB time, dominated by a parallel seq scan over 1.15M `events` rows + hash anti-join against 1.08M `event_embeddings` rows. With `has_embedding` + a partial index (WHERE NOT has_embedding AND payload_text…), it becomes a small index range scan over the actually-missing rows (~25k of 1.15M). Migration is **deliberately minimal**: * `ADD COLUMN has_embedding boolean` (nullable, no DEFAULT) — O(1) metadata flip in PG 11+. Specifically NOT GENERATED STORED to avoid the 1.15M-row rewrite trap that broke #765 / triggered the 2026-05-16 outage. Pattern follows docs/MIGRATIONS.md. * Triggers on `event_embeddings`, not `events`: cheaper (event_embeddings is 1.08M rows but only the inserts/deletes pay) and gives us the invariant for free regardless of how events are inserted. * No partial index yet. The column is mostly NULL until the backfill script runs, so an index would be useless. * No backfill in the migration itself — `scripts/backfill-events-has-embedding.sh` runs in 10k-row batches outside the Helm hook. Idempotent (skips rows where has_embedding IS NULL is already false). Validated against prod schema in a BEGIN/ROLLBACK txn: column adds, both triggers fire, INSERT flips to true, DELETE flips to false. No prod state changed. Next PRs: * PR 4: run the backfill script in prod, then CREATE INDEX CONCURRENTLY on (organization_id, id) WHERE NOT has_embedding AND payload_text IS NOT NULL AND payload_text <> ''. * PR 5: rewrite triggerEmbedBackfill to query the partial index.

coderabbitai · 2026-05-16T17:14:50Z

📝 Walkthrough

Walkthrough

This PR introduces automatic embedding status tracking on events. A new has_embedding column is added to the events table with triggers that keep it synchronized when embeddings are inserted or deleted, followed by a batch backfill script to populate the flag for existing events.

Changes

Event Embedding Tracking

Layer / File(s)	Summary
Schema and trigger setup for embedding tracking `db/migrations/20260517000000_events_has_embedding.sql`	Adds nullable `has_embedding` column to `public.events`; defines `event_embeddings_after_insert()` and `event_embeddings_after_delete()` trigger functions that automatically update the flag when embeddings are added or removed; creates corresponding row-level triggers; includes reversible down migration.
Backfill existing events with embedding status `scripts/backfill-events-has-embedding.sh`	Batch backfill script that idempotently populates `has_embedding` for existing events by left-joining against `event_embeddings` table; uses `FOR UPDATE SKIP LOCKED` for safe concurrent execution; supports configurable batch size and inter-batch delays; reports progress periodically and exits safely when complete or interrupted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A flag to track embeddings true and clear,
Triggers dance when vectors appear—
Insert brings light, delete brings dusk,
Backfill batches earn our trust.
hop hop hop 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: adding a has_embedding column with trigger-based maintenance for performance.
Description check	✅ Passed	The description covers the motivation, technical implementation, validation testing, and risk analysis. However, the required test plan section is missing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/events-has-embedding-column

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c898a62cc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T17:16:01Z

+-- partial index (next migration) treats NULL the same as FALSE so the
+-- scheduler keeps working.
+
+ALTER TABLE public.events ADD COLUMN has_embedding boolean;


Set new events to missing embeddings

Because the new column is nullable with no default and only event_embeddings has maintenance triggers, any event inserted after the one-time backfill finishes but before its embedding row exists will keep has_embedding = NULL. The planned partial-index path uses WHERE NOT has_embedding (and the current scheduler's missing-embedding predicate is in packages/server/src/scheduled/trigger-embed-backfill.ts), so those new NULL rows would be excluded rather than queued for embedding. Add a default/insert-side maintenance path so freshly created events start as false until the embedding insert flips them to true.

Useful? React with 👍 / 👎.

codecov-commenter · 2026-05-16T17:16:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/migrations/20260517000000_events_has_embedding.sql`:
- Line 16: Update the stale comment in the migration file that references
"scripts/backfill-events-has-embedding.sql" so it reflects the actual script
added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the
comment text in db/migrations/20260517000000_events_has_embedding.sql (the line
containing scripts/backfill-events-has-embedding.sql) and change the file
extension in the comment to .sh to match the new script name.

In `@scripts/backfill-events-has-embedding.sh`:
- Around line 27-29: Validate and normalize the BATCH_SIZE and
SLEEP_BETWEEN_BATCHES env vars before they are interpolated into SQL or passed
to sleep: ensure BATCH_SIZE is a positive integer (>=1) and
SLEEP_BETWEEN_BATCHES is a non-negative number (float allowed), otherwise set
them to safe defaults or exit with an error; perform this check immediately
after their assignment (the BATCH_SIZE and SLEEP_BETWEEN_BATCHES variables at
the top of the script) so the sanitized values are used in the SQL LIMIT and the
sleep call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f58b4da3-5be1-4f52-b6c0-3e5a864ffe19

📥 Commits

Reviewing files that changed from the base of the PR and between 71e9b0f and c898a62.

📒 Files selected for processing (2)

db/migrations/20260517000000_events_has_embedding.sql
scripts/backfill-events-has-embedding.sh

coderabbitai · 2026-05-16T17:17:55Z

+--     DEFAULT would rewrite all 1.15M rows under ACCESS EXCLUSIVE (the same
+--     trap that timed out 20260516200000_events_search_tsv).
+--   * not backfilled here: existing rows are populated by a batched script
+--     (scripts/backfill-events-has-embedding.sql) that runs outside the


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix stale script path in migration comment (Line 16).

The comment points to scripts/backfill-events-has-embedding.sql, but this PR adds scripts/backfill-events-has-embedding.sh.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@db/migrations/20260517000000_events_has_embedding.sql` at line 16, Update the stale comment in the migration file that references "scripts/backfill-events-has-embedding.sql" so it reflects the actual script added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the comment text in db/migrations/20260517000000_events_has_embedding.sql (the line containing scripts/backfill-events-has-embedding.sql) and change the file extension in the comment to .sh to match the new script name.

coderabbitai · 2026-05-16T17:17:55Z

+BATCH_SIZE="${BATCH_SIZE:-10000}"
+SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}"
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate batch/sleep env vars before using them in SQL and sleep.

BATCH_SIZE is interpolated into LIMIT (Lines 53-54). A bad value causes runtime SQL errors (or unsafe huge batches). Add numeric/positive guards up front.

🛠️ Suggested guardrails

BATCH_SIZE="${BATCH_SIZE:-10000}" SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}" + +if ! [[ "$BATCH_SIZE" =~ ^[0-9]+$ ]] || [ "$BATCH_SIZE" -le 0 ]; then + echo "BATCH_SIZE must be a positive integer" >&2 + exit 1 +fi + +if ! [[ "$SLEEP_BETWEEN_BATCHES" =~ ^[0-9]+([.][0-9]+)?$ ]]; then + echo "SLEEP_BETWEEN_BATCHES must be a non-negative number" >&2 + exit 1 +fi

Also applies to: 49-55

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/backfill-events-has-embedding.sh` around lines 27 - 29, Validate and normalize the BATCH_SIZE and SLEEP_BETWEEN_BATCHES env vars before they are interpolated into SQL or passed to sleep: ensure BATCH_SIZE is a positive integer (>=1) and SLEEP_BETWEEN_BATCHES is a non-negative number (float allowed), otherwise set them to safe defaults or exit with an error; perform this check immediately after their assignment (the BATCH_SIZE and SLEEP_BETWEEN_BATCHES variables at the top of the script) so the sanitized values are used in the SQL LIMIT and the sleep call.

buremba · 2026-05-16T17:27:06Z

Closing per design feedback: trigger-maintained boolean on events conflicts with the append-only rule in CLAUDE.md/AGENTS.md. Reopening as a sidecar-table proposal or high-watermark scheduler in a follow-up after brainstorm.

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

buremba closed this May 16, 2026

This was referenced May 18, 2026

Perf: embed-backfill scheduler — 1,210s/window DB time on anti-join over 1.156M events #892

Open

Perf: org-root bootstrap (resolve_path) — ~3s cold cache from active-events anti-join + missing org-recent index #922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(events): has_embedding column + maintenance triggers#770

perf(events): has_embedding column + maintenance triggers#770
buremba wants to merge 1 commit into
mainfrom
perf/events-has-embedding-column

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

codecov-commenter commented May 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 16, 2026

Uh oh!

coderabbitai Bot May 16, 2026

Uh oh!

buremba commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		BATCH_SIZE="${BATCH_SIZE:-10000}"
		SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}"

Conversation

buremba commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What's in this PR (Phase 1 of 3)

Phases 2 + 3 (follow-up PRs)

Validation

Risks + mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 16, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading