Skip to content

perf(events): has_embedding column + maintenance triggers#770

Closed
buremba wants to merge 1 commit into
mainfrom
perf/events-has-embedding-column
Closed

perf(events): has_embedding column + maintenance triggers#770
buremba wants to merge 1 commit into
mainfrom
perf/events-has-embedding-column

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Postmortem measurement (pg_stat_statements, see #767): the embed-backfill scheduler is the #1 burner — 564s total / 1274ms mean × 443 calls. Each tick does a parallel seq scan over 1.15M `events` rows + hash anti-join against 1.08M `event_embeddings` rows, just to find the ~25k events without embeddings.

```
Parallel Seq Scan on events e (rows=1029034, t=152ms)
Filter: payload_text IS NOT NULL AND payload_text <> ''
Parallel Hash Anti Join (events e ⋈ event_embeddings emb)
rows=8355 of 1.03M scanned per worker × 3 workers
```

The fix: denormalize "does this event have an embedding?" as a boolean on `events`, maintained by trigger. The scheduler then becomes a tiny partial-index range scan over the ~25k pending rows instead of a 1.15M-row table scan.

What's in this PR (Phase 1 of 3)

  • Migration `20260517000000_events_has_embedding.sql`:
    • `ADD COLUMN has_embedding boolean` — nullable, no default, no rewrite (O(1) metadata flip per PG 11+). Specifically not `GENERATED STORED` — that's the trap that broke #765. Follows the pattern documented in #769.
    • Triggers on `event_embeddings`: AFTER INSERT flips `events.has_embedding = true`; AFTER DELETE flips back to `false`.
    • No partial index yet — column is mostly NULL until backfill runs, so an index would be empty.
  • Backfill script `scripts/backfill-events-has-embedding.sh`:
    • Batched UPDATE (10k rows per txn, configurable). Each batch bounded so VACUUM keeps up.
    • Idempotent: only touches `has_embedding IS NULL` rows; safe to resume after Ctrl-C.
    • Runs outside the Helm hook (it's a one-time prod op, not an automated migration step).

No app code changes. The scheduler still uses the existing seq-scan path until the column is fully backfilled and verified in prod.

Phases 2 + 3 (follow-up PRs)

  • PR 4: run the backfill script against prod, then `CREATE INDEX CONCURRENTLY idx_events_pending_embedding ON events (organization_id, id) WHERE NOT has_embedding AND payload_text IS NOT NULL AND payload_text <> ''`.
  • PR 5: rewrite `triggerEmbedBackfill` to query the partial index. Drop the seq-scan path. Verify EXPLAIN ANALYZE before/after.

Sequencing matters: the index can't be added until the column has values, the code can't switch over until the index exists, and each step is independently verifiable. Bundling them would mean a single rollout window with three things to revert.

Validation

Tested the migration in a `BEGIN; ... ROLLBACK` against the prod schema:

  • Column adds (column_name=has_embedding, data_type=boolean, nullable).
  • Both triggers register (`trg_event_embeddings_after_insert` AFTER INSERT, `trg_event_embeddings_after_delete` AFTER DELETE).
  • Insert a fresh embedding for an event with no existing embedding → `has_embedding` flips from `NULL` to `TRUE`.
  • Delete that embedding → `has_embedding` flips to `FALSE`.
  • ROLLBACK: prod state unchanged.

`make build-packages` + `make typecheck` clean.

Risks + mitigations

  • Trigger overhead on event_embeddings inserts. event_embeddings INSERT is pg_stat_statements rank feat: Add comprehensive architecture documentation #11 (38s / 135k calls / 0.3ms mean). The trigger adds one UPDATE per insert. Estimate: +1-2ms per insert. On steady throughput (~5/sec) that's negligible; on a backfill burst (1k inserts at once) it adds ~1-2s. Acceptable.
  • Backfill script bottlenecking other writes. Each batch is a single 10k-row UPDATE that takes a SHARE ROW EXCLUSIVE lock per row. `FOR UPDATE SKIP LOCKED` in the picker means concurrent triggers/inserts don't block. Worst case: a single batch takes a few seconds and other writes wait briefly.
  • NULL semantic during backfill. Until backfill completes, old rows have `has_embedding = NULL`. No code reads the column yet, so no behavior change. Phase 2's partial index uses `WHERE NOT has_embedding` which excludes NULLs, but that matches the pre-fix semantic ("rows we haven't proven have an embedding"); PR 5 just adds a guard that the partial-index path is only used after backfill finished.

Summary by CodeRabbit

  • Chores
    • Enhanced event embedding tracking with automatic status recording. Database-level mechanisms now maintain consistency between events and their associated embeddings.

Review Change Stack

Phase 1 of the embed-backfill speedup. Adds a `events.has_embedding`
boolean column and the AFTER INSERT / AFTER DELETE triggers on
`event_embeddings` that keep it in sync. No code change yet — the
scheduler still does the seq-scan + hash anti-join until the column
is fully backfilled and a follow-up swaps it for a partial-index lookup.

The embed-backfill scheduler is the #1 burner in pg_stat_statements:
1.27s mean × 443 calls = 564s of DB time, dominated by a parallel seq
scan over 1.15M `events` rows + hash anti-join against 1.08M
`event_embeddings` rows. With `has_embedding` + a partial index
(WHERE NOT has_embedding AND payload_text…), it becomes a small
index range scan over the actually-missing rows (~25k of 1.15M).

Migration is **deliberately minimal**:

* `ADD COLUMN has_embedding boolean` (nullable, no DEFAULT) — O(1)
  metadata flip in PG 11+. Specifically NOT GENERATED STORED to avoid
  the 1.15M-row rewrite trap that broke #765 / triggered the 2026-05-16
  outage. Pattern follows docs/MIGRATIONS.md.
* Triggers on `event_embeddings`, not `events`: cheaper (event_embeddings
  is 1.08M rows but only the inserts/deletes pay) and gives us the
  invariant for free regardless of how events are inserted.
* No partial index yet. The column is mostly NULL until the backfill
  script runs, so an index would be useless.
* No backfill in the migration itself — `scripts/backfill-events-has-embedding.sh`
  runs in 10k-row batches outside the Helm hook. Idempotent (skips
  rows where has_embedding IS NULL is already false).

Validated against prod schema in a BEGIN/ROLLBACK txn: column adds, both
triggers fire, INSERT flips to true, DELETE flips to false. No prod
state changed.

Next PRs:
* PR 4: run the backfill script in prod, then CREATE INDEX CONCURRENTLY
  on (organization_id, id) WHERE NOT has_embedding AND payload_text IS
  NOT NULL AND payload_text <> ''.
* PR 5: rewrite triggerEmbedBackfill to query the partial index.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

📝 Walkthrough

Walkthrough

This PR introduces automatic embedding status tracking on events. A new has_embedding column is added to the events table with triggers that keep it synchronized when embeddings are inserted or deleted, followed by a batch backfill script to populate the flag for existing events.

Changes

Event Embedding Tracking

Layer / File(s) Summary
Schema and trigger setup for embedding tracking
db/migrations/20260517000000_events_has_embedding.sql
Adds nullable has_embedding column to public.events; defines event_embeddings_after_insert() and event_embeddings_after_delete() trigger functions that automatically update the flag when embeddings are added or removed; creates corresponding row-level triggers; includes reversible down migration.
Backfill existing events with embedding status
scripts/backfill-events-has-embedding.sh
Batch backfill script that idempotently populates has_embedding for existing events by left-joining against event_embeddings table; uses FOR UPDATE SKIP LOCKED for safe concurrent execution; supports configurable batch size and inter-batch delays; reports progress periodically and exits safely when complete or interrupted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A flag to track embeddings true and clear,
Triggers dance when vectors appear—
Insert brings light, delete brings dusk,
Backfill batches earn our trust.
hop hop hop 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: adding a has_embedding column with trigger-based maintenance for performance.
Description check ✅ Passed The description covers the motivation, technical implementation, validation testing, and risk analysis. However, the required test plan section is missing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/events-has-embedding-column

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c898a62cc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

-- partial index (next migration) treats NULL the same as FALSE so the
-- scheduler keeps working.

ALTER TABLE public.events ADD COLUMN has_embedding boolean;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set new events to missing embeddings

Because the new column is nullable with no default and only event_embeddings has maintenance triggers, any event inserted after the one-time backfill finishes but before its embedding row exists will keep has_embedding = NULL. The planned partial-index path uses WHERE NOT has_embedding (and the current scheduler's missing-embedding predicate is in packages/server/src/scheduled/trigger-embed-backfill.ts), so those new NULL rows would be excluded rather than queued for embedding. Add a default/insert-side maintenance path so freshly created events start as false until the embedding insert flips them to true.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/migrations/20260517000000_events_has_embedding.sql`:
- Line 16: Update the stale comment in the migration file that references
"scripts/backfill-events-has-embedding.sql" so it reflects the actual script
added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the
comment text in db/migrations/20260517000000_events_has_embedding.sql (the line
containing scripts/backfill-events-has-embedding.sql) and change the file
extension in the comment to .sh to match the new script name.

In `@scripts/backfill-events-has-embedding.sh`:
- Around line 27-29: Validate and normalize the BATCH_SIZE and
SLEEP_BETWEEN_BATCHES env vars before they are interpolated into SQL or passed
to sleep: ensure BATCH_SIZE is a positive integer (>=1) and
SLEEP_BETWEEN_BATCHES is a non-negative number (float allowed), otherwise set
them to safe defaults or exit with an error; perform this check immediately
after their assignment (the BATCH_SIZE and SLEEP_BETWEEN_BATCHES variables at
the top of the script) so the sanitized values are used in the SQL LIMIT and the
sleep call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f58b4da3-5be1-4f52-b6c0-3e5a864ffe19

📥 Commits

Reviewing files that changed from the base of the PR and between 71e9b0f and c898a62.

📒 Files selected for processing (2)
  • db/migrations/20260517000000_events_has_embedding.sql
  • scripts/backfill-events-has-embedding.sh

-- DEFAULT would rewrite all 1.15M rows under ACCESS EXCLUSIVE (the same
-- trap that timed out 20260516200000_events_search_tsv).
-- * not backfilled here: existing rows are populated by a batched script
-- (scripts/backfill-events-has-embedding.sql) that runs outside the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix stale script path in migration comment (Line 16).

The comment points to scripts/backfill-events-has-embedding.sql, but this PR adds scripts/backfill-events-has-embedding.sh.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@db/migrations/20260517000000_events_has_embedding.sql` at line 16, Update the
stale comment in the migration file that references
"scripts/backfill-events-has-embedding.sql" so it reflects the actual script
added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the
comment text in db/migrations/20260517000000_events_has_embedding.sql (the line
containing scripts/backfill-events-has-embedding.sql) and change the file
extension in the comment to .sh to match the new script name.

Comment on lines +27 to +29
BATCH_SIZE="${BATCH_SIZE:-10000}"
SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate batch/sleep env vars before using them in SQL and sleep.

BATCH_SIZE is interpolated into LIMIT (Lines 53-54). A bad value causes runtime SQL errors (or unsafe huge batches). Add numeric/positive guards up front.

🛠️ Suggested guardrails
 BATCH_SIZE="${BATCH_SIZE:-10000}"
 SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}"
+
+if ! [[ "$BATCH_SIZE" =~ ^[0-9]+$ ]] || [ "$BATCH_SIZE" -le 0 ]; then
+  echo "BATCH_SIZE must be a positive integer" >&2
+  exit 1
+fi
+
+if ! [[ "$SLEEP_BETWEEN_BATCHES" =~ ^[0-9]+([.][0-9]+)?$ ]]; then
+  echo "SLEEP_BETWEEN_BATCHES must be a non-negative number" >&2
+  exit 1
+fi

Also applies to: 49-55

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/backfill-events-has-embedding.sh` around lines 27 - 29, Validate and
normalize the BATCH_SIZE and SLEEP_BETWEEN_BATCHES env vars before they are
interpolated into SQL or passed to sleep: ensure BATCH_SIZE is a positive
integer (>=1) and SLEEP_BETWEEN_BATCHES is a non-negative number (float
allowed), otherwise set them to safe defaults or exit with an error; perform
this check immediately after their assignment (the BATCH_SIZE and
SLEEP_BETWEEN_BATCHES variables at the top of the script) so the sanitized
values are used in the SQL LIMIT and the sleep call.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

Closing per design feedback: trigger-maintained boolean on events conflicts with the append-only rule in CLAUDE.md/AGENTS.md. Reopening as a sidecar-table proposal or high-watermark scheduler in a follow-up after brainstorm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants