perf(events): has_embedding column + maintenance triggers#770
Conversation
Phase 1 of the embed-backfill speedup. Adds a `events.has_embedding` boolean column and the AFTER INSERT / AFTER DELETE triggers on `event_embeddings` that keep it in sync. No code change yet — the scheduler still does the seq-scan + hash anti-join until the column is fully backfilled and a follow-up swaps it for a partial-index lookup. The embed-backfill scheduler is the #1 burner in pg_stat_statements: 1.27s mean × 443 calls = 564s of DB time, dominated by a parallel seq scan over 1.15M `events` rows + hash anti-join against 1.08M `event_embeddings` rows. With `has_embedding` + a partial index (WHERE NOT has_embedding AND payload_text…), it becomes a small index range scan over the actually-missing rows (~25k of 1.15M). Migration is **deliberately minimal**: * `ADD COLUMN has_embedding boolean` (nullable, no DEFAULT) — O(1) metadata flip in PG 11+. Specifically NOT GENERATED STORED to avoid the 1.15M-row rewrite trap that broke #765 / triggered the 2026-05-16 outage. Pattern follows docs/MIGRATIONS.md. * Triggers on `event_embeddings`, not `events`: cheaper (event_embeddings is 1.08M rows but only the inserts/deletes pay) and gives us the invariant for free regardless of how events are inserted. * No partial index yet. The column is mostly NULL until the backfill script runs, so an index would be useless. * No backfill in the migration itself — `scripts/backfill-events-has-embedding.sh` runs in 10k-row batches outside the Helm hook. Idempotent (skips rows where has_embedding IS NULL is already false). Validated against prod schema in a BEGIN/ROLLBACK txn: column adds, both triggers fire, INSERT flips to true, DELETE flips to false. No prod state changed. Next PRs: * PR 4: run the backfill script in prod, then CREATE INDEX CONCURRENTLY on (organization_id, id) WHERE NOT has_embedding AND payload_text IS NOT NULL AND payload_text <> ''. * PR 5: rewrite triggerEmbedBackfill to query the partial index.
📝 WalkthroughWalkthroughThis PR introduces automatic embedding status tracking on events. A new ChangesEvent Embedding Tracking
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c898a62cc3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| -- partial index (next migration) treats NULL the same as FALSE so the | ||
| -- scheduler keeps working. | ||
|
|
||
| ALTER TABLE public.events ADD COLUMN has_embedding boolean; |
There was a problem hiding this comment.
Set new events to missing embeddings
Because the new column is nullable with no default and only event_embeddings has maintenance triggers, any event inserted after the one-time backfill finishes but before its embedding row exists will keep has_embedding = NULL. The planned partial-index path uses WHERE NOT has_embedding (and the current scheduler's missing-embedding predicate is in packages/server/src/scheduled/trigger-embed-backfill.ts), so those new NULL rows would be excluded rather than queued for embedding. Add a default/insert-side maintenance path so freshly created events start as false until the embedding insert flips them to true.
Useful? React with 👍 / 👎.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@db/migrations/20260517000000_events_has_embedding.sql`:
- Line 16: Update the stale comment in the migration file that references
"scripts/backfill-events-has-embedding.sql" so it reflects the actual script
added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the
comment text in db/migrations/20260517000000_events_has_embedding.sql (the line
containing scripts/backfill-events-has-embedding.sql) and change the file
extension in the comment to .sh to match the new script name.
In `@scripts/backfill-events-has-embedding.sh`:
- Around line 27-29: Validate and normalize the BATCH_SIZE and
SLEEP_BETWEEN_BATCHES env vars before they are interpolated into SQL or passed
to sleep: ensure BATCH_SIZE is a positive integer (>=1) and
SLEEP_BETWEEN_BATCHES is a non-negative number (float allowed), otherwise set
them to safe defaults or exit with an error; perform this check immediately
after their assignment (the BATCH_SIZE and SLEEP_BETWEEN_BATCHES variables at
the top of the script) so the sanitized values are used in the SQL LIMIT and the
sleep call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f58b4da3-5be1-4f52-b6c0-3e5a864ffe19
📒 Files selected for processing (2)
db/migrations/20260517000000_events_has_embedding.sqlscripts/backfill-events-has-embedding.sh
| -- DEFAULT would rewrite all 1.15M rows under ACCESS EXCLUSIVE (the same | ||
| -- trap that timed out 20260516200000_events_search_tsv). | ||
| -- * not backfilled here: existing rows are populated by a batched script | ||
| -- (scripts/backfill-events-has-embedding.sql) that runs outside the |
There was a problem hiding this comment.
Fix stale script path in migration comment (Line 16).
The comment points to scripts/backfill-events-has-embedding.sql, but this PR adds scripts/backfill-events-has-embedding.sh.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@db/migrations/20260517000000_events_has_embedding.sql` at line 16, Update the
stale comment in the migration file that references
"scripts/backfill-events-has-embedding.sql" so it reflects the actual script
added by this PR ("scripts/backfill-events-has-embedding.sh"); locate the
comment text in db/migrations/20260517000000_events_has_embedding.sql (the line
containing scripts/backfill-events-has-embedding.sql) and change the file
extension in the comment to .sh to match the new script name.
| BATCH_SIZE="${BATCH_SIZE:-10000}" | ||
| SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}" | ||
|
|
There was a problem hiding this comment.
Validate batch/sleep env vars before using them in SQL and sleep.
BATCH_SIZE is interpolated into LIMIT (Lines 53-54). A bad value causes runtime SQL errors (or unsafe huge batches). Add numeric/positive guards up front.
🛠️ Suggested guardrails
BATCH_SIZE="${BATCH_SIZE:-10000}"
SLEEP_BETWEEN_BATCHES="${SLEEP_BETWEEN_BATCHES:-0.1}"
+
+if ! [[ "$BATCH_SIZE" =~ ^[0-9]+$ ]] || [ "$BATCH_SIZE" -le 0 ]; then
+ echo "BATCH_SIZE must be a positive integer" >&2
+ exit 1
+fi
+
+if ! [[ "$SLEEP_BETWEEN_BATCHES" =~ ^[0-9]+([.][0-9]+)?$ ]]; then
+ echo "SLEEP_BETWEEN_BATCHES must be a non-negative number" >&2
+ exit 1
+fiAlso applies to: 49-55
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/backfill-events-has-embedding.sh` around lines 27 - 29, Validate and
normalize the BATCH_SIZE and SLEEP_BETWEEN_BATCHES env vars before they are
interpolated into SQL or passed to sleep: ensure BATCH_SIZE is a positive
integer (>=1) and SLEEP_BETWEEN_BATCHES is a non-negative number (float
allowed), otherwise set them to safe defaults or exit with an error; perform
this check immediately after their assignment (the BATCH_SIZE and
SLEEP_BETWEEN_BATCHES variables at the top of the script) so the sanitized
values are used in the SQL LIMIT and the sleep call.
|
Closing per design feedback: trigger-maintained boolean on events conflicts with the append-only rule in CLAUDE.md/AGENTS.md. Reopening as a sidecar-table proposal or high-watermark scheduler in a follow-up after brainstorm. |
Why
Postmortem measurement (pg_stat_statements, see #767): the embed-backfill scheduler is the #1 burner — 564s total / 1274ms mean × 443 calls. Each tick does a parallel seq scan over 1.15M `events` rows + hash anti-join against 1.08M `event_embeddings` rows, just to find the ~25k events without embeddings.
```
Parallel Seq Scan on events e (rows=1029034, t=152ms)
Filter: payload_text IS NOT NULL AND payload_text <> ''
Parallel Hash Anti Join (events e ⋈ event_embeddings emb)
rows=8355 of 1.03M scanned per worker × 3 workers
```
The fix: denormalize "does this event have an embedding?" as a boolean on `events`, maintained by trigger. The scheduler then becomes a tiny partial-index range scan over the ~25k pending rows instead of a 1.15M-row table scan.
What's in this PR (Phase 1 of 3)
No app code changes. The scheduler still uses the existing seq-scan path until the column is fully backfilled and verified in prod.
Phases 2 + 3 (follow-up PRs)
Sequencing matters: the index can't be added until the column has values, the code can't switch over until the index exists, and each step is independently verifiable. Bundling them would mean a single rollout window with three things to revert.
Validation
Tested the migration in a `BEGIN; ... ROLLBACK` against the prod schema:
`make build-packages` + `make typecheck` clean.
Risks + mitigations
Summary by CodeRabbit