fix(runs): add heartbeat + stale-run reaper#849
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
When a connector worker crashes, OOM-kills, or scales down mid-run the
`runs` row sits in `claimed`/`running` forever and the feed never gets a
retry. The legacy `checkStalledExecutions` cron caught some of this on a
5-minute cadence but only covered `sync` + `embed_backfill`, did a full
table scan, and had no cross-pod coordination.
`reapStaleRuns()` (packages/server/src/scheduled/check-stalled-executions.ts)
covers all four connector lanes (`sync`, `action`, `embed_backfill`,
`auth`), wraps the sweep in `pg_try_advisory_lock` for multi-pod safety,
and reads the threshold from `RUNS_REAPER_STALE_AFTER_SECONDS` (default
120s). It writes the failure as `error_message='worker_heartbeat_lost'`
and re-queues stalled `sync` runs so the feed self-heals.
Wired on a 30s `setInterval` from the gateway boot path (server.ts and
start-local.ts) with explicit teardown on SIGTERM. The legacy 5-minute
TaskScheduler cron stays as a backstop and now delegates to the same
function; the advisory lock keeps the two cadences from double-failing
rows. Watcher lane stays out of scope — it already has its own 2h sweep
in watchers/automation.ts.
Adds the missing partial index `idx_runs_heartbeat_inflight` on
`runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND
run_type IN ('sync','action','embed_backfill','auth')` so the sweeper
query is index-only. Also sets `last_heartbeat_at = current_timestamp`
on the worker poll claim path so a freshly-claimed row has a sane
initial timestamp instead of relying on the first 30s worker heartbeat
to land.
Slice extracted from #615 — placement-tagging, organization-default-device,
and scale-from-zero remain out of scope and the bigger design is
unchanged by this PR.
…pending_interactions
3207f4c to
3081cc7
Compare
… (sync + auth) (#859) PR #849 added a heartbeat-based stale-run reaper that covered four connector lanes (sync, action, embed_backfill, auth). Only `sync` and `auth` runs actually call `client.heartbeat()` from packages/connector-worker/src/daemon/executor.ts — `executeActionRun` (line 353-377) and `executeEmbedBackfillRun` (line 577-622) never heartbeat. Net result on prod: any in-flight `action` or `embed_backfill` run lasting longer than `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s) gets marked `timeout` with `error_message='worker_heartbeat_lost'` while it is still executing successfully. Narrow scope of this fix: - New migration 20260518020000_runs_heartbeat_inflight_narrow.sql drops and recreates `idx_runs_heartbeat_inflight` restricted to `('sync', 'auth')`. Embedded PGlite schema patch mirrors it. - `reapStaleRuns()` WHERE clause narrowed to the same lane set so the bulk UPDATE matches the index. - schema.sql regenerated by hand (predicate updated + new migration row appended). - Test updated: `action`/`embed_backfill` runs that look stale by heartbeat must NOT be reaped today. Out of scope, tracked as follow-ups: 1. Heartbeat action + embed_backfill runs so they can be safely reaped. 2. Heartbeat the Chrome/Owletto browser-worker run path. 3. Restore atomicity of sync timeout + retry insert. Once #1 lands, the lane set can widen back; the index + WHERE clause stay the single source of truth so they cannot drift again.
Summary
Reaper for connector runs whose worker crashed, OOM-killed, or scaled down mid-run. Without this, a
runsrow sits inclaimed/runningforever and the feed never gets a retry.Slice extracted from #615. Placement-tagging columns, organization-default-device, and scale-from-zero machinery are NOT in this PR — that wider design isn't greenlit. This slice is independently shippable because the heartbeat column and per-worker heartbeat call already exist; what was missing was a single-cadence cross-pod-safe sweep over the connector lanes.
What's in scope
reapStaleRuns()inpackages/server/src/scheduled/check-stalled-executions.ts. Covers all four connector lanes (sync,action,embed_backfill,auth) — the legacy code only sweptsync+embed_backfill. Wraps the body inpg_try_advisory_lockso two pods (or the 30s setInterval + the 5min cron) can't double-fail a row. Threshold configurable viaRUNS_REAPER_STALE_AFTER_SECONDS(default 120s).server.tsandstart-local.ts:startStaleRunReaper()registers a 30ssetIntervaland returns a teardown invoked from the SIGTERM/SIGINT handler. Existing 5-minutecheck-stalled-executionsTaskScheduler cron stays as a backstop and now delegates to the same function; advisory lock keeps the cadences safe.idx_runs_heartbeat_inflightonruns(last_heartbeat_at) WHERE status IN ('claimed','running') AND run_type IN ('sync','action','embed_backfill','auth')— shipped asdb/migrations/20260518000000_runs_heartbeat_reaper_index.sql+ mirrored inembedded-schema-patches.tsfor PGlite + appended todb/schema.sql.worker-api.tsnow setslast_heartbeat_at = current_timestampat claim time so a freshly-claimed row has a sane initial timestamp instead of relying on the worker's first 30s heartbeat to land.packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts(5 tests, all passing): fresh-heartbeat is left alone, stale-heartbeat is timed out asworker_heartbeat_lost, terminal-state rows are never touched, never-heartbeat-but-stale-claim rows are timed out via theclaimed_atfallback, watcher lane is excluded (handled bysweepStaleWatcherRuns), action + auth lanes reach parity with sync.What's reused vs. new
runs.last_heartbeat_at) — already existed./api/workers/heartbeatevery 30s; gateway handler already updates the column.checkStalledExecutions— kept, now delegates toreapStaleRuns()so the 5min cron still runs as a backstop alongside the new 30s setInterval. No parallel state.sweepStaleWatcherRuns/resetOrphanedWatcherRunsinwatchers/automation.ts) — unchanged; the watcher lane has its own 2h TTL by design.claimed_atheartbeat and operates only onchat_message/schedule/agent_run/internal/task.Threshold rationale
120s default = ~3 missed 30s worker heartbeats. Real network blips / GC pauses get a grace window; a crashed worker frees the feed within ~2 minutes instead of 5.
Reproducer
Existing tests cover the surface end-to-end against PGlite:
make build-packagesclean.make typecheckclean for the files this PR touches (pre-existing errors elsewhere unrelated to this change).Test plan
worker_heartbeat_lostclaimed_atwindowRelated
Slice of #615. Not closing #615 — only the heartbeat+reaper concern ships here.