Skip to content

fix(runs): add heartbeat + stale-run reaper#849

Merged
buremba merged 4 commits into
mainfrom
fix/stale-run-reaper
May 18, 2026
Merged

fix(runs): add heartbeat + stale-run reaper#849
buremba merged 4 commits into
mainfrom
fix/stale-run-reaper

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

Reaper for connector runs whose worker crashed, OOM-killed, or scaled down mid-run. Without this, a runs row sits in claimed/running forever and the feed never gets a retry.

Slice extracted from #615. Placement-tagging columns, organization-default-device, and scale-from-zero machinery are NOT in this PR — that wider design isn't greenlit. This slice is independently shippable because the heartbeat column and per-worker heartbeat call already exist; what was missing was a single-cadence cross-pod-safe sweep over the connector lanes.

What's in scope

  • reapStaleRuns() in packages/server/src/scheduled/check-stalled-executions.ts. Covers all four connector lanes (sync, action, embed_backfill, auth) — the legacy code only swept sync + embed_backfill. Wraps the body in pg_try_advisory_lock so two pods (or the 30s setInterval + the 5min cron) can't double-fail a row. Threshold configurable via RUNS_REAPER_STALE_AFTER_SECONDS (default 120s).
  • Wiring in server.ts and start-local.ts: startStaleRunReaper() registers a 30s setInterval and returns a teardown invoked from the SIGTERM/SIGINT handler. Existing 5-minute check-stalled-executions TaskScheduler cron stays as a backstop and now delegates to the same function; advisory lock keeps the cadences safe.
  • Partial index idx_runs_heartbeat_inflight on runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND run_type IN ('sync','action','embed_backfill','auth') — shipped as db/migrations/20260518000000_runs_heartbeat_reaper_index.sql + mirrored in embedded-schema-patches.ts for PGlite + appended to db/schema.sql.
  • Claim path in worker-api.ts now sets last_heartbeat_at = current_timestamp at claim time so a freshly-claimed row has a sane initial timestamp instead of relying on the worker's first 30s heartbeat to land.
  • Tests in packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts (5 tests, all passing): fresh-heartbeat is left alone, stale-heartbeat is timed out as worker_heartbeat_lost, terminal-state rows are never touched, never-heartbeat-but-stale-claim rows are timed out via the claimed_at fallback, watcher lane is excluded (handled by sweepStaleWatcherRuns), action + auth lanes reach parity with sync.

What's reused vs. new

  • Heartbeat column (runs.last_heartbeat_at) — already existed.
  • Worker-side heartbeat call — connector-worker daemon already sends /api/workers/heartbeat every 30s; gateway handler already updates the column.
  • Legacy checkStalledExecutions — kept, now delegates to reapStaleRuns() so the 5min cron still runs as a backstop alongside the new 30s setInterval. No parallel state.
  • Watcher reaper (sweepStaleWatcherRuns / resetOrphanedWatcherRuns in watchers/automation.ts) — unchanged; the watcher lane has its own 2h TTL by design.
  • RunsQueue stale sweep for lobu-queue lanes — unchanged; that path uses claimed_at heartbeat and operates only on chat_message / schedule / agent_run / internal / task.

Threshold rationale

120s default = ~3 missed 30s worker heartbeats. Real network blips / GC pauses get a grace window; a crashed worker frees the feed within ~2 minutes instead of 5.

Reproducer

Existing tests cover the surface end-to-end against PGlite:

$ cd packages/server && bun test src/scheduled/__tests__/stale-run-reaper.test.ts
 5 pass
 0 fail
 20 expect() calls
Ran 5 tests across 1 file. [3.05s]

$ cd packages/server && bun test src/gateway/__tests__/runs-queue-integration.test.ts
 6 pass
 0 fail

make build-packages clean. make typecheck clean for the files this PR touches (pre-existing errors elsewhere unrelated to this change).

Test plan

  • Unit/integration: stale-claimed run is failed with worker_heartbeat_lost
  • Unit/integration: fresh-heartbeat run is not touched
  • Unit/integration: terminal-state run is not touched
  • Unit/integration: never-heartbeated claim falls back to claimed_at window
  • Unit/integration: watcher lane skipped
  • Unit/integration: action + auth lanes covered
  • Re-run of the sweep doesn't re-fail an already-timed-out row
  • Manual e2e against a real worker (out of scope; the daemon's heartbeat path is already exercised by every running connector)

Related

Slice of #615. Not closing #615 — only the heartbeat+reaper concern ships here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Warning

Rate limit exceeded

@buremba has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 23 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 37eedc01-0842-4533-be83-80f0506dfe45

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0c819 and 3081cc7.

📒 Files selected for processing (8)
  • db/migrations/20260518010000_runs_heartbeat_reaper_index.sql
  • db/schema.sql
  • packages/server/src/db/embedded-schema-patches.ts
  • packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
  • packages/server/src/scheduled/check-stalled-executions.ts
  • packages/server/src/server.ts
  • packages/server/src/start-local.ts
  • packages/server/src/worker-api.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stale-run-reaper

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

buremba added 4 commits May 18, 2026 03:45
When a connector worker crashes, OOM-kills, or scales down mid-run the
`runs` row sits in `claimed`/`running` forever and the feed never gets a
retry. The legacy `checkStalledExecutions` cron caught some of this on a
5-minute cadence but only covered `sync` + `embed_backfill`, did a full
table scan, and had no cross-pod coordination.

`reapStaleRuns()` (packages/server/src/scheduled/check-stalled-executions.ts)
covers all four connector lanes (`sync`, `action`, `embed_backfill`,
`auth`), wraps the sweep in `pg_try_advisory_lock` for multi-pod safety,
and reads the threshold from `RUNS_REAPER_STALE_AFTER_SECONDS` (default
120s). It writes the failure as `error_message='worker_heartbeat_lost'`
and re-queues stalled `sync` runs so the feed self-heals.

Wired on a 30s `setInterval` from the gateway boot path (server.ts and
start-local.ts) with explicit teardown on SIGTERM. The legacy 5-minute
TaskScheduler cron stays as a backstop and now delegates to the same
function; the advisory lock keeps the two cadences from double-failing
rows. Watcher lane stays out of scope — it already has its own 2h sweep
in watchers/automation.ts.

Adds the missing partial index `idx_runs_heartbeat_inflight` on
`runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND
run_type IN ('sync','action','embed_backfill','auth')` so the sweeper
query is index-only. Also sets `last_heartbeat_at = current_timestamp`
on the worker poll claim path so a freshly-claimed row has a sane
initial timestamp instead of relying on the first 30s worker heartbeat
to land.

Slice extracted from #615 — placement-tagging, organization-default-device,
and scale-from-zero remain out of scope and the bigger design is
unchanged by this PR.
@buremba buremba force-pushed the fix/stale-run-reaper branch from 3207f4c to 3081cc7 Compare May 18, 2026 02:46
@buremba buremba enabled auto-merge (squash) May 18, 2026 02:46
@buremba buremba merged commit 741a4d7 into main May 18, 2026
19 of 20 checks passed
@buremba buremba deleted the fix/stale-run-reaper branch May 18, 2026 02:58
buremba added a commit that referenced this pull request May 18, 2026
… (sync + auth) (#859)

PR #849 added a heartbeat-based stale-run reaper that covered four connector
lanes (sync, action, embed_backfill, auth). Only `sync` and `auth` runs
actually call `client.heartbeat()` from
packages/connector-worker/src/daemon/executor.ts — `executeActionRun`
(line 353-377) and `executeEmbedBackfillRun` (line 577-622) never heartbeat.

Net result on prod: any in-flight `action` or `embed_backfill` run lasting
longer than `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s) gets marked
`timeout` with `error_message='worker_heartbeat_lost'` while it is still
executing successfully.

Narrow scope of this fix:

  - New migration 20260518020000_runs_heartbeat_inflight_narrow.sql drops
    and recreates `idx_runs_heartbeat_inflight` restricted to
    `('sync', 'auth')`. Embedded PGlite schema patch mirrors it.
  - `reapStaleRuns()` WHERE clause narrowed to the same lane set so the
    bulk UPDATE matches the index.
  - schema.sql regenerated by hand (predicate updated + new migration row
    appended).
  - Test updated: `action`/`embed_backfill` runs that look stale by
    heartbeat must NOT be reaped today.

Out of scope, tracked as follow-ups:

  1. Heartbeat action + embed_backfill runs so they can be safely reaped.
  2. Heartbeat the Chrome/Owletto browser-worker run path.
  3. Restore atomicity of sync timeout + retry insert.

Once #1 lands, the lane set can widen back; the index + WHERE clause stay
the single source of truth so they cannot drift again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants