fix(runs): add heartbeat + stale-run reaper by buremba · Pull Request #849 · lobu-ai/lobu

buremba · 2026-05-18T02:08:18Z

Summary

Reaper for connector runs whose worker crashed, OOM-killed, or scaled down mid-run. Without this, a runs row sits in claimed/running forever and the feed never gets a retry.

Slice extracted from #615. Placement-tagging columns, organization-default-device, and scale-from-zero machinery are NOT in this PR — that wider design isn't greenlit. This slice is independently shippable because the heartbeat column and per-worker heartbeat call already exist; what was missing was a single-cadence cross-pod-safe sweep over the connector lanes.

What's in scope

reapStaleRuns() in packages/server/src/scheduled/check-stalled-executions.ts. Covers all four connector lanes (sync, action, embed_backfill, auth) — the legacy code only swept sync + embed_backfill. Wraps the body in pg_try_advisory_lock so two pods (or the 30s setInterval + the 5min cron) can't double-fail a row. Threshold configurable via RUNS_REAPER_STALE_AFTER_SECONDS (default 120s).
Wiring in server.ts and start-local.ts: startStaleRunReaper() registers a 30s setInterval and returns a teardown invoked from the SIGTERM/SIGINT handler. Existing 5-minute check-stalled-executions TaskScheduler cron stays as a backstop and now delegates to the same function; advisory lock keeps the cadences safe.
Partial index idx_runs_heartbeat_inflight on runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND run_type IN ('sync','action','embed_backfill','auth') — shipped as db/migrations/20260518000000_runs_heartbeat_reaper_index.sql + mirrored in embedded-schema-patches.ts for PGlite + appended to db/schema.sql.
Claim path in worker-api.ts now sets last_heartbeat_at = current_timestamp at claim time so a freshly-claimed row has a sane initial timestamp instead of relying on the worker's first 30s heartbeat to land.
Tests in packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts (5 tests, all passing): fresh-heartbeat is left alone, stale-heartbeat is timed out as worker_heartbeat_lost, terminal-state rows are never touched, never-heartbeat-but-stale-claim rows are timed out via the claimed_at fallback, watcher lane is excluded (handled by sweepStaleWatcherRuns), action + auth lanes reach parity with sync.

What's reused vs. new

Heartbeat column (runs.last_heartbeat_at) — already existed.
Worker-side heartbeat call — connector-worker daemon already sends /api/workers/heartbeat every 30s; gateway handler already updates the column.
Legacy checkStalledExecutions — kept, now delegates to reapStaleRuns() so the 5min cron still runs as a backstop alongside the new 30s setInterval. No parallel state.
Watcher reaper (sweepStaleWatcherRuns / resetOrphanedWatcherRuns in watchers/automation.ts) — unchanged; the watcher lane has its own 2h TTL by design.
RunsQueue stale sweep for lobu-queue lanes — unchanged; that path uses claimed_at heartbeat and operates only on chat_message / schedule / agent_run / internal / task.

Threshold rationale

120s default = ~3 missed 30s worker heartbeats. Real network blips / GC pauses get a grace window; a crashed worker frees the feed within ~2 minutes instead of 5.

Reproducer

Existing tests cover the surface end-to-end against PGlite:

$ cd packages/server && bun test src/scheduled/__tests__/stale-run-reaper.test.ts
 5 pass
 0 fail
 20 expect() calls
Ran 5 tests across 1 file. [3.05s]

$ cd packages/server && bun test src/gateway/__tests__/runs-queue-integration.test.ts
 6 pass
 0 fail

make build-packages clean. make typecheck clean for the files this PR touches (pre-existing errors elsewhere unrelated to this change).

Test plan

Unit/integration: stale-claimed run is failed with worker_heartbeat_lost
Unit/integration: fresh-heartbeat run is not touched
Unit/integration: terminal-state run is not touched
Unit/integration: never-heartbeated claim falls back to claimed_at window
Unit/integration: watcher lane skipped
Unit/integration: action + auth lanes covered
Re-run of the sweep doesn't re-fail an already-timed-out row
Manual e2e against a real worker (out of scope; the daemon's heartbeat path is already exercised by every running connector)

Rate limit exceeded

@buremba has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 23 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 37eedc01-0842-4533-be83-80f0506dfe45

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0c819 and 3081cc7.

📒 Files selected for processing (8)

db/migrations/20260518010000_runs_heartbeat_reaper_index.sql
db/schema.sql
packages/server/src/db/embedded-schema-patches.ts
packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
packages/server/src/scheduled/check-stalled-executions.ts
packages/server/src/server.ts
packages/server/src/start-local.ts
packages/server/src/worker-api.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/stale-run-reaper

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-18T02:11:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

When a connector worker crashes, OOM-kills, or scales down mid-run the `runs` row sits in `claimed`/`running` forever and the feed never gets a retry. The legacy `checkStalledExecutions` cron caught some of this on a 5-minute cadence but only covered `sync` + `embed_backfill`, did a full table scan, and had no cross-pod coordination. `reapStaleRuns()` (packages/server/src/scheduled/check-stalled-executions.ts) covers all four connector lanes (`sync`, `action`, `embed_backfill`, `auth`), wraps the sweep in `pg_try_advisory_lock` for multi-pod safety, and reads the threshold from `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s). It writes the failure as `error_message='worker_heartbeat_lost'` and re-queues stalled `sync` runs so the feed self-heals. Wired on a 30s `setInterval` from the gateway boot path (server.ts and start-local.ts) with explicit teardown on SIGTERM. The legacy 5-minute TaskScheduler cron stays as a backstop and now delegates to the same function; the advisory lock keeps the two cadences from double-failing rows. Watcher lane stays out of scope — it already has its own 2h sweep in watchers/automation.ts. Adds the missing partial index `idx_runs_heartbeat_inflight` on `runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND run_type IN ('sync','action','embed_backfill','auth')` so the sweeper query is index-only. Also sets `last_heartbeat_at = current_timestamp` on the worker poll claim path so a freshly-claimed row has a sane initial timestamp instead of relying on the first 30s worker heartbeat to land. Slice extracted from #615 — placement-tagging, organization-default-device, and scale-from-zero remain out of scope and the bigger design is unchanged by this PR.

…pending_interactions

… (sync + auth) (#859) PR #849 added a heartbeat-based stale-run reaper that covered four connector lanes (sync, action, embed_backfill, auth). Only `sync` and `auth` runs actually call `client.heartbeat()` from packages/connector-worker/src/daemon/executor.ts — `executeActionRun` (line 353-377) and `executeEmbedBackfillRun` (line 577-622) never heartbeat. Net result on prod: any in-flight `action` or `embed_backfill` run lasting longer than `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s) gets marked `timeout` with `error_message='worker_heartbeat_lost'` while it is still executing successfully. Narrow scope of this fix: - New migration 20260518020000_runs_heartbeat_inflight_narrow.sql drops and recreates `idx_runs_heartbeat_inflight` restricted to `('sync', 'auth')`. Embedded PGlite schema patch mirrors it. - `reapStaleRuns()` WHERE clause narrowed to the same lane set so the bulk UPDATE matches the index. - schema.sql regenerated by hand (predicate updated + new migration row appended). - Test updated: `action`/`embed_backfill` runs that look stale by heartbeat must NOT be reaped today. Out of scope, tracked as follow-ups: 1. Heartbeat action + embed_backfill runs so they can be safely reaped. 2. Heartbeat the Chrome/Owletto browser-worker run path. 3. Restore atomicity of sync timeout + retry insert. Once #1 lands, the lane set can widen back; the index + WHERE clause stay the single source of truth so they cannot drift again.

This was referenced May 18, 2026

design: connector-worker placement (#615) #842

Closed

Dynamic connector-worker placement: device vs. cloud, scale-from-zero pod pool #615

Closed

buremba added 4 commits May 18, 2026 03:45

fix(schema): regenerate db/schema.sql after heartbeat index migration

9ceb13a

fix(migration): rename to 20260518010000 to avoid collision with #834 …

8cc823d

…pending_interactions

fix(schema): add 20260518010000 to schema_migrations after rebase

3081cc7

buremba force-pushed the fix/stale-run-reaper branch from 3207f4c to 3081cc7 Compare May 18, 2026 02:46

buremba enabled auto-merge (squash) May 18, 2026 02:46

buremba merged commit 741a4d7 into main May 18, 2026
19 of 20 checks passed

buremba deleted the fix/stale-run-reaper branch May 18, 2026 02:58

buremba mentioned this pull request May 18, 2026

feat(connector-worker,server): heartbeat action+embed_backfill, atomic reaper retries #893

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runs): add heartbeat + stale-run reaper#849

fix(runs): add heartbeat + stale-run reaper#849
buremba merged 4 commits into
mainfrom
fix/stale-run-reaper

buremba commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

codecov-commenter commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 18, 2026

Summary

What's in scope

What's reused vs. new

Threshold rationale

Reproducer

Test plan

Related

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

codecov-commenter commented May 18, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 18, 2026 •

edited

Loading