fix(reaper): narrow stale-run reaper to lanes that actually heartbeat (sync + auth)#859
Conversation
… (sync + auth) PR #849 added a heartbeat-based stale-run reaper that covered four connector lanes (sync, action, embed_backfill, auth). Only `sync` and `auth` runs actually call `client.heartbeat()` from packages/connector-worker/src/daemon/executor.ts — `executeActionRun` (line 353-377) and `executeEmbedBackfillRun` (line 577-622) never heartbeat. Net result on prod: any in-flight `action` or `embed_backfill` run lasting longer than `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s) gets marked `timeout` with `error_message='worker_heartbeat_lost'` while it is still executing successfully. Narrow scope of this fix: - New migration 20260518020000_runs_heartbeat_inflight_narrow.sql drops and recreates `idx_runs_heartbeat_inflight` restricted to `('sync', 'auth')`. Embedded PGlite schema patch mirrors it. - `reapStaleRuns()` WHERE clause narrowed to the same lane set so the bulk UPDATE matches the index. - schema.sql regenerated by hand (predicate updated + new migration row appended). - Test updated: `action`/`embed_backfill` runs that look stale by heartbeat must NOT be reaped today. Out of scope, tracked as follow-ups: 1. Heartbeat action + embed_backfill runs so they can be safely reaped. 2. Heartbeat the Chrome/Owletto browser-worker run path. 3. Restore atomicity of sync timeout + retry insert. Once #1 lands, the lane set can widen back; the index + WHERE clause stay the single source of truth so they cannot drift again.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR narrows the stale-heartbeat run reaper to only timeout sync and auth connector lanes, excluding action and embed_backfill. The reaper logic, database index scope, migrations, and embedded patches are updated consistently, with new tests verifying that only heartbeat-capable lanes are reaped. ChangesHeartbeat-based stale run detection scope
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…c reaper retries (#893) * feat(connector-worker,server): heartbeat action+embed_backfill, atomic reaper retries Three coupled fixes that together restore safe reaping for the connector-worker action + embed_backfill lanes (lobu#860), and close a non-atomicity hole in the sync timeout + retry path (lobu#862). ## connector-worker: heartbeat action + embed_backfill `executeActionRun` and `executeEmbedBackfillRun` in `packages/connector-worker/src/daemon/executor.ts` did not call `client.heartbeat()`. PR #859 narrowed the gateway's stale-run reaper to the two lanes that did heartbeat (sync + auth) so it stopped killing in-flight action / embed_backfill runs after 120s. Both executors now `setInterval(client.heartbeat, 30000)` at the start of the run and `clearInterval` from a `finally` block so every exit path stops the timer. ## server/reaper: widen back to all four lanes `reapStaleRuns` in `packages/server/src/scheduled/check-stalled-executions.ts` widens the WHERE clause back to `('sync', 'action', 'embed_backfill', 'auth')`. The matching partial index `idx_runs_heartbeat_inflight` widens with it via `db/migrations/20260518070000_runs_heartbeat_inflight_widen.sql` + mirror in `embedded-schema-patches.ts`. ## server/reaper: atomic timeout + retry via CTE (lobu#862) Replaces the previous bulk-UPDATE-then-per-row-INSERT-loop with a single CTE: \`\`\`sql WITH timed_out AS (UPDATE ... RETURNING ...), retries AS (INSERT INTO runs ... SELECT ... FROM timed_out WHERE run_type='sync' AND feed_id IS NOT NULL AND NOT EXISTS (... dedup against active sibling ...) RETURNING ...) SELECT count(*) FROM timed_out, count(*) FROM retries \`\`\` Before, a crash between the UPDATE and the per-row INSERTs left rows in `timeout` with no retry queued and the feed went dark until the next sweep happened to find another stale row for the same feed. The retry INSERT uses `WHERE NOT EXISTS` against the partial unique index `idx_runs_active_sync_per_feed` (with an exclusion for the row the sibling CTE is timing out — the in-CTE snapshot still shows it as `running`). The `ON CONFLICT (col) WHERE ...` form was tried first but PostgreSQL infers it badly inside a CTE against a partial unique index; `NOT EXISTS` is the portable shape. A new `[reaper] Skipped sync retries — another active sync run exists` log fires when the dedup branch trips so operators can spot cross-pod advisory-lock-boundary races. ## Tests `packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts` — 10 tests, all passing: - existing 6 tests updated for the wider lane set - action/embed_backfill/auth lanes are reaped (parity with sync) - a stale sync run gets timed out AND a retry queued in one statement - non-sync stale runs do NOT produce retries - two stale sync runs on the same feed dedup correctly - a pending sync that pre-exists prevents a duplicate retry - reaper output count and DB state agree (no UPDATE-without-INSERT gap) `packages/connector-worker/src/__tests__/executor-heartbeat.test.ts` — 4 tests, all passing: - executeActionRun heartbeats at least twice over a 70s simulated run - executeEmbedBackfillRun heartbeats at least twice over a 70s simulated run - heartbeat interval is cleared after a successful action run - heartbeat interval is cleared after a failed action run ## Owletto The Chrome browser-worker heartbeat ships in owletto#186 (merged 577f0e06b97a48ec6cdbb7b95ddfe8ece635b703), addressing lobu#861. A follow-up small PR will bump the `packages/owletto` submodule pointer. Closes lobu#860 lobu#862 Refs lobu#861 * chore(submodule,test): bump owletto to 577f0e0, harden 30s cadence assertion - Bump packages/owletto to 577f0e06 (post-merge of owletto#186) so check-drift passes. Pulls in the Chrome browser-worker heartbeat fix for lobu#861. - Replace the tautological `concat([30_000]).toContain(30_000)` assertion with a real check against an accumulator that records every setInterval ms argument across the run.
Summary
PR #849 added a heartbeat-based stale-run reaper covering four connector
lanes (
sync,action,embed_backfill,auth). Codex audit (gpt-5.5,reasoning_effort=high) flagged a critical bug: onlysyncandauthruns actually call
client.heartbeat()frompackages/connector-worker/src/daemon/executor.ts.executeActionRun(lines 353-377) and
executeEmbedBackfillRun(lines 577-622) neverheartbeat.
Net result on prod: any in-flight
actionorembed_backfillrunlasting longer than
RUNS_REAPER_STALE_AFTER_SECONDS(default 120s) getsmarked
status='timeout'witherror_message='worker_heartbeat_lost'while it is still executing successfully. This kills active runs the
moment #849 hits prod.
Scope (narrow + safe)
db/migrations/20260518020000_runs_heartbeat_inflight_narrow.sqldrops and recreates
idx_runs_heartbeat_inflightrestricted to('sync', 'auth').packages/server/src/db/embedded-schema-patches.tsso already-initialised embedded/PGlite databases pick it up at boot.
reapStaleRuns()WHEREclause inpackages/server/src/scheduled/check-stalled-executions.tsnarrowed tothe same lane set so the bulk UPDATE matches the partial index.
db/schema.sqlregenerated by hand (predicate updated + new migrationrow appended).
packages/server/src/scheduled/__tests__/stale-run-reaper.test.tsupdated:
action/embed_backfillruns with stalelast_heartbeat_atmust NOT be reaped while their executors do not emit heartbeats.
Out of scope (separate follow-up issues will be filed)
action+embed_backfillruns so they can be safelyreaped on
last_heartbeat_at. Once that lands, the lane set widensback; the index + WHERE clause stay the single source of truth so
they cannot drift again.
(
apps/chrome/background.js) — same staleness blindspot, differentlane.
audit finding around the SELECT-then-UPDATE-then-INSERT shape; the
advisory lock helps cross-pod, but the per-row sequence is still not
atomic within a single tick).
Reproducer
Test file
packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts:action+embed_backfillruns are reapedby the WHERE clause that still lists all four lanes — the assertion that
they stay
runningfails.Test plan
bun test packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts(6/6 pass)biome check+tsc --noEmitpassmake typecheckalready shows pre-existingWorkerTokenData.organizationIderrors onorigin/mainunrelated to this change; check-drift is a known submodule-pointer concern (owletto behind main), not blocking. Required checks: typecheck / format-lint / build-test / pr-title / migrations / unit / integration / frontend.Codex audit reference
Inline finding from the Codex audit of PR #849 — flagged the index +
WHERE clause covering four lanes while only two emit heartbeats today.
Summary by CodeRabbit