Skip to content

fix(reaper): narrow stale-run reaper to lanes that actually heartbeat (sync + auth)#859

Merged
buremba merged 1 commit into
mainfrom
fix/reaper-narrow-lanes
May 18, 2026
Merged

fix(reaper): narrow stale-run reaper to lanes that actually heartbeat (sync + auth)#859
buremba merged 1 commit into
mainfrom
fix/reaper-narrow-lanes

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

PR #849 added a heartbeat-based stale-run reaper covering four connector
lanes (sync, action, embed_backfill, auth). Codex audit (gpt-5.5,
reasoning_effort=high) flagged a critical bug: only sync and auth
runs actually call client.heartbeat() from
packages/connector-worker/src/daemon/executor.ts. executeActionRun
(lines 353-377) and executeEmbedBackfillRun (lines 577-622) never
heartbeat.

Net result on prod: any in-flight action or embed_backfill run
lasting longer than RUNS_REAPER_STALE_AFTER_SECONDS (default 120s) gets
marked status='timeout' with error_message='worker_heartbeat_lost'
while it is still executing successfully. This kills active runs the
moment #849 hits prod.

Scope (narrow + safe)

  • New migration db/migrations/20260518020000_runs_heartbeat_inflight_narrow.sql
    drops and recreates idx_runs_heartbeat_inflight restricted to
    ('sync', 'auth').
  • Mirrored in packages/server/src/db/embedded-schema-patches.ts so already-
    initialised embedded/PGlite databases pick it up at boot.
  • reapStaleRuns() WHERE clause in
    packages/server/src/scheduled/check-stalled-executions.ts narrowed to
    the same lane set so the bulk UPDATE matches the partial index.
  • db/schema.sql regenerated by hand (predicate updated + new migration
    row appended).
  • Test in packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
    updated: action/embed_backfill runs with stale last_heartbeat_at
    must NOT be reaped while their executors do not emit heartbeats.

Out of scope (separate follow-up issues will be filed)

  1. Heartbeat action + embed_backfill runs so they can be safely
    reaped on last_heartbeat_at. Once that lands, the lane set widens
    back; the index + WHERE clause stay the single source of truth so
    they cannot drift again.
  2. Heartbeat the Chrome/Owletto browser-worker run path
    (apps/chrome/background.js) — same staleness blindspot, different
    lane.
  3. Restore atomicity of the sync timeout + retry insert (the medium
    audit finding around the SELECT-then-UPDATE-then-INSERT shape; the
    advisory lock helps cross-pod, but the per-row sequence is still not
    atomic within a single tick).

Reproducer

Test file packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts:

  • Red (without this fix): the action + embed_backfill runs are reaped
    by the WHERE clause that still lists all four lanes — the assertion that
    they stay running fails.
  • Green (with this fix): all 6 tests pass.
$ bun test packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
 6 pass
 0 fail
 22 expect() calls
Ran 6 tests across 1 file. [3.16s]

Test plan

  • bun test packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts (6/6 pass)
  • Pre-commit biome check + tsc --noEmit pass
  • CI: make typecheck already shows pre-existing WorkerTokenData.organizationId errors on origin/main unrelated to this change; check-drift is a known submodule-pointer concern (owletto behind main), not blocking. Required checks: typecheck / format-lint / build-test / pr-title / migrations / unit / integration / frontend.

Codex audit reference

Inline finding from the Codex audit of PR #849 — flagged the index +
WHERE clause covering four lanes while only two emit heartbeats today.

Summary by CodeRabbit

  • Bug Fixes
    • Corrected stalled connector run detection to focus on sync and auth run types that actively send heartbeats, excluding action and embed_backfill runs from heartbeat-based staleness monitoring.

Review Change Stack

… (sync + auth)

PR #849 added a heartbeat-based stale-run reaper that covered four connector
lanes (sync, action, embed_backfill, auth). Only `sync` and `auth` runs
actually call `client.heartbeat()` from
packages/connector-worker/src/daemon/executor.ts — `executeActionRun`
(line 353-377) and `executeEmbedBackfillRun` (line 577-622) never heartbeat.

Net result on prod: any in-flight `action` or `embed_backfill` run lasting
longer than `RUNS_REAPER_STALE_AFTER_SECONDS` (default 120s) gets marked
`timeout` with `error_message='worker_heartbeat_lost'` while it is still
executing successfully.

Narrow scope of this fix:

  - New migration 20260518020000_runs_heartbeat_inflight_narrow.sql drops
    and recreates `idx_runs_heartbeat_inflight` restricted to
    `('sync', 'auth')`. Embedded PGlite schema patch mirrors it.
  - `reapStaleRuns()` WHERE clause narrowed to the same lane set so the
    bulk UPDATE matches the index.
  - schema.sql regenerated by hand (predicate updated + new migration row
    appended).
  - Test updated: `action`/`embed_backfill` runs that look stale by
    heartbeat must NOT be reaped today.

Out of scope, tracked as follow-ups:

  1. Heartbeat action + embed_backfill runs so they can be safely reaped.
  2. Heartbeat the Chrome/Owletto browser-worker run path.
  3. Restore atomicity of sync timeout + retry insert.

Once #1 lands, the lane set can widen back; the index + WHERE clause stay
the single source of truth so they cannot drift again.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2722a7ec-9513-4877-97dc-11f4f3de28d1

📥 Commits

Reviewing files that changed from the base of the PR and between bd3001e and cac979f.

📒 Files selected for processing (5)
  • db/migrations/20260518020000_runs_heartbeat_inflight_narrow.sql
  • db/schema.sql
  • packages/server/src/db/embedded-schema-patches.ts
  • packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
  • packages/server/src/scheduled/check-stalled-executions.ts

📝 Walkthrough

Walkthrough

This PR narrows the stale-heartbeat run reaper to only timeout sync and auth connector lanes, excluding action and embed_backfill. The reaper logic, database index scope, migrations, and embedded patches are updated consistently, with new tests verifying that only heartbeat-capable lanes are reaped.

Changes

Heartbeat-based stale run detection scope

Layer / File(s) Summary
Stale run reaper logic narrowing
packages/server/src/scheduled/check-stalled-executions.ts
reapStaleRuns() now only marks runs as timeout when run_type is sync or auth. Documentation clarifies that these lanes have active heartbeating, while action and embed_backfill do not and are excluded pending follow-up.
Database index and migration updates
db/migrations/20260518020000_runs_heartbeat_inflight_narrow.sql, db/schema.sql, packages/server/src/db/embedded-schema-patches.ts
Migration file defines narrowed idx_runs_heartbeat_inflight partial index (sync/auth only); schema.sql and migration seeds are updated; embedded schema patch is added for PGlite replay on boot.
Test verification of reaping scope
packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts
Added separate tests: one verifying auth runs are reaped when stale, another verifying action and embed_backfill runs are not reaped under identical conditions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A reaper learns to skip the non-heartbeaters,
Sync and auth alone now face the eater,
Tests confirm the narrow scope so true,
No more false timeouts—just the lanes that knew.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the main change: narrowing the stale-run reaper to only the lanes that actually emit heartbeats (sync and auth), directly addressing the critical bug being fixed.
Description check ✅ Passed The description is comprehensive and addresses all template sections: clear summary of the bug, complete test plan with checkmarks, and detailed notes on scope, follow-ups, and reproducer steps.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/reaper-narrow-lanes

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba buremba merged commit cc4dbe3 into main May 18, 2026
23 checks passed
@buremba buremba deleted the fix/reaper-narrow-lanes branch May 18, 2026 04:12
buremba added a commit that referenced this pull request May 18, 2026
…c reaper retries (#893)

* feat(connector-worker,server): heartbeat action+embed_backfill, atomic reaper retries

Three coupled fixes that together restore safe reaping for the
connector-worker action + embed_backfill lanes (lobu#860), and close
a non-atomicity hole in the sync timeout + retry path (lobu#862).

## connector-worker: heartbeat action + embed_backfill

`executeActionRun` and `executeEmbedBackfillRun` in
`packages/connector-worker/src/daemon/executor.ts` did not call
`client.heartbeat()`. PR #859 narrowed the gateway's stale-run reaper
to the two lanes that did heartbeat (sync + auth) so it stopped killing
in-flight action / embed_backfill runs after 120s.

Both executors now `setInterval(client.heartbeat, 30000)` at the start
of the run and `clearInterval` from a `finally` block so every exit
path stops the timer.

## server/reaper: widen back to all four lanes

`reapStaleRuns` in
`packages/server/src/scheduled/check-stalled-executions.ts` widens the
WHERE clause back to `('sync', 'action', 'embed_backfill', 'auth')`.
The matching partial index `idx_runs_heartbeat_inflight` widens with
it via `db/migrations/20260518070000_runs_heartbeat_inflight_widen.sql`
+ mirror in `embedded-schema-patches.ts`.

## server/reaper: atomic timeout + retry via CTE (lobu#862)

Replaces the previous bulk-UPDATE-then-per-row-INSERT-loop with a
single CTE:

\`\`\`sql
WITH timed_out AS (UPDATE ... RETURNING ...),
     retries   AS (INSERT INTO runs ... SELECT ... FROM timed_out
                   WHERE run_type='sync' AND feed_id IS NOT NULL
                     AND NOT EXISTS (... dedup against active sibling ...)
                   RETURNING ...)
SELECT count(*) FROM timed_out, count(*) FROM retries
\`\`\`

Before, a crash between the UPDATE and the per-row INSERTs left rows
in `timeout` with no retry queued and the feed went dark until the
next sweep happened to find another stale row for the same feed.

The retry INSERT uses `WHERE NOT EXISTS` against the partial unique
index `idx_runs_active_sync_per_feed` (with an exclusion for the row
the sibling CTE is timing out — the in-CTE snapshot still shows it as
`running`). The `ON CONFLICT (col) WHERE ...` form was tried first
but PostgreSQL infers it badly inside a CTE against a partial unique
index; `NOT EXISTS` is the portable shape.

A new `[reaper] Skipped sync retries — another active sync run exists`
log fires when the dedup branch trips so operators can spot cross-pod
advisory-lock-boundary races.

## Tests

`packages/server/src/scheduled/__tests__/stale-run-reaper.test.ts` —
10 tests, all passing:

- existing 6 tests updated for the wider lane set
- action/embed_backfill/auth lanes are reaped (parity with sync)
- a stale sync run gets timed out AND a retry queued in one statement
- non-sync stale runs do NOT produce retries
- two stale sync runs on the same feed dedup correctly
- a pending sync that pre-exists prevents a duplicate retry
- reaper output count and DB state agree (no UPDATE-without-INSERT gap)

`packages/connector-worker/src/__tests__/executor-heartbeat.test.ts` —
4 tests, all passing:

- executeActionRun heartbeats at least twice over a 70s simulated run
- executeEmbedBackfillRun heartbeats at least twice over a 70s simulated run
- heartbeat interval is cleared after a successful action run
- heartbeat interval is cleared after a failed action run

## Owletto

The Chrome browser-worker heartbeat ships in owletto#186 (merged
577f0e06b97a48ec6cdbb7b95ddfe8ece635b703), addressing lobu#861. A
follow-up small PR will bump the `packages/owletto` submodule pointer.

Closes lobu#860 lobu#862
Refs lobu#861

* chore(submodule,test): bump owletto to 577f0e0, harden 30s cadence assertion

- Bump packages/owletto to 577f0e06 (post-merge of owletto#186) so
  check-drift passes. Pulls in the Chrome browser-worker heartbeat
  fix for lobu#861.
- Replace the tautological `concat([30_000]).toContain(30_000)`
  assertion with a real check against an accumulator that records
  every setInterval ms argument across the run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants