Skip to content

fix(server): heartbeat-aware reaping for orphaned watcher runs#1020

Merged
buremba merged 3 commits into
mainfrom
feat/watcher-heartbeat-reaper
May 22, 2026
Merged

fix(server): heartbeat-aware reaping for orphaned watcher runs#1020
buremba merged 3 commits into
mainfrom
feat/watcher-heartbeat-reaper

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 22, 2026

What

A device-pinned watcher run left running by an executor crash or an app/gateway restart mid-turn was only recovered by the coarse 2h sweepStaleWatcherRuns TTL — which keys off COALESCE(claimed_at, created_at), so a recently-claimed-but-abandoned run dodges it. The goal then shows a false "running" for up to ~2h (real-world: run #191, claimed 02:37, idle device, never reaped).

Adds a fast reap path keyed on the run's own liveness: a run whose executor heartbeats (the paired Owletto WatcherDispatcher now beats every 30s) and has gone silent for >3min is reaped. Gated on last_heartbeat_at > claimed_at, so it never fires for clients that don't heartbeat (the claim sets the two equal) — those fall through to the unchanged 2h coarse path. Backward-compatible.

Multi-replica: correct by construction — a run executing anywhere keeps its heartbeat fresh, so no replica's sweep touches it; an abandoned one lapses and any replica can reap it.

Testing (red→green hard gate)

src/__tests__/integration/watchers/automation-contract.test.tssweepStaleWatcherRuns liveness reaping (4 cases):

  • heartbeating-then-silent (>3min) → reapedfails on the old query (the reproducer), passes here
  • fresh heartbeat → kept running
  • recent + never-heartbeated → kept running (compat: falls to 2h path)
  • non-heartbeating + >2h → reaped via coarse backstop

Verified locally: old query → 1 failed; fix → 4 passed; full automation-contract suite 21/21; server typecheck clean. Bumps owletto pointer to b244f54.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved stale watcher detection with separate timeout paths for heartbeat-stale and non-heartbeating runs, yielding more reliable timeouts and clearer error messages.
  • Tests

    • Added integration tests covering multiple liveness scenarios to validate reaping behavior and timeout outcomes.
  • Chores

    • Updated a subproject commit pointer to align dependencies.

Review Change Stack

A device-pinned watcher run left `running` by an executor crash or an
app/gateway restart mid-turn was only recovered by the coarse 2h
sweepStaleWatcherRuns TTL — and that TTL keys off COALESCE(claimed_at,
created_at), so a recently-claimed-but-abandoned run dodges it entirely. The
goal then shows a false "running" for up to ~2h (real-world: run #191).

Add a fast reap path keyed on the run's OWN liveness: a run whose executor
heartbeats (the Owletto WatcherDispatcher now beats every 30s during the turn)
and has gone silent for >3min is reaped. Gated on `last_heartbeat_at >
claimed_at` so it NEVER fires for clients that don't heartbeat — the claim sets
the two equal, so those fall through to the unchanged 2h coarse path
(backward-compatible). Correct under N replicas: a run executing anywhere keeps
its heartbeat fresh, so no replica's sweep touches it.

Bumps the owletto pointer to b244f54 (the paired WatcherDispatcher heartbeat).

Reproducer (integration): red on the old query, green here —
src/__tests__/integration/watchers/automation-contract.test.ts
"sweepStaleWatcherRuns liveness reaping" (4 cases: heartbeat-silent reaped,
fresh-heartbeat kept, recent-no-heartbeat kept, coarse-2h reaped).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a18ed2e9-9341-4f57-9f27-874f4f986806

📥 Commits

Reviewing files that changed from the base of the PR and between 36d7b66 and 4fdf53b.

📒 Files selected for processing (1)
  • packages/owletto

📝 Walkthrough

Walkthrough

Implements heartbeat-aware reaping for watcher runs (fast heartbeat-stale path vs coarse non-heartbeating TTL), adds integration tests for both behaviors, and bumps the packages/owletto submodule pointer.

Changes

Watcher Run Timeout Refactoring

Layer / File(s) Summary
Dual-mode timeout implementation
packages/server/src/watchers/automation.ts
Adds WATCHER_RUN_HEARTBEAT_STALE_INTERVAL and WATCHER_RUN_STALE_INTERVAL; updates sweepStaleWatcherRuns to CASE-select error_message and to update runs timed out either by heartbeat silence (fast path using last_heartbeat_at > claimed_at) or by coarse TTL since COALESCE(claimed_at, created_at).
Timeout behavior integration tests
packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
Adds tests that seed running runs with controlled claimed_at and last_heartbeat_at ages and assert: heartbeat-silent runs time out with heartbeat error, fresh-heartbeat runs remain running, recent non-heartbeat runs are not fast-reaped, and older non-heartbeat runs are coarse-reaped to timeout with coarse error.

Submodule Dependency Update

Layer / File(s) Summary
Owletto submodule update
packages/owletto
Advances the packages/owletto git submodule pointer to a new commit revision.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

I nibble logs beneath the moon so bright,
I count the beats that fade into the night.
Fast when a heartbeat quits its song,
Slow when none have ever been long.
A rabbit hops and marks the timeout right. 🐰

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description provides comprehensive context (problem, solution, testing) but the test plan section from the template is missing entirely. Add the Test plan section from the template to document which validation steps were performed (e.g., typecheck, integration tests).
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: adding heartbeat-aware logic to the watcher run reaping mechanism.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/watcher-heartbeat-reaper

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 22, 2026

bug_free 52, simplicity 72, slop 8, bugs 1, 1 blockers

[env] integration exited 1 because DATABASE_URL was empty/defaulted to unsafe postgres; failures are setup, not diff. Typecheck/unit passed. Ran git diff --check and a predicate probe showing claimed 3h ago + heartbeat 30s ago still reaps. Skipped server boot due missing DATABASE_URL.

Blockers

  • heartbeat-aware sweep still times out live heartbeating watcher runs once claimed_at is older than 2h via the coarse TTL branch

Suggested fixes

File Line Change
packages/server/src/watchers/automation.ts 466 Constrain the coarse TTL branch to non-heartbeating rows, e.g. ((last_heartbeat_at IS NULL OR claimed_at IS NULL OR last_heartbeat_at <= claimed_at) AND COALESCE(claimed_at, created_at) < current_timestamp - ${WATCHER_RUN_STALE_INTERVAL}::interval), and keep heartbeatError only for the heartbeat-stale branch.
packages/server/src/__tests__/integration/watchers/automation-contract.test.ts 1099 Add a regression case with claimedAgo: '3 hours' and heartbeatAgo: '30 seconds' that expects the run to remain running.
Full verdict JSON
{
  "bug_free_confidence": 52,
  "bugs": 1,
  "slop": 8,
  "simplicity": 72,
  "blockers": [
    "heartbeat-aware sweep still times out live heartbeating watcher runs once claimed_at is older than 2h via the coarse TTL branch"
  ],
  "change_type": "fix",
  "behavior_change_risk": "high",
  "tests_adequate": false,
  "suggested_fixes": [
    {
      "file": "packages/server/src/watchers/automation.ts",
      "line": 466,
      "change": "Constrain the coarse TTL branch to non-heartbeating rows, e.g. `((last_heartbeat_at IS NULL OR claimed_at IS NULL OR last_heartbeat_at <= claimed_at) AND COALESCE(claimed_at, created_at) < current_timestamp - ${WATCHER_RUN_STALE_INTERVAL}::interval)`, and keep `heartbeatError` only for the heartbeat-stale branch."
    },
    {
      "file": "packages/server/src/__tests__/integration/watchers/automation-contract.test.ts",
      "line": 1099,
      "change": "Add a regression case with `claimedAgo: '3 hours'` and `heartbeatAgo: '30 seconds'` that expects the run to remain `running`."
    }
  ],
  "notes": "[env] integration exited 1 because DATABASE_URL was empty/defaulted to unsafe postgres; failures are setup, not diff. Typecheck/unit passed. Ran git diff --check and a predicate probe showing claimed 3h ago + heartbeat 30s ago still reaps. Skipped server boot due missing DATABASE_URL.",
  "categories": {
    "src": 57,
    "tests": 82,
    "docs": 0,
    "config": 0,
    "deps": 2,
    "migrations": 0,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/owletto`:
- Line 1: The pinned submodule SHA b244f54504dedfd3fea873d097f779c86690ccda for
package "packages/owletto" is not an ancestor of owletto/main and fails the
check-drift gate; fix by repinning packages/owletto to a commit that is
reachable from owletto/main (i.e., an ancestor or a commit on origin/main) or
first merge the referenced Owletto commit onto owletto/main, then update the
submodule reference so the new SHA is verifiably part of owletto/main and the CI
check passes.

In `@packages/server/src/watchers/automation.ts`:
- Around line 449-469: The coarse backstop branch currently matches runs older
than WATCHER_RUN_STALE_INTERVAL even if they have a fresh last_heartbeat_at;
update the WHERE clause so the "Coarse backstop" condition only applies to runs
that have never heartbeated or whose heartbeat is not newer than the claim time.
Concretely, modify the COALESCE(...) < current_timestamp -
${WATCHER_RUN_STALE_INTERVAL}::interval clause in the watcher timeout SQL to
also require (last_heartbeat_at IS NULL OR last_heartbeat_at <= claimed_at), so
heartbeating runs (last_heartbeat_at > claimed_at and checked by the heartbeat
branch) are excluded and error_message/heartbeat semantics remain correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 95dc1666-a980-4fbc-8997-4096c06f99ec

📥 Commits

Reviewing files that changed from the base of the PR and between b3fb2d7 and c1b44c5.

📒 Files selected for processing (3)
  • packages/owletto
  • packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
  • packages/server/src/watchers/automation.ts

Comment thread packages/owletto Outdated
@@ -1 +1 @@
Subproject commit 6e7243ffa1262876430755f3405423995d937eb5
Subproject commit b244f54504dedfd3fea873d097f779c86690ccda
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin a submodule SHA that is reachable from owletto/main.

This bump currently fails the hard check-drift gate: CI reports b244f54504dedfd3fea873d097f779c86690ccda is not an ancestor of origin/main. Either merge this Owletto commit onto owletto/main first, or repin packages/owletto to a commit that is already reachable there.

🧰 Tools
🪛 GitHub Actions: Submodule Drift / check-drift

[error] 1-1: Pinned SHA $PINNED is not reachable from owletto/main. Hard rule failed (command: git -C packages/owletto merge-base --is-ancestor "$PINNED" origin/main).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/owletto` at line 1, The pinned submodule SHA
b244f54504dedfd3fea873d097f779c86690ccda for package "packages/owletto" is not
an ancestor of owletto/main and fails the check-drift gate; fix by repinning
packages/owletto to a commit that is reachable from owletto/main (i.e., an
ancestor or a commit on origin/main) or first merge the referenced Owletto
commit onto owletto/main, then update the submodule reference so the new SHA is
verifiably part of owletto/main and the CI check passes.

Comment thread packages/server/src/watchers/automation.ts
pi-review blocker: the 2h coarse branch fired for ANY watcher run claimed >2h
ago — including a live, actively-heartbeating run, killing a legitimately long
turn despite a fresh heartbeat. Constrain the coarse branch to rows that never
heartbeated (last_heartbeat_at IS NULL / <= claimed_at); a heartbeating run is
now governed solely by the fast heartbeat-stale path. Add a regression test
(claimed 3h ago + heartbeat 30s ago → stays running).
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 22, 2026

Update after the make-review verdict above:

  • Blocker resolved (36d7b663): the coarse 2h branch is now constrained to non-heartbeating rows (last_heartbeat_at IS NULL OR claimed_at IS NULL OR last_heartbeat_at <= claimed_at), so a live heartbeating run >2h is no longer killed. Added a regression test (claimed 3h ago + heartbeat 30s ago → stays running). Re-review: LGTM, no remaining bugs.
  • Integration exit=1 is an environment artifact, not the diff: the runner found no DATABASE_URL and the ephemeral fallback DB is named postgres, which the DROP-SCHEMA safety guard refuses (the verdict notes this too). I ran the integration coverage directly against a throwaway lobu_test: the 5-case sweepStaleWatcherRuns liveness reaping block + the full automation-contract suite (22/22) pass; the heartbeat-silent case is red on the old query, green here. typecheck + unit + server tsc all clean.

@buremba buremba merged commit a0a15b9 into main May 22, 2026
15 checks passed
@buremba buremba deleted the feat/watcher-heartbeat-reaper branch May 22, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants