fix(server): heartbeat-aware reaping for orphaned watcher runs#1020
Conversation
A device-pinned watcher run left `running` by an executor crash or an app/gateway restart mid-turn was only recovered by the coarse 2h sweepStaleWatcherRuns TTL — and that TTL keys off COALESCE(claimed_at, created_at), so a recently-claimed-but-abandoned run dodges it entirely. The goal then shows a false "running" for up to ~2h (real-world: run #191). Add a fast reap path keyed on the run's OWN liveness: a run whose executor heartbeats (the Owletto WatcherDispatcher now beats every 30s during the turn) and has gone silent for >3min is reaped. Gated on `last_heartbeat_at > claimed_at` so it NEVER fires for clients that don't heartbeat — the claim sets the two equal, so those fall through to the unchanged 2h coarse path (backward-compatible). Correct under N replicas: a run executing anywhere keeps its heartbeat fresh, so no replica's sweep touches it. Bumps the owletto pointer to b244f54 (the paired WatcherDispatcher heartbeat). Reproducer (integration): red on the old query, green here — src/__tests__/integration/watchers/automation-contract.test.ts "sweepStaleWatcherRuns liveness reaping" (4 cases: heartbeat-silent reaped, fresh-heartbeat kept, recent-no-heartbeat kept, coarse-2h reaped).
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughImplements heartbeat-aware reaping for watcher runs (fast heartbeat-stale path vs coarse non-heartbeating TTL), adds integration tests for both behaviors, and bumps the ChangesWatcher Run Timeout Refactoring
Submodule Dependency Update
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
bug_free 52, simplicity 72, slop 8, bugs 1, 1 blockers [env] integration exited 1 because DATABASE_URL was empty/defaulted to unsafe postgres; failures are setup, not diff. Typecheck/unit passed. Ran git diff --check and a predicate probe showing claimed 3h ago + heartbeat 30s ago still reaps. Skipped server boot due missing DATABASE_URL. Blockers
Suggested fixes
Full verdict JSON{
"bug_free_confidence": 52,
"bugs": 1,
"slop": 8,
"simplicity": 72,
"blockers": [
"heartbeat-aware sweep still times out live heartbeating watcher runs once claimed_at is older than 2h via the coarse TTL branch"
],
"change_type": "fix",
"behavior_change_risk": "high",
"tests_adequate": false,
"suggested_fixes": [
{
"file": "packages/server/src/watchers/automation.ts",
"line": 466,
"change": "Constrain the coarse TTL branch to non-heartbeating rows, e.g. `((last_heartbeat_at IS NULL OR claimed_at IS NULL OR last_heartbeat_at <= claimed_at) AND COALESCE(claimed_at, created_at) < current_timestamp - ${WATCHER_RUN_STALE_INTERVAL}::interval)`, and keep `heartbeatError` only for the heartbeat-stale branch."
},
{
"file": "packages/server/src/__tests__/integration/watchers/automation-contract.test.ts",
"line": 1099,
"change": "Add a regression case with `claimedAgo: '3 hours'` and `heartbeatAgo: '30 seconds'` that expects the run to remain `running`."
}
],
"notes": "[env] integration exited 1 because DATABASE_URL was empty/defaulted to unsafe postgres; failures are setup, not diff. Typecheck/unit passed. Ran git diff --check and a predicate probe showing claimed 3h ago + heartbeat 30s ago still reaps. Skipped server boot due missing DATABASE_URL.",
"categories": {
"src": 57,
"tests": 82,
"docs": 0,
"config": 0,
"deps": 2,
"migrations": 0,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/owletto`:
- Line 1: The pinned submodule SHA b244f54504dedfd3fea873d097f779c86690ccda for
package "packages/owletto" is not an ancestor of owletto/main and fails the
check-drift gate; fix by repinning packages/owletto to a commit that is
reachable from owletto/main (i.e., an ancestor or a commit on origin/main) or
first merge the referenced Owletto commit onto owletto/main, then update the
submodule reference so the new SHA is verifiably part of owletto/main and the CI
check passes.
In `@packages/server/src/watchers/automation.ts`:
- Around line 449-469: The coarse backstop branch currently matches runs older
than WATCHER_RUN_STALE_INTERVAL even if they have a fresh last_heartbeat_at;
update the WHERE clause so the "Coarse backstop" condition only applies to runs
that have never heartbeated or whose heartbeat is not newer than the claim time.
Concretely, modify the COALESCE(...) < current_timestamp -
${WATCHER_RUN_STALE_INTERVAL}::interval clause in the watcher timeout SQL to
also require (last_heartbeat_at IS NULL OR last_heartbeat_at <= claimed_at), so
heartbeating runs (last_heartbeat_at > claimed_at and checked by the heartbeat
branch) are excluded and error_message/heartbeat semantics remain correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 95dc1666-a980-4fbc-8997-4096c06f99ec
📒 Files selected for processing (3)
packages/owlettopackages/server/src/__tests__/integration/watchers/automation-contract.test.tspackages/server/src/watchers/automation.ts
| @@ -1 +1 @@ | |||
| Subproject commit 6e7243ffa1262876430755f3405423995d937eb5 | |||
| Subproject commit b244f54504dedfd3fea873d097f779c86690ccda | |||
There was a problem hiding this comment.
Pin a submodule SHA that is reachable from owletto/main.
This bump currently fails the hard check-drift gate: CI reports b244f54504dedfd3fea873d097f779c86690ccda is not an ancestor of origin/main. Either merge this Owletto commit onto owletto/main first, or repin packages/owletto to a commit that is already reachable there.
🧰 Tools
🪛 GitHub Actions: Submodule Drift / check-drift
[error] 1-1: Pinned SHA $PINNED is not reachable from owletto/main. Hard rule failed (command: git -C packages/owletto merge-base --is-ancestor "$PINNED" origin/main).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/owletto` at line 1, The pinned submodule SHA
b244f54504dedfd3fea873d097f779c86690ccda for package "packages/owletto" is not
an ancestor of owletto/main and fails the check-drift gate; fix by repinning
packages/owletto to a commit that is reachable from owletto/main (i.e., an
ancestor or a commit on origin/main) or first merge the referenced Owletto
commit onto owletto/main, then update the submodule reference so the new SHA is
verifiably part of owletto/main and the CI check passes.
pi-review blocker: the 2h coarse branch fired for ANY watcher run claimed >2h ago — including a live, actively-heartbeating run, killing a legitimately long turn despite a fresh heartbeat. Constrain the coarse branch to rows that never heartbeated (last_heartbeat_at IS NULL / <= claimed_at); a heartbeating run is now governed solely by the fast heartbeat-stale path. Add a regression test (claimed 3h ago + heartbeat 30s ago → stays running).
|
Update after the make-review verdict above:
|
What
A device-pinned watcher run left
runningby an executor crash or an app/gateway restart mid-turn was only recovered by the coarse 2hsweepStaleWatcherRunsTTL — which keys offCOALESCE(claimed_at, created_at), so a recently-claimed-but-abandoned run dodges it. The goal then shows a false "running" for up to ~2h (real-world: run #191, claimed 02:37, idle device, never reaped).Adds a fast reap path keyed on the run's own liveness: a run whose executor heartbeats (the paired Owletto
WatcherDispatchernow beats every 30s) and has gone silent for >3min is reaped. Gated onlast_heartbeat_at > claimed_at, so it never fires for clients that don't heartbeat (the claim sets the two equal) — those fall through to the unchanged 2h coarse path. Backward-compatible.Multi-replica: correct by construction — a run executing anywhere keeps its heartbeat fresh, so no replica's sweep touches it; an abandoned one lapses and any replica can reap it.
Testing (red→green hard gate)
src/__tests__/integration/watchers/automation-contract.test.ts→sweepStaleWatcherRuns liveness reaping(4 cases):Verified locally: old query → 1 failed; fix → 4 passed; full
automation-contractsuite 21/21; server typecheck clean. Bumps owletto pointer to b244f54.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Chores