fix(server): server-side agent must skip device-pinned watcher runs#808
Conversation
The server-side watcher claim query (claimWatcherRun in packages/server/src/watchers/automation.ts) previously grabbed every pending run_type='watcher' row regardless of pinning. Once #798's dispatcher starts setting approved_input.device_worker_id, the user's Mac (or other device worker polling /api/workers/poll) will race the server-side agent for the same row — the exact failure mode that masked the watcher-run silent-success bug in the past. Add a JSONB-level guard to the CTE so rows with a non-empty device_worker_id are invisible to the server-side claim path. The pin currently lives in approved_input (issue #799 will add a proper column); the guard handles both NULL and empty-string explicitly so the existing test fixtures (no device_worker_id) keep matching. Test: a watcher run with approved_input.device_worker_id set stays pending after dispatchPendingWatcherRuns runs, both the global-scan and explicit-runIds branches. Closes #802 Prereq for #798
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR implements server-side filtering to prevent the dispatcher from claiming watcher runs pinned to a device worker. The ChangesDevice-pinned watcher run exclusion filter
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly Related Issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6d55c1753
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| r.approved_input->>'device_worker_id' IS NULL | ||
| OR r.approved_input->>'device_worker_id' = '' |
There was a problem hiding this comment.
Keep a claimant for device-pinned watcher runs
When a watcher run has approved_input.device_worker_id, this predicate makes dispatchPendingWatcherRuns skip it, but the worker polling path does not currently claim watcher runs: in packages/server/src/worker-api.ts the /api/workers/poll SQL allow-list is r.run_type IN ('sync', 'action', 'embed_backfill', 'auth'). In the device-pinned watcher scenario described by this change, the row therefore remains pending indefinitely instead of being picked up by the device worker; add the watcher lane to the device claim path (with the same pin/scope checks) before excluding it here.
Useful? React with 👍 / 👎.
* feat(watchers): device-pinned watcher runs end-to-end (#798 PR-1) The schema work in #811 added watchers.device_worker_id + agent_kind + notification + cooldown columns, and #808 made the server-side dispatcher skip runs already pinned to a device. This PR wires the pinning + claim + completion sides so a user's Lobu Mac app can actually execute the run via a local CLI agent (Claude Code, etc.). Server-side: 1. `materializeDueWatcherRuns` now reads `watchers.device_worker_id` and `watchers.agent_kind` and persists both into the run's `approved_input` JSONB. The existing #802 dispatcher exclusion keys off `approved_input->>'device_worker_id'`, so device-pinned rows stay `pending` on the server side. 2. `/api/workers/poll` gains a parallel CTE branch for `run_type='watcher'` AND `approved_input->>'device_worker_id' = <this device's id>`. The poll response is short-circuited for watchers — no connector code, no credentials, no compiled_code lookup. It returns a watcher payload envelope `{ watcher, event, context }` that the device-side dispatcher uses to build a CLI prompt. 3. New endpoint `POST /api/workers/me/runs/:runId/complete-watcher`: authorizes via the existing claim-ownership gate (`authorizeRunForWorker`), writes a `watcher_windows` row with `model_used='device-cli'` and `extracted_data.kind='device_cli_output'`, updates `runs.status` to completed/failed, advances `watchers.last_fired_at`, and emits a `watcher:updated` lifecycle event so dashboard metric_series picks up the run. 4. The new path is whitelisted in the user-scoped worker route gate (was previously 403 for `/api/workers/me/...` outside auth-profiles/feeds). Tests: - materializeDueWatcherRuns persists device_worker_id + agent_kind. - End-to-end complete-watcher → runs.completed, watcher_windows row with the CLI output, last_fired_at advanced. - Failure path: error supplied → runs.failed, no window row. - 409 for non-watcher run types, 404 for unknown run ids. PR-2 (Owletto Mac app dispatcher + ClaudeCodeExecutor) consumes this contract. * fix(watchers): close pi review on device-side complete-watcher (#814) 5 issues from pi's review of #814's WatcherDispatcher / completeWatcherRun that all materialise on real device traffic: 1. (BLOCKER) Schedule advancement mismatch — completing a device watcher moved `last_fired_at` forward but never bumped `next_run_at`, so the scheduler tick re-materialised the same watcher every minute forever. Extract `advanceWatcherSchedule(sql|tx, watcherId)` from automation.ts (was `advanceWatcherScheduleAfterTerminalFailure`) and call it from both manage_watchers(action="complete_window") and the device complete-watcher endpoint after the in-transaction completion writes. 2. Device-identity binding — `claimed_by === body.worker_id` alone is spoofable across devices that share a user OAuth token: any other device with the token could claim a run under any worker_id, then a third caller with the same token could complete it. For user-scoped workers we now resolve the caller's `device_workers.id` from `(workerUserId, body.worker_id)` and require it to equal `approved_input.device_worker_id` (which materializeDueWatcherRuns already snapshotted from the watcher pin). Mismatch → 403. 3. Completion race — two concurrent POSTs both passed the unlocked `authorizeRunForWorker` status read, both opened a tx, both INSERTed a watcher_windows row; the loser's run-UPDATE saw 0 rows and the tx committed a phantom window. Now lock `SELECT ... FOR UPDATE` on the run row at the top of the tx; if status is no longer 'running', return 200 idempotently with `{idempotent: true}`. 4. Window-id allocation — drop the inline `COALESCE(MAX(id), 0) + 1` in favour of the codebase's shared `getNextNumericId(tx, 'watcher_windows')` helper (whitelisted, same pattern used by manage_watchers). 5. Malformed payload no longer stranded the run — validation that throws inside the tx rolls back to `running`, and there's no stale-run sweep today. Validate `approved_input.window_start` / `window_end` BEFORE the transaction; on bad input mark the run `failed` (so next_run_at advances normally) and return 400. Tests: three integration tests covering #1, #3, and #5. The user-scoped spoof test (#2) needs OAuth device-flow setup that isn't wired up in the current test fixtures — leaving it for a follow-up that lands the helper alongside other user-scoped worker tests. * fix(watchers): pi round-2 — bind workerId from auth, race-free id alloc, no double-advance A) Bound-worker check now reads `c.var.mcpAuthInfo?.workerId` (PAT/OAuth token binding), not body.worker_id. Same-user attacker can't complete as another registered worker by lying in the payload. B) `getNextNumericId` takes a per-table `pg_advisory_xact_lock` (`hashtext('<table>_id_alloc')`). Inside `completeWatcherRun`'s tx the lock is held until commit, so two concurrent completions on different watcher runs serialize on allocation instead of racing on MAX(id)+1. C) Validation-failure path only advances the schedule when the `UPDATE … WHERE status='running'` actually matched a row (RETURNING-gated). Two concurrent malformed POSTs no longer double-tick `next_run_at`. Tests: - device spoof (Fix A): token bound to worker A, run pinned to worker B, body posts worker-B — asserts 403 and run stays running. - concurrent allocation (Fix B): two completions on different watchers fired in parallel — both 200, distinct watcher_windows.id. - double-advance (Fix C): two malformed POSTs against the same run — next_run_at advances only once.
Summary
Closes #802. Prereq for #798.
claimWatcherRuninpackages/server/src/watchers/automation.tspreviously grabbed every pendingrun_type='watcher'row, ignoring whether the run was pinned to a specific device worker. Once the dispatcher in #798 starts settingapproved_input.device_worker_id, the user's Mac (polling/api/workers/poll, which already honors device pinning) will race the in-process server-side agent for the same row.This is the exact failure mode behind the historical "owletto-worker daemon claims run_type='watcher' and silently marks success" bug — different surface, same shape (two claimants on one row, one of them eats it without doing the work).
Change
Add a JSONB-level guard inside the
WITH next_run AS (...)CTE so the server-side dispatcher'sFOR UPDATE SKIP LOCKEDselection ignores any run whoseapproved_input->>'device_worker_id'is non-null and non-empty. The pin lives inapproved_inputJSONB for now; issue #799 will add a proper column, at which point this guard plus a column predicate is a single-line follow-up.No schema change.
/api/workers/pollis untouched (already has device pinning).Test plan
automation-contract.test.ts: a watcher run withapproved_input.device_worker_id = 'mac-device-abc'stays inpending(withclaimed_by/claimed_atstill NULL) after both the globaldispatchPendingWatcherRuns({} as Env, { db })scan AND the explicitrunIds: [queued.runId]branch.make typecheck— clean for my files; the only failure is the pre-existing@electric-sql/pglite-postgismodule-not-found inpglite-backend.ts, which exists onorigin/mainunchanged.Summary by CodeRabbit
Bug Fixes
Tests