Skip to content

fix(server): server-side agent must skip device-pinned watcher runs#808

Merged
buremba merged 1 commit into
mainfrom
fix/server-skip-device-pinned-watchers
May 17, 2026
Merged

fix(server): server-side agent must skip device-pinned watcher runs#808
buremba merged 1 commit into
mainfrom
fix/server-skip-device-pinned-watchers

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 17, 2026

Summary

Closes #802. Prereq for #798.

claimWatcherRun in packages/server/src/watchers/automation.ts previously grabbed every pending run_type='watcher' row, ignoring whether the run was pinned to a specific device worker. Once the dispatcher in #798 starts setting approved_input.device_worker_id, the user's Mac (polling /api/workers/poll, which already honors device pinning) will race the in-process server-side agent for the same row.

This is the exact failure mode behind the historical "owletto-worker daemon claims run_type='watcher' and silently marks success" bug — different surface, same shape (two claimants on one row, one of them eats it without doing the work).

Change

Add a JSONB-level guard inside the WITH next_run AS (...) CTE so the server-side dispatcher's FOR UPDATE SKIP LOCKED selection ignores any run whose approved_input->>'device_worker_id' is non-null and non-empty. The pin lives in approved_input JSONB for now; issue #799 will add a proper column, at which point this guard plus a column predicate is a single-line follow-up.

No schema change. /api/workers/poll is untouched (already has device pinning).

Test plan

  • New test in automation-contract.test.ts: a watcher run with approved_input.device_worker_id = 'mac-device-abc' stays in pending (with claimed_by/claimed_at still NULL) after both the global dispatchPendingWatcherRuns({} as Env, { db }) scan AND the explicit runIds: [queued.runId] branch.
  • make typecheck — clean for my files; the only failure is the pre-existing @electric-sql/pglite-postgis module-not-found in pglite-backend.ts, which exists on origin/main unchanged.
  • Integration test run gated on Postgres in CI.

Summary by CodeRabbit

  • Bug Fixes

    • Improved watcher run scheduling to prevent race conditions when runs are assigned to device workers, ensuring proper coordination between dispatcher and device worker assignments.
  • Tests

    • Added test coverage for scheduled watcher runs pinned to device workers.

Review Change Stack

The server-side watcher claim query (claimWatcherRun in
packages/server/src/watchers/automation.ts) previously grabbed every
pending run_type='watcher' row regardless of pinning. Once #798's
dispatcher starts setting approved_input.device_worker_id, the user's
Mac (or other device worker polling /api/workers/poll) will race the
server-side agent for the same row — the exact failure mode that
masked the watcher-run silent-success bug in the past.

Add a JSONB-level guard to the CTE so rows with a non-empty
device_worker_id are invisible to the server-side claim path. The pin
currently lives in approved_input (issue #799 will add a proper
column); the guard handles both NULL and empty-string explicitly so
the existing test fixtures (no device_worker_id) keep matching.

Test: a watcher run with approved_input.device_worker_id set stays
pending after dispatchPendingWatcherRuns runs, both the global-scan
and explicit-runIds branches.

Closes #802
Prereq for #798
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 739aad0b-d880-4c1f-8889-9c9509c246a6

📥 Commits

Reviewing files that changed from the base of the PR and between 32920c8 and b6d55c1.

📒 Files selected for processing (2)
  • packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
  • packages/server/src/watchers/automation.ts

📝 Walkthrough

Walkthrough

The PR implements server-side filtering to prevent the dispatcher from claiming watcher runs pinned to a device worker. The claimWatcherRun query now excludes pending runs where approved_input.device_worker_id is set, and an integration test verifies both default and explicit dispatch paths skip such runs.

Changes

Device-pinned watcher run exclusion filter

Layer / File(s) Summary
Device-pinned run exclusion filter and validation
packages/server/src/watchers/automation.ts, packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
claimWatcherRun SQL query filters out pending runs with approved_input.device_worker_id set (not NULL, not empty); integration test covers both default and explicit runIds dispatch paths, asserting claimed/dispatched counts remain 0 and run status/provenance stay unchanged.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly Related Issues

  • #798 — Device-pinning epic for watchers; this PR implements the required server-side exclusion filter (WHERE clause on approved_input.device_worker_id) that prevents the server dispatcher from racing device workers for the same run.

Poem

A rabbit hops through the watcher flow,
Checking if a device claims the show,
"Skip this one!" the filter does say,
Let the worker have its way. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main change: adding a guard to skip device-pinned watcher runs in the server-side agent.
Description check ✅ Passed The PR description includes all required sections: Summary (with linked issues), detailed Change explanation, and Test plan with checkmarks showing completion status.
Linked Issues check ✅ Passed The changes fully address issue #802's requirement: a JSONB-level guard was added to skip runs where approved_input->>'device_worker_id' is non-null/non-empty, and an integration test verifies this behavior for both dispatch paths.
Out of Scope Changes check ✅ Passed All changes are in scope: the automation.ts fix implements the required device-pinning filter, and the test file validates the exact behavior specified in issue #802.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/server-skip-device-pinned-watchers

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6d55c1753

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +593 to +594
r.approved_input->>'device_worker_id' IS NULL
OR r.approved_input->>'device_worker_id' = ''
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep a claimant for device-pinned watcher runs

When a watcher run has approved_input.device_worker_id, this predicate makes dispatchPendingWatcherRuns skip it, but the worker polling path does not currently claim watcher runs: in packages/server/src/worker-api.ts the /api/workers/poll SQL allow-list is r.run_type IN ('sync', 'action', 'embed_backfill', 'auth'). In the device-pinned watcher scenario described by this change, the row therefore remains pending indefinitely instead of being picked up by the device worker; add the watcher lane to the device claim path (with the same pin/scope checks) before excluding it here.

Useful? React with 👍 / 👎.

@buremba buremba merged commit a88d840 into main May 17, 2026
23 of 24 checks passed
@buremba buremba deleted the fix/server-skip-device-pinned-watchers branch May 17, 2026 04:05
buremba added a commit that referenced this pull request May 17, 2026
* feat(watchers): device-pinned watcher runs end-to-end (#798 PR-1)

The schema work in #811 added watchers.device_worker_id + agent_kind +
notification + cooldown columns, and #808 made the server-side
dispatcher skip runs already pinned to a device. This PR wires the
pinning + claim + completion sides so a user's Lobu Mac app can
actually execute the run via a local CLI agent (Claude Code, etc.).

Server-side:

  1. `materializeDueWatcherRuns` now reads `watchers.device_worker_id`
     and `watchers.agent_kind` and persists both into the run's
     `approved_input` JSONB. The existing #802 dispatcher exclusion
     keys off `approved_input->>'device_worker_id'`, so device-pinned
     rows stay `pending` on the server side.

  2. `/api/workers/poll` gains a parallel CTE branch for
     `run_type='watcher'` AND
     `approved_input->>'device_worker_id' = <this device's id>`. The
     poll response is short-circuited for watchers — no connector
     code, no credentials, no compiled_code lookup. It returns a
     watcher payload envelope `{ watcher, event, context }` that the
     device-side dispatcher uses to build a CLI prompt.

  3. New endpoint `POST /api/workers/me/runs/:runId/complete-watcher`:
     authorizes via the existing claim-ownership gate
     (`authorizeRunForWorker`), writes a `watcher_windows` row with
     `model_used='device-cli'` and `extracted_data.kind='device_cli_output'`,
     updates `runs.status` to completed/failed, advances
     `watchers.last_fired_at`, and emits a `watcher:updated` lifecycle
     event so dashboard metric_series picks up the run.

  4. The new path is whitelisted in the user-scoped worker route gate
     (was previously 403 for `/api/workers/me/...` outside
     auth-profiles/feeds).

Tests:

  - materializeDueWatcherRuns persists device_worker_id + agent_kind.
  - End-to-end complete-watcher → runs.completed, watcher_windows
    row with the CLI output, last_fired_at advanced.
  - Failure path: error supplied → runs.failed, no window row.
  - 409 for non-watcher run types, 404 for unknown run ids.

PR-2 (Owletto Mac app dispatcher + ClaudeCodeExecutor) consumes this
contract.

* fix(watchers): close pi review on device-side complete-watcher (#814)

5 issues from pi's review of #814's WatcherDispatcher / completeWatcherRun
that all materialise on real device traffic:

1. (BLOCKER) Schedule advancement mismatch — completing a device watcher
   moved `last_fired_at` forward but never bumped `next_run_at`, so the
   scheduler tick re-materialised the same watcher every minute forever.
   Extract `advanceWatcherSchedule(sql|tx, watcherId)` from automation.ts
   (was `advanceWatcherScheduleAfterTerminalFailure`) and call it from
   both manage_watchers(action="complete_window") and the device
   complete-watcher endpoint after the in-transaction completion writes.

2. Device-identity binding — `claimed_by === body.worker_id` alone is
   spoofable across devices that share a user OAuth token: any other
   device with the token could claim a run under any worker_id, then a
   third caller with the same token could complete it. For user-scoped
   workers we now resolve the caller's `device_workers.id` from
   `(workerUserId, body.worker_id)` and require it to equal
   `approved_input.device_worker_id` (which materializeDueWatcherRuns
   already snapshotted from the watcher pin). Mismatch → 403.

3. Completion race — two concurrent POSTs both passed the unlocked
   `authorizeRunForWorker` status read, both opened a tx, both INSERTed
   a watcher_windows row; the loser's run-UPDATE saw 0 rows and the tx
   committed a phantom window. Now lock `SELECT ... FOR UPDATE` on the
   run row at the top of the tx; if status is no longer 'running',
   return 200 idempotently with `{idempotent: true}`.

4. Window-id allocation — drop the inline `COALESCE(MAX(id), 0) + 1`
   in favour of the codebase's shared `getNextNumericId(tx, 'watcher_windows')`
   helper (whitelisted, same pattern used by manage_watchers).

5. Malformed payload no longer stranded the run — validation that
   throws inside the tx rolls back to `running`, and there's no
   stale-run sweep today. Validate `approved_input.window_start` /
   `window_end` BEFORE the transaction; on bad input mark the run
   `failed` (so next_run_at advances normally) and return 400.

Tests: three integration tests covering #1, #3, and #5. The user-scoped
spoof test (#2) needs OAuth device-flow setup that isn't wired up in
the current test fixtures — leaving it for a follow-up that lands the
helper alongside other user-scoped worker tests.

* fix(watchers): pi round-2 — bind workerId from auth, race-free id alloc, no double-advance

A) Bound-worker check now reads `c.var.mcpAuthInfo?.workerId` (PAT/OAuth
   token binding), not body.worker_id. Same-user attacker can't complete
   as another registered worker by lying in the payload.

B) `getNextNumericId` takes a per-table `pg_advisory_xact_lock`
   (`hashtext('<table>_id_alloc')`). Inside `completeWatcherRun`'s tx the
   lock is held until commit, so two concurrent completions on different
   watcher runs serialize on allocation instead of racing on MAX(id)+1.

C) Validation-failure path only advances the schedule when the
   `UPDATE … WHERE status='running'` actually matched a row (RETURNING-gated).
   Two concurrent malformed POSTs no longer double-tick `next_run_at`.

Tests:
- device spoof (Fix A): token bound to worker A, run pinned to worker B,
  body posts worker-B — asserts 403 and run stays running.
- concurrent allocation (Fix B): two completions on different watchers
  fired in parallel — both 200, distinct watcher_windows.id.
- double-advance (Fix C): two malformed POSTs against the same run —
  next_run_at advances only once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server-side agent must skip watcher runs pinned to a device worker

2 participants