Skip to content

fix(server): unwedge watchers (array-binding bug) + hardening#1046

Merged
buremba merged 3 commits into
mainfrom
feat/fix-watcher-reconcile-array
May 25, 2026
Merged

fix(server): unwedge watchers (array-binding bug) + hardening#1046
buremba merged 3 commits into
mainfrom
feat/fix-watcher-reconcile-array

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 25, 2026

Summary

Every watcher in prod stopped firing on 2026-05-13 ~08:00 UTC for 12 days. Root cause: one SQL array-binding line. This PR fixes it and hardens the subsystem so the same class can't recur or self-deadlock. Plan reviewed by pi before implementation; diff reviewed by pi after (no blocking issues).

The bug

reconcileWatcherRuns bound a raw JS array into = ANY(${pendingDispatchIds}). The prod pool (db/client.ts) runs with fetch_types: false, so postgres.js can't infer the array element type and ships the lone element as a scalar → malformed array literal: "<uuid>". Every other ANY() in the file uses the pgTextArray(...)::text[] idiom for exactly this reason; line 391 was the lone exception.

It only fires when ≥1 active watcher run carries a dispatched_message_id. Run 146501 (stuck running since 05-13) tripped it on every watcher-automation tick (every minute) and every check-stalled-executions tick — both call reconcileWatcherRuns. So the scheduler and the reaper that would have cleared the triggering run both died: self-deadlock. 27,141 + 3,601 failed task runs traced to it.

Changes

Fix

  • reconcileWatcherRuns: bind via pgTextArray(pendingDispatchIds)::text[].

Hardening (root cause + blast radius)

  • Tests now use the production DB client config. Extracted prod's value-serialization options (fetch_types:false + JSON/bigint handling) to a shared PROD_PG_VALUE_OPTIONS; getTestDb() reuses them. The original bug was invisible to the suite because the test client fetched types and silently made = ANY(${jsArray}) work. This is what would have caught it.
  • Per-phase isolation. watcher-automation extracted to runWatcherAutomationTick; each phase (reset/reconcile/materialize/dispatch) is isolated so one throw can't starve the rest. checkStalledExecutions isolates all three phases so reapStaleRuns always runs (kills the self-deadlock).
  • Don't schedule unrunnable watchers. materializeDueWatcherRuns only schedules watchers with a runnable executor — device-pinned or a matching agents row. A watcher whose agent was deleted is skipped at the source (self-healing) and surfaced as unrunnable in the tick summary instead of minting a doomed run every tick. (ensureWatcherAgentExists stays as a dispatch-time backstop.) Fixes the prod lobu-crm/lobu-team watchers that point at deleted agents.

Tests (red → green)

  • Reproducer: an active watcher run with a dispatched_message_id reconciles cleanly — calls getDb() (the prod pool) on purpose; red emits the exact malformed array literal error, green passes.
  • Tick survives a wedging in-flight run and still materializes other due watchers.
  • A dangling-agent watcher is not scheduled and is counted unrunnable.

Local: tsc clean; watcher suite 25/25; full server integration suite 757 passing (4 failures are pre-existing local-env: getPublicWebUrl fallback needs PUBLIC_WEB_URL, and a connector_definitions.api_type schema drift predating this change — none touch files in this PR).

Prod remediation already applied

Stuck run 146501 was manually marked timeout to unblock immediately, but that's not durable (any in-flight watcher re-wedges the buggy code). This PR is the durable fix; merging auto-builds an image that Flux deploys to summaries-prod.

Out of scope (follow-ups): alerting/SLO on scheduler-tick + watcher-run failure rate; the api_type schema drift; recreating agents for the lobu-crm/lobu-team watchers.

Summary by CodeRabbit

  • New Features

    • Added a multi-phase watcher automation tick that runs reconcile/materialize/dispatch resiliently and reports per-tick errors and an “unrunnable” count.
  • Bug Fixes

    • Prevented crashes when reconciling watcher runs that reference orphaned dispatch records.
  • Stability / Jobs

    • Scheduled watcher job and stalled-execution cron now continue later phases even if earlier phases fail.
  • Tests

    • Added regression and end-to-end tests for stuck in-flight runs, materialization, unrunnable-worker handling, and test DB behavior aligned with production.

Review Change Stack

…tchers wedged 12d)

reconcileWatcherRuns bound a raw JS array into `= ANY(${pendingDispatchIds})`.
The production pool (db/client.ts) runs with `fetch_types: false`, so postgres.js
can't infer the array element type and ships the lone element as a scalar; PG
then throws `malformed array literal: "<uuid>"`. Every other ANY() in this file
already uses the explicit `pgTextArray(...)::text[]` literal idiom for exactly
this reason — line 391 was the lone exception.

Impact: the query is only reached when >=1 active watcher run carries a
dispatched_message_id. A run stuck `running` (146501, since 2026-05-13) tripped
it on every `watcher-automation` tick (every minute) AND every
`check-stalled-executions` tick — both call reconcileWatcherRuns. So watchers
stopped firing across all orgs for 12 days, and the reaper that would have
cleared the stuck run was itself wedged by the same error (self-deadlock).
27,141 + 3,601 failed task runs in prod traced to this single line.

Reproducer (integration): an active watcher run with a dispatched_message_id and
no window must reconcile cleanly. It exercises getDb() (the prod pool) on
purpose — the test-harness client fetches types and silently masks the bug.
Red before fix (exact `malformed array literal` error), green after.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c03d5f74-b72f-4ec8-9d97-7a1abef922e2

📥 Commits

Reviewing files that changed from the base of the PR and between fc21a5f and 4e07c54.

📒 Files selected for processing (1)
  • packages/server/src/watchers/automation.ts

📝 Walkthrough

Walkthrough

Adds a guarded multi‑phase watcher automation tick (reset, reconcile, materialize, dispatch) that isolates phase errors, expands materialization to skip unrunnable watchers and return an unrunnable metric, fixes Postgres array parameterization in reconciliation using pgTextArray, and centralizes postgres.js value options reused by tests.

Changes

Watcher automation and scheduled jobs

Layer / File(s) Summary
SQL parameterization fix
packages/server/src/watchers/automation.ts
Import pgTextArray and bind pendingDispatchIds as pgTextArray(... )::text[] in reconcileWatcherRuns SQL.
Materialize runnable/unrunnable logic & metrics
packages/server/src/watchers/automation.ts
Materialize only watchers with runnable executors (device-pinned or existing agent row); compute unrunnable count and include it in returned summaries and early returns.
Run tick orchestration & job wiring
packages/server/src/watchers/automation.ts, packages/server/src/scheduled/jobs.ts, packages/server/src/scheduled/check-stalled-executions.ts
Add exported runWatcherAutomationTick(env) that runs phases with per-phase error capture and returns phase summaries + errors; scheduled job now calls this tick and logs consolidated summary; cron phases isolated so later reapers run despite earlier failures.
Integration & regression tests
packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
Add regression test for a running watcher run with dispatched_message_id but no watcher_windows row; add end-to-end tick tests covering wedged in-flight runs and counting unrunnable watchers when assigned agents are missing.

Postgres client and test DB alignment

Layer / File(s) Summary
Add PROD_PG_VALUE_OPTIONS
packages/server/src/db/client.ts
Introduce exported PROD_PG_VALUE_OPTIONS containing postgres.js VALUE transforms (fetch_types:false, JSON/JSONB parsing, bigint handling).
Reuse options in createDbClient
packages/server/src/db/client.ts
Replace inline value options in createDbClient with ...PROD_PG_VALUE_OPTIONS.
Apply production options to tests
packages/server/src/__tests__/setup/test-db.ts
Import and spread PROD_PG_VALUE_OPTIONS into the test postgres client so test DB behavior matches production serialization/parsing.

Sequence Diagram(s)

sequenceDiagram
  participant Scheduler
  participant WatcherAutomation
  participant DB
  participant Dispatcher
  Scheduler->>WatcherAutomation: runWatcherAutomationTick(env)
  WatcherAutomation->>DB: resetOrphanedWatcherRuns()
  WatcherAutomation->>DB: reconcileWatcherRuns(pgTextArray pending IDs)
  WatcherAutomation->>DB: materializeDueWatcherRuns()
  WatcherAutomation->>Dispatcher: dispatchPendingWatcherRuns()
  Dispatcher->>DB: record dispatch outcomes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lobu-ai/lobu#994: Both PRs fix watcher SQL array-parameterization for dispatched_message_id using pgTextArray(...)::text[].
  • lobu-ai/lobu#814: Related changes touching materializeDueWatcherRuns and device-pinned execution logic.
  • lobu-ai/lobu#893: Overlaps scheduled housekeeping and stale-run reaper behavior in check-stalled-executions.

"I'm a rabbit, code in tow,
Tamed arrays now parse and flow,
Ticks run safe, phases hum along,
Tests clap softly—nothing's wrong.
Hoppity cheer for DB glow!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main bug fix (array-binding bug in watchers) and the hardening efforts applied.
Description check ✅ Passed The description fully adheres to the template, including a comprehensive Summary section, detailed Test plan checkboxes, and Notes with context, linked issue reference, and follow-up scope clarifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-watcher-reconcile-array

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba buremba enabled auto-merge (squash) May 25, 2026 17:34
@buremba buremba disabled auto-merge May 25, 2026 17:55
buremba added 2 commits May 25, 2026 19:04
…st client, phase isolation

Follow-ups to the malformed-array reconcile fix, hardening the same subsystem.

- Tests now use prod's value-serialization options (fetch_types:false + JSON/bigint
  handling) via a shared PROD_PG_VALUE_OPTIONS. Root cause the original bug slipped
  through: getTestDb() used a more forgiving client than prod, so `= ANY(${jsArray})`
  worked in tests and threw `malformed array literal` only in prod.

- materializeDueWatcherRuns only schedules watchers with a runnable executor:
  device-pinned (claimed via the worker poll lane, no cloud agent needed) OR a
  matching agents row. A watcher whose agent was deleted is no longer materialized
  into a doomed run every tick; it is skipped at the source (self-healing) and
  counted as `unrunnable` in the tick summary. ensureWatcherAgentExists stays as a
  dispatch-time delete-after-select backstop.

- watcher-automation extracted to runWatcherAutomationTick with per-phase isolation:
  a throw in one phase (reset/reconcile/materialize/dispatch) no longer aborts the
  others. The original bug was a reconcile throw starving materialize+dispatch.

- checkStalledExecutions isolates all three phases (reconcile, sweep, reapStaleRuns)
  so the reaper always runs — previously a reconcile throw disabled the very reaper
  that would clear the run triggering it (self-deadlock).

- Integration tests: tick survives a wedging in-flight run and still materializes
  other due watchers; a dangling-agent watcher is not scheduled + counted unrunnable.

Full server integration suite: 757 passed (4 pre-existing local-env failures
unrelated to these files: getPublicWebUrl fallback needs PUBLIC_WEB_URL, and a
connector_definitions.api_type schema-drift predating this change).
pi review nit: the unrunnable metric omitted the no-active-run clause, so a
ghost-agent watcher with an in-flight run could be overcounted. Mirror the
dueWatchers predicate exactly. Metric accuracy only; no scheduling change.
@buremba buremba changed the title fix(server): reconcile dispatched_message_id ANY() binding — watchers wedged 12 days in prod fix(server): unwedge watchers (array-binding bug) + hardening May 25, 2026
@buremba buremba enabled auto-merge (squash) May 25, 2026 18:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/server/src/watchers/automation.ts (1)

537-540: 💤 Low value

Unused _env parameter should be removed per coding guidelines.

The _env parameter is prefixed with underscore but never used. The guideline states to delete unused parameters rather than prefix them.

Proposed fix
 export async function materializeDueWatcherRuns(
-  _env: Env,
   db?: DbClient
 ): Promise<MaterializeDueWatcherRunsResult> {

This would also require updating the call site in runWatcherAutomationTick (line 669) and jobs.ts.

As per coding guidelines: "Fix unused parameters by deleting them, not by prefixing with _"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/watchers/automation.ts` around lines 537 - 540, Remove
the unused `_env` parameter from the materializeDueWatcherRuns function
signature (rename from materializeDueWatcherRuns(_env: Env, db?: DbClient) to
materializeDueWatcherRuns(db?: DbClient>) and update all call sites accordingly
(notably runWatcherAutomationTick and any references in jobs.ts) so they pass
the db argument only; ensure any type annotations or imports referencing Env are
removed if no longer used and adjust usages inside materializeDueWatcherRuns to
reference the existing db parameter by its original name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/server/src/watchers/automation.ts`:
- Around line 537-540: Remove the unused `_env` parameter from the
materializeDueWatcherRuns function signature (rename from
materializeDueWatcherRuns(_env: Env, db?: DbClient) to
materializeDueWatcherRuns(db?: DbClient>) and update all call sites accordingly
(notably runWatcherAutomationTick and any references in jobs.ts) so they pass
the db argument only; ensure any type annotations or imports referencing Env are
removed if no longer used and adjust usages inside materializeDueWatcherRuns to
reference the existing db parameter by its original name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: af30304b-e32c-49ca-8379-0b535294e887

📥 Commits

Reviewing files that changed from the base of the PR and between 12ef38d and fc21a5f.

📒 Files selected for processing (6)
  • packages/server/src/__tests__/integration/watchers/automation-contract.test.ts
  • packages/server/src/__tests__/setup/test-db.ts
  • packages/server/src/db/client.ts
  • packages/server/src/scheduled/check-stalled-executions.ts
  • packages/server/src/scheduled/jobs.ts
  • packages/server/src/watchers/automation.ts

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 25, 2026

bug_free 30, simplicity 72, slop 0, bugs 1, 1 blockers

Typecheck and unit passed. Integration failed in changed test-db.ts because DATABASE_URL db "postgres" is rejected before setup; no extra exploratory probe beyond log/diff review because deterministic integration is blocked.

Blockers

  • integration tests fail: changed test database guard rejects DATABASE_URL database "postgres" before DB setup

Suggested fixes

File Line Change
packages/server/src/__tests__/setup/test-db.ts 63 Align this guard with the review/CI integration harness: either make the harness use a database name like lobu_test, or add an explicit safe override path so the canonical integration suite no longer fails when DATABASE_URL ends in /postgres.
Full verdict JSON
{
  "bug_free_confidence": 30,
  "bugs": 1,
  "slop": 0,
  "simplicity": 72,
  "blockers": [
    "integration tests fail: changed test database guard rejects DATABASE_URL database \"postgres\" before DB setup"
  ],
  "change_type": "fix",
  "behavior_change_risk": "high",
  "tests_adequate": false,
  "suggested_fixes": [
    {
      "file": "packages/server/src/__tests__/setup/test-db.ts",
      "line": 63,
      "change": "Align this guard with the review/CI integration harness: either make the harness use a database name like lobu_test, or add an explicit safe override path so the canonical integration suite no longer fails when DATABASE_URL ends in /postgres."
    }
  ],
  "notes": "Typecheck and unit passed. Integration failed in changed test-db.ts because DATABASE_URL db \"postgres\" is rejected before setup; no extra exploratory probe beyond log/diff review because deterministic integration is blocked.",
  "categories": {
    "src": 934,
    "tests": 392,
    "docs": 104,
    "config": 154,
    "deps": 4,
    "migrations": 0,
    "ci": 0,
    "generated": 72
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

@buremba buremba merged commit c524b42 into main May 25, 2026
19 of 21 checks passed
@buremba buremba deleted the feat/fix-watcher-reconcile-array branch May 25, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants