fix(server): unwedge watchers (array-binding bug) + hardening#1046
Conversation
…tchers wedged 12d)
reconcileWatcherRuns bound a raw JS array into `= ANY(${pendingDispatchIds})`.
The production pool (db/client.ts) runs with `fetch_types: false`, so postgres.js
can't infer the array element type and ships the lone element as a scalar; PG
then throws `malformed array literal: "<uuid>"`. Every other ANY() in this file
already uses the explicit `pgTextArray(...)::text[]` literal idiom for exactly
this reason — line 391 was the lone exception.
Impact: the query is only reached when >=1 active watcher run carries a
dispatched_message_id. A run stuck `running` (146501, since 2026-05-13) tripped
it on every `watcher-automation` tick (every minute) AND every
`check-stalled-executions` tick — both call reconcileWatcherRuns. So watchers
stopped firing across all orgs for 12 days, and the reaper that would have
cleared the stuck run was itself wedged by the same error (self-deadlock).
27,141 + 3,601 failed task runs in prod traced to this single line.
Reproducer (integration): an active watcher run with a dispatched_message_id and
no window must reconcile cleanly. It exercises getDb() (the prod pool) on
purpose — the test-harness client fetches types and silently masks the bug.
Red before fix (exact `malformed array literal` error), green after.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a guarded multi‑phase watcher automation tick (reset, reconcile, materialize, dispatch) that isolates phase errors, expands materialization to skip unrunnable watchers and return an ChangesWatcher automation and scheduled jobs
Postgres client and test DB alignment
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant WatcherAutomation
participant DB
participant Dispatcher
Scheduler->>WatcherAutomation: runWatcherAutomationTick(env)
WatcherAutomation->>DB: resetOrphanedWatcherRuns()
WatcherAutomation->>DB: reconcileWatcherRuns(pgTextArray pending IDs)
WatcherAutomation->>DB: materializeDueWatcherRuns()
WatcherAutomation->>Dispatcher: dispatchPendingWatcherRuns()
Dispatcher->>DB: record dispatch outcomes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…st client, phase isolation
Follow-ups to the malformed-array reconcile fix, hardening the same subsystem.
- Tests now use prod's value-serialization options (fetch_types:false + JSON/bigint
handling) via a shared PROD_PG_VALUE_OPTIONS. Root cause the original bug slipped
through: getTestDb() used a more forgiving client than prod, so `= ANY(${jsArray})`
worked in tests and threw `malformed array literal` only in prod.
- materializeDueWatcherRuns only schedules watchers with a runnable executor:
device-pinned (claimed via the worker poll lane, no cloud agent needed) OR a
matching agents row. A watcher whose agent was deleted is no longer materialized
into a doomed run every tick; it is skipped at the source (self-healing) and
counted as `unrunnable` in the tick summary. ensureWatcherAgentExists stays as a
dispatch-time delete-after-select backstop.
- watcher-automation extracted to runWatcherAutomationTick with per-phase isolation:
a throw in one phase (reset/reconcile/materialize/dispatch) no longer aborts the
others. The original bug was a reconcile throw starving materialize+dispatch.
- checkStalledExecutions isolates all three phases (reconcile, sweep, reapStaleRuns)
so the reaper always runs — previously a reconcile throw disabled the very reaper
that would clear the run triggering it (self-deadlock).
- Integration tests: tick survives a wedging in-flight run and still materializes
other due watchers; a dangling-agent watcher is not scheduled + counted unrunnable.
Full server integration suite: 757 passed (4 pre-existing local-env failures
unrelated to these files: getPublicWebUrl fallback needs PUBLIC_WEB_URL, and a
connector_definitions.api_type schema-drift predating this change).
pi review nit: the unrunnable metric omitted the no-active-run clause, so a ghost-agent watcher with an in-flight run could be overcounted. Mirror the dueWatchers predicate exactly. Metric accuracy only; no scheduling change.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/server/src/watchers/automation.ts (1)
537-540: 💤 Low valueUnused
_envparameter should be removed per coding guidelines.The
_envparameter is prefixed with underscore but never used. The guideline states to delete unused parameters rather than prefix them.Proposed fix
export async function materializeDueWatcherRuns( - _env: Env, db?: DbClient ): Promise<MaterializeDueWatcherRunsResult> {This would also require updating the call site in
runWatcherAutomationTick(line 669) andjobs.ts.As per coding guidelines: "Fix unused parameters by deleting them, not by prefixing with
_"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/watchers/automation.ts` around lines 537 - 540, Remove the unused `_env` parameter from the materializeDueWatcherRuns function signature (rename from materializeDueWatcherRuns(_env: Env, db?: DbClient) to materializeDueWatcherRuns(db?: DbClient>) and update all call sites accordingly (notably runWatcherAutomationTick and any references in jobs.ts) so they pass the db argument only; ensure any type annotations or imports referencing Env are removed if no longer used and adjust usages inside materializeDueWatcherRuns to reference the existing db parameter by its original name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/server/src/watchers/automation.ts`:
- Around line 537-540: Remove the unused `_env` parameter from the
materializeDueWatcherRuns function signature (rename from
materializeDueWatcherRuns(_env: Env, db?: DbClient) to
materializeDueWatcherRuns(db?: DbClient>) and update all call sites accordingly
(notably runWatcherAutomationTick and any references in jobs.ts) so they pass
the db argument only; ensure any type annotations or imports referencing Env are
removed if no longer used and adjust usages inside materializeDueWatcherRuns to
reference the existing db parameter by its original name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: af30304b-e32c-49ca-8379-0b535294e887
📒 Files selected for processing (6)
packages/server/src/__tests__/integration/watchers/automation-contract.test.tspackages/server/src/__tests__/setup/test-db.tspackages/server/src/db/client.tspackages/server/src/scheduled/check-stalled-executions.tspackages/server/src/scheduled/jobs.tspackages/server/src/watchers/automation.ts
|
bug_free 30, simplicity 72, slop 0, bugs 1, 1 blockers Typecheck and unit passed. Integration failed in changed test-db.ts because DATABASE_URL db "postgres" is rejected before setup; no extra exploratory probe beyond log/diff review because deterministic integration is blocked. Blockers
Suggested fixes
Full verdict JSON{
"bug_free_confidence": 30,
"bugs": 1,
"slop": 0,
"simplicity": 72,
"blockers": [
"integration tests fail: changed test database guard rejects DATABASE_URL database \"postgres\" before DB setup"
],
"change_type": "fix",
"behavior_change_risk": "high",
"tests_adequate": false,
"suggested_fixes": [
{
"file": "packages/server/src/__tests__/setup/test-db.ts",
"line": 63,
"change": "Align this guard with the review/CI integration harness: either make the harness use a database name like lobu_test, or add an explicit safe override path so the canonical integration suite no longer fails when DATABASE_URL ends in /postgres."
}
],
"notes": "Typecheck and unit passed. Integration failed in changed test-db.ts because DATABASE_URL db \"postgres\" is rejected before setup; no extra exploratory probe beyond log/diff review because deterministic integration is blocked.",
"categories": {
"src": 934,
"tests": 392,
"docs": 104,
"config": 154,
"deps": 4,
"migrations": 0,
"ci": 0,
"generated": 72
}
}Local review gate — branch protection can require the |
Summary
Every watcher in prod stopped firing on 2026-05-13 ~08:00 UTC for 12 days. Root cause: one SQL array-binding line. This PR fixes it and hardens the subsystem so the same class can't recur or self-deadlock. Plan reviewed by
pibefore implementation; diff reviewed bypiafter (no blocking issues).The bug
reconcileWatcherRunsbound a raw JS array into= ANY(${pendingDispatchIds}). The prod pool (db/client.ts) runs withfetch_types: false, so postgres.js can't infer the array element type and ships the lone element as a scalar →malformed array literal: "<uuid>". Every otherANY()in the file uses thepgTextArray(...)::text[]idiom for exactly this reason; line 391 was the lone exception.It only fires when ≥1 active watcher run carries a
dispatched_message_id. Run 146501 (stuckrunningsince 05-13) tripped it on everywatcher-automationtick (every minute) and everycheck-stalled-executionstick — both callreconcileWatcherRuns. So the scheduler and the reaper that would have cleared the triggering run both died: self-deadlock. 27,141 + 3,601 failed task runs traced to it.Changes
Fix
reconcileWatcherRuns: bind viapgTextArray(pendingDispatchIds)::text[].Hardening (root cause + blast radius)
fetch_types:false+ JSON/bigint handling) to a sharedPROD_PG_VALUE_OPTIONS;getTestDb()reuses them. The original bug was invisible to the suite because the test client fetched types and silently made= ANY(${jsArray})work. This is what would have caught it.watcher-automationextracted torunWatcherAutomationTick; each phase (reset/reconcile/materialize/dispatch) is isolated so one throw can't starve the rest.checkStalledExecutionsisolates all three phases soreapStaleRunsalways runs (kills the self-deadlock).materializeDueWatcherRunsonly schedules watchers with a runnable executor — device-pinned or a matchingagentsrow. A watcher whose agent was deleted is skipped at the source (self-healing) and surfaced asunrunnablein the tick summary instead of minting a doomed run every tick. (ensureWatcherAgentExistsstays as a dispatch-time backstop.) Fixes the prodlobu-crm/lobu-teamwatchers that point at deleted agents.Tests (red → green)
dispatched_message_idreconciles cleanly — callsgetDb()(the prod pool) on purpose; red emits the exactmalformed array literalerror, green passes.unrunnable.Local: tsc clean; watcher suite 25/25; full server integration suite 757 passing (4 failures are pre-existing local-env:
getPublicWebUrlfallback needsPUBLIC_WEB_URL, and aconnector_definitions.api_typeschema drift predating this change — none touch files in this PR).Prod remediation already applied
Stuck run 146501 was manually marked
timeoutto unblock immediately, but that's not durable (any in-flight watcher re-wedges the buggy code). This PR is the durable fix; merging auto-builds an image that Flux deploys tosummaries-prod.Summary by CodeRabbit
New Features
Bug Fixes
Stability / Jobs
Tests