fix: stop server startup from auto-failing in-flight workflow runs (#1216)#1231
fix: stop server startup from auto-failing in-flight workflow runs (#1216)#1231
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughServer startup no longer auto-fails Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/web/src/components/layout/TopNav.tsx (1)
18-23: Query key change may cause brief staleness after chat operations.ChatInterface invalidates
['workflowRuns']but this query uses['dashboardRuns', { status: 'running', forCount: true }]. Workflow state changes triggered from chat won't immediately update this count. The 10srefetchIntervalprovides eventual consistency, so this is a minor UX gap rather than a bug.Consider adding
['dashboardRuns']to ChatInterface's invalidation list if immediate consistency is desired.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/web/src/components/layout/TopNav.tsx` around lines 18 - 23, The dashboardRuns useQuery (queryKey ['dashboardRuns', { status: 'running', forCount: true }], queryFn listDashboardRuns) can become briefly stale because ChatInterface currently only invalidates ['workflowRuns']; update ChatInterface to also invalidate the ['dashboardRuns'] query key (or include the matching key shape) after chat-triggered workflow changes so runningCount reflects updates immediately instead of waiting for the 10s refetchInterval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/web/src/components/layout/TopNav.tsx`:
- Around line 18-23: The dashboardRuns useQuery (queryKey ['dashboardRuns', {
status: 'running', forCount: true }], queryFn listDashboardRuns) can become
briefly stale because ChatInterface currently only invalidates ['workflowRuns'];
update ChatInterface to also invalidate the ['dashboardRuns'] query key (or
include the matching key shape) after chat-triggered workflow changes so
runningCount reflects updates immediately instead of waiting for the 10s
refetchInterval.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dc978b37-07be-4213-bd83-1f87825e78ea
📒 Files selected for processing (6)
CHANGELOG.mdpackages/server/src/index.tspackages/web/src/components/dashboard/ConfirmRunActionDialog.tsxpackages/web/src/components/dashboard/WorkflowHistoryTable.tsxpackages/web/src/components/dashboard/WorkflowRunCard.tsxpackages/web/src/components/layout/TopNav.tsx
PR Review SummarySix specialized agents reviewed this PR in parallel: code-reviewer, silent-failure-hunter, type-design-analyzer, comment-analyzer, docs-impact, code-simplifier. Critical Issues (0 found)None. Important Issues (0 found)None. Suggestions (4 found)
Strengths
Documentation Issues
VerdictREADY TO MERGE — with one doc fix recommended. The only substantive finding is the stale docs-web page describing the removed behavior. The three code suggestions are nits (4 lines total). Nothing blocks merge. Recommended Actions
|
PR review surfaced one real correctness issue in docs and three small code polish items. None block merge; addressing for cleanliness. - packages/docs-web/src/content/docs/guides/authoring-workflows.md:486 removed the "auto-marked as failed on next startup" paragraph that described the now-deleted behavior. Replaced with a "Crashed servers / orphaned runs" note pointing users at `archon workflow cleanup` and the dashboard Cancel/Abandon buttons; explains the auto-resume mechanism still works once the row reaches a terminal status. - ConfirmRunActionDialog: narrow `onConfirm` from `() => void | Promise<void>` to `() => void`. All five callsites are synchronous wrappers around React Query mutations whose error handling lives at the page level (`runAction` in DashboardPage). The union widened the API for no current caller. Documented in the JSDoc what to do if an awaiting caller appears later. - TopNav: dropped the redundant `String(runningCount)` cast in the aria-label — template literal coerces. Also rewrote the comment above the `listDashboardRuns` query: the previous version implied `limit=1` constrained `counts.running`; in fact `counts` is a server-side aggregate independent of `limit`, and `limit=1` only minimises the `runs` array we discard.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/docs-web/src/content/docs/guides/authoring-workflows.md`:
- Line 486: Update the paragraph to clarify that the CLI command "archon
workflow cleanup" invokes deleteOldWorkflowRuns(days) and only deletes terminal
runs (failed, cancelled, completed) older than N days; it does NOT clean up
orphaned "running" rows. For stuck "running" rows instruct users to use the
dashboard per-row Cancel/Abandon buttons or rely on the server-side
failOrphanedRuns() which runs on server startup, and adjust the statement about
auto-resume to reference that rows become failed/cancelled when explicitly
cleaned so subsequent invocations can auto-resume from completed nodes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0602701d-9068-49bd-a413-50af37a216e6
📒 Files selected for processing (3)
packages/docs-web/src/content/docs/guides/authoring-workflows.mdpackages/web/src/components/dashboard/ConfirmRunActionDialog.tsxpackages/web/src/components/layout/TopNav.tsx
✅ Files skipped from review due to trivial changes (1)
- packages/web/src/components/layout/TopNav.tsx
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx
…1216) `failOrphanedRuns()` at server startup unconditionally flipped every `running` workflow row to `failed`, including runs actively executing in another process (CLI / adapters). The dag-executor's between-layer status check then bailed out of the run, exit code 1 — even though every node had completed successfully. Same class of bug the CLI already learned (see comment at packages/cli/src/cli.ts:256-258). Per the new CLAUDE.md principle "No Autonomous Lifecycle Mutation Across Process Boundaries", we don't replace the call with a timer-based heuristic. Instead we remove it and surface running workflows to the user with one-click actions. Backend - `packages/server/src/index.ts` — remove the `failOrphanedRuns()` call at startup. Replace with explanatory comment referencing the CLI precedent and the CLAUDE.md principle. The function in `packages/core/src/db/workflows.ts:911` is preserved for use by the explicit `archon workflow cleanup` command. UI - `packages/web/src/components/layout/TopNav.tsx` — replace the binary pulse dot on the Dashboard nav with a numeric count badge sourced from `/api/dashboard/runs` `counts.running`. Hidden when count is 0. Same 10s polling interval as before. No animation — a steady factual count is honest; a pulse would imply system judgment. - `packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx` (new) — shadcn AlertDialog wrapper for destructive workflow-run actions, mirroring the codebase-delete pattern in `sidebar/ProjectSelector.tsx`. Caller passes the existing button as `trigger` slot; dialog handles open/close via Radix. - `packages/web/src/components/dashboard/WorkflowRunCard.tsx` — replace 4 `window.confirm()` callsites (Reject, Abandon, Cancel, Delete) with ConfirmRunActionDialog. Each gets a context-appropriate description. - `packages/web/src/components/dashboard/WorkflowHistoryTable.tsx` — replace 1 `window.confirm()` (Delete) with the same dialog. CHANGELOG entries under [Unreleased]: Fixed for #1216, two Changed entries for the nav badge and dialog upgrade. No new tests: the web package has no React component testing infrastructure (existing `bun test` covers `src/lib/` and `src/stores/` only). Type-check + lint + manual UI verification + the backend reproducer are the verification levels. Closes #1216.
PR review surfaced one real correctness issue in docs and three small code polish items. None block merge; addressing for cleanliness. - packages/docs-web/src/content/docs/guides/authoring-workflows.md:486 removed the "auto-marked as failed on next startup" paragraph that described the now-deleted behavior. Replaced with a "Crashed servers / orphaned runs" note pointing users at `archon workflow cleanup` and the dashboard Cancel/Abandon buttons; explains the auto-resume mechanism still works once the row reaches a terminal status. - ConfirmRunActionDialog: narrow `onConfirm` from `() => void | Promise<void>` to `() => void`. All five callsites are synchronous wrappers around React Query mutations whose error handling lives at the page level (`runAction` in DashboardPage). The union widened the API for no current caller. Documented in the JSDoc what to do if an awaiting caller appears later. - TopNav: dropped the redundant `String(runningCount)` cast in the aria-label — template literal coerces. Also rewrote the comment above the `listDashboardRuns` query: the previous version implied `limit=1` constrained `counts.running`; in fact `counts` is a server-side aggregate independent of `limit`, and `limit=1` only minimises the `runs` array we discard.
CodeRabbit caught a factual error I introduced in the doc update: `archon workflow cleanup` calls `deleteOldWorkflowRuns(days)` which DELETEs old terminal rows (`completed`/`failed`/`cancelled` older than N days) for disk hygiene. It does NOT transition stuck `running` rows. The correct remediation for a stuck `running` row is either the dashboard's per-row Cancel/Abandon button (already documented) or `archon workflow abandon <run-id>` from the CLI (existing subcommand, see packages/cli/src/cli.ts:366-374). Fixed three locations: - packages/docs-web/.../guides/authoring-workflows.md — replaced the vague "clean up explicitly" with concrete Web UI / CLI instructions and an explicit "Not to be confused with `archon workflow cleanup`" callout to close off the ambiguity CodeRabbit flagged. - packages/server/src/index.ts — comment updated to point at the correct remediation (`archon workflow abandon`) and clarify that `archon workflow cleanup` is unrelated disk-hygiene. - CHANGELOG.md — same correction in the [Unreleased] Fixed entry.
e7e7be9 to
fc1a41d
Compare
…oleam00#1216) (coleam00#1231) * fix: stop server startup from auto-failing in-flight workflow runs (coleam00#1216) `failOrphanedRuns()` at server startup unconditionally flipped every `running` workflow row to `failed`, including runs actively executing in another process (CLI / adapters). The dag-executor's between-layer status check then bailed out of the run, exit code 1 — even though every node had completed successfully. Same class of bug the CLI already learned (see comment at packages/cli/src/cli.ts:256-258). Per the new CLAUDE.md principle "No Autonomous Lifecycle Mutation Across Process Boundaries", we don't replace the call with a timer-based heuristic. Instead we remove it and surface running workflows to the user with one-click actions. Backend - `packages/server/src/index.ts` — remove the `failOrphanedRuns()` call at startup. Replace with explanatory comment referencing the CLI precedent and the CLAUDE.md principle. The function in `packages/core/src/db/workflows.ts:911` is preserved for use by the explicit `archon workflow cleanup` command. UI - `packages/web/src/components/layout/TopNav.tsx` — replace the binary pulse dot on the Dashboard nav with a numeric count badge sourced from `/api/dashboard/runs` `counts.running`. Hidden when count is 0. Same 10s polling interval as before. No animation — a steady factual count is honest; a pulse would imply system judgment. - `packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx` (new) — shadcn AlertDialog wrapper for destructive workflow-run actions, mirroring the codebase-delete pattern in `sidebar/ProjectSelector.tsx`. Caller passes the existing button as `trigger` slot; dialog handles open/close via Radix. - `packages/web/src/components/dashboard/WorkflowRunCard.tsx` — replace 4 `window.confirm()` callsites (Reject, Abandon, Cancel, Delete) with ConfirmRunActionDialog. Each gets a context-appropriate description. - `packages/web/src/components/dashboard/WorkflowHistoryTable.tsx` — replace 1 `window.confirm()` (Delete) with the same dialog. CHANGELOG entries under [Unreleased]: Fixed for coleam00#1216, two Changed entries for the nav badge and dialog upgrade. No new tests: the web package has no React component testing infrastructure (existing `bun test` covers `src/lib/` and `src/stores/` only). Type-check + lint + manual UI verification + the backend reproducer are the verification levels. Closes coleam00#1216. * review: address PR coleam00#1231 nits — stale doc + 3 code polish PR review surfaced one real correctness issue in docs and three small code polish items. None block merge; addressing for cleanliness. - packages/docs-web/src/content/docs/guides/authoring-workflows.md:486 removed the "auto-marked as failed on next startup" paragraph that described the now-deleted behavior. Replaced with a "Crashed servers / orphaned runs" note pointing users at `archon workflow cleanup` and the dashboard Cancel/Abandon buttons; explains the auto-resume mechanism still works once the row reaches a terminal status. - ConfirmRunActionDialog: narrow `onConfirm` from `() => void | Promise<void>` to `() => void`. All five callsites are synchronous wrappers around React Query mutations whose error handling lives at the page level (`runAction` in DashboardPage). The union widened the API for no current caller. Documented in the JSDoc what to do if an awaiting caller appears later. - TopNav: dropped the redundant `String(runningCount)` cast in the aria-label — template literal coerces. Also rewrote the comment above the `listDashboardRuns` query: the previous version implied `limit=1` constrained `counts.running`; in fact `counts` is a server-side aggregate independent of `limit`, and `limit=1` only minimises the `runs` array we discard. * review: correct remediation docs — cleanup ≠ abandon CodeRabbit caught a factual error I introduced in the doc update: `archon workflow cleanup` calls `deleteOldWorkflowRuns(days)` which DELETEs old terminal rows (`completed`/`failed`/`cancelled` older than N days) for disk hygiene. It does NOT transition stuck `running` rows. The correct remediation for a stuck `running` row is either the dashboard's per-row Cancel/Abandon button (already documented) or `archon workflow abandon <run-id>` from the CLI (existing subcommand, see packages/cli/src/cli.ts:366-374). Fixed three locations: - packages/docs-web/.../guides/authoring-workflows.md — replaced the vague "clean up explicitly" with concrete Web UI / CLI instructions and an explicit "Not to be confused with `archon workflow cleanup`" callout to close off the ambiguity CodeRabbit flagged. - packages/server/src/index.ts — comment updated to point at the correct remediation (`archon workflow abandon`) and clarify that `archon workflow cleanup` is unrelated disk-hygiene. - CHANGELOG.md — same correction in the [Unreleased] Fixed entry.
Summary
archon servestartup unconditionally flipped allrunningworkflow rows tofailedviafailOrphanedRuns()atpackages/server/src/index.ts:213. This killed CLI workflows actively executing in another process. Reproducer: start a workflow in one terminal, start the server in another while it's still running — the workflow's status flips tofailedmid-execution and the CLI exits non-zero even though every node completed successfully. Filed as bug: server startup marks actively-running workflows as failed via failOrphanedRuns() #1216, discovered during PR fix(providers): replace Claude SDK embed with explicit binary-path resolver #1217 smoke testing.failOrphanedRuns()call from server startup (matches the CLI precedent atpackages/cli/src/cli.ts:256-258). UI gets a numeric count badge on the Dashboard nav (replacing a binary pulse dot) and AlertDialog confirmations for destructive workflow-run actions (replacing 5window.confirm()callsites).failOrphanedRuns()function itself inpackages/core/src/db/workflows.ts:911is preserved — it's still used byarchon workflow cleanup(the explicit user-driven path). Codex provider behavior unchanged. No DB migration. No new dependencies. No timer-based heuristic introduced anywhere — per the new CLAUDE.md principle.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
archon workflow cleanupretains the linksidebar/ProjectSelector.tsx:142–165patternLabel Snapshot
risk: low(removal of an autonomous mutation; UI changes are additive replacements of an existing pattern)size: M(6 files, +197 / −72; bulk in WorkflowRunCard's 4 dialog conversions)core, server, webserver:index,web:dashboard,web:layoutChange Metadata
bug(primary: fixes bug: server startup marks actively-running workflows as failed via failOrphanedRuns() #1216) +refactor(secondary: dialog UX)serverLinked Issue
Validation Evidence (required)
End-to-end reproducer (the bug fix verification):
Result with this PR applied:
Workflow completed successfully.(exit 0) ✓orphan/fail_orphans/orphaned_workflow_runs_failedevents ✓status='completed', notfailed✓Without this PR (verified before the fix): Terminal A exits 1 with "Workflow failed", server log emits
db.orphaned_workflow_runs_failed { count: 1 }— exactly the run that was in flight.Regression sweep:
Security Impact (required)
Compatibility / Migration
failOrphanedRuns()retained for explicit cleanup)Behavioral change for operators: Server restarts no longer auto-mark
runningworkflow rows asfailed. Truly orphaned rows from a crashed server now persist asrunninguntil cleaned up viaarchon workflow cleanupor per-row Cancel/Abandon in the dashboard. The Dashboard nav count badge surfaces the count.Human Verification (required)
bun run dev:serverstarts cleanly with no orphan-related log eventsfailOrphanedRuns()function is preserved and still callable by the explicitarchon workflow cleanuppathcreateWorkflowStoreimport in server/index.ts was also removed (caught by TS noUnusedLocals)ConfirmRunActionDialogdoes NOT swallow promise rejections fromonConfirm— errors propagate to the parent'srunActionhelper which already displays them viaactionErrorstatebun testonly coverssrc/lib/andsrc/stores/); adding@testing-library/reactwould be significant scope creep matching no existing pattern. Type-check + lint + manual UI verification + the backend reproducer are the verification levels in this PR.Side Effects / Blast Radius (required)
runningrows from crashed servers will accumulate in the DB until explicit cleanup. The count badge surfaces them; users can click into the dashboard and Cancel per row. This is the intended trade-off per CLAUDE.md "No Autonomous Lifecycle Mutation Across Process Boundaries".listDashboardRuns({ status: 'running', limit: 1 })in TopNav adds one query per 10s wherelistWorkflowRunswas previously called. Same frequency, slightly heavier endpoint (returns enriched run + counts vs raw run array). Thelimit: 1keeps the runs payload trivially small; we only consumecounts.running.archon workflow statusCLI command continues to work and lists running rowsdb.orphaned_workflow_runs_failedlog event is now only emitted by the explicit cleanup path, so its presence post-merge is a useful signal that someone ran cleanup intentionallyRollback Plan (required)
git revert 7a00e047ondev. One commit, atomic. No DB changes to reverse.window.confirm(worse UX but functional)Risks and Mitigations
archon workflow cleanupor the dashboard explicitly. Some may not realize the behavior changed.db.orphaned_workflow_runs_failedevent, removing a false-positive signal.window.confirmelsewhere — ProjectSelector pattern, this PR's pattern, and any I missed).window.confirmin the touched files (WorkflowRunCard.tsx,WorkflowHistoryTable.tsx). Other components inpackages/web/may still usewindow.confirmand should be reviewed in a follow-up sweep — out of scope here.Summary by CodeRabbit
New Features
Changed
Documentation