Skip to content

fix: stop server startup from auto-failing in-flight workflow runs (#1216)#1231

Merged
Wirasm merged 3 commits intodevfrom
fix/orphan-runs-no-startup-cleanup
Apr 15, 2026
Merged

fix: stop server startup from auto-failing in-flight workflow runs (#1216)#1231
Wirasm merged 3 commits intodevfrom
fix/orphan-runs-no-startup-cleanup

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented Apr 15, 2026

Summary

  • Problem: Every archon serve startup unconditionally flipped all running workflow rows to failed via failOrphanedRuns() at packages/server/src/index.ts:213. This killed CLI workflows actively executing in another process. Reproducer: start a workflow in one terminal, start the server in another while it's still running — the workflow's status flips to failed mid-execution and the CLI exits non-zero even though every node completed successfully. Filed as bug: server startup marks actively-running workflows as failed via failOrphanedRuns() #1216, discovered during PR fix(providers): replace Claude SDK embed with explicit binary-path resolver #1217 smoke testing.
  • Why it matters: Every server restart silently corrupted in-flight workflow state. Users with CI/cron-driven server restarts could lose long-running workflows without an actionable signal. The dag-executor's defensive between-layer check was the only thing preventing partial corruption — but that protection means valuable work (completed nodes, accumulated cost, generated artifacts) gets recorded with status=failed.
  • What changed: Backend removes the failOrphanedRuns() call from server startup (matches the CLI precedent at packages/cli/src/cli.ts:256-258). UI gets a numeric count badge on the Dashboard nav (replacing a binary pulse dot) and AlertDialog confirmations for destructive workflow-run actions (replacing 5 window.confirm() callsites).
  • What did not change (scope boundary): The failOrphanedRuns() function itself in packages/core/src/db/workflows.ts:911 is preserved — it's still used by archon workflow cleanup (the explicit user-driven path). Codex provider behavior unchanged. No DB migration. No new dependencies. No timer-based heuristic introduced anywhere — per the new CLAUDE.md principle.

UX Journey

Before

Terminal A                               Server (Terminal B)              UI
──────────                               ───────────────────              ──
archon workflow run e2e-claude-smoke
  ├─ creates run row (status=running)
  └─ executes nodes…
                                          archon serve  ──┐
                                          ├─ failOrphanedRuns()
                                          │  UPDATE remote_agent_workflow_runs
                                          │  SET status='failed'
                                          │  WHERE status='running'  ❌ kills A's row
                                          └─ binds port, ready
  ├─ next node finishes
  └─ between-layer status check
     sees status='failed'
     ↓
  ❌ Workflow failed:                                                    Dashboard:
     "Workflow did not complete                                           pulse dot
     successfully" (exit 1)                                               (binary signal)

After

Terminal A                               Server (Terminal B)              UI
──────────                               ───────────────────              ──
archon workflow run e2e-claude-smoke
  ├─ creates run row (status=running)
  └─ executes nodes…
                                          archon serve  ──┐
                                          *no failOrphanedRuns() call*
                                          └─ binds port, ready
  ├─ all nodes complete
  └─ between-layer status check
     sees status='running'
     ↓
  ✓ Workflow completed                                                    Dashboard nav:
    successfully (exit 0)                                                  [📊 Dashboard 1]
                                                                          (numeric count
                                                                           badge, hidden if 0)

User sees an unfamiliar "running" workflow on the dashboard?
   → clicks the workflow card → AlertDialog → "Cancel workflow" → confirmed → row marked cancelled
   (no system-driven heuristic; user owns the decision)

Architecture Diagram

Before

                 ┌──────────────────────────────┐
                 │  packages/server/src/index.ts│
                 │   startServer()              │
                 │   ├─ DB connect              │
                 │   ├─ failOrphanedRuns()  ──> │ ❌ mutates ALL `running` rows
                 │   │                          │    regardless of process owner
                 │   └─ bind port               │
                 └──────────────────────────────┘
                                    │
                                    ▼
                 ┌──────────────────────────────────────┐
                 │  packages/core/src/db/workflows.ts    │
                 │  failOrphanedRuns()                   │
                 │  UPDATE … SET status='failed'         │
                 │  WHERE status='running'  (no scope)   │
                 └──────────────────────────────────────┘

After

                 ┌──────────────────────────────┐
                 │  packages/server/src/index.ts│ [~]
                 │   startServer()              │
                 │   ├─ DB connect              │
                 │   ├─ // explanatory comment   │
                 │   │    *no autonomous mutation*│
                 │   └─ bind port               │
                 └──────────────────────────────┘

                 ┌──────────────────────────────────────┐
                 │  packages/core/src/db/workflows.ts    │ (unchanged)
                 │  failOrphanedRuns()                   │
                 │  Now only called by                   │
                 │  `archon workflow cleanup` (explicit) │
                 └──────────────────────────────────────┘

                 ┌────────────────────────────────────────────────────┐
                 │  packages/web/src/components/layout/TopNav.tsx [~] │
                 │   Dashboard nav: pulse-dot ──> count badge          │
                 │   reads /api/dashboard/runs counts.running          │
                 └────────────────────────────────────────────────────┘

                 ┌────────────────────────────────────────────────────┐
                 │  packages/web/src/components/dashboard/             │
                 │   ConfirmRunActionDialog.tsx  [+]                   │
                 │   shadcn AlertDialog wrapper, mirrors               │
                 │   sidebar/ProjectSelector codebase-delete pattern   │
                 │                                                      │
                 │   WorkflowRunCard.tsx  [~]                          │
                 │   4× window.confirm() ──> ConfirmRunActionDialog    │
                 │                                                      │
                 │   WorkflowHistoryTable.tsx  [~]                     │
                 │   1× window.confirm() ──> ConfirmRunActionDialog    │
                 └────────────────────────────────────────────────────┘

Connection inventory:

From To Status Notes
server/index.ts createWorkflowStore() removed no longer needed at startup; archon workflow cleanup retains the link
server/index.ts failOrphanedRuns() removed the offending invocation
TopNav.tsx listDashboardRuns new replaces listWorkflowRuns; reads counts.running
TopNav.tsx listWorkflowRuns removed superseded by listDashboardRuns query
WorkflowRunCard.tsx ConfirmRunActionDialog new 4 callsites (Reject, Abandon, Cancel, Delete)
WorkflowHistoryTable.tsx ConfirmRunActionDialog new 1 callsite (Delete)
ConfirmRunActionDialog.tsx shadcn AlertDialog primitives new mirrors sidebar/ProjectSelector.tsx:142–165 pattern
WorkflowRunCard.tsx window.confirm removed 4 callsites
WorkflowHistoryTable.tsx window.confirm removed 1 callsite

Label Snapshot

  • Risk: risk: low (removal of an autonomous mutation; UI changes are additive replacements of an existing pattern)
  • Size: size: M (6 files, +197 / −72; bulk in WorkflowRunCard's 4 dialog conversions)
  • Scope: core, server, web
  • Module: server:index, web:dashboard, web:layout

Change Metadata

Linked Issue

Validation Evidence (required)

bun run type-check     # 10/10 packages: Exited with code 0
bun run lint           # 0 errors, 0 warnings
bun run format:check   # All matched files use Prettier code style
bun run test           # one pre-existing failure on dev (cleanup-service.test.ts —
                       # `runScheduledCleanup > continues processing after error on
                       # one environment`); verified to also fail on origin/dev
                       # without this branch's changes. NOT introduced by this PR.

End-to-end reproducer (the bug fix verification):

# Terminal A
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree

# Terminal B (during A's execution)
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run dev:server

Result with this PR applied:

  • Terminal A → Workflow completed successfully. (exit 0) ✓
  • Server log → zero orphan / fail_orphans / orphaned_workflow_runs_failed events ✓
  • DB → run row ends with status='completed', not failed

Without this PR (verified before the fix): Terminal A exits 1 with "Workflow failed", server log emits db.orphaned_workflow_runs_failed { count: 1 } — exactly the run that was in flight.

Regression sweep:

grep -n 'window\.confirm' \
  packages/web/src/components/dashboard/WorkflowRunCard.tsx \
  packages/web/src/components/dashboard/WorkflowHistoryTable.tsx
# zero matches
grep -nE "failOrphanedRuns\(\)" packages/server/src/index.ts
# zero matches

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes (no API contract change; failOrphanedRuns() retained for explicit cleanup)
  • Config/env changes? No
  • Database migration needed? No

Behavioral change for operators: Server restarts no longer auto-mark running workflow rows as failed. Truly orphaned rows from a crashed server now persist as running until cleaned up via archon workflow cleanup or per-row Cancel/Abandon in the dashboard. The Dashboard nav count badge surfaces the count.

Human Verification (required)

  • Verified scenarios:
    • End-to-end reproducer (above) — bug confirmed fixed
    • bun run dev:server starts cleanly with no orphan-related log events
    • CLI workflow completes cleanly even with concurrent server start
    • Type-check + lint + format all green across all 10 packages
  • Edge cases checked:
    • The failOrphanedRuns() function is preserved and still callable by the explicit archon workflow cleanup path
    • The unused createWorkflowStore import in server/index.ts was also removed (caught by TS noUnusedLocals)
    • The ConfirmRunActionDialog does NOT swallow promise rejections from onConfirm — errors propagate to the parent's runAction helper which already displays them via actionError state
  • What was not verified:
    • UI manual interaction with the new AlertDialogs (no browser available in this environment) — the AlertDialog primitive and the mirrored ProjectSelector pattern are both production-tested elsewhere; the change is essentially a render-shape swap
    • The dashboard nav badge update timing (relies on existing 10s polling; should appear within 10s of a workflow start)
    • No new component tests added — the web package has no React component test infrastructure (bun test only covers src/lib/ and src/stores/); adding @testing-library/react would be significant scope creep matching no existing pattern. Type-check + lint + manual UI verification + the backend reproducer are the verification levels in this PR.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Server startup; CLI workflows running concurrently with server restarts; web UI Dashboard nav + workflow run cards + history table
  • Potential unintended effects:
    • Truly orphaned running rows from crashed servers will accumulate in the DB until explicit cleanup. The count badge surfaces them; users can click into the dashboard and Cancel per row. This is the intended trade-off per CLAUDE.md "No Autonomous Lifecycle Mutation Across Process Boundaries".
    • listDashboardRuns({ status: 'running', limit: 1 }) in TopNav adds one query per 10s where listWorkflowRuns was previously called. Same frequency, slightly heavier endpoint (returns enriched run + counts vs raw run array). The limit: 1 keeps the runs payload trivially small; we only consume counts.running.
  • Guardrails / monitoring for early detection:
    • The dashboard nav count badge is the primary visibility signal — operators see it grow if orphans accumulate
    • archon workflow status CLI command continues to work and lists running rows
    • Existing db.orphaned_workflow_runs_failed log event is now only emitted by the explicit cleanup path, so its presence post-merge is a useful signal that someone ran cleanup intentionally

Rollback Plan (required)

  • Fast rollback command/path: git revert 7a00e047 on dev. One commit, atomic. No DB changes to reverse.
  • Feature flags or config toggles: None.
  • Observable failure symptoms:
    • If for some reason the dashboard nav badge fails to render: existing pulse-dot pattern is restored by the revert
    • If the AlertDialogs misbehave: revert restores window.confirm (worse UX but functional)
    • If the bug fix introduces an unforeseen regression: very unlikely (the change is a removal of an unconditional mutation), but revert is safe and restores the prior behavior including the bug

Risks and Mitigations

  • Risk: Operators who relied on server restarts to "tidy up" stuck workflows will need to use archon workflow cleanup or the dashboard explicitly. Some may not realize the behavior changed.
    • Mitigation: CHANGELOG entry under [Unreleased] documents the change. The Dashboard nav count badge surfaces stuck workflows visibly. Server log on startup no longer emits the misleading db.orphaned_workflow_runs_failed event, removing a false-positive signal.
  • Risk: A pause in dialog-confirmation polish leaves the codebase with two destructive-confirm patterns (AlertDialog in some places, window.confirm elsewhere — ProjectSelector pattern, this PR's pattern, and any I missed).
    • Mitigation: This PR replaces all window.confirm in the touched files (WorkflowRunCard.tsx, WorkflowHistoryTable.tsx). Other components in packages/web/ may still use window.confirm and should be reviewed in a follow-up sweep — out of scope here.

Summary by CodeRabbit

  • New Features

    • Custom confirmation dialogs replace browser prompts for destructive workflow actions (Abandon, Cancel, Delete, Reject), providing contextual titles and descriptions.
  • Changed

    • Server no longer auto-marks actively-running workflows as failed on startup; stuck runs must be handled by users via dashboard actions or the CLI.
    • Dashboard tab shows running workflow count as a numeric badge with aria-labels.
  • Documentation

    • Guides updated to reflect orphaned-run, resume, and cleanup behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e577b3e-5d3f-432d-8a1d-8feb9dd836d5

📥 Commits

Reviewing files that changed from the base of the PR and between e7e7be9 and fc1a41d.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md
  • packages/server/src/index.ts
  • packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx
  • packages/web/src/components/dashboard/WorkflowHistoryTable.tsx
  • packages/web/src/components/dashboard/WorkflowRunCard.tsx
  • packages/web/src/components/layout/TopNav.tsx
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/server/src/index.ts
  • packages/web/src/components/dashboard/WorkflowHistoryTable.tsx
  • packages/web/src/components/dashboard/WorkflowRunCard.tsx
  • packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx

📝 Walkthrough

Walkthrough

Server startup no longer auto-fails running workflow rows (removed failOrphanedRuns()); orphaned-run cleanup is user-driven. Web UI: native window.confirm() prompts replaced by a reusable confirmation dialog component, and the top navigation now shows a numeric running-workflow count badge.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Documented removal of server startup orphan-run cleanup, explicit user-driven orphan handling, dashboard running-count badge, and replacement of native confirm prompts with a custom dialog.
Server startup
packages/server/src/index.ts
Removed createWorkflowStore().failOrphanedRuns() startup call and added comment that orphaned-run cleanup is not run at boot.
Confirm dialog component
packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx
New ConfirmRunActionDialog React component wrapping AlertDialog; accepts trigger, title, description, confirmLabel, and onConfirm.
Dashboard destructive actions
packages/web/src/components/dashboard/WorkflowHistoryTable.tsx, packages/web/src/components/dashboard/WorkflowRunCard.tsx
Replaced inline window.confirm() flows with ConfirmRunActionDialog; action callbacks moved to dialog onConfirm.
Top navigation
packages/web/src/components/layout/TopNav.tsx
Query switched to listDashboardRuns(..., forCount: true) and UI changed from pulsing dot to numeric runningCount badge with appropriate aria-label.
Docs
packages/docs-web/src/content/docs/guides/authoring-workflows.md
Removed claim that server restart auto-fails running runs; documented that stuck running rows remain and require explicit user/CLI actions; clarified archon workflow cleanup only removes old terminal runs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop and nudge the running rows anew,
No more false fails when servers restart you.
A friendly dialog asks before we part—
Badges count the runners close to my heart.
Hooray for cleanups done by human art!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: removing server startup's auto-failing of in-flight workflow runs, which is the core bug fix.
Description check ✅ Passed The PR description comprehensively covers the template sections: problem statement, impact analysis, detailed before/after flows, architecture diagrams with connection inventory, change metadata, validation evidence with end-to-end reproducer, security/compatibility/human verification sections, and rollback plan.
Linked Issues check ✅ Passed The PR fully addresses issue #1216 by implementing the preferred solution: removing failOrphanedRuns() from server startup, matching the CLI precedent, and preserving the function for explicit cleanup via archon workflow cleanup.
Out of Scope Changes check ✅ Passed All changes are scoped to the primary objectives: backend orphan-cleanup removal, web UI count badge replacement, and confirmation dialog replacements. No unrelated refactors, dependency additions, or unintended scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/orphan-runs-no-startup-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/web/src/components/layout/TopNav.tsx (1)

18-23: Query key change may cause brief staleness after chat operations.

ChatInterface invalidates ['workflowRuns'] but this query uses ['dashboardRuns', { status: 'running', forCount: true }]. Workflow state changes triggered from chat won't immediately update this count. The 10s refetchInterval provides eventual consistency, so this is a minor UX gap rather than a bug.

Consider adding ['dashboardRuns'] to ChatInterface's invalidation list if immediate consistency is desired.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/web/src/components/layout/TopNav.tsx` around lines 18 - 23, The
dashboardRuns useQuery (queryKey ['dashboardRuns', { status: 'running',
forCount: true }], queryFn listDashboardRuns) can become briefly stale because
ChatInterface currently only invalidates ['workflowRuns']; update ChatInterface
to also invalidate the ['dashboardRuns'] query key (or include the matching key
shape) after chat-triggered workflow changes so runningCount reflects updates
immediately instead of waiting for the 10s refetchInterval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/web/src/components/layout/TopNav.tsx`:
- Around line 18-23: The dashboardRuns useQuery (queryKey ['dashboardRuns', {
status: 'running', forCount: true }], queryFn listDashboardRuns) can become
briefly stale because ChatInterface currently only invalidates ['workflowRuns'];
update ChatInterface to also invalidate the ['dashboardRuns'] query key (or
include the matching key shape) after chat-triggered workflow changes so
runningCount reflects updates immediately instead of waiting for the 10s
refetchInterval.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc978b37-07be-4213-bd83-1f87825e78ea

📥 Commits

Reviewing files that changed from the base of the PR and between f61d576 and 7a00e04.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • packages/server/src/index.ts
  • packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx
  • packages/web/src/components/dashboard/WorkflowHistoryTable.tsx
  • packages/web/src/components/dashboard/WorkflowRunCard.tsx
  • packages/web/src/components/layout/TopNav.tsx

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Apr 15, 2026

PR Review Summary

Six specialized agents reviewed this PR in parallel: code-reviewer, silent-failure-hunter, type-design-analyzer, comment-analyzer, docs-impact, code-simplifier.

Critical Issues (0 found)

None.

Important Issues (0 found)

None.

Suggestions (4 found)

Agent Suggestion Location
docs-impact Docs describe the old auto-fail-on-restart behavior — directly contradicted by this PR packages/docs-web/src/content/docs/guides/authoring-workflows.md:486
code-simplifier onConfirm: () => void | Promise<void> — the Promise branch is unused; every callsite is synchronous and void onConfirm() never awaits. Narrow to () => void. packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx:22
code-simplifier Redundant String(runningCount) — already typed number, template literal coerces automatically packages/web/src/components/layout/TopNav.tsx:64
comment-analyzer Comment implies limit=1 limits what counts.running returns; actually limit only constrains the runs array — counts is a separate aggregate packages/web/src/components/layout/TopNav.tsx:19-21

Strengths

  • Principle compliance: Removal of failOrphanedRuns() at startup is a textbook application of CLAUDE.md's "No Autonomous Lifecycle Mutation Across Process Boundaries". The server/src/index.ts comment cites the principle by name, references the CLI precedent, and links bug: server startup marks actively-running workflows as failed via failOrphanedRuns() #1216 — flagged as exemplary by comment-analyzer.
  • No silent failures: void onConfirm() in ConfirmRunActionDialog is intentional and documented; errors propagate to the parent runAction helper in DashboardPage.tsx:272-284 which surfaces them via actionError state. No swallowed rejections anywhere in the diff.
  • Type design: Props interface for the new dialog scored 7.5/10 average across four dimensions. Description typed as ReactNode enables rich formatting without dangerouslySetInnerHTML.
  • Clean dead-code removal: createWorkflowStore import removed alongside the orphan call; no leftovers.
  • Mirrors existing pattern: ConfirmRunActionDialog follows sidebar/ProjectSelector.tsx:142-165 faithfully; the asChild+<div> divergence in AlertDialogDescription is an intentional fix for nested block elements under a ReactNode description.

Documentation Issues

  • packages/docs-web/.../authoring-workflows.md:486 — Describes the removed behavior as current. Replace with text noting that stuck running runs must now be cleaned up via archon workflow cleanup or dashboard Abandon/Cancel.
  • CHANGELOG.md [Unreleased] entry is accurate; no update needed.
  • CLAUDE.md principle already codified (c4ab0a2); no update needed.
  • CLI reference docs for archon workflow cleanup remain accurate.

Verdict

READY TO MERGE — with one doc fix recommended.

The only substantive finding is the stale docs-web page describing the removed behavior. The three code suggestions are nits (4 lines total). Nothing blocks merge.

Recommended Actions

  1. Update packages/docs-web/src/content/docs/guides/authoring-workflows.md:486 to reflect the new behavior.
  2. Optional polish: narrow onConfirm to () => void, drop String() cast, tighten counts.running comment.
  3. Note the pre-existing cleanup-service.test.ts failure on dev is unrelated (documented in PR body; verified on origin/dev).

Wirasm added a commit that referenced this pull request Apr 15, 2026
PR review surfaced one real correctness issue in docs and three small
code polish items. None block merge; addressing for cleanliness.

- packages/docs-web/src/content/docs/guides/authoring-workflows.md:486
  removed the "auto-marked as failed on next startup" paragraph that
  described the now-deleted behavior. Replaced with a "Crashed servers /
  orphaned runs" note pointing users at `archon workflow cleanup` and
  the dashboard Cancel/Abandon buttons; explains the auto-resume
  mechanism still works once the row reaches a terminal status.

- ConfirmRunActionDialog: narrow `onConfirm` from
  `() => void | Promise<void>` to `() => void`. All five callsites are
  synchronous wrappers around React Query mutations whose error
  handling lives at the page level (`runAction` in DashboardPage). The
  union widened the API for no current caller. Documented in the JSDoc
  what to do if an awaiting caller appears later.

- TopNav: dropped the redundant `String(runningCount)` cast in the
  aria-label — template literal coerces. Also rewrote the comment above
  the `listDashboardRuns` query: the previous version implied `limit=1`
  constrained `counts.running`; in fact `counts` is a server-side
  aggregate independent of `limit`, and `limit=1` only minimises the
  `runs` array we discard.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/docs-web/src/content/docs/guides/authoring-workflows.md`:
- Line 486: Update the paragraph to clarify that the CLI command "archon
workflow cleanup" invokes deleteOldWorkflowRuns(days) and only deletes terminal
runs (failed, cancelled, completed) older than N days; it does NOT clean up
orphaned "running" rows. For stuck "running" rows instruct users to use the
dashboard per-row Cancel/Abandon buttons or rely on the server-side
failOrphanedRuns() which runs on server startup, and adjust the statement about
auto-resume to reference that rows become failed/cancelled when explicitly
cleaned so subsequent invocations can auto-resume from completed nodes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0602701d-9068-49bd-a413-50af37a216e6

📥 Commits

Reviewing files that changed from the base of the PR and between 7a00e04 and e7e7be9.

📒 Files selected for processing (3)
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md
  • packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx
  • packages/web/src/components/layout/TopNav.tsx
✅ Files skipped from review due to trivial changes (1)
  • packages/web/src/components/layout/TopNav.tsx
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx

Comment thread packages/docs-web/src/content/docs/guides/authoring-workflows.md Outdated
Wirasm added 3 commits April 15, 2026 11:55
…1216)

`failOrphanedRuns()` at server startup unconditionally flipped every
`running` workflow row to `failed`, including runs actively executing in
another process (CLI / adapters). The dag-executor's between-layer
status check then bailed out of the run, exit code 1 — even though every
node had completed successfully. Same class of bug the CLI already
learned (see comment at packages/cli/src/cli.ts:256-258).

Per the new CLAUDE.md principle "No Autonomous Lifecycle Mutation Across
Process Boundaries", we don't replace the call with a timer-based
heuristic. Instead we remove it and surface running workflows to the
user with one-click actions.

Backend
- `packages/server/src/index.ts` — remove the `failOrphanedRuns()` call
  at startup. Replace with explanatory comment referencing the CLI
  precedent and the CLAUDE.md principle. The function in
  `packages/core/src/db/workflows.ts:911` is preserved for use by the
  explicit `archon workflow cleanup` command.

UI
- `packages/web/src/components/layout/TopNav.tsx` — replace the binary
  pulse dot on the Dashboard nav with a numeric count badge sourced
  from `/api/dashboard/runs` `counts.running`. Hidden when count is 0.
  Same 10s polling interval as before. No animation — a steady factual
  count is honest; a pulse would imply system judgment.

- `packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx`
  (new) — shadcn AlertDialog wrapper for destructive workflow-run
  actions, mirroring the codebase-delete pattern in
  `sidebar/ProjectSelector.tsx`. Caller passes the existing button as
  `trigger` slot; dialog handles open/close via Radix.

- `packages/web/src/components/dashboard/WorkflowRunCard.tsx` — replace
  4 `window.confirm()` callsites (Reject, Abandon, Cancel, Delete) with
  ConfirmRunActionDialog. Each gets a context-appropriate description.

- `packages/web/src/components/dashboard/WorkflowHistoryTable.tsx` —
  replace 1 `window.confirm()` (Delete) with the same dialog.

CHANGELOG entries under [Unreleased]: Fixed for #1216, two Changed
entries for the nav badge and dialog upgrade.

No new tests: the web package has no React component testing
infrastructure (existing `bun test` covers `src/lib/` and `src/stores/`
only). Type-check + lint + manual UI verification + the backend
reproducer are the verification levels.

Closes #1216.
PR review surfaced one real correctness issue in docs and three small
code polish items. None block merge; addressing for cleanliness.

- packages/docs-web/src/content/docs/guides/authoring-workflows.md:486
  removed the "auto-marked as failed on next startup" paragraph that
  described the now-deleted behavior. Replaced with a "Crashed servers /
  orphaned runs" note pointing users at `archon workflow cleanup` and
  the dashboard Cancel/Abandon buttons; explains the auto-resume
  mechanism still works once the row reaches a terminal status.

- ConfirmRunActionDialog: narrow `onConfirm` from
  `() => void | Promise<void>` to `() => void`. All five callsites are
  synchronous wrappers around React Query mutations whose error
  handling lives at the page level (`runAction` in DashboardPage). The
  union widened the API for no current caller. Documented in the JSDoc
  what to do if an awaiting caller appears later.

- TopNav: dropped the redundant `String(runningCount)` cast in the
  aria-label — template literal coerces. Also rewrote the comment above
  the `listDashboardRuns` query: the previous version implied `limit=1`
  constrained `counts.running`; in fact `counts` is a server-side
  aggregate independent of `limit`, and `limit=1` only minimises the
  `runs` array we discard.
CodeRabbit caught a factual error I introduced in the doc update:
`archon workflow cleanup` calls `deleteOldWorkflowRuns(days)` which
DELETEs old terminal rows (`completed`/`failed`/`cancelled` older than
N days) for disk hygiene. It does NOT transition stuck `running` rows.

The correct remediation for a stuck `running` row is either the
dashboard's per-row Cancel/Abandon button (already documented) or
`archon workflow abandon <run-id>` from the CLI (existing subcommand,
see packages/cli/src/cli.ts:366-374).

Fixed three locations:
- packages/docs-web/.../guides/authoring-workflows.md — replaced the
  vague "clean up explicitly" with concrete Web UI / CLI instructions
  and an explicit "Not to be confused with `archon workflow cleanup`"
  callout to close off the ambiguity CodeRabbit flagged.
- packages/server/src/index.ts — comment updated to point at the
  correct remediation (`archon workflow abandon`) and clarify that
  `archon workflow cleanup` is unrelated disk-hygiene.
- CHANGELOG.md — same correction in the [Unreleased] Fixed entry.
@Wirasm Wirasm force-pushed the fix/orphan-runs-no-startup-cleanup branch from e7e7be9 to fc1a41d Compare April 15, 2026 08:58
@Wirasm Wirasm merged commit 882fc58 into dev Apr 15, 2026
4 checks passed
@Wirasm Wirasm deleted the fix/orphan-runs-no-startup-cleanup branch April 15, 2026 09:05
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…oleam00#1216) (coleam00#1231)

* fix: stop server startup from auto-failing in-flight workflow runs (coleam00#1216)

`failOrphanedRuns()` at server startup unconditionally flipped every
`running` workflow row to `failed`, including runs actively executing in
another process (CLI / adapters). The dag-executor's between-layer
status check then bailed out of the run, exit code 1 — even though every
node had completed successfully. Same class of bug the CLI already
learned (see comment at packages/cli/src/cli.ts:256-258).

Per the new CLAUDE.md principle "No Autonomous Lifecycle Mutation Across
Process Boundaries", we don't replace the call with a timer-based
heuristic. Instead we remove it and surface running workflows to the
user with one-click actions.

Backend
- `packages/server/src/index.ts` — remove the `failOrphanedRuns()` call
  at startup. Replace with explanatory comment referencing the CLI
  precedent and the CLAUDE.md principle. The function in
  `packages/core/src/db/workflows.ts:911` is preserved for use by the
  explicit `archon workflow cleanup` command.

UI
- `packages/web/src/components/layout/TopNav.tsx` — replace the binary
  pulse dot on the Dashboard nav with a numeric count badge sourced
  from `/api/dashboard/runs` `counts.running`. Hidden when count is 0.
  Same 10s polling interval as before. No animation — a steady factual
  count is honest; a pulse would imply system judgment.

- `packages/web/src/components/dashboard/ConfirmRunActionDialog.tsx`
  (new) — shadcn AlertDialog wrapper for destructive workflow-run
  actions, mirroring the codebase-delete pattern in
  `sidebar/ProjectSelector.tsx`. Caller passes the existing button as
  `trigger` slot; dialog handles open/close via Radix.

- `packages/web/src/components/dashboard/WorkflowRunCard.tsx` — replace
  4 `window.confirm()` callsites (Reject, Abandon, Cancel, Delete) with
  ConfirmRunActionDialog. Each gets a context-appropriate description.

- `packages/web/src/components/dashboard/WorkflowHistoryTable.tsx` —
  replace 1 `window.confirm()` (Delete) with the same dialog.

CHANGELOG entries under [Unreleased]: Fixed for coleam00#1216, two Changed
entries for the nav badge and dialog upgrade.

No new tests: the web package has no React component testing
infrastructure (existing `bun test` covers `src/lib/` and `src/stores/`
only). Type-check + lint + manual UI verification + the backend
reproducer are the verification levels.

Closes coleam00#1216.

* review: address PR coleam00#1231 nits — stale doc + 3 code polish

PR review surfaced one real correctness issue in docs and three small
code polish items. None block merge; addressing for cleanliness.

- packages/docs-web/src/content/docs/guides/authoring-workflows.md:486
  removed the "auto-marked as failed on next startup" paragraph that
  described the now-deleted behavior. Replaced with a "Crashed servers /
  orphaned runs" note pointing users at `archon workflow cleanup` and
  the dashboard Cancel/Abandon buttons; explains the auto-resume
  mechanism still works once the row reaches a terminal status.

- ConfirmRunActionDialog: narrow `onConfirm` from
  `() => void | Promise<void>` to `() => void`. All five callsites are
  synchronous wrappers around React Query mutations whose error
  handling lives at the page level (`runAction` in DashboardPage). The
  union widened the API for no current caller. Documented in the JSDoc
  what to do if an awaiting caller appears later.

- TopNav: dropped the redundant `String(runningCount)` cast in the
  aria-label — template literal coerces. Also rewrote the comment above
  the `listDashboardRuns` query: the previous version implied `limit=1`
  constrained `counts.running`; in fact `counts` is a server-side
  aggregate independent of `limit`, and `limit=1` only minimises the
  `runs` array we discard.

* review: correct remediation docs — cleanup ≠ abandon

CodeRabbit caught a factual error I introduced in the doc update:
`archon workflow cleanup` calls `deleteOldWorkflowRuns(days)` which
DELETEs old terminal rows (`completed`/`failed`/`cancelled` older than
N days) for disk hygiene. It does NOT transition stuck `running` rows.

The correct remediation for a stuck `running` row is either the
dashboard's per-row Cancel/Abandon button (already documented) or
`archon workflow abandon <run-id>` from the CLI (existing subcommand,
see packages/cli/src/cli.ts:366-374).

Fixed three locations:
- packages/docs-web/.../guides/authoring-workflows.md — replaced the
  vague "clean up explicitly" with concrete Web UI / CLI instructions
  and an explicit "Not to be confused with `archon workflow cleanup`"
  callout to close off the ambiguity CodeRabbit flagged.
- packages/server/src/index.ts — comment updated to point at the
  correct remediation (`archon workflow abandon`) and clarify that
  `archon workflow cleanup` is unrelated disk-hygiene.
- CHANGELOG.md — same correction in the [Unreleased] Fixed entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: server startup marks actively-running workflows as failed via failOrphanedRuns()

1 participant