Skip to content

feat(web): surface stalled workflow runs with abandon action + fix(workflows): archon-architect analyze node prompt/tools conflict#1374

Closed
MartienM wants to merge 2 commits intocoleam00:devfrom
MartienM:fix/stalled-workflow-ui
Closed

feat(web): surface stalled workflow runs with abandon action + fix(workflows): archon-architect analyze node prompt/tools conflict#1374
MartienM wants to merge 2 commits intocoleam00:devfrom
MartienM:fix/stalled-workflow-ui

Conversation

@MartienM
Copy link
Copy Markdown

@MartienM MartienM commented Apr 23, 2026

Summary

  • Web UI: surfaces stalled workflow runs (no events for 5 min) with a warning banner and one-click Abandon action; remaps dangling running/pending node states and unmatched tool_called events after terminal status so the UI stops rendering spinners for completed runs.
  • Workflow fix: removes an impossible Write a file instruction from the archon-architect analyze node, which had denied_tools: [Write, Edit, Bash] and caused the Claude SDK to loop on ToolSearch trying to obtain Write, then stall silently with no events for hours.

Context

Hit this in production: ran archon-architect on a real codebase, it stalled 1m23s into the analyze node with no node_failed or workflow_failed event. The UI polled forever, the conversation slot stayed locked, and there was no visible way to abandon without dropping to CLI.

Root cause was two unrelated issues that together produced a silent, unrecoverable hang:

  1. analyze node prompt told the agent to Write a structured assessment to $ARTIFACTS_DIR/architecture-assessment.md while denied_tools: [Write, Edit, Bash] made that impossible. The agent retried ToolSearch "select:Write" five times over 12 seconds, announced it would spawn a sub-agent, then produced no further events. The file was never read anywhere downstream — plan consumes the assessment via $analyze.output (assistant response text), not via disk.
  2. The analyze node had no timeout, so the engine never marked the run as failed; and abandonWorkflow() only mutates the DB row — it emits no node_failed events, so the UI keeps showing spinners on the last-running node.

What changes

packages/web/src/components/workflows/WorkflowExecution.tsx

  • STALE_THRESHOLD_MS = 5 * 60 * 1000 — derived from max(events.created_at) vs Date.now(), re-evaluated each second via the existing setTick timer.
  • Yellow warning banner between the tab strip and the body when a running run has had no events for >5min, with "Abandon run" button wired to abandonWorkflowRun.
  • Post-merge node remap: when the workflow is terminal, in-flight running/pending DAG nodes are mapped to skipped so stale node_started rows (without a terminator) don't paint a spinner.
  • Post-merge tool-call remap: when the run is terminal, unmatched tool_called events get duration: 0 so WorkflowLogs stops treating them as running.

Deliberately kept out-of-scope: autonomous lifecycle mutation. Per the "No Autonomous Lifecycle Mutation Across Process Boundaries" rule in CLAUDE.md, the engine still doesn't auto-fail stalled runs — the user confirms via the button. CLI-started runs stay in user hands.

.archon/workflows/defaults/archon-architect.yaml

  • Replaced "Write a structured assessment to $ARTIFACTS_DIR/..." with "Produce a structured assessment as your final response" and a note explaining the node cannot write files. Downstream plan already consumes $analyze.output; the file was never referenced anywhere.

packages/workflows/src/defaults/bundled-defaults.generated.ts

  • Regenerated via bun run generate:bundled so check:bundled passes.

Test plan

  • Type-check passes (bun --filter @archon/web type-check)
  • Lint passes (eslint --max-warnings 0 on changed file)
  • bun run check:bundled passes
  • Manual: reproduced the original stall on archon-architect run 0fae53d26e8158074af432e119f99194, opened the run page, observed the banner after 5min, clicked Abandon, saw run transition to cancelled and both node + tool spinners disappear.
  • Re-ran archon-architect after the prompt fix — completed successfully in 21m10s (all 7 nodes including analyzeplan chain through $analyze.output).

Not in scope

  • Engine-side stale detection (heartbeat watchdog that emits workflow_stalled events). Considered, decided to start with the UI-only signal because (a) it's the minimum needed to unblock users on this class of failure, (b) moving to engine-side is a larger PR that should be evaluated separately against the autonomous-mutation principle, and (c) the UI fix works regardless of how the run was started (web, CLI, adapter).
  • A node-level timeout default for AI nodes. Related but orthogonal — timeouts catch "legitimately too long", this PR catches "silently stopped emitting". Both have their place.

Summary by CodeRabbit

  • New Features

    • Added ability to abandon long-running workflow runs from the UI.
    • Added detection and warning banner for stalled workflows.
  • Bug Fixes

    • Improved handling of workflow terminal states to prevent spinners and incorrect status displays.
  • Performance

    • Increased validation timeout from 5 to 30 minutes for longer validation runs.

MartienM and others added 2 commits April 22, 2026 16:15
When the Claude SDK subprocess stops emitting events without signalling
failure, runs stay in `running` forever — the UI polls indefinitely and
the conversation slot stays locked. WorkflowExecution now shows a
warning banner when no workflow_event has arrived in 5 minutes, with a
one-click Abandon action. Abandon keeps lifecycle-mutation in user hands
(per CLAUDE.md's "No Autonomous Lifecycle Mutation" rule).

Also patches dangling state that abandonWorkflow() does not emit events
for: on terminal status, still-`running` DAG nodes are remapped to
`skipped`, and unmatched tool_called events get `duration: 0` so
WorkflowLogs stops rendering a permanent spinner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tect analyze node

The analyze node has `denied_tools: [Write, Edit, Bash]` by design — it
is the "pure diagnosis" phase of the workflow's trust ladder (measure →
analyze → plan → execute). But the prompt explicitly told the agent to
write the assessment to $ARTIFACTS_DIR/architecture-assessment.md. The
agent would then loop on ToolSearch trying to obtain a Write tool, fail,
and stall indefinitely — no node-level timeout on analyze, so the run
never terminates.

The downstream `plan` node already consumes the assessment via
$analyze.output (the assistant response text), not the file. The file
was never read anywhere. Changed the prompt to ask for a structured
response instead of a file write, matching what the workflow actually
uses.

Also regenerates bundled-defaults.generated.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

The PR modifies the archon-architect workflow to output analysis results directly as structured response data rather than writing to a file, increases validation timeout to 30 minutes, and adds stall detection, improved event handling, state normalization, and workflow abandonment capabilities to the WorkflowExecution component.

Changes

Cohort / File(s) Summary
Workflow Configuration
.archon/workflows/defaults/archon-architect.yaml, packages/workflows/src/defaults/bundled-defaults.generated.ts
Modified analyze node to output structured assessment as final response (consumed via $analyze.output) instead of writing to architecture-assessment.md; increased validate node timeout from 300s to 1800s (30 min); adjusted scan-metrics prompt for TYPE SAFETY GAPS statistics calculation.
WorkflowExecution Component
packages/web/src/components/workflows/WorkflowExecution.tsx
Added stall detection with warning banner when workflow has no new events for threshold duration; introduced tool-event duration mapping (assigning duration: 0 to unmatched tool_called events in terminal workflows); refactored workflow state merging to remap running/pending DAG nodes to skipped in terminal state; added user-initiated abandon control with abandoningState, API call, and cache invalidation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WorkflowUI as WorkflowExecution Component
    participant EventMonitor as Event Tracker
    participant API as API Service
    participant QueryCache as Query Cache

    rect rgba(100, 150, 200, 0.5)
    Note over EventMonitor,WorkflowUI: Stall Detection Flow
    EventMonitor->>WorkflowUI: Process workflow events
    WorkflowUI->>EventMonitor: Calculate latest event timestamp
    EventMonitor-->>WorkflowUI: Check time elapsed vs STALE_THRESHOLD_MS
    WorkflowUI->>WorkflowUI: Stall detected?
    alt Stall detected
        WorkflowUI->>User: Display warning banner with AlertTriangle
    end
    end

    rect rgba(200, 100, 100, 0.5)
    Note over User,QueryCache: Workflow Abandonment Flow
    User->>WorkflowUI: Click abandon button
    WorkflowUI->>WorkflowUI: Set abandoning state
    WorkflowUI->>API: POST abandonWorkflowRun(runId)
    API-->>WorkflowUI: Success/Error response
    alt Success
        WorkflowUI->>QueryCache: Invalidate ['workflowRun', runId]
    else Error
        WorkflowUI->>WorkflowUI: Log error, enable button again
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A rabbit hops through workflow trees, 🐰
Detecting stalls with watchful ease—
When runs grow old, no events to show,
A warning appears: "Best let this go!"
Abandon buttons now make things right,
The architect analyzes—output, not file!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes two main changes: web UI stalled workflow detection with abandon action, and the archon-architect workflow prompt fix.
Description check ✅ Passed Description includes Summary, Context, What changes (with per-file details), Test plan with checkmarks, and Not in scope sections; however, lacks explicit UX Journey and Architecture Diagram sections from template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/stalled-workflow-ui

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/web/src/components/workflows/WorkflowExecution.tsx (1)

531-545: Silent abandon failures leave the user confused.

If abandonWorkflowRun rejects (e.g., 404/500), the banner stays up and the user just sees "Abandoning…" flip back to "Abandon run" with no feedback. Consider surfacing a toast or inline error — otherwise repeated clicks look like the button is broken.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/web/src/components/workflows/WorkflowExecution.tsx` around lines 531
- 545, The catch in handleAbandon currently only logs errors, leaving the UI
silent; modify handleAbandon to surface failures to the user by calling the
app's toast/error UI (e.g., a showToast or setBannerError helper) when
abandonWorkflowRun rejects, include the runId and err.message (use err
instanceof Error ? err.message : String(err)), and keep the existing finally
block that calls setAbandoning(false); update the catch to also avoid swallowing
useful info (so you may rethrow or return after showing the toast if your flow
requires it) and ensure references are to handleAbandon, abandonWorkflowRun,
queryClient.invalidateQueries, setAbandoning and abandoning so the change is
made in the correct function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/web/src/components/workflows/WorkflowExecution.tsx`:
- Around line 531-545: The catch in handleAbandon currently only logs errors,
leaving the UI silent; modify handleAbandon to surface failures to the user by
calling the app's toast/error UI (e.g., a showToast or setBannerError helper)
when abandonWorkflowRun rejects, include the runId and err.message (use err
instanceof Error ? err.message : String(err)), and keep the existing finally
block that calls setAbandoning(false); update the catch to also avoid swallowing
useful info (so you may rethrow or return after showing the toast if your flow
requires it) and ensure references are to handleAbandon, abandonWorkflowRun,
queryClient.invalidateQueries, setAbandoning and abandoning so the change is
made in the correct function.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81271561-2f1d-4a49-a87d-979ab5cad3af

📥 Commits

Reviewing files that changed from the base of the PR and between b99cee4 and bf90a35.

📒 Files selected for processing (3)
  • .archon/workflows/defaults/archon-architect.yaml
  • packages/web/src/components/workflows/WorkflowExecution.tsx
  • packages/workflows/src/defaults/bundled-defaults.generated.ts

@MartienM
Copy link
Copy Markdown
Author

Decided to keep these fixes in my personal fork for now rather than pursuing upstream contribution. Thanks @coderabbitai for the review — the suggestion has been applied on my local branch. Closing.

@MartienM MartienM closed this Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Tip

For best results, initiate chat on the files or code changes.

@MartienM: Thanks for letting me know! Glad the review was helpful. Best of luck with the fixes on your fork — the stall detection banner and the archon-architect prompt correction are solid improvements. Feel free to come back anytime if you decide to upstream them or need a review on future changes. 🐇

(´• ω •`) ノ


🧠 Learnings used
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/core/src/**/*.{ts,tsx} : Workflow commands: `/workflow list`, `/workflow reload`, `/workflow status`, `/workflow cancel`, `/workflow resume <id>` (re-runs failed workflow, skipping completed nodes), `/workflow abandon <id>`, `/workflow cleanup [days]` (CLI only)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/cli/src/**/*.{ts,tsx} : CLI workflow commands: `bun run cli workflow list`, `bun run cli workflow run <workflow> <prompt>`, `bun run cli workflow status`, `bun run cli workflow resume <run-id>`, `bun run cli workflow abandon <run-id>`, `bun run cli workflow cleanup [days]`

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: When a process cannot reliably distinguish 'actively running elsewhere' from 'orphaned by a crash' (work started by different process/input source), it must not autonomously mark work as failed/cancelled/abandoned; surface the ambiguous state to user and provide one-click action

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/workflows/src/**/*.{ts,tsx} : `interactive: true` at the workflow level forces foreground execution on web (required for approval-gate workflows in the web UI)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/web/src/**/*.{ts,tsx} : `archon/web` is React frontend (Vite + Tailwind v4 + shadcn/ui + Zustand), SSE streaming. `WorkflowRunStatus`, `WorkflowDefinition`, and `DagNode` derived from `src/lib/api.generated.d.ts` (generated from OpenAPI spec); never import from `archon/workflows`

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/workflows/src/workflow-discovery.ts : Resilient workflow loading: One broken YAML doesn't abort discovery; errors shown in `/workflow list`

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/server/src/routes/**/*.ts : Web UI REST API: Workflow management endpoints (`GET /api/workflows`, `POST /api/workflows/validate`, `GET /api/workflows/:name`, `PUT /api/workflows/:name`, `DELETE /api/workflows/:name`), workflow run lifecycle (`POST /api/workflows/runs/{runId}/resume`, `POST /api/workflows/runs/{runId}/abandon`, `DELETE /api/workflows/runs/{runId}`)

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-20T18:47:37.611Z
Learning: Applies to packages/workflows/src/**/*.{ts,tsx} : Workflow nodes support types: `command:` (named command file), `prompt:` (inline prompt), `bash:` (shell script, stdout captured as `$nodeId.output`), `loop:` (iterative AI prompt), `approval:` (human gate with optional `capture_response`), `script:` (TypeScript/Python/named script, stdout captured, supports `deps:` and `timeout:`, requires `runtime: bun` or `runtime: uv`). All node types support `when:` conditions, `trigger_rule`, `$nodeId.output` substitution, and per-node overrides of provider/model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant