feat(web): surface stalled workflow runs with abandon action + fix(workflows): archon-architect analyze node prompt/tools conflict#1374
Conversation
When the Claude SDK subprocess stops emitting events without signalling failure, runs stay in `running` forever — the UI polls indefinitely and the conversation slot stays locked. WorkflowExecution now shows a warning banner when no workflow_event has arrived in 5 minutes, with a one-click Abandon action. Abandon keeps lifecycle-mutation in user hands (per CLAUDE.md's "No Autonomous Lifecycle Mutation" rule). Also patches dangling state that abandonWorkflow() does not emit events for: on terminal status, still-`running` DAG nodes are remapped to `skipped`, and unmatched tool_called events get `duration: 0` so WorkflowLogs stops rendering a permanent spinner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tect analyze node The analyze node has `denied_tools: [Write, Edit, Bash]` by design — it is the "pure diagnosis" phase of the workflow's trust ladder (measure → analyze → plan → execute). But the prompt explicitly told the agent to write the assessment to $ARTIFACTS_DIR/architecture-assessment.md. The agent would then loop on ToolSearch trying to obtain a Write tool, fail, and stall indefinitely — no node-level timeout on analyze, so the run never terminates. The downstream `plan` node already consumes the assessment via $analyze.output (the assistant response text), not the file. The file was never read anywhere. Changed the prompt to ask for a structured response instead of a file write, matching what the workflow actually uses. Also regenerates bundled-defaults.generated.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR modifies the archon-architect workflow to output analysis results directly as structured response data rather than writing to a file, increases validation timeout to 30 minutes, and adds stall detection, improved event handling, state normalization, and workflow abandonment capabilities to the WorkflowExecution component. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant WorkflowUI as WorkflowExecution Component
participant EventMonitor as Event Tracker
participant API as API Service
participant QueryCache as Query Cache
rect rgba(100, 150, 200, 0.5)
Note over EventMonitor,WorkflowUI: Stall Detection Flow
EventMonitor->>WorkflowUI: Process workflow events
WorkflowUI->>EventMonitor: Calculate latest event timestamp
EventMonitor-->>WorkflowUI: Check time elapsed vs STALE_THRESHOLD_MS
WorkflowUI->>WorkflowUI: Stall detected?
alt Stall detected
WorkflowUI->>User: Display warning banner with AlertTriangle
end
end
rect rgba(200, 100, 100, 0.5)
Note over User,QueryCache: Workflow Abandonment Flow
User->>WorkflowUI: Click abandon button
WorkflowUI->>WorkflowUI: Set abandoning state
WorkflowUI->>API: POST abandonWorkflowRun(runId)
API-->>WorkflowUI: Success/Error response
alt Success
WorkflowUI->>QueryCache: Invalidate ['workflowRun', runId]
else Error
WorkflowUI->>WorkflowUI: Log error, enable button again
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/web/src/components/workflows/WorkflowExecution.tsx (1)
531-545: Silent abandon failures leave the user confused.If
abandonWorkflowRunrejects (e.g., 404/500), the banner stays up and the user just sees "Abandoning…" flip back to "Abandon run" with no feedback. Consider surfacing a toast or inline error — otherwise repeated clicks look like the button is broken.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/web/src/components/workflows/WorkflowExecution.tsx` around lines 531 - 545, The catch in handleAbandon currently only logs errors, leaving the UI silent; modify handleAbandon to surface failures to the user by calling the app's toast/error UI (e.g., a showToast or setBannerError helper) when abandonWorkflowRun rejects, include the runId and err.message (use err instanceof Error ? err.message : String(err)), and keep the existing finally block that calls setAbandoning(false); update the catch to also avoid swallowing useful info (so you may rethrow or return after showing the toast if your flow requires it) and ensure references are to handleAbandon, abandonWorkflowRun, queryClient.invalidateQueries, setAbandoning and abandoning so the change is made in the correct function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/web/src/components/workflows/WorkflowExecution.tsx`:
- Around line 531-545: The catch in handleAbandon currently only logs errors,
leaving the UI silent; modify handleAbandon to surface failures to the user by
calling the app's toast/error UI (e.g., a showToast or setBannerError helper)
when abandonWorkflowRun rejects, include the runId and err.message (use err
instanceof Error ? err.message : String(err)), and keep the existing finally
block that calls setAbandoning(false); update the catch to also avoid swallowing
useful info (so you may rethrow or return after showing the toast if your flow
requires it) and ensure references are to handleAbandon, abandonWorkflowRun,
queryClient.invalidateQueries, setAbandoning and abandoning so the change is
made in the correct function.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 81271561-2f1d-4a49-a87d-979ab5cad3af
📒 Files selected for processing (3)
.archon/workflows/defaults/archon-architect.yamlpackages/web/src/components/workflows/WorkflowExecution.tsxpackages/workflows/src/defaults/bundled-defaults.generated.ts
|
Decided to keep these fixes in my personal fork for now rather than pursuing upstream contribution. Thanks @coderabbitai for the review — the suggestion has been applied on my local branch. Closing. |
|
Tip For best results, initiate chat on the files or code changes.
(´• ω •`) ノ 🧠 Learnings used |
Summary
running/pendingnode states and unmatchedtool_calledevents after terminal status so the UI stops rendering spinners for completed runs.Write a fileinstruction from thearchon-architectanalyze node, which haddenied_tools: [Write, Edit, Bash]and caused the Claude SDK to loop onToolSearchtrying to obtainWrite, then stall silently with no events for hours.Context
Hit this in production: ran
archon-architecton a real codebase, it stalled 1m23s into theanalyzenode with nonode_failedorworkflow_failedevent. The UI polled forever, the conversation slot stayed locked, and there was no visible way to abandon without dropping to CLI.Root cause was two unrelated issues that together produced a silent, unrecoverable hang:
analyzenode prompt told the agent toWrite a structured assessment to $ARTIFACTS_DIR/architecture-assessment.mdwhiledenied_tools: [Write, Edit, Bash]made that impossible. The agent retriedToolSearch "select:Write"five times over 12 seconds, announced it would spawn a sub-agent, then produced no further events. The file was never read anywhere downstream —planconsumes the assessment via$analyze.output(assistant response text), not via disk.analyzenode had no timeout, so the engine never marked the run as failed; andabandonWorkflow()only mutates the DB row — it emits nonode_failedevents, so the UI keeps showing spinners on the last-running node.What changes
packages/web/src/components/workflows/WorkflowExecution.tsxSTALE_THRESHOLD_MS = 5 * 60 * 1000— derived frommax(events.created_at)vsDate.now(), re-evaluated each second via the existingsetTicktimer.runningrun has had no events for >5min, with "Abandon run" button wired toabandonWorkflowRun.running/pendingDAG nodes are mapped toskippedso stalenode_startedrows (without a terminator) don't paint a spinner.tool_calledevents getduration: 0soWorkflowLogsstops treating them as running.Deliberately kept out-of-scope: autonomous lifecycle mutation. Per the "No Autonomous Lifecycle Mutation Across Process Boundaries" rule in CLAUDE.md, the engine still doesn't auto-fail stalled runs — the user confirms via the button. CLI-started runs stay in user hands.
.archon/workflows/defaults/archon-architect.yaml$ARTIFACTS_DIR/..." with "Produce a structured assessment as your final response" and a note explaining the node cannot write files. Downstreamplanalready consumes$analyze.output; the file was never referenced anywhere.packages/workflows/src/defaults/bundled-defaults.generated.tsbun run generate:bundledsocheck:bundledpasses.Test plan
bun --filter @archon/web type-check)eslint --max-warnings 0on changed file)bun run check:bundledpassesarchon-architectrun0fae53d26e8158074af432e119f99194, opened the run page, observed the banner after 5min, clicked Abandon, saw run transition tocancelledand both node + tool spinners disappear.archon-architectafter the prompt fix — completed successfully in 21m10s (all 7 nodes includinganalyze→planchain through$analyze.output).Not in scope
workflow_stalledevents). Considered, decided to start with the UI-only signal because (a) it's the minimum needed to unblock users on this class of failure, (b) moving to engine-side is a larger PR that should be evaluated separately against the autonomous-mutation principle, and (c) the UI fix works regardless of how the run was started (web, CLI, adapter).Summary by CodeRabbit
New Features
Bug Fixes
Performance