feat: Add persistent failed operations UI with retry/remove actions by leex279 · Pull Request #854 · coleam00/Archon

leex279 · 2025-11-09T19:16:27Z

Summary

Implements comprehensive error visibility for crawl operations that previously disappeared after failure.

Problem

Failed crawl operations (e.g., MemoryError) were disappearing from the UI after 5-30 seconds, leaving users without:

Visibility into what went wrong
Error diagnostics
Recovery options
Knowledge of which URLs failed

Solution

Backend Changes:

✅ Extended error state cleanup delay from 30s to 300s (5 minutes)
✅ Added include_failed query parameter to /api/progress/ endpoint
✅ Store original request parameters in error state for retry
✅ Failed operations persist with full error details and logs

Frontend Changes:

✅ Created FailedOperationsSection component with error display
✅ Added useFailedOperations hook for querying failed operations
✅ Implemented useRetryKnowledgeSource hook for restarting crawls
✅ Added useRemoveFailedOperation hook for dismissal
✅ Integrated into Knowledge Base view

Features

✅ Failed operations remain visible with red error indication
✅ Retry button restarts crawl with original parameters
✅ Remove button with confirmation modal
✅ Expandable error details with full message and logs
✅ Auto-cleanup after 5 minutes (user-controlled dismissal)
✅ Separate from active operations for clarity

Test Plan

Backend: Ruff checks pass
Frontend: Biome checks pass
Frontend: TypeScript compilation passes
Manual: Test failed crawl appears in Failed Operations section
Manual: Verify retry button restarts crawl
Manual: Verify remove button dismisses operation
Manual: Check 5-minute auto-cleanup
Manual: Verify expandable error details work

Screenshots

To be added during manual testing

🤖 Generated with Claude Code

Implements comprehensive error visibility for crawl operations that previously disappeared after failure. ## Backend Changes - Extended error state cleanup delay from 30s to 300s (5 minutes) for user visibility - Added `include_failed` query parameter to `/api/progress/` endpoint - Store original request parameters in error state for retry functionality - Failed operations persist with full error details and logs ## Frontend Changes - Created FailedOperationsSection component with error display and actions - Added useFailedOperations hook for querying failed operations - Implemented useRetryKnowledgeSource hook for restarting failed crawls - Added useRemoveFailedOperation hook for explicit user dismissal - Integrated failed operations display into Knowledge Base view ## Features - Failed operations remain visible with red error indication - Retry button restarts crawl with original parameters - Remove button allows explicit dismissal with confirmation modal - Expandable error details show full error message and crawl logs - Auto-cleanup after 5 minutes (no auto-dismissal before user sees error) - Separate from active operations list for clear UX Fixes issue where MemoryError and other crawl failures would silently disappear from the UI, leaving users without visibility or recovery options. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-11-09T19:16:36Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/fix-error-operation-visibility

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The listFailedOperations service was returning all operations when include_failed=true. Now it properly filters to only return operations with status 'error' or 'failed', preventing active operations from appearing in the Failed Operations section.

Wirasm · 2026-04-06T07:24:12Z

🔄 This repository is being replaced by a new version of Archon.

The original Python/MCP codebase is being archived to the archive/v1-python-mcp branch. The new Archon (TypeScript workflow engine for AI coding agents) is replacing it.

This PR is being closed as part of the migration. Thank you for your contribution!

* fix(workflows): idle timeout too aggressive — break after result, reset on all messages (#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes #854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout

* fix: auto-select first DAG node on workflow run load (#781) The workflow output panel showed "No output available" on initial load because no DAG node was selected by default. Users had to manually click a node in the Logs tab to see output. Changes: - Add useEffect that auto-selects the first DAG node when workflow data loads and no node is selected - Prefer the currently running node for live workflows Fixes #781 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: formatting and test fixes from validation - Fix formatting in workflows.test.ts and api.ts (prettier) - Replace Hono() with OpenAPIHono({ defaultHook: validationErrorHook }) in api.conversations.test.ts to match route registration requirements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: extract selectInitialNode to testable pure function Extract DAG node auto-selection logic from useEffect into a pure function with 5 unit tests covering all branches (undefined, empty, no running, prefer running, multiple running). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: workflow lifecycle overhaul — path-based guards, interrupted status, resume/abandon (#871) * feat: add interrupted to WorkflowRunStatus schema Implements US-001 from PRD. Changes: - Add 'interrupted' to workflowRunStatusSchema z.enum in packages/workflows/src/schemas/workflow-run.ts - Add 'interrupted' to workflowRunStatusSchema in packages/server/src/routes/schemas/workflow.schemas.ts - Add interrupted: z.number() to dashboardRunsResponseSchema counts object - Add 'interrupted' to dashboardValidStatuses in API handler - Add interrupted: 0 to DashboardRunsResult counts interface and runtime object in packages/core/src/db/workflows.ts * feat: update IWorkflowStore interface & DB query implementations Implements US-002 from PRD. Changes: - IWorkflowStore: rename getActiveWorkflowRun → getActiveWorkflowRunByPath(workingPath) - IWorkflowStore: drop conversationId from findResumableRun signature - IWorkflowStore: add interruptOrphanedRuns() method - db/workflows: add getActiveWorkflowRunByPath querying status IN ('running', 'interrupted') - db/workflows: update findResumableRun to query by workflow_name + working_path only, include 'interrupted' status - db/workflows: add interruptOrphanedRuns() UPDATE SET status='interrupted' WHERE status='running' - store-adapter: wire all three new/modified methods - executor: update call sites to use renamed methods (type-check requirement) - tests: update all mock stores and add new tests for getActiveWorkflowRunByPath and interruptOrphanedRuns * feat: replace staleness guard with path-based lifecycle Implements US-003 from PRD. Changes: - executor.ts: remove STALE_MINUTES staleness auto-kill; replace with status-based guard — 'running' blocks, 'interrupted' offers resume/abandon - server/src/index.ts: replace failStaleWorkflowRuns() with createWorkflowStore().interruptOrphanedRuns() on startup - executor-preamble.test.ts: replace staleness detection tests with concurrent run guard tests covering 'running' and 'interrupted' cases * feat: command handler — /workflow status, resume, and abandon Implements US-004. Replaces time-based stale heuristics with explicit lifecycle commands for workflow management. Changes: - Remove WORKFLOW_SLOW_THRESHOLD_MS and WORKFLOW_STALE_THRESHOLD_MS constants - Replace /workflow status with global view: lists all running+interrupted runs across all worktrees (ID, name, working path, status, started-at) - Add /workflow resume <id>: validates state then calls resumeWorkflowRun - Add /workflow abandon <id>: validates state then calls failWorkflowRun - Add statuses[] filter to listWorkflowRuns for IN (...) queries - Update /workflow help text and default case usage string - Update /status command to remove stale warning that referenced removed constants - Replace old /workflow status tests with new behavior coverage - Add /workflow resume and /workflow abandon test coverage * feat: CLI workflow status, resume, and abandon subcommands Implements US-005 from PRD. Changes: - Implement workflowStatusCommand: lists all running+interrupted runs with ID, name, path, status, age; supports --json flag - Add workflowResumeCommand: validates run state then calls resumeWorkflowRun - Add workflowAbandonCommand: validates run state then calls failWorkflowRun('Abandoned by user') - Replace findLastFailedRun usage in --resume path with findResumableRun(workflowName, cwd) - Wire resume/abandon subcommands in cli.ts - Update tests: replace "not implemented" test with status/resume/abandon coverage * feat: Web UI interrupted status badge and dashboard support Implements US-006 from PRD. Changes: - api.generated.d.ts: add 'interrupted' to WorkflowRunStatus enum and DashboardRunsResponse.counts - api.ts: add interrupted field to DashboardCounts interface - WorkflowExecution.tsx: add 'interrupted' to TERMINAL_STATUSES; add amber color to StatusBadge - WorkflowRunCard.tsx: add amber dot and badge for interrupted status - StatusSummaryBar.tsx: add 'interrupted' to STATUS_CHIPS filter list - DashboardPage.tsx: include interrupted in activeRuns filter and counts default * refactor: remove dead timer-based workflow staleness code Implements US-007 from PRD. Changes: - Remove findLastFailedRun() from db/workflows.ts (CLI path unified on findResumableRun in US-005) - Remove failStaleWorkflowRuns() from db/workflows.ts (replaced by interruptOrphanedRuns in US-002) - Remove IDatabase import from db/workflows.ts (no longer needed) - Remove failStaleWorkflowRuns tests from db/workflows.test.ts grep -r 'STALE' packages/ (workflow-timer variant), grep -r 'findLastFailedRun' and grep -r 'failStaleWorkflowRuns' all return zero matches. * fix: address review feedback — truncated IDs, resume semantics, type safety - Use full UUIDs in resume/abandon command suggestions (was .slice(0, 8)) - Add completed_at to interruptOrphanedRuns for correct duration metrics - Fix resume commands: mark interrupted→failed to unblock path guard, let next workflow invocation auto-resume via findResumableRun - Collapse dual status/statuses fields into status?: T | T[] - Extract TERMINAL/RESUMABLE/ACTIVE_WORKFLOW_STATUSES constants - Add explicit else-if for interrupted status in executor path guard - Add structured logging to CLI workflow commands - Restore conversationId to cmd.workflow_status_failed log - Add tests: listWorkflowRuns statuses filter, interrupted auto-resume, DB error handling for resume/abandon - Update docs: commands-reference, cli-user-guide, authoring-workflows, CLAUDE.md * refactor: simplify CLI commands and status filter logic - Extract getRunOrThrow helper for shared run lookup pattern - Use WorkflowRun[] instead of Awaited<ReturnType<...>> - Remove single-item special case in listWorkflowRuns (IN works for all) - Use ?? instead of || for null-coalescing consistency - Remove unused ACTIVE_WORKFLOW_STATUSES constant - Add inline comment on completed_at for interrupted runs * fix: handle SQLite string dates in formatAge SQLite returns started_at as a string, not a Date object. formatAge now accepts Date | string and converts accordingly. Found during E2E testing against real SQLite database. * feat: UX improvements — real resume, dashboard actions, cleanup command - CLI resume now actually re-executes the workflow (calls workflowRunCommand with --resume internally instead of just flipping DB status) - Remove truncated IDs from executor guard messages (full ID in commands only) - Add Resume/Abandon/Delete buttons to dashboard workflow run cards - Add Delete button to history table rows - Add API endpoints: POST resume, POST abandon, DELETE workflow run - Add CLI workflow cleanup command (deletes terminal runs older than N days) - Add deleteWorkflowRun and deleteOldWorkflowRuns DB functions * refactor: simplify API handlers, dashboard actions, and log conventions - Use RESUMABLE/TERMINAL_WORKFLOW_STATUSES constants in API handlers (was inline string checks diverging from CLI/command-handler) - Extract makeRunAction helper in DashboardPage (4 identical handlers → 1) - Fix log event names to use domain prefix convention (api.workflow_run_*) - Use Ban icon for Abandon to distinguish from Cancel's XCircle - Use instanceof Date guard in formatAge for clarity - Add comment on delete handler's active-status guard * refactor: simplify workflow lifecycle — remove interrupted, single resume path Rework the primitives for a clean foundation: Status model: 5 statuses (pending, running, completed, failed, cancelled). Remove 'interrupted' entirely — server restart now marks orphaned runs as 'failed' directly (with metadata.failure_reason = 'server_restart'). Resume model: one path. The executor's implicit findResumableRun detects prior failed runs and skips completed nodes. The CLI --resume flag reuses the prior run's worktree but lets the executor handle node-skipping (no more preCreatedRun bypass). Chat /workflow resume tells the user to re-invoke (auto-resume kicks in). Path guard: only blocks 'running' status (was running + interrupted). Guards always run regardless of preCreatedRun. Cancellation: generalized from status === 'cancelled' to status !== 'running' at all 3 check points (streaming, loop iterations, DAG layers). Ready for future 'paused' status with zero changes. - interruptOrphanedRuns → failOrphanedRuns - Remove preCreatedRun bypass from CLI --resume path - Generalize 3 cancellation check points in dag-executor - Update all API endpoints, command handlers, UI components - Update all tests and documentation * fix: cancel/complete race, abandon semantics, UTC dates - Fix cancel/complete race condition: dag-executor now checks DB status before calling completeWorkflowRun or failWorkflowRun, preventing a cancel during the final layer from being overwritten to completed - Abandon uses cancelWorkflowRun instead of failWorkflowRun, so abandoned runs don't get auto-resumed by findResumableRun - Fix formatAge UTC bug: SQLite dates without Z suffix now parsed as UTC * fix: address PR review — SQL safety, transactions, error handling, docs, tests - Validate olderThanDays before SQL interpolation in deleteOldWorkflowRuns - Wrap multi-statement deletes in transactions (deleteOldWorkflowRuns, deleteWorkflowRun) - Fix deleteWorkflowRun error double-wrap (don't re-wrap "not found" errors) - Handle null getWorkflowRunStatus in DAG executor (treat as deleted, abort) - Fix mock name mismatch: interruptOrphanedRuns → failOrphanedRuns in 3 test files - Fix default mock getWorkflowRunStatus to return 'running' instead of null - Add NaN guard to formatAge (returns 'unknown' on unparseable dates) - Fix stale 'interrupted' references in route summary and delete comment - Include working path in /workflow resume response - Align deleteOldWorkflowRuns return type to { count } for consistency - Document workflow cleanup command in CLAUDE.md, CLI user guide, commands reference - Document new API endpoints (resume, abandon, delete) in CLAUDE.md - Add tests for deleteOldWorkflowRuns, deleteWorkflowRun, workflowCleanupCommand - Fix workflowAbandonCommand test to assert cancelWorkflowRun call * refactor: simplify code per review — extract helper, cleaner date parsing, consistent guards - Extract duplicated status-check blocks into skipIfStatusChanged helper in dag-executor - Simplify formatAge to single-pass date parsing with Z suffix (ISO 8601) - Use TERMINAL_WORKFLOW_STATUSES constant in delete route guard - Rename cancelError → actionError in DashboardPage (covers 4 actions now) - Fix merge conflict: add IDatabase import, getRunningWorkflows from dev - Fix api.conversations.test.ts: add missing workflow mocks, fix Hono → OpenAPIHono * fix: address review findings — double-rollback, missing guards, log context, tests - Fix double-rollback in deleteWorkflowRun by removing inner rollback() call and letting the outer catch handle it - Add terminal-status guard inside deleteWorkflowRun itself, not just in the route handler, to prevent deletion of running workflows - Add rollback failure logging to the rollback() helper - Add runId to error logs in resume/abandon/delete API route handlers - Add workingPath to getActiveWorkflowRunByPath error log - Add workflowRunId to dag-executor status-check warn logs - Wrap workflowRunCommand in try/catch in workflowResumeCommand with structured logging and null guard for user_message - Clean up stale 'interrupted' references in JSDoc - Fix missing / prefix on workflow cleanup in commands-reference.md - Add API route tests for POST /resume, POST /abandon, DELETE /:runId * refactor: apply code simplifications from review - Replace fragile startsWith string matching in deleteWorkflowRun catch with a typed WorkflowRunGuardError class - Reorder listWorkflowRuns placeholder generation: capture startIdx before pushing values for clarity - Replace curried makeRunAction factory in DashboardPage with a plain runAction helper function - Move skipIfStatusChanged helper definition before its call sites in dag-executor to match reading order * fix(workflows): idle timeout too aggressive on DAG nodes (#854) (#886) * fix(workflows): idle timeout too aggressive — break after result, reset on all messages (#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes #854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout * feat: add approval gate node type for human-in-the-loop workflows (#888) * feat: add approval gate node type for human-in-the-loop workflows Add `approval` as a fifth DAG node type that pauses workflow execution until a human approves or rejects. Built on existing resume infrastructure. - New `approval` node type in YAML workflows with `message` field - `paused` workflow status (non-terminal, resumable) - Approve/reject via REST API, CLI, chat commands, and Web UI - Approval comment available as `$node.output` in downstream nodes - Dashboard shows amber pulsing badge with approval message for paused runs - Path guard blocks new workflows on paused worktrees * fix: resolve 7 bugs in approval gate implementation Critical: - Parse metadata JSON on SQLite (returned as TEXT string, not object) - Suppress spurious "Workflow stopped (paused)" message after approval gates Medium: - CLI exits 0 on pause (not error code 1) - workflow status shows paused runs alongside running - Docs clarify auto-resume is CLI-only; API/chat mark resumable Low: - Skip completed_at when transitioning to failed for approval resume - Warn on AI fields (model, hooks, etc.) set on approval nodes - Add rowCount check to updateWorkflowRun Also: extract ApprovalContext type, add .min(1) to approval.message, fix stale "Only failed runs" error messages, fix log domain prefix, add recovery instructions to CLI approve error message. * feat(workflows): add interactive PRD workflow with approval gates (#891) Guided PRD creation through 3 approval gates: 1. Foundation questions (who/what/why/why now) 2. Deep dive questions (vision/users/JTBD/constraints) 3. Scope questions (MVP/hypothesis/exclusions) AI researches market + codebase between gates. Each gate carries the user's answers via $gate.output to downstream nodes. Final node generates the PRD from accumulated context. * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (#890) * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (#889) When workflow resume or approve is called, the worktree path is passed as cwd, but the codebase was registered under the main checkout path. This caused findCodebaseByDefaultCwd to return null, triggering auto-registration which tried to create a source symlink that already existed, crashing the command. Changes: - Add codebaseId option to WorkflowRunOptions - Look up codebase by stored ID before falling back to auto-registration - Pass run.codebase_id from workflowResumeCommand and workflowApproveCommand - Add test verifying getCodebase is called with stored codebase_id on resume Fixes #889 * fix(cli): address review findings for PR #890 - Add domain prefix to 5 log events in codebase-lookup block (cli.codebase_lookup_failed, cli.db_connection_hint, cli.codebase_id_lookup_failed, cli.codebase_auto_registered, cli.codebase_auto_registration_failed) - Add inline comment clarifying intentional fallthrough in catch block - Add test for workflowApproveCommand codebase_id forwarding - Add test for error fallback path when getCodebase throws - Update pre-existing test assertion to match renamed log event * fix(workflows): improve interactive PRD — first-principles prompting + validation node (#893) Two improvements to the interactive PRD workflow: 1. First-principles prompting: All research/technical/generate nodes now instruct the AI to start from what exists, verify claims by reading actual files, prefer extending over creating, and cite exact file:line. 2. Validation node: New final node reads the generated PRD and checks every technical claim against the codebase — file paths, endpoint names, DB schemas, event types, component names. Corrects inline and documents what was fixed. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Rasmus Widing <152263317+Wirasm@users.noreply.github.com>

… (coleam00#886) * fix(workflows): idle timeout too aggressive — break after result, reset on all messages (coleam00#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes coleam00#854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout

* fix: auto-select first DAG node on workflow run load (coleam00#781) The workflow output panel showed "No output available" on initial load because no DAG node was selected by default. Users had to manually click a node in the Logs tab to see output. Changes: - Add useEffect that auto-selects the first DAG node when workflow data loads and no node is selected - Prefer the currently running node for live workflows Fixes coleam00#781 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: formatting and test fixes from validation - Fix formatting in workflows.test.ts and api.ts (prettier) - Replace Hono() with OpenAPIHono({ defaultHook: validationErrorHook }) in api.conversations.test.ts to match route registration requirements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: extract selectInitialNode to testable pure function Extract DAG node auto-selection logic from useEffect into a pure function with 5 unit tests covering all branches (undefined, empty, no running, prefer running, multiple running). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: workflow lifecycle overhaul — path-based guards, interrupted status, resume/abandon (coleam00#871) * feat: add interrupted to WorkflowRunStatus schema Implements US-001 from PRD. Changes: - Add 'interrupted' to workflowRunStatusSchema z.enum in packages/workflows/src/schemas/workflow-run.ts - Add 'interrupted' to workflowRunStatusSchema in packages/server/src/routes/schemas/workflow.schemas.ts - Add interrupted: z.number() to dashboardRunsResponseSchema counts object - Add 'interrupted' to dashboardValidStatuses in API handler - Add interrupted: 0 to DashboardRunsResult counts interface and runtime object in packages/core/src/db/workflows.ts * feat: update IWorkflowStore interface & DB query implementations Implements US-002 from PRD. Changes: - IWorkflowStore: rename getActiveWorkflowRun → getActiveWorkflowRunByPath(workingPath) - IWorkflowStore: drop conversationId from findResumableRun signature - IWorkflowStore: add interruptOrphanedRuns() method - db/workflows: add getActiveWorkflowRunByPath querying status IN ('running', 'interrupted') - db/workflows: update findResumableRun to query by workflow_name + working_path only, include 'interrupted' status - db/workflows: add interruptOrphanedRuns() UPDATE SET status='interrupted' WHERE status='running' - store-adapter: wire all three new/modified methods - executor: update call sites to use renamed methods (type-check requirement) - tests: update all mock stores and add new tests for getActiveWorkflowRunByPath and interruptOrphanedRuns * feat: replace staleness guard with path-based lifecycle Implements US-003 from PRD. Changes: - executor.ts: remove STALE_MINUTES staleness auto-kill; replace with status-based guard — 'running' blocks, 'interrupted' offers resume/abandon - server/src/index.ts: replace failStaleWorkflowRuns() with createWorkflowStore().interruptOrphanedRuns() on startup - executor-preamble.test.ts: replace staleness detection tests with concurrent run guard tests covering 'running' and 'interrupted' cases * feat: command handler — /workflow status, resume, and abandon Implements US-004. Replaces time-based stale heuristics with explicit lifecycle commands for workflow management. Changes: - Remove WORKFLOW_SLOW_THRESHOLD_MS and WORKFLOW_STALE_THRESHOLD_MS constants - Replace /workflow status with global view: lists all running+interrupted runs across all worktrees (ID, name, working path, status, started-at) - Add /workflow resume <id>: validates state then calls resumeWorkflowRun - Add /workflow abandon <id>: validates state then calls failWorkflowRun - Add statuses[] filter to listWorkflowRuns for IN (...) queries - Update /workflow help text and default case usage string - Update /status command to remove stale warning that referenced removed constants - Replace old /workflow status tests with new behavior coverage - Add /workflow resume and /workflow abandon test coverage * feat: CLI workflow status, resume, and abandon subcommands Implements US-005 from PRD. Changes: - Implement workflowStatusCommand: lists all running+interrupted runs with ID, name, path, status, age; supports --json flag - Add workflowResumeCommand: validates run state then calls resumeWorkflowRun - Add workflowAbandonCommand: validates run state then calls failWorkflowRun('Abandoned by user') - Replace findLastFailedRun usage in --resume path with findResumableRun(workflowName, cwd) - Wire resume/abandon subcommands in cli.ts - Update tests: replace "not implemented" test with status/resume/abandon coverage * feat: Web UI interrupted status badge and dashboard support Implements US-006 from PRD. Changes: - api.generated.d.ts: add 'interrupted' to WorkflowRunStatus enum and DashboardRunsResponse.counts - api.ts: add interrupted field to DashboardCounts interface - WorkflowExecution.tsx: add 'interrupted' to TERMINAL_STATUSES; add amber color to StatusBadge - WorkflowRunCard.tsx: add amber dot and badge for interrupted status - StatusSummaryBar.tsx: add 'interrupted' to STATUS_CHIPS filter list - DashboardPage.tsx: include interrupted in activeRuns filter and counts default * refactor: remove dead timer-based workflow staleness code Implements US-007 from PRD. Changes: - Remove findLastFailedRun() from db/workflows.ts (CLI path unified on findResumableRun in US-005) - Remove failStaleWorkflowRuns() from db/workflows.ts (replaced by interruptOrphanedRuns in US-002) - Remove IDatabase import from db/workflows.ts (no longer needed) - Remove failStaleWorkflowRuns tests from db/workflows.test.ts grep -r 'STALE' packages/ (workflow-timer variant), grep -r 'findLastFailedRun' and grep -r 'failStaleWorkflowRuns' all return zero matches. * fix: address review feedback — truncated IDs, resume semantics, type safety - Use full UUIDs in resume/abandon command suggestions (was .slice(0, 8)) - Add completed_at to interruptOrphanedRuns for correct duration metrics - Fix resume commands: mark interrupted→failed to unblock path guard, let next workflow invocation auto-resume via findResumableRun - Collapse dual status/statuses fields into status?: T | T[] - Extract TERMINAL/RESUMABLE/ACTIVE_WORKFLOW_STATUSES constants - Add explicit else-if for interrupted status in executor path guard - Add structured logging to CLI workflow commands - Restore conversationId to cmd.workflow_status_failed log - Add tests: listWorkflowRuns statuses filter, interrupted auto-resume, DB error handling for resume/abandon - Update docs: commands-reference, cli-user-guide, authoring-workflows, CLAUDE.md * refactor: simplify CLI commands and status filter logic - Extract getRunOrThrow helper for shared run lookup pattern - Use WorkflowRun[] instead of Awaited<ReturnType<...>> - Remove single-item special case in listWorkflowRuns (IN works for all) - Use ?? instead of || for null-coalescing consistency - Remove unused ACTIVE_WORKFLOW_STATUSES constant - Add inline comment on completed_at for interrupted runs * fix: handle SQLite string dates in formatAge SQLite returns started_at as a string, not a Date object. formatAge now accepts Date | string and converts accordingly. Found during E2E testing against real SQLite database. * feat: UX improvements — real resume, dashboard actions, cleanup command - CLI resume now actually re-executes the workflow (calls workflowRunCommand with --resume internally instead of just flipping DB status) - Remove truncated IDs from executor guard messages (full ID in commands only) - Add Resume/Abandon/Delete buttons to dashboard workflow run cards - Add Delete button to history table rows - Add API endpoints: POST resume, POST abandon, DELETE workflow run - Add CLI workflow cleanup command (deletes terminal runs older than N days) - Add deleteWorkflowRun and deleteOldWorkflowRuns DB functions * refactor: simplify API handlers, dashboard actions, and log conventions - Use RESUMABLE/TERMINAL_WORKFLOW_STATUSES constants in API handlers (was inline string checks diverging from CLI/command-handler) - Extract makeRunAction helper in DashboardPage (4 identical handlers → 1) - Fix log event names to use domain prefix convention (api.workflow_run_*) - Use Ban icon for Abandon to distinguish from Cancel's XCircle - Use instanceof Date guard in formatAge for clarity - Add comment on delete handler's active-status guard * refactor: simplify workflow lifecycle — remove interrupted, single resume path Rework the primitives for a clean foundation: Status model: 5 statuses (pending, running, completed, failed, cancelled). Remove 'interrupted' entirely — server restart now marks orphaned runs as 'failed' directly (with metadata.failure_reason = 'server_restart'). Resume model: one path. The executor's implicit findResumableRun detects prior failed runs and skips completed nodes. The CLI --resume flag reuses the prior run's worktree but lets the executor handle node-skipping (no more preCreatedRun bypass). Chat /workflow resume tells the user to re-invoke (auto-resume kicks in). Path guard: only blocks 'running' status (was running + interrupted). Guards always run regardless of preCreatedRun. Cancellation: generalized from status === 'cancelled' to status !== 'running' at all 3 check points (streaming, loop iterations, DAG layers). Ready for future 'paused' status with zero changes. - interruptOrphanedRuns → failOrphanedRuns - Remove preCreatedRun bypass from CLI --resume path - Generalize 3 cancellation check points in dag-executor - Update all API endpoints, command handlers, UI components - Update all tests and documentation * fix: cancel/complete race, abandon semantics, UTC dates - Fix cancel/complete race condition: dag-executor now checks DB status before calling completeWorkflowRun or failWorkflowRun, preventing a cancel during the final layer from being overwritten to completed - Abandon uses cancelWorkflowRun instead of failWorkflowRun, so abandoned runs don't get auto-resumed by findResumableRun - Fix formatAge UTC bug: SQLite dates without Z suffix now parsed as UTC * fix: address PR review — SQL safety, transactions, error handling, docs, tests - Validate olderThanDays before SQL interpolation in deleteOldWorkflowRuns - Wrap multi-statement deletes in transactions (deleteOldWorkflowRuns, deleteWorkflowRun) - Fix deleteWorkflowRun error double-wrap (don't re-wrap "not found" errors) - Handle null getWorkflowRunStatus in DAG executor (treat as deleted, abort) - Fix mock name mismatch: interruptOrphanedRuns → failOrphanedRuns in 3 test files - Fix default mock getWorkflowRunStatus to return 'running' instead of null - Add NaN guard to formatAge (returns 'unknown' on unparseable dates) - Fix stale 'interrupted' references in route summary and delete comment - Include working path in /workflow resume response - Align deleteOldWorkflowRuns return type to { count } for consistency - Document workflow cleanup command in CLAUDE.md, CLI user guide, commands reference - Document new API endpoints (resume, abandon, delete) in CLAUDE.md - Add tests for deleteOldWorkflowRuns, deleteWorkflowRun, workflowCleanupCommand - Fix workflowAbandonCommand test to assert cancelWorkflowRun call * refactor: simplify code per review — extract helper, cleaner date parsing, consistent guards - Extract duplicated status-check blocks into skipIfStatusChanged helper in dag-executor - Simplify formatAge to single-pass date parsing with Z suffix (ISO 8601) - Use TERMINAL_WORKFLOW_STATUSES constant in delete route guard - Rename cancelError → actionError in DashboardPage (covers 4 actions now) - Fix merge conflict: add IDatabase import, getRunningWorkflows from dev - Fix api.conversations.test.ts: add missing workflow mocks, fix Hono → OpenAPIHono * fix: address review findings — double-rollback, missing guards, log context, tests - Fix double-rollback in deleteWorkflowRun by removing inner rollback() call and letting the outer catch handle it - Add terminal-status guard inside deleteWorkflowRun itself, not just in the route handler, to prevent deletion of running workflows - Add rollback failure logging to the rollback() helper - Add runId to error logs in resume/abandon/delete API route handlers - Add workingPath to getActiveWorkflowRunByPath error log - Add workflowRunId to dag-executor status-check warn logs - Wrap workflowRunCommand in try/catch in workflowResumeCommand with structured logging and null guard for user_message - Clean up stale 'interrupted' references in JSDoc - Fix missing / prefix on workflow cleanup in commands-reference.md - Add API route tests for POST /resume, POST /abandon, DELETE /:runId * refactor: apply code simplifications from review - Replace fragile startsWith string matching in deleteWorkflowRun catch with a typed WorkflowRunGuardError class - Reorder listWorkflowRuns placeholder generation: capture startIdx before pushing values for clarity - Replace curried makeRunAction factory in DashboardPage with a plain runAction helper function - Move skipIfStatusChanged helper definition before its call sites in dag-executor to match reading order * fix(workflows): idle timeout too aggressive on DAG nodes (coleam00#854) (coleam00#886) * fix(workflows): idle timeout too aggressive — break after result, reset on all messages (coleam00#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes coleam00#854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout * feat: add approval gate node type for human-in-the-loop workflows (coleam00#888) * feat: add approval gate node type for human-in-the-loop workflows Add `approval` as a fifth DAG node type that pauses workflow execution until a human approves or rejects. Built on existing resume infrastructure. - New `approval` node type in YAML workflows with `message` field - `paused` workflow status (non-terminal, resumable) - Approve/reject via REST API, CLI, chat commands, and Web UI - Approval comment available as `$node.output` in downstream nodes - Dashboard shows amber pulsing badge with approval message for paused runs - Path guard blocks new workflows on paused worktrees * fix: resolve 7 bugs in approval gate implementation Critical: - Parse metadata JSON on SQLite (returned as TEXT string, not object) - Suppress spurious "Workflow stopped (paused)" message after approval gates Medium: - CLI exits 0 on pause (not error code 1) - workflow status shows paused runs alongside running - Docs clarify auto-resume is CLI-only; API/chat mark resumable Low: - Skip completed_at when transitioning to failed for approval resume - Warn on AI fields (model, hooks, etc.) set on approval nodes - Add rowCount check to updateWorkflowRun Also: extract ApprovalContext type, add .min(1) to approval.message, fix stale "Only failed runs" error messages, fix log domain prefix, add recovery instructions to CLI approve error message. * feat(workflows): add interactive PRD workflow with approval gates (coleam00#891) Guided PRD creation through 3 approval gates: 1. Foundation questions (who/what/why/why now) 2. Deep dive questions (vision/users/JTBD/constraints) 3. Scope questions (MVP/hypothesis/exclusions) AI researches market + codebase between gates. Each gate carries the user's answers via $gate.output to downstream nodes. Final node generates the PRD from accumulated context. * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (coleam00#890) * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (coleam00#889) When workflow resume or approve is called, the worktree path is passed as cwd, but the codebase was registered under the main checkout path. This caused findCodebaseByDefaultCwd to return null, triggering auto-registration which tried to create a source symlink that already existed, crashing the command. Changes: - Add codebaseId option to WorkflowRunOptions - Look up codebase by stored ID before falling back to auto-registration - Pass run.codebase_id from workflowResumeCommand and workflowApproveCommand - Add test verifying getCodebase is called with stored codebase_id on resume Fixes coleam00#889 * fix(cli): address review findings for PR coleam00#890 - Add domain prefix to 5 log events in codebase-lookup block (cli.codebase_lookup_failed, cli.db_connection_hint, cli.codebase_id_lookup_failed, cli.codebase_auto_registered, cli.codebase_auto_registration_failed) - Add inline comment clarifying intentional fallthrough in catch block - Add test for workflowApproveCommand codebase_id forwarding - Add test for error fallback path when getCodebase throws - Update pre-existing test assertion to match renamed log event * fix(workflows): improve interactive PRD — first-principles prompting + validation node (coleam00#893) Two improvements to the interactive PRD workflow: 1. First-principles prompting: All research/technical/generate nodes now instruct the AI to start from what exists, verify claims by reading actual files, prefer extending over creating, and cite exact file:line. 2. Validation node: New final node reads the generated PRD and checks every technical claim against the codebase — file paths, endpoint names, DB schemas, event types, component names. Corrects inline and documents what was fixed. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Rasmus Widing <152263317+Wirasm@users.noreply.github.com>

… (coleam00#886) * fix(workflows): idle timeout too aggressive — break after result, reset on all messages (coleam00#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes coleam00#854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout

* fix: auto-select first DAG node on workflow run load (coleam00#781) The workflow output panel showed "No output available" on initial load because no DAG node was selected by default. Users had to manually click a node in the Logs tab to see output. Changes: - Add useEffect that auto-selects the first DAG node when workflow data loads and no node is selected - Prefer the currently running node for live workflows Fixes coleam00#781 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: formatting and test fixes from validation - Fix formatting in workflows.test.ts and api.ts (prettier) - Replace Hono() with OpenAPIHono({ defaultHook: validationErrorHook }) in api.conversations.test.ts to match route registration requirements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: extract selectInitialNode to testable pure function Extract DAG node auto-selection logic from useEffect into a pure function with 5 unit tests covering all branches (undefined, empty, no running, prefer running, multiple running). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: workflow lifecycle overhaul — path-based guards, interrupted status, resume/abandon (coleam00#871) * feat: add interrupted to WorkflowRunStatus schema Implements US-001 from PRD. Changes: - Add 'interrupted' to workflowRunStatusSchema z.enum in packages/workflows/src/schemas/workflow-run.ts - Add 'interrupted' to workflowRunStatusSchema in packages/server/src/routes/schemas/workflow.schemas.ts - Add interrupted: z.number() to dashboardRunsResponseSchema counts object - Add 'interrupted' to dashboardValidStatuses in API handler - Add interrupted: 0 to DashboardRunsResult counts interface and runtime object in packages/core/src/db/workflows.ts * feat: update IWorkflowStore interface & DB query implementations Implements US-002 from PRD. Changes: - IWorkflowStore: rename getActiveWorkflowRun → getActiveWorkflowRunByPath(workingPath) - IWorkflowStore: drop conversationId from findResumableRun signature - IWorkflowStore: add interruptOrphanedRuns() method - db/workflows: add getActiveWorkflowRunByPath querying status IN ('running', 'interrupted') - db/workflows: update findResumableRun to query by workflow_name + working_path only, include 'interrupted' status - db/workflows: add interruptOrphanedRuns() UPDATE SET status='interrupted' WHERE status='running' - store-adapter: wire all three new/modified methods - executor: update call sites to use renamed methods (type-check requirement) - tests: update all mock stores and add new tests for getActiveWorkflowRunByPath and interruptOrphanedRuns * feat: replace staleness guard with path-based lifecycle Implements US-003 from PRD. Changes: - executor.ts: remove STALE_MINUTES staleness auto-kill; replace with status-based guard — 'running' blocks, 'interrupted' offers resume/abandon - server/src/index.ts: replace failStaleWorkflowRuns() with createWorkflowStore().interruptOrphanedRuns() on startup - executor-preamble.test.ts: replace staleness detection tests with concurrent run guard tests covering 'running' and 'interrupted' cases * feat: command handler — /workflow status, resume, and abandon Implements US-004. Replaces time-based stale heuristics with explicit lifecycle commands for workflow management. Changes: - Remove WORKFLOW_SLOW_THRESHOLD_MS and WORKFLOW_STALE_THRESHOLD_MS constants - Replace /workflow status with global view: lists all running+interrupted runs across all worktrees (ID, name, working path, status, started-at) - Add /workflow resume <id>: validates state then calls resumeWorkflowRun - Add /workflow abandon <id>: validates state then calls failWorkflowRun - Add statuses[] filter to listWorkflowRuns for IN (...) queries - Update /workflow help text and default case usage string - Update /status command to remove stale warning that referenced removed constants - Replace old /workflow status tests with new behavior coverage - Add /workflow resume and /workflow abandon test coverage * feat: CLI workflow status, resume, and abandon subcommands Implements US-005 from PRD. Changes: - Implement workflowStatusCommand: lists all running+interrupted runs with ID, name, path, status, age; supports --json flag - Add workflowResumeCommand: validates run state then calls resumeWorkflowRun - Add workflowAbandonCommand: validates run state then calls failWorkflowRun('Abandoned by user') - Replace findLastFailedRun usage in --resume path with findResumableRun(workflowName, cwd) - Wire resume/abandon subcommands in cli.ts - Update tests: replace "not implemented" test with status/resume/abandon coverage * feat: Web UI interrupted status badge and dashboard support Implements US-006 from PRD. Changes: - api.generated.d.ts: add 'interrupted' to WorkflowRunStatus enum and DashboardRunsResponse.counts - api.ts: add interrupted field to DashboardCounts interface - WorkflowExecution.tsx: add 'interrupted' to TERMINAL_STATUSES; add amber color to StatusBadge - WorkflowRunCard.tsx: add amber dot and badge for interrupted status - StatusSummaryBar.tsx: add 'interrupted' to STATUS_CHIPS filter list - DashboardPage.tsx: include interrupted in activeRuns filter and counts default * refactor: remove dead timer-based workflow staleness code Implements US-007 from PRD. Changes: - Remove findLastFailedRun() from db/workflows.ts (CLI path unified on findResumableRun in US-005) - Remove failStaleWorkflowRuns() from db/workflows.ts (replaced by interruptOrphanedRuns in US-002) - Remove IDatabase import from db/workflows.ts (no longer needed) - Remove failStaleWorkflowRuns tests from db/workflows.test.ts grep -r 'STALE' packages/ (workflow-timer variant), grep -r 'findLastFailedRun' and grep -r 'failStaleWorkflowRuns' all return zero matches. * fix: address review feedback — truncated IDs, resume semantics, type safety - Use full UUIDs in resume/abandon command suggestions (was .slice(0, 8)) - Add completed_at to interruptOrphanedRuns for correct duration metrics - Fix resume commands: mark interrupted→failed to unblock path guard, let next workflow invocation auto-resume via findResumableRun - Collapse dual status/statuses fields into status?: T | T[] - Extract TERMINAL/RESUMABLE/ACTIVE_WORKFLOW_STATUSES constants - Add explicit else-if for interrupted status in executor path guard - Add structured logging to CLI workflow commands - Restore conversationId to cmd.workflow_status_failed log - Add tests: listWorkflowRuns statuses filter, interrupted auto-resume, DB error handling for resume/abandon - Update docs: commands-reference, cli-user-guide, authoring-workflows, CLAUDE.md * refactor: simplify CLI commands and status filter logic - Extract getRunOrThrow helper for shared run lookup pattern - Use WorkflowRun[] instead of Awaited<ReturnType<...>> - Remove single-item special case in listWorkflowRuns (IN works for all) - Use ?? instead of || for null-coalescing consistency - Remove unused ACTIVE_WORKFLOW_STATUSES constant - Add inline comment on completed_at for interrupted runs * fix: handle SQLite string dates in formatAge SQLite returns started_at as a string, not a Date object. formatAge now accepts Date | string and converts accordingly. Found during E2E testing against real SQLite database. * feat: UX improvements — real resume, dashboard actions, cleanup command - CLI resume now actually re-executes the workflow (calls workflowRunCommand with --resume internally instead of just flipping DB status) - Remove truncated IDs from executor guard messages (full ID in commands only) - Add Resume/Abandon/Delete buttons to dashboard workflow run cards - Add Delete button to history table rows - Add API endpoints: POST resume, POST abandon, DELETE workflow run - Add CLI workflow cleanup command (deletes terminal runs older than N days) - Add deleteWorkflowRun and deleteOldWorkflowRuns DB functions * refactor: simplify API handlers, dashboard actions, and log conventions - Use RESUMABLE/TERMINAL_WORKFLOW_STATUSES constants in API handlers (was inline string checks diverging from CLI/command-handler) - Extract makeRunAction helper in DashboardPage (4 identical handlers → 1) - Fix log event names to use domain prefix convention (api.workflow_run_*) - Use Ban icon for Abandon to distinguish from Cancel's XCircle - Use instanceof Date guard in formatAge for clarity - Add comment on delete handler's active-status guard * refactor: simplify workflow lifecycle — remove interrupted, single resume path Rework the primitives for a clean foundation: Status model: 5 statuses (pending, running, completed, failed, cancelled). Remove 'interrupted' entirely — server restart now marks orphaned runs as 'failed' directly (with metadata.failure_reason = 'server_restart'). Resume model: one path. The executor's implicit findResumableRun detects prior failed runs and skips completed nodes. The CLI --resume flag reuses the prior run's worktree but lets the executor handle node-skipping (no more preCreatedRun bypass). Chat /workflow resume tells the user to re-invoke (auto-resume kicks in). Path guard: only blocks 'running' status (was running + interrupted). Guards always run regardless of preCreatedRun. Cancellation: generalized from status === 'cancelled' to status !== 'running' at all 3 check points (streaming, loop iterations, DAG layers). Ready for future 'paused' status with zero changes. - interruptOrphanedRuns → failOrphanedRuns - Remove preCreatedRun bypass from CLI --resume path - Generalize 3 cancellation check points in dag-executor - Update all API endpoints, command handlers, UI components - Update all tests and documentation * fix: cancel/complete race, abandon semantics, UTC dates - Fix cancel/complete race condition: dag-executor now checks DB status before calling completeWorkflowRun or failWorkflowRun, preventing a cancel during the final layer from being overwritten to completed - Abandon uses cancelWorkflowRun instead of failWorkflowRun, so abandoned runs don't get auto-resumed by findResumableRun - Fix formatAge UTC bug: SQLite dates without Z suffix now parsed as UTC * fix: address PR review — SQL safety, transactions, error handling, docs, tests - Validate olderThanDays before SQL interpolation in deleteOldWorkflowRuns - Wrap multi-statement deletes in transactions (deleteOldWorkflowRuns, deleteWorkflowRun) - Fix deleteWorkflowRun error double-wrap (don't re-wrap "not found" errors) - Handle null getWorkflowRunStatus in DAG executor (treat as deleted, abort) - Fix mock name mismatch: interruptOrphanedRuns → failOrphanedRuns in 3 test files - Fix default mock getWorkflowRunStatus to return 'running' instead of null - Add NaN guard to formatAge (returns 'unknown' on unparseable dates) - Fix stale 'interrupted' references in route summary and delete comment - Include working path in /workflow resume response - Align deleteOldWorkflowRuns return type to { count } for consistency - Document workflow cleanup command in CLAUDE.md, CLI user guide, commands reference - Document new API endpoints (resume, abandon, delete) in CLAUDE.md - Add tests for deleteOldWorkflowRuns, deleteWorkflowRun, workflowCleanupCommand - Fix workflowAbandonCommand test to assert cancelWorkflowRun call * refactor: simplify code per review — extract helper, cleaner date parsing, consistent guards - Extract duplicated status-check blocks into skipIfStatusChanged helper in dag-executor - Simplify formatAge to single-pass date parsing with Z suffix (ISO 8601) - Use TERMINAL_WORKFLOW_STATUSES constant in delete route guard - Rename cancelError → actionError in DashboardPage (covers 4 actions now) - Fix merge conflict: add IDatabase import, getRunningWorkflows from dev - Fix api.conversations.test.ts: add missing workflow mocks, fix Hono → OpenAPIHono * fix: address review findings — double-rollback, missing guards, log context, tests - Fix double-rollback in deleteWorkflowRun by removing inner rollback() call and letting the outer catch handle it - Add terminal-status guard inside deleteWorkflowRun itself, not just in the route handler, to prevent deletion of running workflows - Add rollback failure logging to the rollback() helper - Add runId to error logs in resume/abandon/delete API route handlers - Add workingPath to getActiveWorkflowRunByPath error log - Add workflowRunId to dag-executor status-check warn logs - Wrap workflowRunCommand in try/catch in workflowResumeCommand with structured logging and null guard for user_message - Clean up stale 'interrupted' references in JSDoc - Fix missing / prefix on workflow cleanup in commands-reference.md - Add API route tests for POST /resume, POST /abandon, DELETE /:runId * refactor: apply code simplifications from review - Replace fragile startsWith string matching in deleteWorkflowRun catch with a typed WorkflowRunGuardError class - Reorder listWorkflowRuns placeholder generation: capture startIdx before pushing values for clarity - Replace curried makeRunAction factory in DashboardPage with a plain runAction helper function - Move skipIfStatusChanged helper definition before its call sites in dag-executor to match reading order * fix(workflows): idle timeout too aggressive on DAG nodes (coleam00#854) (coleam00#886) * fix(workflows): idle timeout too aggressive — break after result, reset on all messages (coleam00#854) The idle timeout (5 min default) caused two problems: (1) after a node's AI finished (result message), the loop waited for the subprocess to exit, wasting 5 min on hangs; (2) tool messages didn't reset the timer, so long Bash calls (tests, builds) triggered false timeouts on actively working nodes. Changes: - Break out of the for-await loop immediately after receiving the result message in both command/prompt and loop node paths — no more post-completion waste - Remove shouldResetTimer predicate so all message types (including tool) reset the timer — timeout only fires on complete silence - Increase STEP_IDLE_TIMEOUT_MS from 5 min to 30 min — with every message resetting the timer, this is a deadlock detector, not a work limiter Fixes coleam00#854 * fix(workflows): update withIdleTimeout JSDoc to match new timer behavior Remove the tool-exclusion example from the shouldResetTimer docs since that pattern was just removed from all call sites. Clarify that most callers should omit the parameter. * fix(workflows): address review findings — log cleanup errors, add break tests, fix stale docs - Log generator cleanup errors in withIdleTimeout instead of silently swallowing - Add behavioral tests for break-after-result in both command/prompt and loop nodes - Fix stale "5 minutes" default in docs/loop-nodes.md (now 30 minutes) - Clarify shouldResetTimer test names and comments (utility API, not executor behavior) - Extract effectiveIdleTimeout in loop node path (matches command/prompt pattern) - Remove redundant iterResult alias in withIdleTimeout * feat: add approval gate node type for human-in-the-loop workflows (coleam00#888) * feat: add approval gate node type for human-in-the-loop workflows Add `approval` as a fifth DAG node type that pauses workflow execution until a human approves or rejects. Built on existing resume infrastructure. - New `approval` node type in YAML workflows with `message` field - `paused` workflow status (non-terminal, resumable) - Approve/reject via REST API, CLI, chat commands, and Web UI - Approval comment available as `$node.output` in downstream nodes - Dashboard shows amber pulsing badge with approval message for paused runs - Path guard blocks new workflows on paused worktrees * fix: resolve 7 bugs in approval gate implementation Critical: - Parse metadata JSON on SQLite (returned as TEXT string, not object) - Suppress spurious "Workflow stopped (paused)" message after approval gates Medium: - CLI exits 0 on pause (not error code 1) - workflow status shows paused runs alongside running - Docs clarify auto-resume is CLI-only; API/chat mark resumable Low: - Skip completed_at when transitioning to failed for approval resume - Warn on AI fields (model, hooks, etc.) set on approval nodes - Add rowCount check to updateWorkflowRun Also: extract ApprovalContext type, add .min(1) to approval.message, fix stale "Only failed runs" error messages, fix log domain prefix, add recovery instructions to CLI approve error message. * feat(workflows): add interactive PRD workflow with approval gates (coleam00#891) Guided PRD creation through 3 approval gates: 1. Foundation questions (who/what/why/why now) 2. Deep dive questions (vision/users/JTBD/constraints) 3. Scope questions (MVP/hypothesis/exclusions) AI researches market + codebase between gates. Each gate carries the user's answers via $gate.output to downstream nodes. Final node generates the PRD from accumulated context. * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (coleam00#890) * fix(cli): pass codebase_id to workflowRunCommand on resume/approve (coleam00#889) When workflow resume or approve is called, the worktree path is passed as cwd, but the codebase was registered under the main checkout path. This caused findCodebaseByDefaultCwd to return null, triggering auto-registration which tried to create a source symlink that already existed, crashing the command. Changes: - Add codebaseId option to WorkflowRunOptions - Look up codebase by stored ID before falling back to auto-registration - Pass run.codebase_id from workflowResumeCommand and workflowApproveCommand - Add test verifying getCodebase is called with stored codebase_id on resume Fixes coleam00#889 * fix(cli): address review findings for PR coleam00#890 - Add domain prefix to 5 log events in codebase-lookup block (cli.codebase_lookup_failed, cli.db_connection_hint, cli.codebase_id_lookup_failed, cli.codebase_auto_registered, cli.codebase_auto_registration_failed) - Add inline comment clarifying intentional fallthrough in catch block - Add test for workflowApproveCommand codebase_id forwarding - Add test for error fallback path when getCodebase throws - Update pre-existing test assertion to match renamed log event * fix(workflows): improve interactive PRD — first-principles prompting + validation node (coleam00#893) Two improvements to the interactive PRD workflow: 1. First-principles prompting: All research/technical/generate nodes now instruct the AI to start from what exists, verify claims by reading actual files, prefer extending over creating, and cite exact file:line. 2. Validation node: New final node reads the generated PRD and checks every technical claim against the codebase — file paths, endpoint names, DB schemas, event types, component names. Corrects inline and documents what was fixed. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Rasmus Widing <152263317+Wirasm@users.noreply.github.com>

Wirasm closed this Apr 6, 2026

Wirasm deleted the feature/fix-error-operation-visibility branch April 6, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add persistent failed operations UI with retry/remove actions#854

feat: Add persistent failed operations UI with retry/remove actions#854
leex279 wants to merge 2 commits intomainfrom
feature/fix-error-operation-visibility

leex279 commented Nov 9, 2025

Uh oh!

coderabbitai Bot commented Nov 9, 2025

Review skipped

Uh oh!

Wirasm commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leex279 commented Nov 9, 2025

Summary

Problem

Solution

Features

Test Plan

Screenshots

Uh oh!

coderabbitai Bot commented Nov 9, 2025

Review skipped

Uh oh!

Wirasm commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants