fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291
fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291
Conversation
Previously, `dag-executor` only failed nodes/iterations when the SDK returned an `error_max_budget_usd` result. Every other `isError: true` subtype — including `error_during_execution` — was silently `break`ed out of the stream with whatever partial output had accumulated, letting failed runs masquerade as successful ones with empty output. This is the most likely explanation for the "5-second crash" symptom in #1208: iterations finish instantly with empty text, the loop keeps going, and only the `claude.result_is_error` log tips the user off. Changes: - Capture the SDK's `errors: string[]` detail on result messages (previously discarded) and surface it through `MessageChunk.errors`. - Log `errors`, `stopReason` alongside `errorSubtype` in `claude.result_is_error` so users can see what actually failed. - Throw from both the general node path and the loop iteration path on any `isError: true` result, including the subtype and SDK errors detail in the thrown message. Note: this does not implement auto-retry. See PR comments on #1121 and the analysis on #1208 — a retry-with-fresh-session approach for loop iterations is not obviously correct until we see what `error_during_execution` actually carries in the reporter's env. This change is the observability + fail-loud step that has to come first so that signal is no longer silent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe changes extend error handling throughout the Claude provider and DAG executor. The provider now captures SDK-provided error strings in the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/workflows/src/dag-executor.ts (1)
1648-1669: Loop iteration fail-loud matches node path — looks correct.Throwing inside the
for awaitis caught by the enclosingtryat Line 1572 /catchat Line 1742, which emitsloop_iteration_failedwitherr.message(includes subtype and joinederrors) and returns{ state: 'failed' }, so the loop stops instead of burning iterations on repeatederror_during_execution— which directly addresses the#1208scenario described in the comment.Two small, optional points:
- The node path (Lines 767–785) and this block are nearly identical. Consider extracting a small helper like
buildSdkErrorFromResult(msg, { nodeId, iteration? })returning{ message, logContext }to keep them in sync as the error shape evolves.- Unlike the node path, the log payload here doesn't include
durationMs(iteration elapsed). SinceiterationStartis in scope, addingdurationMs: Date.now() - iterationStartwould make loop failures as diagnosable as node failures.♻️ Optional: add iteration duration to the log
getLog().error( { nodeId: node.id, iteration: i, errorSubtype: subtype, errors: msg.errors, sessionId: msg.sessionId, stopReason: msg.stopReason, + durationMs: Date.now() - iterationStart, }, 'loop_node.iteration_sdk_error' );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/workflows/src/dag-executor.ts` around lines 1648 - 1669, The loop error-handling duplicates logic from the node path; extract a helper (e.g., buildSdkErrorFromResult(msg, { nodeId, iteration? })) that returns { message, logContext } and use it here to build the thrown Error and the getLog().error payload so both places stay in sync; also add durationMs: Date.now() - iterationStart to the logContext in the loop block (use iterationStart in scope) so the log payload matches the node path and the thrown Error message uses the helper's message (including subtype and joined errors).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/workflows/src/dag-executor.ts`:
- Around line 1648-1669: The loop error-handling duplicates logic from the node
path; extract a helper (e.g., buildSdkErrorFromResult(msg, { nodeId, iteration?
})) that returns { message, logContext } and use it here to build the thrown
Error and the getLog().error payload so both places stay in sync; also add
durationMs: Date.now() - iterationStart to the logContext in the loop block (use
iterationStart in scope) so the log payload matches the node path and the thrown
Error message uses the helper's message (including subtype and joined errors).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a7b17bce-ab9d-4e07-9ebd-52d9c52c16fe
📒 Files selected for processing (4)
packages/providers/src/claude/provider.tspackages/providers/src/types.tspackages/workflows/src/dag-executor.test.tspackages/workflows/src/dag-executor.ts
Archon PR Validation ReportVerdict: APPROVE SummaryAll four claimed gaps confirmed on main and verified fixed on the feature branch. The fix is minimal, follows existing patterns (budget-cap throw structure), and includes focused regression tests for both DAG node and loop iteration paths. No issues found, no regressions. Bug Confirmation
IssuesNo blocking issues found. What's Done Well
Fix Quality: 5/5 | CLAUDE.md Compliance: Full Validated by archon-validate-pr workflow |
…ly features Merge upstream commit 4c6ddd9 (fix(workflows): fail loudly on SDK isError results, coleam00#1291) into spike/providers-refactor, bringing in the `IAssistantClient` -> `IAgentProvider` refactor and its associated loud-failure change for SDK `isError` results. Upstream changes absorbed: - IAssistantClient / WorkflowAssistantOptions -> IAgentProvider / SendQueryOptions + nodeConfig + assistantConfig (typed providers) - getAssistantClient -> getAgentProvider factory in WorkflowDeps - `errors: string[]` surfaced through MessageChunk; loud failure on all `isError: true` result subtypes (not just `error_max_budget_usd`) - New @archon/providers contract layer + @archon/providers/types subpath Fork-only features restored on top of the refactor: - Durable-progress loop tracking + "failed after partial execution" wording with { node_counts, failed_nodes } metadata on the 2-arg `failWorkflowRun` call (source: fork commit 5f2377e) - Workflow-level Codex tuning (modelReasoningEffort, webSearchMode, additionalDirectories) expressed as top-level AgentRequestOptions fields, with node > workflow > config precedence; BASH_NODE_AI_FIELDS extended so loader warns on these on non-AI nodes (source: fork commit b6c1905, re-expressed against the new contract) - Matching Zod DagNode schema extensions + transform conditional spreads for the three Codex tuning fields Validation: - check:bundled up-to-date - type-check clean across 10 packages - lint 0 errors / 0 warnings - format:check clean - full per-package test suite green (workflows 5 batches, core 7, adapters 3, isolation 3; all exit 0) Per CLAUDE.md fork policy, this merge stays on the spike branch. `dev` remains a pristine `upstream/dev` mirror and is not touched by this commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
dag-executorpreviously only failed nodes onerror_max_budget_usd; every otherisError: trueSDK result (includingerror_during_execution) was silentlybreaked with partial/empty output.claude.result_is_errorlog line reveals something broke.Changes
packages/providers/src/types.ts— add optionalerrors?: string[]to theresultMessageChunk.packages/providers/src/claude/provider.ts— capture the SDK'serrors: string[]array (previously discarded), include it andstopReasonin theclaude.result_is_errorlog, and pass it through to consumers.packages/workflows/src/dag-executor.ts— two sites:isError: truewith a message including the subtype and SDKerrorsdetail.breaking. Surrounding try/catch already maps this to a cleanloop_iteration_failedevent +{ state: 'failed' }return.error_during_executionsubtype with a populatederrorsarray.Why not auto-retry with fresh session?
See my analysis on #1208 and #1121. Short version:
error_during_executionis an SDK catch-all — can be stale session, tool error, MCP crash, token refresh, network interruption. Treating all of them as "reset session and retry" would mask several distinct root causes and regress context continuity across approval gates.dag-executor.test.ts:3533intentionally asserts session passing on interactive loop resume — that's the designed behavior.error_during_executioncarries in the reporter's environment, so any retry heuristic would be speculative.PR #1121 takes a similar retry-on-stale-session approach in the orchestrator chat path, where it is correct (different error shape, different layer). See comment on #1121 for the scope note.
Test plan
bun run type-check— cleanbun run lint— cleanbun run format:check— cleanfails node when SDK returns error_during_execution resultloop iteration fails loudly when SDK returns error_during_executionerror_max_budget_usdtests still pass (behavior unchanged for budget path)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests