fix(workflows): fail loudly on SDK isError results in DAG and loop nodes by coleam00 · Pull Request #1291 · coleam00/Archon

coleam00 · 2026-04-18T17:20:25Z

Summary

dag-executor previously only failed nodes on error_max_budget_usd; every other isError: true SDK result (including error_during_execution) was silently breaked with partial/empty output.
This likely explains the "5-second crash" symptom in fix: interactive loop resume crashes with error_during_execution (stale session) #1208: failed iterations finish instantly with empty text, the loop keeps going, and only the claude.result_is_error log line reveals something broke.
This PR is the observability + fail-loud step — prerequisite to any retry work. No auto-retry here.

Changes

packages/providers/src/types.ts — add optional errors?: string[] to the result MessageChunk.
packages/providers/src/claude/provider.ts — capture the SDK's errors: string[] array (previously discarded), include it and stopReason in the claude.result_is_error log, and pass it through to consumers.
packages/workflows/src/dag-executor.ts — two sites:
- General node path: after the existing budget-cap case, throw on any other isError: true with a message including the subtype and SDK errors detail.
- Loop iteration path: add the same check so failed iterations raise instead of silently breaking. Surrounding try/catch already maps this to a clean loop_iteration_failed event + { state: 'failed' } return.
Tests for both paths using the error_during_execution subtype with a populated errors array.

Why not auto-retry with fresh session?

See my analysis on #1208 and #1121. Short version:

error_during_execution is an SDK catch-all — can be stale session, tool error, MCP crash, token refresh, network interruption. Treating all of them as "reset session and retry" would mask several distinct root causes and regress context continuity across approval gates.
The existing unit test at dag-executor.test.ts:3533 intentionally asserts session passing on interactive loop resume — that's the designed behavior.
Without this PR's observability fix first, we can't even see what error_during_execution carries in the reporter's environment, so any retry heuristic would be speculative.

PR #1121 takes a similar retry-on-stale-session approach in the orchestrator chat path, where it is correct (different error shape, different layer). See comment on #1121 for the scope note.

Test plan

bun run type-check — clean
bun run lint — clean
bun run format:check — clean
New unit test: fails node when SDK returns error_during_execution result
New unit test: loop iteration fails loudly when SDK returns error_during_execution
Existing error_max_budget_usd tests still pass (behavior unchanged for budget path)
Existing interactive-loop-resume test still passes (no regression in session-threading design)

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved error handling during workflow execution—SDK error details are now properly captured and reported.
- Workflow nodes and loop iterations now explicitly fail when encountering SDK errors with detailed error messages.
Tests
- Added regression tests for SDK error handling in node and loop execution scenarios.

Previously, `dag-executor` only failed nodes/iterations when the SDK returned an `error_max_budget_usd` result. Every other `isError: true` subtype — including `error_during_execution` — was silently `break`ed out of the stream with whatever partial output had accumulated, letting failed runs masquerade as successful ones with empty output. This is the most likely explanation for the "5-second crash" symptom in #1208: iterations finish instantly with empty text, the loop keeps going, and only the `claude.result_is_error` log tips the user off. Changes: - Capture the SDK's `errors: string[]` detail on result messages (previously discarded) and surface it through `MessageChunk.errors`. - Log `errors`, `stopReason` alongside `errorSubtype` in `claude.result_is_error` so users can see what actually failed. - Throw from both the general node path and the loop iteration path on any `isError: true` result, including the subtype and SDK errors detail in the thrown message. Note: this does not implement auto-retry. See PR comments on #1121 and the analysis on #1208 — a retry-with-fresh-session approach for loop iterations is not obviously correct until we see what `error_during_execution` actually carries in the reporter's env. This change is the observability + fail-loud step that has to come first so that signal is no longer silent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-04-18T17:20:38Z

📝 Walkthrough

Walkthrough

The changes extend error handling throughout the Claude provider and DAG executor. The provider now captures SDK-provided error strings in the result MessageChunk when errors occur, while the executor explicitly handles non-budget SDK errors by logging detailed context and throwing failures for nodes and loop iterations.

Changes

Cohort / File(s)	Summary
Claude Provider Error Extension `packages/providers/src/claude/provider.ts`, `packages/providers/src/types.ts`	Extended `MessageChunk` result variant with optional `errors?: string[]` field; provider now extracts `sdkErrors` from `resultMsg.errors` and conditionally includes them in emitted chunks when `is_error` is true.
DAG Executor Error Handling `packages/workflows/src/dag-executor.ts`	Added explicit checks for `msg.isError` in both node and loop iteration execution paths; logs error context (subtype, errors, sessionId, stopReason, elapsed duration) and throws to fail operations instead of silently breaking the stream.
Error Handling Test Coverage `packages/workflows/src/dag-executor.test.ts`	Added two regression tests verifying that `error_during_execution` SDK results cause immediate executor failure with proper event emissions containing the subtype and error details for both node and loop execution scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

refactor: decompose provider sendQuery() into explicit helper boundaries #1162: Introduced streamClaudeMessages normalization and initial is_error handling that this PR directly extends with SDK error string propagation.

Poem

Through the burrow of code, a rabbit hops with care,
Error whispers now flow through the provider's air,
Executor listens close, logs what went wrong,
From Claude's SDK truth, we're finally strong! 🐰✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(workflows): fail loudly on SDK isError results in DAG and loop nodes' directly matches the PR's main objective: adding error handling to surface failures instead of silently breaking.
Description check	✅ Passed	The description provides a clear summary, specific file changes, rationale for not auto-retrying, and test plan—all addressing the required sections comprehensively.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/dag-executor-iserror-fail-loud

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

packages/workflows/src/dag-executor.ts (1)
1648-1669: Loop iteration fail-loud matches node path — looks correct.

Throwing inside the for await is caught by the enclosing try at Line 1572 / catch at Line 1742, which emits loop_iteration_failed with err.message (includes subtype and joined errors) and returns { state: 'failed' }, so the loop stops instead of burning iterations on repeated error_during_execution — which directly addresses the #1208 scenario described in the comment.

Two small, optional points:

The node path (Lines 767–785) and this block are nearly identical. Consider extracting a small helper like buildSdkErrorFromResult(msg, { nodeId, iteration? }) returning { message, logContext } to keep them in sync as the error shape evolves.

Unlike the node path, the log payload here doesn't include durationMs (iteration elapsed). Since iterationStart is in scope, adding durationMs: Date.now() - iterationStart would make loop failures as diagnosable as node failures.
♻️ Optional: add iteration duration to the log
             getLog().error(
               {
                 nodeId: node.id,
                 iteration: i,
                 errorSubtype: subtype,
                 errors: msg.errors,
                 sessionId: msg.sessionId,
                 stopReason: msg.stopReason,
+                durationMs: Date.now() - iterationStart,
               },
               'loop_node.iteration_sdk_error'
             );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/dag-executor.ts` around lines 1648 - 1669, The loop
error-handling duplicates logic from the node path; extract a helper (e.g.,
buildSdkErrorFromResult(msg, { nodeId, iteration? })) that returns { message,
logContext } and use it here to build the thrown Error and the getLog().error
payload so both places stay in sync; also add durationMs: Date.now() -
iterationStart to the logContext in the loop block (use iterationStart in scope)
so the log payload matches the node path and the thrown Error message uses the
helper's message (including subtype and joined errors).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/workflows/src/dag-executor.ts`:
- Around line 1648-1669: The loop error-handling duplicates logic from the node
path; extract a helper (e.g., buildSdkErrorFromResult(msg, { nodeId, iteration?
})) that returns { message, logContext } and use it here to build the thrown
Error and the getLog().error payload so both places stay in sync; also add
durationMs: Date.now() - iterationStart to the logContext in the loop block (use
iterationStart in scope) so the log payload matches the node path and the thrown
Error message uses the helper's message (including subtype and joined errors).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a7b17bce-ab9d-4e07-9ebd-52d9c52c16fe

📥 Commits

Reviewing files that changed from the base of the PR and between d89bc76 and dfbb4ac.

📒 Files selected for processing (4)

packages/providers/src/claude/provider.ts
packages/providers/src/types.ts
packages/workflows/src/dag-executor.test.ts
packages/workflows/src/dag-executor.ts

coleam00 · 2026-04-18T19:54:20Z

Archon PR Validation Report

Verdict: APPROVE

Summary

All four claimed gaps confirmed on main and verified fixed on the feature branch. The fix is minimal, follows existing patterns (budget-cap throw structure), and includes focused regression tests for both DAG node and loop iteration paths. No issues found, no regressions.

Bug Confirmation

Claim	Main	Feature
General node silently breaks on non-budget isError	Confirmed	Fixed — catch-all throw after budget check
Loop iteration has zero isError checking	Confirmed	Fixed — equivalent throw, stops loop on failure
SDK `errors[]` discarded by provider	Confirmed	Fixed — captured with Array.isArray guard
MessageChunk lacks `errors` field	Confirmed	Fixed — `errors?: string[]` added

Issues

No blocking issues found.

What's Done Well

Scope discipline: fail-loud first, retry later (once error shapes are observable)
Pattern consistency with existing budget-cap handling
Enriched logging (stopReason + errors) closes the observability gap
Both regression tests are focused and effective

Fix Quality: 5/5 | CLAUDE.md Compliance: Full

Validated by archon-validate-pr workflow

…ly features Merge upstream commit 4c6ddd9 (fix(workflows): fail loudly on SDK isError results, coleam00#1291) into spike/providers-refactor, bringing in the `IAssistantClient` -> `IAgentProvider` refactor and its associated loud-failure change for SDK `isError` results. Upstream changes absorbed: - IAssistantClient / WorkflowAssistantOptions -> IAgentProvider / SendQueryOptions + nodeConfig + assistantConfig (typed providers) - getAssistantClient -> getAgentProvider factory in WorkflowDeps - `errors: string[]` surfaced through MessageChunk; loud failure on all `isError: true` result subtypes (not just `error_max_budget_usd`) - New @archon/providers contract layer + @archon/providers/types subpath Fork-only features restored on top of the refactor: - Durable-progress loop tracking + "failed after partial execution" wording with { node_counts, failed_nodes } metadata on the 2-arg `failWorkflowRun` call (source: fork commit 5f2377e) - Workflow-level Codex tuning (modelReasoningEffort, webSearchMode, additionalDirectories) expressed as top-level AgentRequestOptions fields, with node > workflow > config precedence; BASH_NODE_AI_FIELDS extended so loader warns on these on non-AI nodes (source: fork commit b6c1905, re-expressed against the new contract) - Matching Zod DagNode schema extensions + transform conditional spreads for the three Codex tuning fields Validation: - check:bundled up-to-date - type-check clean across 10 packages - lint 0 errors / 0 warnings - format:check clean - full per-package test suite green (workflows 5 batches, core 7, adapters 3, isolation 3; all exit 0) Per CLAUDE.md fork policy, this merge stays on the spike branch. `dev` remains a pristine `upstream/dev` mirror and is not touched by this commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coleam00 mentioned this pull request Apr 18, 2026

fix: interactive loop resume crashes with error_during_execution (stale session) #1208

Open

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

coleam00 merged commit 4c6ddd9 into dev Apr 18, 2026
4 checks passed

coleam00 deleted the fix/dag-executor-iserror-fail-loud branch April 18, 2026 20:02

coderabbitai Bot mentioned this pull request Apr 19, 2026

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop #1294

Merged

coderabbitai Bot mentioned this pull request Apr 21, 2026

fix(workflows): filter user-plugin MCP noise out of workflow warnings #1327

Merged

3 tasks

narigondelsiglo mentioned this pull request Apr 22, 2026

feat(providers): retry 400 tool use concurrency errors from Claude SDK #1341

Open

prospapledge88 mentioned this pull request Apr 24, 2026

chore: cherry-pick Tier 1-2 upstream fixes (14 commits) prospapledge88/Archon#6

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291

fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291
coleam00 merged 1 commit intodevfrom
fix/dag-executor-iserror-fail-loud

coleam00 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coleam00 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coleam00 commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why not auto-retry with fresh session?

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coleam00 commented Apr 18, 2026

Archon PR Validation Report

Summary

Bug Confirmation

Issues

What's Done Well

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coleam00 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading