Skip to content

fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291

Merged
coleam00 merged 1 commit intodevfrom
fix/dag-executor-iserror-fail-loud
Apr 18, 2026
Merged

fix(workflows): fail loudly on SDK isError results in DAG and loop nodes#1291
coleam00 merged 1 commit intodevfrom
fix/dag-executor-iserror-fail-loud

Conversation

@coleam00
Copy link
Copy Markdown
Owner

@coleam00 coleam00 commented Apr 18, 2026

Summary

  • dag-executor previously only failed nodes on error_max_budget_usd; every other isError: true SDK result (including error_during_execution) was silently breaked with partial/empty output.
  • This likely explains the "5-second crash" symptom in fix: interactive loop resume crashes with error_during_execution (stale session) #1208: failed iterations finish instantly with empty text, the loop keeps going, and only the claude.result_is_error log line reveals something broke.
  • This PR is the observability + fail-loud step — prerequisite to any retry work. No auto-retry here.

Changes

  • packages/providers/src/types.ts — add optional errors?: string[] to the result MessageChunk.
  • packages/providers/src/claude/provider.ts — capture the SDK's errors: string[] array (previously discarded), include it and stopReason in the claude.result_is_error log, and pass it through to consumers.
  • packages/workflows/src/dag-executor.ts — two sites:
    • General node path: after the existing budget-cap case, throw on any other isError: true with a message including the subtype and SDK errors detail.
    • Loop iteration path: add the same check so failed iterations raise instead of silently breaking. Surrounding try/catch already maps this to a clean loop_iteration_failed event + { state: 'failed' } return.
  • Tests for both paths using the error_during_execution subtype with a populated errors array.

Why not auto-retry with fresh session?

See my analysis on #1208 and #1121. Short version:

  • error_during_execution is an SDK catch-all — can be stale session, tool error, MCP crash, token refresh, network interruption. Treating all of them as "reset session and retry" would mask several distinct root causes and regress context continuity across approval gates.
  • The existing unit test at dag-executor.test.ts:3533 intentionally asserts session passing on interactive loop resume — that's the designed behavior.
  • Without this PR's observability fix first, we can't even see what error_during_execution carries in the reporter's environment, so any retry heuristic would be speculative.

PR #1121 takes a similar retry-on-stale-session approach in the orchestrator chat path, where it is correct (different error shape, different layer). See comment on #1121 for the scope note.

Test plan

  • bun run type-check — clean
  • bun run lint — clean
  • bun run format:check — clean
  • New unit test: fails node when SDK returns error_during_execution result
  • New unit test: loop iteration fails loudly when SDK returns error_during_execution
  • Existing error_max_budget_usd tests still pass (behavior unchanged for budget path)
  • Existing interactive-loop-resume test still passes (no regression in session-threading design)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling during workflow execution—SDK error details are now properly captured and reported.
    • Workflow nodes and loop iterations now explicitly fail when encountering SDK errors with detailed error messages.
  • Tests

    • Added regression tests for SDK error handling in node and loop execution scenarios.

Previously, `dag-executor` only failed nodes/iterations when the SDK
returned an `error_max_budget_usd` result. Every other `isError: true`
subtype — including `error_during_execution` — was silently `break`ed
out of the stream with whatever partial output had accumulated, letting
failed runs masquerade as successful ones with empty output.

This is the most likely explanation for the "5-second crash" symptom in
#1208: iterations finish instantly with empty text, the loop keeps
going, and only the `claude.result_is_error` log tips the user off.

Changes:
- Capture the SDK's `errors: string[]` detail on result messages
  (previously discarded) and surface it through `MessageChunk.errors`.
- Log `errors`, `stopReason` alongside `errorSubtype` in
  `claude.result_is_error` so users can see what actually failed.
- Throw from both the general node path and the loop iteration path
  on any `isError: true` result, including the subtype and SDK errors
  detail in the thrown message.

Note: this does not implement auto-retry. See PR comments on #1121 and
the analysis on #1208 — a retry-with-fresh-session approach for loop
iterations is not obviously correct until we see what
`error_during_execution` actually carries in the reporter's env.
This change is the observability + fail-loud step that has to come
first so that signal is no longer silent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

The changes extend error handling throughout the Claude provider and DAG executor. The provider now captures SDK-provided error strings in the result MessageChunk when errors occur, while the executor explicitly handles non-budget SDK errors by logging detailed context and throwing failures for nodes and loop iterations.

Changes

Cohort / File(s) Summary
Claude Provider Error Extension
packages/providers/src/claude/provider.ts, packages/providers/src/types.ts
Extended MessageChunk result variant with optional errors?: string[] field; provider now extracts sdkErrors from resultMsg.errors and conditionally includes them in emitted chunks when is_error is true.
DAG Executor Error Handling
packages/workflows/src/dag-executor.ts
Added explicit checks for msg.isError in both node and loop iteration execution paths; logs error context (subtype, errors, sessionId, stopReason, elapsed duration) and throws to fail operations instead of silently breaking the stream.
Error Handling Test Coverage
packages/workflows/src/dag-executor.test.ts
Added two regression tests verifying that error_during_execution SDK results cause immediate executor failure with proper event emissions containing the subtype and error details for both node and loop execution scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

Through the burrow of code, a rabbit hops with care,
Error whispers now flow through the provider's air,
Executor listens close, logs what went wrong,
From Claude's SDK truth, we're finally strong! 🐰✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(workflows): fail loudly on SDK isError results in DAG and loop nodes' directly matches the PR's main objective: adding error handling to surface failures instead of silently breaking.
Description check ✅ Passed The description provides a clear summary, specific file changes, rationale for not auto-retrying, and test plan—all addressing the required sections comprehensively.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dag-executor-iserror-fail-loud

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/workflows/src/dag-executor.ts (1)

1648-1669: Loop iteration fail-loud matches node path — looks correct.

Throwing inside the for await is caught by the enclosing try at Line 1572 / catch at Line 1742, which emits loop_iteration_failed with err.message (includes subtype and joined errors) and returns { state: 'failed' }, so the loop stops instead of burning iterations on repeated error_during_execution — which directly addresses the #1208 scenario described in the comment.

Two small, optional points:

  1. The node path (Lines 767–785) and this block are nearly identical. Consider extracting a small helper like buildSdkErrorFromResult(msg, { nodeId, iteration? }) returning { message, logContext } to keep them in sync as the error shape evolves.
  2. Unlike the node path, the log payload here doesn't include durationMs (iteration elapsed). Since iterationStart is in scope, adding durationMs: Date.now() - iterationStart would make loop failures as diagnosable as node failures.
♻️ Optional: add iteration duration to the log
             getLog().error(
               {
                 nodeId: node.id,
                 iteration: i,
                 errorSubtype: subtype,
                 errors: msg.errors,
                 sessionId: msg.sessionId,
                 stopReason: msg.stopReason,
+                durationMs: Date.now() - iterationStart,
               },
               'loop_node.iteration_sdk_error'
             );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/dag-executor.ts` around lines 1648 - 1669, The loop
error-handling duplicates logic from the node path; extract a helper (e.g.,
buildSdkErrorFromResult(msg, { nodeId, iteration? })) that returns { message,
logContext } and use it here to build the thrown Error and the getLog().error
payload so both places stay in sync; also add durationMs: Date.now() -
iterationStart to the logContext in the loop block (use iterationStart in scope)
so the log payload matches the node path and the thrown Error message uses the
helper's message (including subtype and joined errors).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/workflows/src/dag-executor.ts`:
- Around line 1648-1669: The loop error-handling duplicates logic from the node
path; extract a helper (e.g., buildSdkErrorFromResult(msg, { nodeId, iteration?
})) that returns { message, logContext } and use it here to build the thrown
Error and the getLog().error payload so both places stay in sync; also add
durationMs: Date.now() - iterationStart to the logContext in the loop block (use
iterationStart in scope) so the log payload matches the node path and the thrown
Error message uses the helper's message (including subtype and joined errors).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a7b17bce-ab9d-4e07-9ebd-52d9c52c16fe

📥 Commits

Reviewing files that changed from the base of the PR and between d89bc76 and dfbb4ac.

📒 Files selected for processing (4)
  • packages/providers/src/claude/provider.ts
  • packages/providers/src/types.ts
  • packages/workflows/src/dag-executor.test.ts
  • packages/workflows/src/dag-executor.ts

@coleam00
Copy link
Copy Markdown
Owner Author

Archon PR Validation Report

Verdict: APPROVE

Summary

All four claimed gaps confirmed on main and verified fixed on the feature branch. The fix is minimal, follows existing patterns (budget-cap throw structure), and includes focused regression tests for both DAG node and loop iteration paths. No issues found, no regressions.

Bug Confirmation

Claim Main Feature
General node silently breaks on non-budget isError Confirmed Fixed — catch-all throw after budget check
Loop iteration has zero isError checking Confirmed Fixed — equivalent throw, stops loop on failure
SDK errors[] discarded by provider Confirmed Fixed — captured with Array.isArray guard
MessageChunk lacks errors field Confirmed Fixed — errors?: string[] added

Issues

No blocking issues found.

What's Done Well

  • Scope discipline: fail-loud first, retry later (once error shapes are observable)
  • Pattern consistency with existing budget-cap handling
  • Enriched logging (stopReason + errors) closes the observability gap
  • Both regression tests are focused and effective

Fix Quality: 5/5 | CLAUDE.md Compliance: Full


Validated by archon-validate-pr workflow

@coleam00 coleam00 merged commit 4c6ddd9 into dev Apr 18, 2026
4 checks passed
@coleam00 coleam00 deleted the fix/dag-executor-iserror-fail-loud branch April 18, 2026 20:02
matzls added a commit to matzls/Archon that referenced this pull request Apr 19, 2026
…ly features

Merge upstream commit 4c6ddd9 (fix(workflows): fail loudly on SDK isError
results, coleam00#1291) into spike/providers-refactor, bringing in the
`IAssistantClient` -> `IAgentProvider` refactor and its associated
loud-failure change for SDK `isError` results.

Upstream changes absorbed:
- IAssistantClient / WorkflowAssistantOptions -> IAgentProvider /
  SendQueryOptions + nodeConfig + assistantConfig (typed providers)
- getAssistantClient -> getAgentProvider factory in WorkflowDeps
- `errors: string[]` surfaced through MessageChunk; loud failure on all
  `isError: true` result subtypes (not just `error_max_budget_usd`)
- New @archon/providers contract layer + @archon/providers/types subpath

Fork-only features restored on top of the refactor:
- Durable-progress loop tracking + "failed after partial execution"
  wording with { node_counts, failed_nodes } metadata on the 2-arg
  `failWorkflowRun` call (source: fork commit 5f2377e)
- Workflow-level Codex tuning (modelReasoningEffort, webSearchMode,
  additionalDirectories) expressed as top-level AgentRequestOptions
  fields, with node > workflow > config precedence; BASH_NODE_AI_FIELDS
  extended so loader warns on these on non-AI nodes (source: fork
  commit b6c1905, re-expressed against the new contract)
- Matching Zod DagNode schema extensions + transform conditional spreads
  for the three Codex tuning fields

Validation:
- check:bundled up-to-date
- type-check clean across 10 packages
- lint 0 errors / 0 warnings
- format:check clean
- full per-package test suite green (workflows 5 batches, core 7,
  adapters 3, isolation 3; all exit 0)

Per CLAUDE.md fork policy, this merge stays on the spike branch. `dev`
remains a pristine `upstream/dev` mirror and is not touched by this
commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant