fix(claude): surface nested Claude Code hangs with timeout + diagnostics#1071
fix(claude): surface nested Claude Code hangs with timeout + diagnostics#1071
Conversation
Two changes addressing the silent-hang class of bug in #1067: 1. CLAUDECODE=1 startup warning in the CLI When `archon` is invoked from inside a Claude Code session (parent env has CLAUDECODE=1), print a warning with a pointer to #1067 and the `archon serve` workaround. Suppressable via ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1. 2. First-event timeout on the SDK query generator Wrap the Claude Agent SDK's `query()` async iterable so the first yielded message is raced against a 60-second timeout (configurable via ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS). If the subprocess spawns but produces no output within the window, log a diagnostic dump (subprocess env keys, parent CLAUDE_CODE_* keys, SDK model, stderr tail, process state) and throw a descriptive error pointing at #1067. Uses Promise.race on the first iterator.next() rather than relying on abortController.abort() alone — this unblocks us even if the SDK ignores the abort, which is the exact pathological case we're defending against. The reporter in #1067 saw 30+ minutes of silence; with this change they would have seen a clear error with actionable evidence within 60 seconds. Context: the hang does not reproduce on current dev in our environment (see the RCA at .agents/rca/issue-1067-nested-claude-hang.md), so this is defense + diagnostics rather than a root-cause fix. If the next reporter hits the same hang, the diagnostic dump will pinpoint the cause. Validation: - All 9 packages type-check clean, lint clean, format clean - 3 new unit tests for the timeout path (fast path, timeout path, error message discoverability) - E2E: `archon version` with CLAUDECODE=1 prints the warning; `archon workflow run archon-assist` still completes normally (timeout doesn't fire on fast path)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis change introduces a first-event timeout mechanism for Claude API queries (default 60 seconds, configurable) with diagnostic logging when no initial event arrives within the deadline. Additionally, a CLI startup warning detects nested Claude execution and notifies users unless explicitly suppressed. Changes
Sequence DiagramsequenceDiagram
participant Client as ClaudeClient
participant Controller as AbortController
participant SDK as Anthropic SDK
participant Timeout as Timeout Handler
participant Logger as Diagnostic Logger
Client->>Timeout: withFirstMessageTimeout(iterator, timeoutMs)
Timeout->>SDK: Race first event yield
alt First Event Arrives in Time
SDK->>Timeout: yield chunk
Timeout->>Client: return chunk
else Timeout Expires
Timeout->>Controller: abort()
Timeout->>Logger: buildFirstEventHangDiagnostics()
Logger->>Logger: collect working dir, attempt, model, stderr, env vars
Logger->>Logger: log error with diagnostics
Timeout->>Client: throw TimeoutError with issue link
Client->>Client: detect firstEventTimedOut, rethrow without retry
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Superseded by #1092 (now merged). |
Follow-up to #1068. Addresses the silent-hang class of bug reported in #1067.
Context
#1067 reported that `archon workflow run` hangs at `dag_node_started` for 30+ minutes with no error. The RCA (in `.agents/rca/issue-1067-nested-claude-hang.md`) could not reproduce the hang on current `dev` or the v0.3.5 binary — Archon already strips `CLAUDECODE` from the subprocess env (`packages/core/src/clients/claude.ts:167-170`). The reporter's hang is likely environmental (keychain/TCC/SessionCreate state, macOS launchd, specific Claude Code CLI version, or an upstream SDK quirk we didn't trace).
Since we can't pinpoint the root cause, this PR adds defense in depth + diagnostics so the next user hitting this class of issue gets a loud, actionable failure in under 60 seconds instead of 30 minutes of silence.
Changes
1. CLAUDECODE=1 startup warning
`packages/cli/src/cli.ts` — prints a warning when `CLAUDECODE=1` is inherited from the parent env, pointing at #1067 and the `archon serve` workaround. Suppressable via `ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1`.
```
⚠ Detected CLAUDECODE=1 — you appear to be running `archon` from inside a Claude Code session.
If workflows hang silently at dag_node_started, this is a known class of issue.
Workaround: run `archon serve` from a regular shell and use the web UI or HTTP API.
Details: #1067
```
2. First-event timeout on the SDK query generator
`packages/core/src/clients/claude.ts` — wraps `query({ prompt, options })` in a `withFirstMessageTimeout` helper that races the first `iterator.next()` against a 60-second timeout (configurable via `ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS`).
If the SDK produces no output within the window:
Why Promise.race instead of just abortController: the pathological case we're defending against is "SDK ignores abort" — so we need an independent unblocking mechanism. Promise.race guarantees the first-event wait resolves within the timeout regardless of SDK behavior.
The wrapper is zero-overhead on the happy path: after the first message arrives, it just forwards `iterator.next()` calls directly.
Validation
Files
Rollback
Single commit, no schema or API changes. `git revert ` cleanly reverts all three changes. The timeout is configurable via env var, so operators can set it to a very high value if they want to effectively disable it without a revert.
Not in this PR
Summary by CodeRabbit