Skip to content

fix(claude): surface nested Claude Code hangs with timeout + diagnostics#1071

Closed
Wirasm wants to merge 1 commit intodevfrom
fix/claude-hang-diagnostics
Closed

fix(claude): surface nested Claude Code hangs with timeout + diagnostics#1071
Wirasm wants to merge 1 commit intodevfrom
fix/claude-hang-diagnostics

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented Apr 11, 2026

Follow-up to #1068. Addresses the silent-hang class of bug reported in #1067.

Context

#1067 reported that `archon workflow run` hangs at `dag_node_started` for 30+ minutes with no error. The RCA (in `.agents/rca/issue-1067-nested-claude-hang.md`) could not reproduce the hang on current `dev` or the v0.3.5 binary — Archon already strips `CLAUDECODE` from the subprocess env (`packages/core/src/clients/claude.ts:167-170`). The reporter's hang is likely environmental (keychain/TCC/SessionCreate state, macOS launchd, specific Claude Code CLI version, or an upstream SDK quirk we didn't trace).

Since we can't pinpoint the root cause, this PR adds defense in depth + diagnostics so the next user hitting this class of issue gets a loud, actionable failure in under 60 seconds instead of 30 minutes of silence.

Changes

1. CLAUDECODE=1 startup warning

`packages/cli/src/cli.ts` — prints a warning when `CLAUDECODE=1` is inherited from the parent env, pointing at #1067 and the `archon serve` workaround. Suppressable via `ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1`.

```
⚠ Detected CLAUDECODE=1 — you appear to be running `archon` from inside a Claude Code session.
If workflows hang silently at dag_node_started, this is a known class of issue.
Workaround: run `archon serve` from a regular shell and use the web UI or HTTP API.
Details: #1067
```

2. First-event timeout on the SDK query generator

`packages/core/src/clients/claude.ts` — wraps `query({ prompt, options })` in a `withFirstMessageTimeout` helper that races the first `iterator.next()` against a 60-second timeout (configurable via `ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS`).

If the SDK produces no output within the window:

  1. Log `claude.first_event_timeout` at error level with a diagnostic dump:
    • Subprocess env keys (names only, no values — no secrets leaked)
    • Parent process keys matching `CLAUDECODE`, `CLAUDE_CODE_`, `ANTHROPIC_`
    • SDK model, process platform, UID, TTY state, CLAUDECODE value, CLAUDE_CODE_ENTRYPOINT
    • Last 20 stderr lines from the subprocess
  2. Call `controller.abort()` (best effort — the SDK may or may not respond)
  3. Throw `Claude Code subprocess produced no output within {N}ms. See logs... Details: https://github.com/coleam00/Archon/issues/1067\`
  4. Do NOT retry — a wedged subprocess will produce the same hang again

Why Promise.race instead of just abortController: the pathological case we're defending against is "SDK ignores abort" — so we need an independent unblocking mechanism. Promise.race guarantees the first-event wait resolves within the timeout regardless of SDK behavior.

The wrapper is zero-overhead on the happy path: after the first message arrives, it just forwards `iterator.next()` calls directly.

Validation

Files

  • `packages/cli/src/cli.ts` — CLAUDECODE warning (+15 lines)
  • `packages/core/src/clients/claude.ts` — `withFirstMessageTimeout`, `getFirstEventTimeoutMs`, `buildFirstEventHangDiagnostics`, wire-in (+137 lines)
  • `packages/core/src/clients/claude.test.ts` — 3 new tests (+87 lines)

Rollback

Single commit, no schema or API changes. `git revert ` cleanly reverts all three changes. The timeout is configurable via env var, so operators can set it to a very high value if they want to effectively disable it without a revert.

Not in this PR

Summary by CodeRabbit

  • Bug Fixes
    • Added timeout detection for Claude client when responses aren't received within expected timeframes, including improved diagnostic information in error messages.
    • Added CLI warning in specific environment configurations to help users troubleshoot setup issues.

Two changes addressing the silent-hang class of bug in #1067:

1. CLAUDECODE=1 startup warning in the CLI

When `archon` is invoked from inside a Claude Code session (parent env has
CLAUDECODE=1), print a warning with a pointer to #1067 and the
`archon serve` workaround. Suppressable via
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1.

2. First-event timeout on the SDK query generator

Wrap the Claude Agent SDK's `query()` async iterable so the first yielded
message is raced against a 60-second timeout (configurable via
ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS). If the subprocess spawns but
produces no output within the window, log a diagnostic dump
(subprocess env keys, parent CLAUDE_CODE_* keys, SDK model, stderr tail,
process state) and throw a descriptive error pointing at #1067.

Uses Promise.race on the first iterator.next() rather than relying on
abortController.abort() alone — this unblocks us even if the SDK ignores
the abort, which is the exact pathological case we're defending against.
The reporter in #1067 saw 30+ minutes of silence; with this change they
would have seen a clear error with actionable evidence within 60 seconds.

Context: the hang does not reproduce on current dev in our environment
(see the RCA at .agents/rca/issue-1067-nested-claude-hang.md), so this
is defense + diagnostics rather than a root-cause fix. If the next
reporter hits the same hang, the diagnostic dump will pinpoint the cause.

Validation:
- All 9 packages type-check clean, lint clean, format clean
- 3 new unit tests for the timeout path (fast path, timeout path,
  error message discoverability)
- E2E: `archon version` with CLAUDECODE=1 prints the warning;
  `archon workflow run archon-assist` still completes normally
  (timeout doesn't fire on fast path)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6694a4cf-4816-47c8-9cd0-7db1f4b7299a

📥 Commits

Reviewing files that changed from the base of the PR and between 536584d and 17e47f3.

📒 Files selected for processing (3)
  • packages/cli/src/cli.ts
  • packages/core/src/clients/claude.test.ts
  • packages/core/src/clients/claude.ts

📝 Walkthrough

Walkthrough

This change introduces a first-event timeout mechanism for Claude API queries (default 60 seconds, configurable) with diagnostic logging when no initial event arrives within the deadline. Additionally, a CLI startup warning detects nested Claude execution and notifies users unless explicitly suppressed.

Changes

Cohort / File(s) Summary
CLI Startup Warning
packages/cli/src/cli.ts
Detects nested Claude execution via CLAUDECODE=1 environment variable and conditionally emits a console.warn message with guidance unless ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING is set.
Claude First-Event Timeout
packages/core/src/clients/claude.ts, packages/core/src/clients/claude.test.ts
Implements first streamed message timeout (default 60s, configurable via ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS) with withFirstMessageTimeout() race logic, buildFirstEventHangDiagnostics() for diagnostic snapshots, and error-level logging on timeout. Retry logic now detects and rethrows timeout-specific errors without retry. Tests cover timeout expiration scenarios, immediate yield validation, and error message verification including issue references.

Sequence Diagram

sequenceDiagram
    participant Client as ClaudeClient
    participant Controller as AbortController
    participant SDK as Anthropic SDK
    participant Timeout as Timeout Handler
    participant Logger as Diagnostic Logger

    Client->>Timeout: withFirstMessageTimeout(iterator, timeoutMs)
    Timeout->>SDK: Race first event yield
    alt First Event Arrives in Time
        SDK->>Timeout: yield chunk
        Timeout->>Client: return chunk
    else Timeout Expires
        Timeout->>Controller: abort()
        Timeout->>Logger: buildFirstEventHangDiagnostics()
        Logger->>Logger: collect working dir, attempt, model, stderr, env vars
        Logger->>Logger: log error with diagnostics
        Timeout->>Client: throw TimeoutError with issue link
        Client->>Client: detect firstEventTimedOut, rethrow without retry
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 twitches nose knowingly
A timeout waits with patient grace,
While Claude takes time to find its place,
With warnings bright for nested calls,
The diagnostics catch it all! ⏰✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding timeout and diagnostics for nested Claude Code hangs, which is the core focus of the PR.
Description check ✅ Passed The description covers all critical template sections: context/problem, specific changes with code examples, validation evidence, backward compatibility, and rollback plan. It is comprehensive and well-structured.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/claude-hang-diagnostics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Apr 12, 2026

Superseded by #1092 which consolidates this PR and #1068 into a single PR with docs updates.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Apr 12, 2026

Superseded by #1092 (now merged).

@Wirasm Wirasm closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant