feat(orchestrator): auto-reset stale Claude SDK sessions + bare 'reset' command for Slack#1121
feat(orchestrator): auto-reset stale Claude SDK sessions + bare 'reset' command for Slack#1121mhooooo wants to merge 1 commit intocoleam00:devfrom
Conversation
…lack When the Claude Code SDK rejects a resume attempt with "No conversation found" (the SDK session ID is gone), the orchestrator now transparently resets the session and retries the query instead of surfacing an error that the user has to /reset manually. Also accepts bare 'reset' without the leading slash on Slack, since Slack intercepts /reset as its own slash command. Changes: - claude.ts: classify stale_session as a non-retryable error class (checked before 'crash' — specific wins over generic); export STALE_SESSION_PATTERNS as the single source of truth for both the classifier and the orchestrator's isStaleSessionError() helper - session-transitions.ts: new 'stale-session-cleared' transition (deactivates — next message creates a fresh session) - orchestrator-agent.ts: isStaleSessionError() helper; SLACK_BARE_COMMANDS normalization scoped to Slack platform only; handleStreamMode and handleBatchMode wrap their AI query loops in runStreamQuery() / runBatchQuery() functions so a catch block can reset sessionForQuery and re-run with the fresh session ID; state reset before retry (allMessages/allChunks/assistantMessages/commandDetected) so partial content from the failed attempt never bleeds into the fresh response - claude.test.ts: stale_session classification tests, including priority over 'crash' on overlapping error messages, and .cause assertions - orchestrator.test.ts: parameterized stream/batch retry tests covering successful reset+retry, no-third-retry guard, null-session skip, and fresh session ID assertion on retry Ported from the dynamous/remote-coding-agent fork (commit 229217cf) with the following intentional deltas against the new v0.3.5 base: - Dropped defaultCodebase auto-scoping block (Patch 1 not carried — CONFIG-REPLACEABLE per investigation verdict) - Slack bare-command normalization scoped to Slack platform only (fork shipped unscoped initially; this change came in a later review-findings sub-commit) - runStreamQuery/runBatchQuery keep upstream's 4-arg aiClient.sendQuery signature including requestOptions (fork was on pre-v0.3.2 3-arg shape) - Upstream deterministic command list preserved (help, status, reset, workflow, register-project, update-project, remove-project, commands, init, worktree) — fork only had 5 - No CHANGELOG / bun.lock / package.json / docs changes — those will be rebuilt on top of the v0.3.5 base Upstream-PR candidate for coleam00/Archon.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Related to #1208 — overlapping area or partial fix. |
Scope note — this PR does not fix #1208PR #1121 is a solid fix for the orchestrator chat path's stale-session problem and worth merging on its own merits. But to set expectations cleanly, it should not be framed as the fix for #1208. Three concrete gaps prevent it from reaching that failure mode. 1. Different layerPR #1121's retry wrapper lives in
2. Different error delivery shapePR #1121 catches thrown errors. The #1208 failure arrives as a yielded result message with For reference, 3. Different pattern match
Independent evidence that #1208 is not a stale-session rejection
Suggested framingKeep this PR focused on the orchestrator chat path. If you want to address #1208 with the same pattern, it's a separate change in |
Previously, `dag-executor` only failed nodes/iterations when the SDK returned an `error_max_budget_usd` result. Every other `isError: true` subtype — including `error_during_execution` — was silently `break`ed out of the stream with whatever partial output had accumulated, letting failed runs masquerade as successful ones with empty output. This is the most likely explanation for the "5-second crash" symptom in #1208: iterations finish instantly with empty text, the loop keeps going, and only the `claude.result_is_error` log tips the user off. Changes: - Capture the SDK's `errors: string[]` detail on result messages (previously discarded) and surface it through `MessageChunk.errors`. - Log `errors`, `stopReason` alongside `errorSubtype` in `claude.result_is_error` so users can see what actually failed. - Throw from both the general node path and the loop iteration path on any `isError: true` result, including the subtype and SDK errors detail in the thrown message. Note: this does not implement auto-retry. See PR comments on #1121 and the analysis on #1208 — a retry-with-fresh-session approach for loop iterations is not obviously correct until we see what `error_during_execution` actually carries in the reporter's env. This change is the observability + fail-loud step that has to come first so that signal is no longer silent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Archon PR Validation ReportVerdict: REQUEST_CHANGES SummaryBoth claimed bugs are confirmed on main — stale Claude SDK sessions surface as cryptic errors with no auto-recovery, and Slack intercepts Bug Confirmation
Required Changes
What's Done Well
Fix Quality Score: 3.6/5 Validated by archon-validate-pr workflow |
|
Thanks for this PR — after running I did what I could from the maintainer side and needs the rest from you: ✅ Done by maintainerRetargeted this PR's base from ❌ Needs a rebase (not a simple
|
In this PR (old main) |
On current dev |
|---|---|
packages/core/src/clients/claude.ts |
packages/providers/src/claude/provider.ts |
packages/core/src/clients/claude.test.ts |
packages/providers/src/claude/provider.test.ts |
IAssistantClient interface |
IAgentProvider interface |
aiClient.sendQuery(...) (in orchestrator) |
provider.sendQuery(...) |
monolithic sendQuery body |
decomposed into helpers (see #1162) |
Porting guide for your three buckets of changes:
-
STALE_SESSION_PATTERNS+classifySubprocessErrorchanges — port topackages/providers/src/claude/provider.ts. The classifier still exists there (around line 111) with the same signature. Just add thestale_sessionbranch and the exported patterns constant. Note there's now aclassifyAndEnrichError()wrapper helper (around line 810) that builds the enriched error +shouldRetry; plugstale_sessionin there as non-retryable, same shape as the existingauthbranch. Export from@archon/providersso the orchestrator can import it. -
classifySubprocessErrortests — port topackages/providers/src/claude/provider.test.ts. The test file structure is similar enough that most assertions transfer directly. -
Orchestrator changes (
orchestrator-agent.ts,orchestrator.test.ts,session-transitions.ts) —session-transitions.tsshould merge cleanly (just add'stale-session-cleared'). The orchestrator has diverged ~279 lines, sohandleStreamMode/handleBatchModewill need manual re-application of yourrunStreamQuery/runBatchQueryretry pattern against the current shape. Preserve the existingisErrorhandling that now lives in those functions — don't drop it when you re-apply your changes.
🐛 One more bug to fix while you're in there
orchestrator-agent.ts:552 (on your branch) — the approval-routing gate uses the raw message instead of effectiveMessage:
// Current (bug):
if (!message.startsWith('/')) {
const pausedRun = await workflowDb.getPausedWorkflowRun(conversation.id);
...
}
// Fix:
if (!effectiveMessage.startsWith('/')) {
const pausedRun = await workflowDb.getPausedWorkflowRun(conversation.id);
...
}Scenario this protects against: a Slack user has a paused interactive workflow, types reset (bare) to get out. Your normalization correctly rewrites it to /reset in effectiveMessage, but this gate still sees the raw reset (no slash), routes into the natural-language-approval path, stores reset as the approval response, and the user is stuck. The fix is a one-token change.
Same nit applies lower down in handleUpdateProject/handleRemoveProject call sites — those pass message but probably want effectiveMessage too, for consistency. Lower priority.
Summary
- Base retarget: done ✅
- Rebase-as-port + approval bug fix: needs you 👇
- Design review: all green 👍 — this is a real bug, and the fix approach is correct. Just needs the new-architecture re-spelling.
Once you push the ported version, I'll re-run validation.
Previously, `dag-executor` only failed nodes/iterations when the SDK returned an `error_max_budget_usd` result. Every other `isError: true` subtype — including `error_during_execution` — was silently `break`ed out of the stream with whatever partial output had accumulated, letting failed runs masquerade as successful ones with empty output. This is the most likely explanation for the "5-second crash" symptom in #1208: iterations finish instantly with empty text, the loop keeps going, and only the `claude.result_is_error` log tips the user off. Changes: - Capture the SDK's `errors: string[]` detail on result messages (previously discarded) and surface it through `MessageChunk.errors`. - Log `errors`, `stopReason` alongside `errorSubtype` in `claude.result_is_error` so users can see what actually failed. - Throw from both the general node path and the loop iteration path on any `isError: true` result, including the subtype and SDK errors detail in the thrown message. Note: this does not implement auto-retry. See PR comments on #1121 and the analysis on #1208 — a retry-with-fresh-session approach for loop iterations is not obviously correct until we see what `error_during_execution` actually carries in the reporter's env. This change is the observability + fail-loud step that has to come first so that signal is no longer silent. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
|
@mhooooo — quick ping. The design review is done and green; the last open item is the port to the new architecture. Here's the short checklist so there's no ambiguity about what would unblock merge: Required to proceed
Out of scope for this PR (confirmed by Cole's analysis): #1208 is not fixed by this change — different code path ( Timeline If you can push the ported version within the next ~5 days, we'll merge as soon as validation is green. If that's not feasible for you, let us know and we'll pick it up ourselves using Cole's porting guide as the spec — we'd still credit you via Thanks for the clean design and the thorough tests — the bug analysis and the architectural split (client detects, orchestrator retries once) is exactly right. |
Summary
When the Claude Code SDK rejects a resume attempt with
No conversation foundorconversation not found(stale session ID), the orchestrator currently surfaces this as a retryablecrashclass error and eventually propagates it to the user, who then has to manually/resetand retry their message. This PR makes the orchestrator detect the stale-session error class, auto-reset the session transparently, and retry the query once with a fresh session ID — so the user never sees the interruption.Also accepts bare
resetwithout the leading slash on Slack. Slack intercepts/resetas its own built-in command, so users on Slack couldn't use the text slash-command to reset their Archon session.Motivation
Running Archon as a long-lived personal assistant (heartbeat + Slack DM + multi-day conversations), the SDK occasionally loses session state — the
assistant_session_idbecomes stale, the next query fails, and the conversation is effectively dead until the user notices and manually resets. Auto-reset turns a visible failure into a 0.5s hiccup.Changes
packages/core/src/clients/claude.tsSTALE_SESSION_PATTERNSas a single source of truthclassifySubprocessError()to include'stale_session'packages/core/src/state/session-transitions.ts'stale-session-cleared'triggerpackages/core/src/orchestrator/orchestrator-agent.tsisStaleSessionError()helper using shared patternsSLACK_BARE_COMMANDSnormalization (Slack-only)Tests
classifySubprocessErrortests for stale-session patterns (case-insensitive, both paths)reset→/reset, Slack-onlyScope
Carried forward from
dynamous-community/remote-coding-agent @ v0.2.12(commit 229217cf). Rebased onto latest main. Happy to split into two PRs if preferred.Test plan
bun run type-check— cleanbun test packages/core/src/clients/claude.test.tsbun test packages/core/src/orchestrator/orchestrator.test.tsreseton Slack → treated as/reset