Skip to content

feat(orchestrator): auto-reset stale Claude SDK sessions + bare 'reset' command for Slack#1121

Open
mhooooo wants to merge 1 commit intocoleam00:devfrom
mhooooo:feat/stale-session-auto-reset
Open

feat(orchestrator): auto-reset stale Claude SDK sessions + bare 'reset' command for Slack#1121
mhooooo wants to merge 1 commit intocoleam00:devfrom
mhooooo:feat/stale-session-auto-reset

Conversation

@mhooooo
Copy link
Copy Markdown

@mhooooo mhooooo commented Apr 12, 2026

Summary

When the Claude Code SDK rejects a resume attempt with No conversation found or conversation not found (stale session ID), the orchestrator currently surfaces this as a retryable crash class error and eventually propagates it to the user, who then has to manually /reset and retry their message. This PR makes the orchestrator detect the stale-session error class, auto-reset the session transparently, and retry the query once with a fresh session ID — so the user never sees the interruption.

Also accepts bare reset without the leading slash on Slack. Slack intercepts /reset as its own built-in command, so users on Slack couldn't use the text slash-command to reset their Archon session.

Motivation

Running Archon as a long-lived personal assistant (heartbeat + Slack DM + multi-day conversations), the SDK occasionally loses session state — the assistant_session_id becomes stale, the next query fails, and the conversation is effectively dead until the user notices and manually resets. Auto-reset turns a visible failure into a 0.5s hiccup.

Changes

packages/core/src/clients/claude.ts

  • Export STALE_SESSION_PATTERNS as a single source of truth
  • Extend classifySubprocessError() to include 'stale_session'
  • Check stale_session before crash (specific wins over generic)

packages/core/src/state/session-transitions.ts

  • New 'stale-session-cleared' trigger

packages/core/src/orchestrator/orchestrator-agent.ts

  • isStaleSessionError() helper using shared patterns
  • Stream + batch handlers wrap AI-query in retry-once on stale session
  • State cleared before retry (no partial content bleed)
  • SLACK_BARE_COMMANDS normalization (Slack-only)

Tests

  • classifySubprocessError tests for stale-session patterns (case-insensitive, both paths)
  • Priority: overlapping crash + stale_session → stale_session wins
  • Stream + batch: successful reset + retry, no-third-retry guard, skip on fresh session
  • Bare command normalization: reset/reset, Slack-only

Scope

  • Claude only (not codex)
  • One retry max
  • Bare-command normalization scoped to Slack

Carried forward from dynamous-community/remote-coding-agent @ v0.2.12 (commit 229217cf). Rebased onto latest main. Happy to split into two PRs if preferred.

Test plan

  • bun run type-check — clean
  • bun test packages/core/src/clients/claude.test.ts
  • bun test packages/core/src/orchestrator/orchestrator.test.ts
  • Manual: trigger stale-session error → verify auto-recovery
  • Manual: bare reset on Slack → treated as /reset

…lack

When the Claude Code SDK rejects a resume attempt with "No conversation
found" (the SDK session ID is gone), the orchestrator now transparently
resets the session and retries the query instead of surfacing an error
that the user has to /reset manually.

Also accepts bare 'reset' without the leading slash on Slack, since Slack
intercepts /reset as its own slash command.

Changes:
- claude.ts: classify stale_session as a non-retryable error class
  (checked before 'crash' — specific wins over generic); export
  STALE_SESSION_PATTERNS as the single source of truth for both the
  classifier and the orchestrator's isStaleSessionError() helper
- session-transitions.ts: new 'stale-session-cleared' transition
  (deactivates — next message creates a fresh session)
- orchestrator-agent.ts: isStaleSessionError() helper; SLACK_BARE_COMMANDS
  normalization scoped to Slack platform only; handleStreamMode and
  handleBatchMode wrap their AI query loops in runStreamQuery() /
  runBatchQuery() functions so a catch block can reset sessionForQuery
  and re-run with the fresh session ID; state reset before retry
  (allMessages/allChunks/assistantMessages/commandDetected) so partial
  content from the failed attempt never bleeds into the fresh response
- claude.test.ts: stale_session classification tests, including priority
  over 'crash' on overlapping error messages, and .cause assertions
- orchestrator.test.ts: parameterized stream/batch retry tests covering
  successful reset+retry, no-third-retry guard, null-session skip, and
  fresh session ID assertion on retry

Ported from the dynamous/remote-coding-agent fork (commit 229217cf) with
the following intentional deltas against the new v0.3.5 base:
- Dropped defaultCodebase auto-scoping block (Patch 1 not carried —
  CONFIG-REPLACEABLE per investigation verdict)
- Slack bare-command normalization scoped to Slack platform only
  (fork shipped unscoped initially; this change came in a later
  review-findings sub-commit)
- runStreamQuery/runBatchQuery keep upstream's 4-arg aiClient.sendQuery
  signature including requestOptions (fork was on pre-v0.3.2 3-arg shape)
- Upstream deterministic command list preserved (help, status, reset,
  workflow, register-project, update-project, remove-project, commands,
  init, worktree) — fork only had 5
- No CHANGELOG / bun.lock / package.json / docs changes — those will
  be rebuilt on top of the v0.3.5 base

Upstream-PR candidate for coleam00/Archon.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b0c1f56-b2d4-49d0-a474-16a6b86c57cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 17, 2026

Related to #1208 — overlapping area or partial fix.

@coleam00
Copy link
Copy Markdown
Owner

Scope note — this PR does not fix #1208

PR #1121 is a solid fix for the orchestrator chat path's stale-session problem and worth merging on its own merits. But to set expectations cleanly, it should not be framed as the fix for #1208. Three concrete gaps prevent it from reaching that failure mode.

1. Different layer

PR #1121's retry wrapper lives in orchestrator-agent.ts (handleStreamMode / handleBatchMode), which is called by handleMessage for Slack/Web/Telegram chat.

dag-executor.ts (workflow loops — where #1208's archon-piv-loop crashes) invokes the provider directly. It never calls handleMessage. The retry try/catch in this PR does not execute on that path.

2. Different error delivery shape

PR #1121 catches thrown errors. The #1208 failure arrives as a yielded result message with isError: true, errorSubtype: 'error_during_execution' (packages/providers/src/claude/provider.ts:754-774). No exception is thrown, so the try/catch wrapper never fires.

For reference, dag-executor.ts:757 only handles isError when errorSubtype === 'error_max_budget_usd'. All other subtypes — including error_during_executionbreak silently at line 767 with whatever partial output accumulated. That matches the "5–7 second crash" symptom in #1208: fast-failing iteration produces empty output, loop continues.

3. Different pattern match

STALE_SESSION_PATTERNS = ['no conversation found', 'conversation not found']. The string 'error_during_execution' matches neither. Even if it did throw, isStaleSessionError() would return false.

Independent evidence that #1208 is not a stale-session rejection

  • Cross-process session resume works (verified with a two-process test, separate PIDs, session ID handed off — magic word recalled successfully).
  • Invalid session UUIDs produce "No conversation found with session ID: ..." (thrown), not error_during_execution (yielded result).
  • The reporter's logs show two different session IDs between resuming_session and result_is_error — the SDK allocated a new session successfully; failure happened during that session's execution, not during resume handshake.

Suggested framing

Keep this PR focused on the orchestrator chat path. If you want to address #1208 with the same pattern, it's a separate change in dag-executor.ts around the msg.isError handling at line 757, not a throw/catch wrapper — and the retryable subtypes / detection logic is a different design question (still analyzing that).

coleam00 added a commit that referenced this pull request Apr 18, 2026
Previously, `dag-executor` only failed nodes/iterations when the SDK
returned an `error_max_budget_usd` result. Every other `isError: true`
subtype — including `error_during_execution` — was silently `break`ed
out of the stream with whatever partial output had accumulated, letting
failed runs masquerade as successful ones with empty output.

This is the most likely explanation for the "5-second crash" symptom in
#1208: iterations finish instantly with empty text, the loop keeps
going, and only the `claude.result_is_error` log tips the user off.

Changes:
- Capture the SDK's `errors: string[]` detail on result messages
  (previously discarded) and surface it through `MessageChunk.errors`.
- Log `errors`, `stopReason` alongside `errorSubtype` in
  `claude.result_is_error` so users can see what actually failed.
- Throw from both the general node path and the loop iteration path
  on any `isError: true` result, including the subtype and SDK errors
  detail in the thrown message.

Note: this does not implement auto-retry. See PR comments on #1121 and
the analysis on #1208 — a retry-with-fresh-session approach for loop
iterations is not obviously correct until we see what
`error_during_execution` actually carries in the reporter's env.
This change is the observability + fail-loud step that has to come
first so that signal is no longer silent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coleam00
Copy link
Copy Markdown
Owner

Archon PR Validation Report

Verdict: REQUEST_CHANGES

Summary

Both claimed bugs are confirmed on main — stale Claude SDK sessions surface as cryptic errors with no auto-recovery, and Slack intercepts /reset leaving Slack users stuck. The fix design is architecturally sound (client-level detection, orchestrator-level retry-once with session transition). However, three issues must be addressed before merge.

Bug Confirmation

Claim Main Feature
Stale session errors require manual /reset Confirmed Fixed (auto-reset + retry-once)
Slack intercepts /reset Confirmed Fixed (bare reset normalization)

Required Changes

  1. Retarget PR to dev — Per project git workflow, all feature work merges into dev, not main.

  2. Rebase onto current dev — PR is based on an old main snapshot. Current dev has diverged with:

    • isError handling in batch/stream result events (will be silently lost on merge)
    • Type rename: AssistantRequestOptionsSendQueryOptions
    • Import path changes to @archon/providers/types
  3. Fix approval routing bugorchestrator-agent.ts:551 uses message.startsWith('/') instead of effectiveMessage.startsWith('/'). On Slack, bare reset during a workflow approval gate gets treated as approval text instead of the /reset command.

What's Done Well

  • Correct separation of concerns (detect at client, retry at orchestrator)
  • Triple retry guard prevents infinite loops
  • Shared STALE_SESSION_PATTERNS constant — single source of truth
  • Comprehensive test coverage across stream and batch modes

Fix Quality Score: 3.6/5


Validated by archon-validate-pr workflow

@coleam00 coleam00 changed the base branch from main to dev April 18, 2026 19:18
@coleam00
Copy link
Copy Markdown
Owner

Thanks for this PR — after running archon-validate-pr and a manual review, the bug analysis and fix design are solid. Both issues are confirmed on current main: classifySubprocessError() has no stale_session class, and Slack users can't send /reset. The client-level detection + orchestrator-level retry-once split is the right shape.

I did what I could from the maintainer side and needs the rest from you:

✅ Done by maintainer

Retargeted this PR's base from maindev. Per CLAUDE.md, all feature work merges into dev; main is a release-only branch. No action needed from you on this.

❌ Needs a rebase (not a simple git rebase)

This PR was written against main (~v0.3.6), but dev has undergone a package split that moves every file this PR touches. I tried a straight rebase and it hit a structural conflict — the old paths literally don't exist anymore. You'll need to port the changes to the new locations.

What moved on dev:

In this PR (old main) On current dev
packages/core/src/clients/claude.ts packages/providers/src/claude/provider.ts
packages/core/src/clients/claude.test.ts packages/providers/src/claude/provider.test.ts
IAssistantClient interface IAgentProvider interface
aiClient.sendQuery(...) (in orchestrator) provider.sendQuery(...)
monolithic sendQuery body decomposed into helpers (see #1162)

Porting guide for your three buckets of changes:

  1. STALE_SESSION_PATTERNS + classifySubprocessError changes — port to packages/providers/src/claude/provider.ts. The classifier still exists there (around line 111) with the same signature. Just add the stale_session branch and the exported patterns constant. Note there's now a classifyAndEnrichError() wrapper helper (around line 810) that builds the enriched error + shouldRetry; plug stale_session in there as non-retryable, same shape as the existing auth branch. Export from @archon/providers so the orchestrator can import it.

  2. classifySubprocessError tests — port to packages/providers/src/claude/provider.test.ts. The test file structure is similar enough that most assertions transfer directly.

  3. Orchestrator changes (orchestrator-agent.ts, orchestrator.test.ts, session-transitions.ts)session-transitions.ts should merge cleanly (just add 'stale-session-cleared'). The orchestrator has diverged ~279 lines, so handleStreamMode / handleBatchMode will need manual re-application of your runStreamQuery/runBatchQuery retry pattern against the current shape. Preserve the existing isError handling that now lives in those functions — don't drop it when you re-apply your changes.

🐛 One more bug to fix while you're in there

orchestrator-agent.ts:552 (on your branch) — the approval-routing gate uses the raw message instead of effectiveMessage:

// Current (bug):
if (!message.startsWith('/')) {
  const pausedRun = await workflowDb.getPausedWorkflowRun(conversation.id);
  ...
}

// Fix:
if (!effectiveMessage.startsWith('/')) {
  const pausedRun = await workflowDb.getPausedWorkflowRun(conversation.id);
  ...
}

Scenario this protects against: a Slack user has a paused interactive workflow, types reset (bare) to get out. Your normalization correctly rewrites it to /reset in effectiveMessage, but this gate still sees the raw reset (no slash), routes into the natural-language-approval path, stores reset as the approval response, and the user is stuck. The fix is a one-token change.

Same nit applies lower down in handleUpdateProject/handleRemoveProject call sites — those pass message but probably want effectiveMessage too, for consistency. Lower priority.

Summary

  • Base retarget: done ✅
  • Rebase-as-port + approval bug fix: needs you 👇
  • Design review: all green 👍 — this is a real bug, and the fix approach is correct. Just needs the new-architecture re-spelling.

Once you push the ported version, I'll re-run validation.

coleam00 added a commit that referenced this pull request Apr 18, 2026
Previously, `dag-executor` only failed nodes/iterations when the SDK
returned an `error_max_budget_usd` result. Every other `isError: true`
subtype — including `error_during_execution` — was silently `break`ed
out of the stream with whatever partial output had accumulated, letting
failed runs masquerade as successful ones with empty output.

This is the most likely explanation for the "5-second crash" symptom in
#1208: iterations finish instantly with empty text, the loop keeps
going, and only the `claude.result_is_error` log tips the user off.

Changes:
- Capture the SDK's `errors: string[]` detail on result messages
  (previously discarded) and surface it through `MessageChunk.errors`.
- Log `errors`, `stopReason` alongside `errorSubtype` in
  `claude.result_is_error` so users can see what actually failed.
- Throw from both the general node path and the loop iteration path
  on any `isError: true` result, including the subtype and SDK errors
  detail in the thrown message.

Note: this does not implement auto-retry. See PR comments on #1121 and
the analysis on #1208 — a retry-with-fresh-session approach for loop
iterations is not obviously correct until we see what
`error_during_execution` actually carries in the reporter's env.
This change is the observability + fail-loud step that has to come
first so that signal is no longer silent.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 20, 2026

@mhooooo — quick ping. The design review is done and green; the last open item is the port to the new architecture. Here's the short checklist so there's no ambiguity about what would unblock merge:

Required to proceed

  1. Rebase onto current dev (already retargeted from main for you). Note that straight git rebase dev will hit structural conflicts because packages/core/src/clients/claude.ts has moved. Use @coleam00's porting guide (comment here) — it maps every old path/type to its new home.

  2. Re-apply your three changes in the new locations:

    • STALE_SESSION_PATTERNS + classifySubprocessError stale_session branch → packages/providers/src/claude/provider.ts (classifier is ~line 111; plug into classifyAndEnrichError() ~line 810 as non-retryable, same shape as the auth branch). Export the constant from @archon/providers.
    • classifySubprocessError tests → packages/providers/src/claude/provider.test.ts.
    • session-transitions.ts 'stale-session-cleared' trigger → merges cleanly.
    • orchestrator-agent.ts runStreamQuery/runBatchQuery retry wrappers → re-apply against the current shape. Important: preserve the isError handling that now lives in those functions — don't drop it when you re-apply.
  3. Fix the approval-routing bug at orchestrator-agent.ts:552 — use effectiveMessage.startsWith('/') instead of message.startsWith('/'). Without this, your own Slack-bare-reset normalization doesn't reach the approval gate correctly.

  4. Drop homebrew/archon.rb from the diff — that version bump and the SHAs are a merge artifact from the old main base; rebasing onto dev will naturally drop it.

Out of scope for this PR (confirmed by Cole's analysis): #1208 is not fixed by this change — different code path (dag-executor, not orchestrator-agent), different error shape (yielded isError, not thrown). Please don't broaden scope to try to address it here.

Timeline

If you can push the ported version within the next ~5 days, we'll merge as soon as validation is green. If that's not feasible for you, let us know and we'll pick it up ourselves using Cole's porting guide as the spec — we'd still credit you via Co-authored-by: on the commits.

Thanks for the clean design and the thorough tests — the bug analysis and the architectural split (client detects, orchestrator retries once) is exactly right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants