fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop by kagura-agent · Pull Request #1294 · coleam00/Archon

kagura-agent · 2026-04-19T08:23:59Z

Summary

Problem: After container restart, sending a message in an existing conversation silently fails and enters an infinite failure loop — the expired Claude API session ID is persisted even on error
Why it matters: Users are stuck with broken conversations after any restart, with no recovery path except manual DB intervention
What changed: updateSession() and tryPersistSessionId() now accept string | null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution
What did not change: Conversation history (remote_agent_messages), session creation flow, normal (non-error) session persistence

UX Journey

Before

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       persists SAME stale ID
  sees error ◀──────── returns error
sends again ─────────▶ loads SAME stale ID
                       tries Claude API ──────▶  rejects again
  (infinite loop)      never recovers

After

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       [clears session ID → NULL]
  sees error ◀──────── returns error
sends again ─────────▶ [no stored ID → creates new session]
                       tries Claude API ──────▶  accepts (fresh)
  sees reply ◀──────── streams response

Architecture Diagram

Before

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ updateSession(sessionId)  ──▶ DB (stale ID persisted)

After

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ [updateSession(null)]  ──▶ DB (ID cleared)

Connection inventory:

From	To	Status	Notes
handleStreamMode	updateSession	modified	Now passes `null` on error
handleBatchMode	tryPersistSessionId	modified	Now passes `null` on error
updateSession	DB	modified	Accepts `string \| null`
tryPersistSessionId	DB	modified	Accepts `string \| null`

Label Snapshot

Risk: risk: low
Size: size: S
Scope: core
Module: core:orchestrator

Security Impact

No security impact. This change only affects internal session state management — no auth, no user data exposure, no new inputs.

Human Verification

Verified by restarting container and sending message to existing conversation — session recovers automatically
All 89 orchestrator-agent tests pass
All 28 sessions DB tests pass
TypeScript type-check clean (tsc --noEmit)

Side Effects / Blast Radius

Only affects error recovery path (error_during_execution)
Normal session flow is unchanged
Worst case: session is cleared unnecessarily, resulting in a fresh session (no data loss, just context reload)

Rollback Plan

Revert the single commit. The only behavioral change is in the error handler — reverting restores original (broken) behavior of persisting stale IDs.

Closes #1280

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced session management to automatically clear invalid sessions when execution errors occur, improving reliability.
- Expanded error logging to include additional details for better troubleshooting of result processing failures.
Tests
- Added comprehensive test coverage for session state clearing during error recovery scenarios.
- Improved test structure with named mock utilities for better maintainability.

coderabbitai · 2026-04-19T08:24:15Z

📝 Walkthrough

Walkthrough

Allow clearing persisted assistant session IDs by accepting null when updating sessions and change orchestrator result handling to clear (set NULL) the stored assistant_session_id when an error_during_execution result is received, preventing stale-session retry loops.

Changes

Cohort / File(s)	Summary
Session persistence `packages/core/src/db/sessions.ts`, `packages/core/src/db/sessions.test.ts`	`updateSession` signature changed to accept `sessionId: string \| null`; tests added/updated to verify NULL is bound and not-found behavior remains.
Orchestrator logic `packages/core/src/orchestrator/orchestrator-agent.ts`, `packages/core/src/orchestrator/orchestrator-agent.test.ts`	`tryPersistSessionId` accepts `string \| null`. `handleStreamMode`/`handleBatchMode` detect `msg.isError && msg.errorSubtype === 'error_during_execution'`, log clearing, persist `null` for assistant session, prevent using the failed sessionId for that turn; tests added to assert transition + `updateSession(..., null)` in both stream and batch.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Orch as Orchestrator
  participant AI as AIClient
  participant DB as Database

  Client->>Orch: send user message
  Orch->>DB: read assistant_session_id (may exist)
  Orch->>AI: forward message (include assistant_session_id if present)
  AI-->>Orch: result (isError=true, subtype=error_during_execution)
  Orch->>DB: tryPersistSessionId(session.id, NULL)
  DB-->>Orch: ack
  Orch->>AI: subsequent message uses no assistant_session_id (fresh session)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix(core): surface auth errors instead of silently dropping them #1089: Modifies orchestrator result handling and session persistence codepaths; overlaps on error/result handling and session-id updates.
fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291: Alters handling of SDK result errors and execution failure flows; related to error_during_execution behaviour and persistence.

Poem

🐰 I nudge the stale ID into the night,
Clearing cobwebs so the convo's bright.
No looping echoes, just fresh, hopeful talk—
I hop, I clear, then skip away to walk. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main change: clearing stale session IDs on error_during_execution to prevent infinite failure loops.
Description check	✅ Passed	The PR description comprehensively covers all required template sections with clear explanations, UX/architecture diagrams, validation evidence, security impact, and rollback plan.
Linked Issues check	✅ Passed	The code changes fully address issue `#1280`: updateSession and tryPersistSessionId now accept null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution as required.
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to the error recovery path for stale session IDs; conversation history, session creation flow, and normal session persistence remain unchanged as specified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wirasm · 2026-04-20T06:36:48Z

@kagura-agent related to #1208 — overlapping area or partial fix.

Wirasm · 2026-04-20T06:38:02Z

Hi @kagura-agent — thanks for opening this PR.

This repository uses a PR template at .github/pull_request_template.md with several required sections. A few of them appear to be empty or placeholder here:

Security Impact (required)
Human Verification (required)
Side Effects / Blast Radius (required)
Rollback Plan (required)

Could you fill those out (even briefly)? The template helps reviewers understand scope, risk, and rollback — it speeds up review significantly.

If a section genuinely doesn't apply, just write "N/A" in it rather than leaving it blank.

kagura-agent · 2026-04-20T07:49:52Z

Thanks @Wirasm! Filled out all the template sections.

Re #1208 — I took a look and this PR addresses a different failure mode: #1208 is about initial session creation, while this one handles the case where a previously valid session expires after container restart. The fix is also in a different code path (error handler clearing stale IDs vs. session creation logic). Happy to add a note cross-referencing #1208 if that helps!

Wirasm · 2026-04-29T09:13:55Z

Review Summary

Verdict: minor-fixes-needed

This is a tightly-scoped bug fix that clears stale assistant_session_id values when error_during_execution errors occur, breaking infinite failure loops. The code is clean, type-safe, and CLAUDE.md-compliant — good work. The gap is test coverage: the new session-clearing branch has no tests, and updateSession's null path also needs one. Both are small, targeted additions.

Blocking issues

orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The new branch that calls await tryPersistSessionId(session.id, null) on error_during_execution is untested. Add a test for both handleStreamMode and handleBatchMode that yields a synthetic result event { type: 'result', isError: true, errorSubtype: 'error_during_execution', sessionId: 'stale-session-id' } and asserts updateSession is called with (sessionId, null).

Suggested fixes

sessions.ts:60–62 / sessions.test.ts — updateSession is now called with null, but the null path has no test. Add a test that verifies await updateSession('session-123', null) generates UPDATE remote_agent_sessions SET assistant_session_id = NULL WHERE id = $1 and does not throw.

Minor / nice-to-have

orchestrator-agent.ts:958–963 / orchestrator-agent.ts:1088–1093 — The 'clearing_stale_session_id' warning logs only conversationId and errorSubtype. Add other available fields from msg (e.g., toolName) to help diagnose the root cause of the execution error.
orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The interaction between the new clearing branch and the existing error-path persist call is not explicitly tested when a result carries both an error and a sessionId. Add a test asserting updateSession is called with null (clearing takes precedence).
orchestrator-agent.ts:333 — When tryPersistSessionId logs a failure, it uses newSessionId: null in the payload, which may mislead log readers into thinking the intent was to set null. Rename the field to persistedValue: assistantSessionId.

Compliments

Well-scoped PR with a clear before/after UX diagram, architecture diagram, connection inventory, and rollback plan — exactly the kind of documentation that makes future maintainers' lives easier.
The tryPersistSessionId error-swallowing is intentional and appropriate (non-critical side effect), and the failure is logged with context — good error handling design.
Both handleStreamMode and handleBatchMode are updated consistently — no subtle asymmetry.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

…o prevent infinite failure loop When a Claude API session expires (e.g. after container restart), the orchestrator persists the new (failed) session ID from the error result, causing every subsequent message in that conversation to hit the same error — an infinite failure loop. Fix: on error_during_execution result, set assistant_session_id to NULL instead of persisting the failed session ID. The next message starts a fresh session with full context rebuilt from the DB. Conversation history is unaffected since it lives in remote_agent_messages, independent of the Claude session. Changes: - updateSession() and tryPersistSessionId() now accept string | null - Both handleStreamMode and handleBatchMode clear session ID on error_during_execution Fixes coleam00#1280

… feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

kagura-agent · 2026-04-29T09:32:22Z

Thanks for the thorough review @Wirasm! All items addressed:

Blocking — stale session clearing tests:

Added tests for both handleStreamMode and handleBatchMode that yield a synthetic error_during_execution result and assert updateSession is called with (sessionId, null)

Suggested — updateSession(id, null) SQL test:

Added test in sessions.test.ts verifying the NULL path generates correct SQL

Minor fixes:

Enriched warning logs with errors and stopReason fields (note: toolName doesn't exist on result type chunks per the MessageChunk type definition)
persistedValue rename was already in the original commit

All 100 orchestrator tests + 29 sessions tests pass. Rebased on latest dev.

coderabbitai

🧹 Nitpick comments (1)

packages/core/src/orchestrator/orchestrator-agent.ts (1)
983-992: Normalize result-path log event names to the repo convention.

The new/updated event names (clearing_stale_session_id, ai_result_error) don’t follow the required {domain}.{action}_{state} format.
♻️ Suggested naming adjustment
- 'clearing_stale_session_id'
+ 'orchestrator.session_clear_failed'

- 'ai_result_error'
+ 'orchestrator.ai_result_failed'
As per coding guidelines, "Structured logging with Pino: use event naming format {domain}.{action}_{state} (standard states: _started, _completed, _failed, _validated, _rejected)."

Also applies to: 999-1007, 1127-1136, 1143-1151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/orchestrator/orchestrator-agent.ts` around lines 983 - 992,
The log event names like 'clearing_stale_session_id' and 'ai_result_error' do
not follow the repo convention `{domain}.{action}_{state}`; update the strings
passed to getLog().warn/getLog().error (and any other getLog() calls around the
same areas) to use that format (choose appropriate domain names such as
"session" or "ai.result", an action verb like "clear_stale_session" or "result",
and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 983-992: The log event names like 'clearing_stale_session_id' and
'ai_result_error' do not follow the repo convention `{domain}.{action}_{state}`;
update the strings passed to getLog().warn/getLog().error (and any other
getLog() calls around the same areas) to use that format (choose appropriate
domain names such as "session" or "ai.result", an action verb like
"clear_stale_session" or "result", and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d3dd9ef-3822-4b18-b28f-0d28548c5bb6

📥 Commits

Reviewing files that changed from the base of the PR and between d928859 and ec93193.

📒 Files selected for processing (4)

packages/core/src/db/sessions.test.ts
packages/core/src/db/sessions.ts
packages/core/src/orchestrator/orchestrator-agent.test.ts
packages/core/src/orchestrator/orchestrator-agent.ts

kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from 553c6a2 to d928859 Compare April 19, 2026 09:09

Wirasm mentioned this pull request Apr 20, 2026

fix: interactive loop resume crashes with error_during_execution (stale session) #1208

Open

voidborne-d mentioned this pull request Apr 27, 2026

fix(orchestrator): clear stale assistant session ID on error_during_execution (#1280) #1370

Closed

kagura-agent and others added 2 commits April 29, 2026 17:24

test(orchestrator): add stale session clearing tests + address review…

ec93193

… feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from d928859 to ec93193 Compare April 29, 2026 09:32

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Wirasm merged commit cbcca8c into coleam00:dev Apr 29, 2026
4 checks passed

Wirasm mentioned this pull request Apr 29, 2026

Release 0.3.10 #1488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294
Wirasm merged 2 commits intocoleam00:devfrom
kagura-agent:fix/1280-stale-session-id-loop

kagura-agent commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

kagura-agent commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 29, 2026

Uh oh!

kagura-agent commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kagura-agent commented Apr 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Architecture Diagram

Before

After

Label Snapshot

Security Impact

Human Verification

Side Effects / Blast Radius

Rollback Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

kagura-agent commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 29, 2026

Review Summary

Blocking issues

Suggested fixes

Minor / nice-to-have

Compliments

Uh oh!

kagura-agent commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kagura-agent commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading