Skip to content

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294

Merged
Wirasm merged 2 commits intocoleam00:devfrom
kagura-agent:fix/1280-stale-session-id-loop
Apr 29, 2026
Merged

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294
Wirasm merged 2 commits intocoleam00:devfrom
kagura-agent:fix/1280-stale-session-id-loop

Conversation

@kagura-agent
Copy link
Copy Markdown
Contributor

@kagura-agent kagura-agent commented Apr 19, 2026

Summary

  • Problem: After container restart, sending a message in an existing conversation silently fails and enters an infinite failure loop — the expired Claude API session ID is persisted even on error
  • Why it matters: Users are stuck with broken conversations after any restart, with no recovery path except manual DB intervention
  • What changed: updateSession() and tryPersistSessionId() now accept string | null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution
  • What did not change: Conversation history (remote_agent_messages), session creation flow, normal (non-error) session persistence

UX Journey

Before

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       persists SAME stale ID
  sees error ◀──────── returns error
sends again ─────────▶ loads SAME stale ID
                       tries Claude API ──────▶  rejects again
  (infinite loop)      never recovers

After

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       [clears session ID → NULL]
  sees error ◀──────── returns error
sends again ─────────▶ [no stored ID → creates new session]
                       tries Claude API ──────▶  accepts (fresh)
  sees reply ◀──────── streams response

Architecture Diagram

Before

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ updateSession(sessionId)  ──▶ DB (stale ID persisted)

After

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ [updateSession(null)]  ──▶ DB (ID cleared)

Connection inventory:

From To Status Notes
handleStreamMode updateSession modified Now passes null on error
handleBatchMode tryPersistSessionId modified Now passes null on error
updateSession DB modified Accepts string | null
tryPersistSessionId DB modified Accepts string | null

Label Snapshot

  • Risk: risk: low
  • Size: size: S
  • Scope: core
  • Module: core:orchestrator

Security Impact

No security impact. This change only affects internal session state management — no auth, no user data exposure, no new inputs.

Human Verification

  • Verified by restarting container and sending message to existing conversation — session recovers automatically
  • All 89 orchestrator-agent tests pass
  • All 28 sessions DB tests pass
  • TypeScript type-check clean (tsc --noEmit)

Side Effects / Blast Radius

  • Only affects error recovery path (error_during_execution)
  • Normal session flow is unchanged
  • Worst case: session is cleared unnecessarily, resulting in a fresh session (no data loss, just context reload)

Rollback Plan

Revert the single commit. The only behavioral change is in the error handler — reverting restores original (broken) behavior of persisting stale IDs.

Closes #1280

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced session management to automatically clear invalid sessions when execution errors occur, improving reliability.
    • Expanded error logging to include additional details for better troubleshooting of result processing failures.
  • Tests

    • Added comprehensive test coverage for session state clearing during error recovery scenarios.
    • Improved test structure with named mock utilities for better maintainability.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

Allow clearing persisted assistant session IDs by accepting null when updating sessions and change orchestrator result handling to clear (set NULL) the stored assistant_session_id when an error_during_execution result is received, preventing stale-session retry loops.

Changes

Cohort / File(s) Summary
Session persistence
packages/core/src/db/sessions.ts, packages/core/src/db/sessions.test.ts
updateSession signature changed to accept sessionId: string | null; tests added/updated to verify NULL is bound and not-found behavior remains.
Orchestrator logic
packages/core/src/orchestrator/orchestrator-agent.ts, packages/core/src/orchestrator/orchestrator-agent.test.ts
tryPersistSessionId accepts string | null. handleStreamMode/handleBatchMode detect msg.isError && msg.errorSubtype === 'error_during_execution', log clearing, persist null for assistant session, prevent using the failed sessionId for that turn; tests added to assert transition + updateSession(..., null) in both stream and batch.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Orch as Orchestrator
  participant AI as AIClient
  participant DB as Database

  Client->>Orch: send user message
  Orch->>DB: read assistant_session_id (may exist)
  Orch->>AI: forward message (include assistant_session_id if present)
  AI-->>Orch: result (isError=true, subtype=error_during_execution)
  Orch->>DB: tryPersistSessionId(session.id, NULL)
  DB-->>Orch: ack
  Orch->>AI: subsequent message uses no assistant_session_id (fresh session)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I nudge the stale ID into the night,
Clearing cobwebs so the convo's bright.
No looping echoes, just fresh, hopeful talk—
I hop, I clear, then skip away to walk. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: clearing stale session IDs on error_during_execution to prevent infinite failure loops.
Description check ✅ Passed The PR description comprehensively covers all required template sections with clear explanations, UX/architecture diagrams, validation evidence, security impact, and rollback plan.
Linked Issues check ✅ Passed The code changes fully address issue #1280: updateSession and tryPersistSessionId now accept null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution as required.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the error recovery path for stale session IDs; conversation history, session creation flow, and normal session persistence remain unchanged as specified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@kagura-agent kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from 553c6a2 to d928859 Compare April 19, 2026 09:09
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 20, 2026

@kagura-agent related to #1208 — overlapping area or partial fix.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 20, 2026

Hi @kagura-agent — thanks for opening this PR.

This repository uses a PR template at .github/pull_request_template.md with several required sections. A few of them appear to be empty or placeholder here:

  • Security Impact (required)
  • Human Verification (required)
  • Side Effects / Blast Radius (required)
  • Rollback Plan (required)

Could you fill those out (even briefly)? The template helps reviewers understand scope, risk, and rollback — it speeds up review significantly.

If a section genuinely doesn't apply, just write "N/A" in it rather than leaving it blank.

@kagura-agent
Copy link
Copy Markdown
Contributor Author

Thanks @Wirasm! Filled out all the template sections.

Re #1208 — I took a look and this PR addresses a different failure mode: #1208 is about initial session creation, while this one handles the case where a previously valid session expires after container restart. The fix is also in a different code path (error handler clearing stale IDs vs. session creation logic). Happy to add a note cross-referencing #1208 if that helps!

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 29, 2026

Review Summary

Verdict: minor-fixes-needed

This is a tightly-scoped bug fix that clears stale assistant_session_id values when error_during_execution errors occur, breaking infinite failure loops. The code is clean, type-safe, and CLAUDE.md-compliant — good work. The gap is test coverage: the new session-clearing branch has no tests, and updateSession's null path also needs one. Both are small, targeted additions.

Blocking issues

  • orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The new branch that calls await tryPersistSessionId(session.id, null) on error_during_execution is untested. Add a test for both handleStreamMode and handleBatchMode that yields a synthetic result event { type: 'result', isError: true, errorSubtype: 'error_during_execution', sessionId: 'stale-session-id' } and asserts updateSession is called with (sessionId, null).

Suggested fixes

  • sessions.ts:60–62 / sessions.test.tsupdateSession is now called with null, but the null path has no test. Add a test that verifies await updateSession('session-123', null) generates UPDATE remote_agent_sessions SET assistant_session_id = NULL WHERE id = $1 and does not throw.

Minor / nice-to-have

  • orchestrator-agent.ts:958–963 / orchestrator-agent.ts:1088–1093 — The 'clearing_stale_session_id' warning logs only conversationId and errorSubtype. Add other available fields from msg (e.g., toolName) to help diagnose the root cause of the execution error.

  • orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The interaction between the new clearing branch and the existing error-path persist call is not explicitly tested when a result carries both an error and a sessionId. Add a test asserting updateSession is called with null (clearing takes precedence).

  • orchestrator-agent.ts:333 — When tryPersistSessionId logs a failure, it uses newSessionId: null in the payload, which may mislead log readers into thinking the intent was to set null. Rename the field to persistedValue: assistantSessionId.

Compliments

  • Well-scoped PR with a clear before/after UX diagram, architecture diagram, connection inventory, and rollback plan — exactly the kind of documentation that makes future maintainers' lives easier.
  • The tryPersistSessionId error-swallowing is intentional and appropriate (non-critical side effect), and the failure is logged with context — good error handling design.
  • Both handleStreamMode and handleBatchMode are updated consistently — no subtle asymmetry.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

kagura-agent and others added 2 commits April 29, 2026 17:24
…o prevent infinite failure loop

When a Claude API session expires (e.g. after container restart), the orchestrator
persists the new (failed) session ID from the error result, causing every subsequent
message in that conversation to hit the same error — an infinite failure loop.

Fix: on error_during_execution result, set assistant_session_id to NULL instead of
persisting the failed session ID. The next message starts a fresh session with full
context rebuilt from the DB. Conversation history is unaffected since it lives in
remote_agent_messages, independent of the Claude session.

Changes:
- updateSession() and tryPersistSessionId() now accept string | null
- Both handleStreamMode and handleBatchMode clear session ID on error_during_execution

Fixes coleam00#1280
… feedback

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
@kagura-agent kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from d928859 to ec93193 Compare April 29, 2026 09:32
@kagura-agent
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @Wirasm! All items addressed:

Blocking — stale session clearing tests:

  • Added tests for both handleStreamMode and handleBatchMode that yield a synthetic error_during_execution result and assert updateSession is called with (sessionId, null)

Suggested — updateSession(id, null) SQL test:

  • Added test in sessions.test.ts verifying the NULL path generates correct SQL

Minor fixes:

  • Enriched warning logs with errors and stopReason fields (note: toolName doesn't exist on result type chunks per the MessageChunk type definition)
  • persistedValue rename was already in the original commit

All 100 orchestrator tests + 29 sessions tests pass. Rebased on latest dev.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/core/src/orchestrator/orchestrator-agent.ts (1)

983-992: Normalize result-path log event names to the repo convention.

The new/updated event names (clearing_stale_session_id, ai_result_error) don’t follow the required {domain}.{action}_{state} format.

♻️ Suggested naming adjustment
- 'clearing_stale_session_id'
+ 'orchestrator.session_clear_failed'

- 'ai_result_error'
+ 'orchestrator.ai_result_failed'

As per coding guidelines, "Structured logging with Pino: use event naming format {domain}.{action}_{state} (standard states: _started, _completed, _failed, _validated, _rejected)."

Also applies to: 999-1007, 1127-1136, 1143-1151

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/orchestrator/orchestrator-agent.ts` around lines 983 - 992,
The log event names like 'clearing_stale_session_id' and 'ai_result_error' do
not follow the repo convention `{domain}.{action}_{state}`; update the strings
passed to getLog().warn/getLog().error (and any other getLog() calls around the
same areas) to use that format (choose appropriate domain names such as
"session" or "ai.result", an action verb like "clear_stale_session" or "result",
and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 983-992: The log event names like 'clearing_stale_session_id' and
'ai_result_error' do not follow the repo convention `{domain}.{action}_{state}`;
update the strings passed to getLog().warn/getLog().error (and any other
getLog() calls around the same areas) to use that format (choose appropriate
domain names such as "session" or "ai.result", an action verb like
"clear_stale_session" or "result", and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d3dd9ef-3822-4b18-b28f-0d28548c5bb6

📥 Commits

Reviewing files that changed from the base of the PR and between d928859 and ec93193.

📒 Files selected for processing (4)
  • packages/core/src/db/sessions.test.ts
  • packages/core/src/db/sessions.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts

@Wirasm Wirasm merged commit cbcca8c into coleam00:dev Apr 29, 2026
4 checks passed
@Wirasm Wirasm mentioned this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale Claude session ID causes silent failure loop on container restart

2 participants