fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294
Conversation
📝 WalkthroughWalkthroughAllow clearing persisted assistant session IDs by accepting Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Orch as Orchestrator
participant AI as AIClient
participant DB as Database
Client->>Orch: send user message
Orch->>DB: read assistant_session_id (may exist)
Orch->>AI: forward message (include assistant_session_id if present)
AI-->>Orch: result (isError=true, subtype=error_during_execution)
Orch->>DB: tryPersistSessionId(session.id, NULL)
DB-->>Orch: ack
Orch->>AI: subsequent message uses no assistant_session_id (fresh session)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
553c6a2 to
d928859
Compare
|
@kagura-agent related to #1208 — overlapping area or partial fix. |
|
Hi @kagura-agent — thanks for opening this PR. This repository uses a PR template at
Could you fill those out (even briefly)? The template helps reviewers understand scope, risk, and rollback — it speeds up review significantly. If a section genuinely doesn't apply, just write "N/A" in it rather than leaving it blank. |
|
Thanks @Wirasm! Filled out all the template sections. Re #1208 — I took a look and this PR addresses a different failure mode: #1208 is about initial session creation, while this one handles the case where a previously valid session expires after container restart. The fix is also in a different code path (error handler clearing stale IDs vs. session creation logic). Happy to add a note cross-referencing #1208 if that helps! |
Review SummaryVerdict: minor-fixes-needed This is a tightly-scoped bug fix that clears stale Blocking issues
Suggested fixes
Minor / nice-to-have
Compliments
Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality. |
…o prevent infinite failure loop When a Claude API session expires (e.g. after container restart), the orchestrator persists the new (failed) session ID from the error result, causing every subsequent message in that conversation to hit the same error — an infinite failure loop. Fix: on error_during_execution result, set assistant_session_id to NULL instead of persisting the failed session ID. The next message starts a fresh session with full context rebuilt from the DB. Conversation history is unaffected since it lives in remote_agent_messages, independent of the Claude session. Changes: - updateSession() and tryPersistSessionId() now accept string | null - Both handleStreamMode and handleBatchMode clear session ID on error_during_execution Fixes coleam00#1280
… feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
d928859 to
ec93193
Compare
|
Thanks for the thorough review @Wirasm! All items addressed: Blocking — stale session clearing tests:
Suggested —
Minor fixes:
All 100 orchestrator tests + 29 sessions tests pass. Rebased on latest |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/core/src/orchestrator/orchestrator-agent.ts (1)
983-992: Normalize result-path log event names to the repo convention.The new/updated event names (
clearing_stale_session_id,ai_result_error) don’t follow the required{domain}.{action}_{state}format.♻️ Suggested naming adjustment
- 'clearing_stale_session_id' + 'orchestrator.session_clear_failed' - 'ai_result_error' + 'orchestrator.ai_result_failed'As per coding guidelines, "Structured logging with Pino: use event naming format
{domain}.{action}_{state}(standard states:_started,_completed,_failed,_validated,_rejected)."Also applies to: 999-1007, 1127-1136, 1143-1151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/core/src/orchestrator/orchestrator-agent.ts` around lines 983 - 992, The log event names like 'clearing_stale_session_id' and 'ai_result_error' do not follow the repo convention `{domain}.{action}_{state}`; update the strings passed to getLog().warn/getLog().error (and any other getLog() calls around the same areas) to use that format (choose appropriate domain names such as "session" or "ai.result", an action verb like "clear_stale_session" or "result", and a standard state suffix like `_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g., the getLog().warn inside the clearing stale session flow and the getLog().error for ai result errors) emits events matching `{domain}.{action}_{state}`; apply the same renaming to the other occurrences referenced (lines ~999-1007, ~1127-1136, ~1143-1151).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 983-992: The log event names like 'clearing_stale_session_id' and
'ai_result_error' do not follow the repo convention `{domain}.{action}_{state}`;
update the strings passed to getLog().warn/getLog().error (and any other
getLog() calls around the same areas) to use that format (choose appropriate
domain names such as "session" or "ai.result", an action verb like
"clear_stale_session" or "result", and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8d3dd9ef-3822-4b18-b28f-0d28548c5bb6
📒 Files selected for processing (4)
packages/core/src/db/sessions.test.tspackages/core/src/db/sessions.tspackages/core/src/orchestrator/orchestrator-agent.test.tspackages/core/src/orchestrator/orchestrator-agent.ts
Summary
updateSession()andtryPersistSessionId()now acceptstring | null; bothhandleStreamModeandhandleBatchModeclear session ID onerror_during_executionremote_agent_messages), session creation flow, normal (non-error) session persistenceUX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
nullon errornullon errorstring | nullstring | nullLabel Snapshot
risk: lowsize: Scorecore:orchestratorSecurity Impact
No security impact. This change only affects internal session state management — no auth, no user data exposure, no new inputs.
Human Verification
tsc --noEmit)Side Effects / Blast Radius
error_during_execution)Rollback Plan
Revert the single commit. The only behavioral change is in the error handler — reverting restores original (broken) behavior of persisting stale IDs.
Closes #1280
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests