fix(daemon): rehydrate conversation when surface-action arrives after eviction#29895
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f61da8055e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| WHERE content LIKE ? | ||
| ORDER BY created_at DESC | ||
| LIMIT 1`, | ||
| `%"surfaceId":"${surfaceId}"%`, |
There was a problem hiding this comment.
Escape LIKE wildcards in surfaceId lookup
The DB fallback uses a LIKE pattern built directly from request surfaceId, so % and _ in user input are treated as wildcards. A request like surfaceId: "%" will match any ui_surface row and can resolve the wrong conversation, causing triggerSurfaceAction/undoSurfaceAction (when conversationId is omitted) to run against unrelated conversations instead of returning 404. Escape wildcard characters (or switch to exact JSON extraction) before querying.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 8fe70d63a0. Now escapes %, _, and \ before binding and uses LIKE ? ESCAPE '\\'. New test (escapes LIKE wildcards in surfaceId) asserts the SQL uses ESCAPE and that a request with surfaceId: "%" produces the parameter %"surfaceId":"\\%"% rather than matching every row.
| } | ||
| if (!resolvedConversationId) return undefined; | ||
|
|
||
| return await getOrCreateConversation(resolvedConversationId); |
There was a problem hiding this comment.
Fail missing conversation IDs instead of creating new state
When a caller provides conversationId and it is not in memory, this path always calls getOrCreateConversation(resolvedConversationId). Because that API creates conversations when they do not exist, surface-action requests with stale/invalid IDs now create a fresh conversation and continue dispatching instead of preserving the previous NotFound behavior. This turns an action endpoint into a conversation-creation path and can route actions into unintended IDs; validate existence/ownership before calling getOrCreateConversation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 8fe70d63a0. resolveSurfaceConversation now always validates the surface against the messages table before calling getOrCreateConversation, even when the caller supplies a conversationId. If the DB row is missing → NotFoundError (no phantom). If the row's conversation_id doesn't match the caller-supplied id → also NotFoundError. New tests cover both paths and the previous "rehydrate without DB scan" optimization is gone.
| let resolvedConversationId: string | undefined = conversationId ?? undefined; | ||
| if (!resolvedConversationId) { | ||
| const row = rawGet<{ conversation_id: string }>( | ||
| `SELECT conversation_id FROM messages | ||
| WHERE content LIKE ? | ||
| ORDER BY created_at DESC | ||
| LIMIT 1`, | ||
| `%"surfaceId":"${surfaceId}"%`, | ||
| ); | ||
| if (row) resolvedConversationId = row.conversation_id; | ||
| } | ||
| if (!resolvedConversationId) return undefined; | ||
|
|
||
| return await getOrCreateConversation(resolvedConversationId); |
There was a problem hiding this comment.
🔴 resolveSurfaceConversation always returns a conversation for any provided conversationId, creating phantom conversations for invalid IDs
When a caller provides a conversationId that doesn't exist in memory, resolveSurfaceConversation unconditionally calls getOrCreateConversation(conversationId) at line 136. Because getOrCreateConversation (conversation-store.ts:194) always succeeds — creating a new empty in-memory Conversation even when no DB record exists — the function never returns undefined when conversationId is truthy.
This changes the error behavior from the old code: previously, findConversation(bogusId) returned undefined and the handler threw a clean NotFoundError (404). Now, a phantom empty conversation is created, setConversation registers it in the in-memory store, and the handler proceeds to dispatch the surface action on it. For handleSurfaceAction, this means the surface won't be found in pendingSurfaceActions and the code falls through to the history-restored path (conversation-surfaces.ts:1258), potentially enqueuing an LLM message on a ghost conversation. For handleSurfaceUndo, it calls handleSurfaceUndo(surfaceId) on an empty conversation with unpredictable results. Both paths return 500 instead of the expected 404.
Prompt for agents
In resolveSurfaceConversation (surface-action-routes.ts:114-137), when conversationId is provided but findConversation returns undefined, the function calls getOrCreateConversation which always succeeds — even for IDs that have no DB record. This creates phantom conversations and changes the error from 404 to 500.
The fix should verify that the conversation actually exists in the database before calling getOrCreateConversation. Options:
1. After getOrCreateConversation returns, check that the rehydrated conversation's surfaceState contains the requested surfaceId. If not, return undefined.
2. Query the DB first (e.g. check if any messages exist for that conversation_id) before calling getOrCreateConversation.
3. Use a hypothetical findConversation-only variant that doesn't create if missing.
Option 1 is simplest and also catches the case where the conversation exists but the surface doesn't belong to it.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Yep, agreed — same fix as the Codex finding above. 8fe70d63a0 makes the DB scan the gate before getOrCreateConversation, so a stale or invalid conversationId returns 404 instead of creating a phantom. Two new tests (returns 404 when caller-supplied conversationId has no matching surface in DB and ...mismatches the DB row) lock this in.
| test("returns 404 with caller-supplied conversationId that cannot be rehydrated", async () => { | ||
| // rehydrated stays null → getOrCreateConversation throws inside the mock, | ||
| // which surfaces as a route-level miss. | ||
| const handler = findHandler("triggerSurfaceAction"); | ||
|
|
||
| let caught: unknown; | ||
| try { | ||
| await handler({ | ||
| body: { | ||
| conversationId: "conv-deleted", | ||
| surfaceId: "surf-x", | ||
| actionId: "act-x", | ||
| }, | ||
| }); | ||
| } catch (err) { | ||
| caught = err; | ||
| } | ||
|
|
||
| expect(caught).toBeInstanceOf(Error); | ||
| // We tried to rehydrate before giving up. | ||
| expect(getOrCreateCalls).toEqual(["conv-deleted"]); | ||
| }); |
There was a problem hiding this comment.
🚩 Test mock for getOrCreateConversation diverges from production behavior on invalid IDs
The test at line 234-255 ("returns 404 with caller-supplied conversationId that cannot be rehydrated") sets rehydrated = null so the mock throws an Error. In production, getOrCreateConversation never throws for an unknown ID — it creates a new empty conversation. This means the test validates mock behavior rather than real rehydration failure semantics. The test expects Error (not NotFoundError), which is consistent with the mock throwing, but doesn't reflect the actual production behavior where a phantom conversation would be created and the surface action would proceed (and likely fail differently). This is related to BUG-0001.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Resolved by the production fix — the test no longer relies on the mock throwing. getOrCreateConversation is now only called after the DB scan confirms the surface exists, so the test for "no matching surface" sets rawGetReturn = null and asserts that getOrCreateCalls stays empty. The mock no longer needs to throw to model the failure path.
| } | ||
| if (!resolvedConversationId) return undefined; | ||
|
|
||
| return await getOrCreateConversation(resolvedConversationId); |
There was a problem hiding this comment.
🚩 Rehydrated conversation has no-op sendToClient — surface events silently dropped
When getOrCreateConversation creates a new Conversation (conversation-store.ts:199), it sets sendToClient = () => {} (a no-op). After rehydration through resolveSurfaceConversation, the conversation's client callback is never updated to a real sender. This means any SSE events (like ui_surface_complete) emitted during surface action handling on a rehydrated conversation won't reach the client via the direct sendToClient channel. Events published through assistantEventHub (broadcast) would still work, so the practical impact depends on which event paths the surface action code uses. This is pre-existing behavior of getOrCreateConversation but becomes newly relevant with this PR's rehydration path.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Acknowledged — this is a real but pre-existing limitation of getOrCreateConversation, not something this PR makes worse. After this fix the user is no longer wedged on a 404; if their action triggers an LLM response, the response body of the action POST conveys the immediate result, and any new SSE stream the client opens against the rehydrated conversation will get fresh events normally. Updating sendToClient from the surface-action route would require knowing the caller's stream, which the current API doesn't carry — better as a separate followup that adds a stream-aware variant or hooks SSE establishment to refresh the sender. I'll file a tracking ticket.
f61da80 to
bd293e0
Compare
… eviction Surfaces persist as ui_surface content blocks in the messages table, but surfaceState/pendingSurfaceActions live only in memory. After a daemon restart or LRU eviction, a stale tab can still render a surface whose owning conversation is no longer loaded — clicking an action on it gets a 404 from /v1/surface-actions because findConversation* miss. Add resolveSurfaceConversation: on miss, scan the messages table for the ui_surface block, then call getOrCreateConversation to rehydrate. Apply to both the action and undo handlers. Includes regression tests covering live hits, DB-fallback rehydrate (with and without caller-supplied conversationId), and the no-surface-found 404 path.
bd293e0 to
8fe70d6
Compare
Bug
Clicking an action button on a visible surface returns 404 with the message
"Failed to submit. Please try again.". Web clientsubmitSurfaceActionPOSTs/v1/assistants/{assistant_id}/surface-actions/, Django proxies to the daemon, the daemon'shandleSurfaceActionthrowsNotFoundError("No active conversation found").Root cause
Surfaces persist as
ui_surfacecontent blocks in themessagestable — durable, survive restart. ButsurfaceStateandpendingSurfaceActionsare populated only in memory byrestoreSurfaceStateFromHistory()when a conversation is actively loaded.After:
ConversationEvictor(default 30 min idle / 100-conv LRU / 3GB RSS)…the tab still shows the surface (durable in DB), but the in-memory conversation is gone.
findConversation*returns undefined → 404.Fix
New helper
resolveSurfaceConversation:findConversation(conversationId)(orfindConversationBySurfaceId(surfaceId)if noconversationIdprovided). If hit, return immediately.messagesfor"surfaceId":"<id>"(with LIKE wildcards insurfaceIdproperly escaped) to find the owning conversation.NotFoundError. The DB scan is the source of truth that the surface exists; without it,getOrCreateConversationwould silently create a phantom empty conversation for any caller-supplied id.conversationIdthat mismatches the DB row →NotFoundError.getOrCreateConversation(row.conversation_id)to load and rehydratesurfaceStateviarestoreSurfaceStateFromHistory(). Action dispatches normally.Applied to both the surface-action handler and the surface-undo handler.
Tests
New
surface-action-routes.test.tscovers:surfaceId(no DB scan)conversationId(no DB scan)conversationIdafter validating DB rowconversationIdhas no matching surface in DBconversationIdmismatches the DB row's ownerLIKEwildcards insurfaceIdso e.g.surfaceId: "%"can't match unrelated rowsaccepted=falserejection bubbles up asBadRequestErrorAll 12 pass. Typecheck clean. OpenAPI spec unchanged (no API surface delta).
Known limitation (followup)
A rehydrated conversation gets a no-op
sendToClientfromgetOrCreateConversation, so direct SSE writes during action handling on the rehydrated path won't reach the client through that callback. Broadcasts viaassistantEventHubstill work, and the action's response body still conveys the immediate result. A separate followup will add a stream-aware path that refreshessendToClientwhen the client reattaches an SSE stream to the rehydrated conversation.Context
The user-visible escape hatch for the same bug landed separately on web in vellum-assistant-platform#6179 (merged) — it stops the composer from getting wedged when an interactive surface is up by treating "user types" as an implicit dismiss. That fix lets the user move past a surface without clicking. This PR restores the click-the-action path, which carries different intent than typing past the surface (e.g. "close as not-a-vuln" vs. "send a message").