Skip to content

fix(daemon): rehydrate conversation when surface-action arrives after eviction#29895

Merged
dvargasfuertes merged 1 commit into
mainfrom
apollo/surface-action-rehydrate
May 8, 2026
Merged

fix(daemon): rehydrate conversation when surface-action arrives after eviction#29895
dvargasfuertes merged 1 commit into
mainfrom
apollo/surface-action-rehydrate

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

@vellum-apollo-bot vellum-apollo-bot Bot commented May 7, 2026

Bug

Clicking an action button on a visible surface returns 404 with the message "Failed to submit. Please try again.". Web client submitSurfaceAction POSTs /v1/assistants/{assistant_id}/surface-actions/, Django proxies to the daemon, the daemon's handleSurfaceAction throws NotFoundError("No active conversation found").

Root cause

Surfaces persist as ui_surface content blocks in the messages table — durable, survive restart. But surfaceState and pendingSurfaceActions are populated only in memory by restoreSurfaceStateFromHistory() when a conversation is actively loaded.

After:

  • a daemon restart, or
  • LRU eviction by ConversationEvictor (default 30 min idle / 100-conv LRU / 3GB RSS)

…the tab still shows the surface (durable in DB), but the in-memory conversation is gone. findConversation* returns undefined → 404.

Fix

New helper resolveSurfaceConversation:

  1. Try findConversation(conversationId) (or findConversationBySurfaceId(surfaceId) if no conversationId provided). If hit, return immediately.
  2. On miss, scan messages for "surfaceId":"<id>" (with LIKE wildcards in surfaceId properly escaped) to find the owning conversation.
  3. If no DB row exists → NotFoundError. The DB scan is the source of truth that the surface exists; without it, getOrCreateConversation would silently create a phantom empty conversation for any caller-supplied id.
  4. If the caller supplied a conversationId that mismatches the DB row → NotFoundError.
  5. Otherwise, call getOrCreateConversation(row.conversation_id) to load and rehydrate surfaceState via restoreSurfaceStateFromHistory(). Action dispatches normally.

Applied to both the surface-action handler and the surface-undo handler.

Tests

New surface-action-routes.test.ts covers:

  • live in-memory hit by surfaceId (no DB scan)
  • live in-memory hit by conversationId (no DB scan)
  • DB-fallback rehydrate when in-memory misses
  • rehydrate with caller-supplied conversationId after validating DB row
  • 404 when surface is not in memory and not in DB (no phantom created)
  • 404 when caller-supplied conversationId has no matching surface in DB
  • 404 when caller-supplied conversationId mismatches the DB row's owner
  • escapes LIKE wildcards in surfaceId so e.g. surfaceId: "%" can't match unrelated rows
  • accepted=false rejection bubbles up as BadRequestError
  • undo handler: live hit, DB-fallback rehydrate, 404

All 12 pass. Typecheck clean. OpenAPI spec unchanged (no API surface delta).

Known limitation (followup)

A rehydrated conversation gets a no-op sendToClient from getOrCreateConversation, so direct SSE writes during action handling on the rehydrated path won't reach the client through that callback. Broadcasts via assistantEventHub still work, and the action's response body still conveys the immediate result. A separate followup will add a stream-aware path that refreshes sendToClient when the client reattaches an SSE stream to the rehydrated conversation.

Context

The user-visible escape hatch for the same bug landed separately on web in vellum-assistant-platform#6179 (merged) — it stops the composer from getting wedged when an interactive surface is up by treating "user types" as an implicit dismiss. That fix lets the user move past a surface without clicking. This PR restores the click-the-action path, which carries different intent than typing past the surface (e.g. "close as not-a-vuln" vs. "send a message").

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f61da8055e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

WHERE content LIKE ?
ORDER BY created_at DESC
LIMIT 1`,
`%"surfaceId":"${surfaceId}"%`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Escape LIKE wildcards in surfaceId lookup

The DB fallback uses a LIKE pattern built directly from request surfaceId, so % and _ in user input are treated as wildcards. A request like surfaceId: "%" will match any ui_surface row and can resolve the wrong conversation, causing triggerSurfaceAction/undoSurfaceAction (when conversationId is omitted) to run against unrelated conversations instead of returning 404. Escape wildcard characters (or switch to exact JSON extraction) before querying.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 8fe70d63a0. Now escapes %, _, and \ before binding and uses LIKE ? ESCAPE '\\'. New test (escapes LIKE wildcards in surfaceId) asserts the SQL uses ESCAPE and that a request with surfaceId: "%" produces the parameter %"surfaceId":"\\%"% rather than matching every row.

}
if (!resolvedConversationId) return undefined;

return await getOrCreateConversation(resolvedConversationId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail missing conversation IDs instead of creating new state

When a caller provides conversationId and it is not in memory, this path always calls getOrCreateConversation(resolvedConversationId). Because that API creates conversations when they do not exist, surface-action requests with stale/invalid IDs now create a fresh conversation and continue dispatching instead of preserving the previous NotFound behavior. This turns an action endpoint into a conversation-creation path and can route actions into unintended IDs; validate existence/ownership before calling getOrCreateConversation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8fe70d63a0. resolveSurfaceConversation now always validates the surface against the messages table before calling getOrCreateConversation, even when the caller supplies a conversationId. If the DB row is missing → NotFoundError (no phantom). If the row's conversation_id doesn't match the caller-supplied id → also NotFoundError. New tests cover both paths and the previous "rehydrate without DB scan" optimization is gone.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 2 additional findings in Devin Review.

Open in Devin Review

Comment on lines +123 to +136
let resolvedConversationId: string | undefined = conversationId ?? undefined;
if (!resolvedConversationId) {
const row = rawGet<{ conversation_id: string }>(
`SELECT conversation_id FROM messages
WHERE content LIKE ?
ORDER BY created_at DESC
LIMIT 1`,
`%"surfaceId":"${surfaceId}"%`,
);
if (row) resolvedConversationId = row.conversation_id;
}
if (!resolvedConversationId) return undefined;

return await getOrCreateConversation(resolvedConversationId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 resolveSurfaceConversation always returns a conversation for any provided conversationId, creating phantom conversations for invalid IDs

When a caller provides a conversationId that doesn't exist in memory, resolveSurfaceConversation unconditionally calls getOrCreateConversation(conversationId) at line 136. Because getOrCreateConversation (conversation-store.ts:194) always succeeds — creating a new empty in-memory Conversation even when no DB record exists — the function never returns undefined when conversationId is truthy.

This changes the error behavior from the old code: previously, findConversation(bogusId) returned undefined and the handler threw a clean NotFoundError (404). Now, a phantom empty conversation is created, setConversation registers it in the in-memory store, and the handler proceeds to dispatch the surface action on it. For handleSurfaceAction, this means the surface won't be found in pendingSurfaceActions and the code falls through to the history-restored path (conversation-surfaces.ts:1258), potentially enqueuing an LLM message on a ghost conversation. For handleSurfaceUndo, it calls handleSurfaceUndo(surfaceId) on an empty conversation with unpredictable results. Both paths return 500 instead of the expected 404.

Prompt for agents
In resolveSurfaceConversation (surface-action-routes.ts:114-137), when conversationId is provided but findConversation returns undefined, the function calls getOrCreateConversation which always succeeds — even for IDs that have no DB record. This creates phantom conversations and changes the error from 404 to 500.

The fix should verify that the conversation actually exists in the database before calling getOrCreateConversation. Options:

1. After getOrCreateConversation returns, check that the rehydrated conversation's surfaceState contains the requested surfaceId. If not, return undefined.
2. Query the DB first (e.g. check if any messages exist for that conversation_id) before calling getOrCreateConversation.
3. Use a hypothetical findConversation-only variant that doesn't create if missing.

Option 1 is simplest and also catches the case where the conversation exists but the surface doesn't belong to it.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed — same fix as the Codex finding above. 8fe70d63a0 makes the DB scan the gate before getOrCreateConversation, so a stale or invalid conversationId returns 404 instead of creating a phantom. Two new tests (returns 404 when caller-supplied conversationId has no matching surface in DB and ...mismatches the DB row) lock this in.

Comment on lines +234 to +255
test("returns 404 with caller-supplied conversationId that cannot be rehydrated", async () => {
// rehydrated stays null → getOrCreateConversation throws inside the mock,
// which surfaces as a route-level miss.
const handler = findHandler("triggerSurfaceAction");

let caught: unknown;
try {
await handler({
body: {
conversationId: "conv-deleted",
surfaceId: "surf-x",
actionId: "act-x",
},
});
} catch (err) {
caught = err;
}

expect(caught).toBeInstanceOf(Error);
// We tried to rehydrate before giving up.
expect(getOrCreateCalls).toEqual(["conv-deleted"]);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Test mock for getOrCreateConversation diverges from production behavior on invalid IDs

The test at line 234-255 ("returns 404 with caller-supplied conversationId that cannot be rehydrated") sets rehydrated = null so the mock throws an Error. In production, getOrCreateConversation never throws for an unknown ID — it creates a new empty conversation. This means the test validates mock behavior rather than real rehydration failure semantics. The test expects Error (not NotFoundError), which is consistent with the mock throwing, but doesn't reflect the actual production behavior where a phantom conversation would be created and the surface action would proceed (and likely fail differently). This is related to BUG-0001.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by the production fix — the test no longer relies on the mock throwing. getOrCreateConversation is now only called after the DB scan confirms the surface exists, so the test for "no matching surface" sets rawGetReturn = null and asserts that getOrCreateCalls stays empty. The mock no longer needs to throw to model the failure path.

}
if (!resolvedConversationId) return undefined;

return await getOrCreateConversation(resolvedConversationId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Rehydrated conversation has no-op sendToClient — surface events silently dropped

When getOrCreateConversation creates a new Conversation (conversation-store.ts:199), it sets sendToClient = () => {} (a no-op). After rehydration through resolveSurfaceConversation, the conversation's client callback is never updated to a real sender. This means any SSE events (like ui_surface_complete) emitted during surface action handling on a rehydrated conversation won't reach the client via the direct sendToClient channel. Events published through assistantEventHub (broadcast) would still work, so the practical impact depends on which event paths the surface action code uses. This is pre-existing behavior of getOrCreateConversation but becomes newly relevant with this PR's rehydration path.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a real but pre-existing limitation of getOrCreateConversation, not something this PR makes worse. After this fix the user is no longer wedged on a 404; if their action triggers an LLM response, the response body of the action POST conveys the immediate result, and any new SSE stream the client opens against the rehydrated conversation will get fresh events normally. Updating sendToClient from the surface-action route would require knowing the caller's stream, which the current API doesn't carry — better as a separate followup that adds a stream-aware variant or hooks SSE establishment to refresh the sender. I'll file a tracking ticket.

@vellum-apollo-bot vellum-apollo-bot Bot force-pushed the apollo/surface-action-rehydrate branch from f61da80 to bd293e0 Compare May 7, 2026 17:18
… eviction

Surfaces persist as ui_surface content blocks in the messages table, but
surfaceState/pendingSurfaceActions live only in memory. After a daemon
restart or LRU eviction, a stale tab can still render a surface whose
owning conversation is no longer loaded — clicking an action on it gets
a 404 from /v1/surface-actions because findConversation* miss.

Add resolveSurfaceConversation: on miss, scan the messages table for the
ui_surface block, then call getOrCreateConversation to rehydrate. Apply
to both the action and undo handlers.

Includes regression tests covering live hits, DB-fallback rehydrate (with
and without caller-supplied conversationId), and the no-surface-found 404
path.
@vellum-apollo-bot vellum-apollo-bot Bot force-pushed the apollo/surface-action-rehydrate branch from bd293e0 to 8fe70d6 Compare May 7, 2026 20:56
@dvargasfuertes dvargasfuertes merged commit 0b1c0dd into main May 8, 2026
13 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/surface-action-rehydrate branch May 8, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant