fix(daemon): rehydrate conversation when surface-action arrives after eviction by vellum-apollo-bot[bot] · Pull Request #29895 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-07T16:32:15Z

Bug

Clicking an action button on a visible surface returns 404 with the message "Failed to submit. Please try again.". Web client submitSurfaceAction POSTs /v1/assistants/{assistant_id}/surface-actions/, Django proxies to the daemon, the daemon's handleSurfaceAction throws NotFoundError("No active conversation found").

Root cause

Surfaces persist as ui_surface content blocks in the messages table — durable, survive restart. But surfaceState and pendingSurfaceActions are populated only in memory by restoreSurfaceStateFromHistory() when a conversation is actively loaded.

After:

a daemon restart, or
LRU eviction by ConversationEvictor (default 30 min idle / 100-conv LRU / 3GB RSS)

…the tab still shows the surface (durable in DB), but the in-memory conversation is gone. findConversation* returns undefined → 404.

Fix

New helper resolveSurfaceConversation:

Try findConversation(conversationId) (or findConversationBySurfaceId(surfaceId) if no conversationId provided). If hit, return immediately.
On miss, scan messages for "surfaceId":"<id>" (with LIKE wildcards in surfaceId properly escaped) to find the owning conversation.
If no DB row exists → NotFoundError. The DB scan is the source of truth that the surface exists; without it, getOrCreateConversation would silently create a phantom empty conversation for any caller-supplied id.
If the caller supplied a conversationId that mismatches the DB row → NotFoundError.
Otherwise, call getOrCreateConversation(row.conversation_id) to load and rehydrate surfaceState via restoreSurfaceStateFromHistory(). Action dispatches normally.

Applied to both the surface-action handler and the surface-undo handler.

Tests

New surface-action-routes.test.ts covers:

live in-memory hit by surfaceId (no DB scan)
live in-memory hit by conversationId (no DB scan)
DB-fallback rehydrate when in-memory misses
rehydrate with caller-supplied conversationId after validating DB row
404 when surface is not in memory and not in DB (no phantom created)
404 when caller-supplied conversationId has no matching surface in DB
404 when caller-supplied conversationId mismatches the DB row's owner
escapes LIKE wildcards in surfaceId so e.g. surfaceId: "%" can't match unrelated rows
accepted=false rejection bubbles up as BadRequestError
undo handler: live hit, DB-fallback rehydrate, 404

All 12 pass. Typecheck clean. OpenAPI spec unchanged (no API surface delta).

Known limitation (followup)

A rehydrated conversation gets a no-op sendToClient from getOrCreateConversation, so direct SSE writes during action handling on the rehydrated path won't reach the client through that callback. Broadcasts via assistantEventHub still work, and the action's response body still conveys the immediate result. A separate followup will add a stream-aware path that refreshes sendToClient when the client reattaches an SSE stream to the rehydrated conversation.

Context

The user-visible escape hatch for the same bug landed separately on web in vellum-assistant-platform#6179 (merged) — it stops the composer from getting wedged when an interactive surface is up by treating "user types" as an implicit dismiss. That fix lets the user move past a surface without clicking. This PR restores the click-the-action path, which carries different intent than typing past the surface (e.g. "close as not-a-vuln" vs. "send a message").

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f61da8055e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T16:36:39Z

+       WHERE content LIKE ?
+       ORDER BY created_at DESC
+       LIMIT 1`,
+      `%"surfaceId":"${surfaceId}"%`,


Escape LIKE wildcards in surfaceId lookup

The DB fallback uses a LIKE pattern built directly from request surfaceId, so % and _ in user input are treated as wildcards. A request like surfaceId: "%" will match any ui_surface row and can resolve the wrong conversation, causing triggerSurfaceAction/undoSurfaceAction (when conversationId is omitted) to run against unrelated conversations instead of returning 404. Escape wildcard characters (or switch to exact JSON extraction) before querying.

Useful? React with 👍 / 👎.

Good catch — fixed in 8fe70d63a0. Now escapes %, _, and \ before binding and uses LIKE ? ESCAPE '\\'. New test (escapes LIKE wildcards in surfaceId) asserts the SQL uses ESCAPE and that a request with surfaceId: "%" produces the parameter %"surfaceId":"\\%"% rather than matching every row.

chatgpt-codex-connector · 2026-05-07T16:36:40Z

+  }
+  if (!resolvedConversationId) return undefined;
+
+  return await getOrCreateConversation(resolvedConversationId);


Fail missing conversation IDs instead of creating new state

When a caller provides conversationId and it is not in memory, this path always calls getOrCreateConversation(resolvedConversationId). Because that API creates conversations when they do not exist, surface-action requests with stale/invalid IDs now create a fresh conversation and continue dispatching instead of preserving the previous NotFound behavior. This turns an action endpoint into a conversation-creation path and can route actions into unintended IDs; validate existence/ownership before calling getOrCreateConversation.

Useful? React with 👍 / 👎.

Fixed in 8fe70d63a0. resolveSurfaceConversation now always validates the surface against the messages table before calling getOrCreateConversation, even when the caller supplies a conversationId. If the DB row is missing → NotFoundError (no phantom). If the row's conversation_id doesn't match the caller-supplied id → also NotFoundError. New tests cover both paths and the previous "rehydrate without DB scan" optimization is gone.

devin-ai-integration

Devin Review found 3 potential issues.

View 2 additional findings in Devin Review.

devin-ai-integration · 2026-05-07T16:37:01Z

+  let resolvedConversationId: string | undefined = conversationId ?? undefined;
+  if (!resolvedConversationId) {
+    const row = rawGet<{ conversation_id: string }>(
+      `SELECT conversation_id FROM messages
+       WHERE content LIKE ?
+       ORDER BY created_at DESC
+       LIMIT 1`,
+      `%"surfaceId":"${surfaceId}"%`,
+    );
+    if (row) resolvedConversationId = row.conversation_id;
+  }
+  if (!resolvedConversationId) return undefined;
+
+  return await getOrCreateConversation(resolvedConversationId);


🔴 resolveSurfaceConversation always returns a conversation for any provided conversationId, creating phantom conversations for invalid IDs

When a caller provides a conversationId that doesn't exist in memory, resolveSurfaceConversation unconditionally calls getOrCreateConversation(conversationId) at line 136. Because getOrCreateConversation (conversation-store.ts:194) always succeeds — creating a new empty in-memory Conversation even when no DB record exists — the function never returns undefined when conversationId is truthy.

This changes the error behavior from the old code: previously, findConversation(bogusId) returned undefined and the handler threw a clean NotFoundError (404). Now, a phantom empty conversation is created, setConversation registers it in the in-memory store, and the handler proceeds to dispatch the surface action on it. For handleSurfaceAction, this means the surface won't be found in pendingSurfaceActions and the code falls through to the history-restored path (conversation-surfaces.ts:1258), potentially enqueuing an LLM message on a ghost conversation. For handleSurfaceUndo, it calls handleSurfaceUndo(surfaceId) on an empty conversation with unpredictable results. Both paths return 500 instead of the expected 404.

Prompt for agents

In resolveSurfaceConversation (surface-action-routes.ts:114-137), when conversationId is provided but findConversation returns undefined, the function calls getOrCreateConversation which always succeeds — even for IDs that have no DB record. This creates phantom conversations and changes the error from 404 to 500. The fix should verify that the conversation actually exists in the database before calling getOrCreateConversation. Options: 1. After getOrCreateConversation returns, check that the rehydrated conversation's surfaceState contains the requested surfaceId. If not, return undefined. 2. Query the DB first (e.g. check if any messages exist for that conversation_id) before calling getOrCreateConversation. 3. Use a hypothetical findConversation-only variant that doesn't create if missing. Option 1 is simplest and also catches the case where the conversation exists but the surface doesn't belong to it.

Was this helpful? React with 👍 or 👎 to provide feedback.

Yep, agreed — same fix as the Codex finding above. 8fe70d63a0 makes the DB scan the gate before getOrCreateConversation, so a stale or invalid conversationId returns 404 instead of creating a phantom. Two new tests (returns 404 when caller-supplied conversationId has no matching surface in DB and ...mismatches the DB row) lock this in.

devin-ai-integration · 2026-05-07T16:37:02Z

+  test("returns 404 with caller-supplied conversationId that cannot be rehydrated", async () => {
+    // rehydrated stays null → getOrCreateConversation throws inside the mock,
+    // which surfaces as a route-level miss.
+    const handler = findHandler("triggerSurfaceAction");
+
+    let caught: unknown;
+    try {
+      await handler({
+        body: {
+          conversationId: "conv-deleted",
+          surfaceId: "surf-x",
+          actionId: "act-x",
+        },
+      });
+    } catch (err) {
+      caught = err;
+    }
+
+    expect(caught).toBeInstanceOf(Error);
+    // We tried to rehydrate before giving up.
+    expect(getOrCreateCalls).toEqual(["conv-deleted"]);
+  });


🚩 Test mock for getOrCreateConversation diverges from production behavior on invalid IDs

The test at line 234-255 ("returns 404 with caller-supplied conversationId that cannot be rehydrated") sets rehydrated = null so the mock throws an Error. In production, getOrCreateConversation never throws for an unknown ID — it creates a new empty conversation. This means the test validates mock behavior rather than real rehydration failure semantics. The test expects Error (not NotFoundError), which is consistent with the mock throwing, but doesn't reflect the actual production behavior where a phantom conversation would be created and the surface action would proceed (and likely fail differently). This is related to BUG-0001.

Was this helpful? React with 👍 or 👎 to provide feedback.

Resolved by the production fix — the test no longer relies on the mock throwing. getOrCreateConversation is now only called after the DB scan confirms the surface exists, so the test for "no matching surface" sets rawGetReturn = null and asserts that getOrCreateCalls stays empty. The mock no longer needs to throw to model the failure path.

devin-ai-integration · 2026-05-07T16:37:06Z

+  }
+  if (!resolvedConversationId) return undefined;
+
+  return await getOrCreateConversation(resolvedConversationId);


🚩 Rehydrated conversation has no-op sendToClient — surface events silently dropped

When getOrCreateConversation creates a new Conversation (conversation-store.ts:199), it sets sendToClient = () => {} (a no-op). After rehydration through resolveSurfaceConversation, the conversation's client callback is never updated to a real sender. This means any SSE events (like ui_surface_complete) emitted during surface action handling on a rehydrated conversation won't reach the client via the direct sendToClient channel. Events published through assistantEventHub (broadcast) would still work, so the practical impact depends on which event paths the surface action code uses. This is pre-existing behavior of getOrCreateConversation but becomes newly relevant with this PR's rehydration path.

Was this helpful? React with 👍 or 👎 to provide feedback.

Acknowledged — this is a real but pre-existing limitation of getOrCreateConversation, not something this PR makes worse. After this fix the user is no longer wedged on a 404; if their action triggers an LLM response, the response body of the action POST conveys the immediate result, and any new SSE stream the client opens against the rehydrated conversation will get fresh events normally. Updating sendToClient from the surface-action route would require knowing the caller's stream, which the current API doesn't carry — better as a separate followup that adds a stream-aware variant or hooks SSE establishment to refresh the sender. I'll file a tracking ticket.

… eviction Surfaces persist as ui_surface content blocks in the messages table, but surfaceState/pendingSurfaceActions live only in memory. After a daemon restart or LRU eviction, a stale tab can still render a surface whose owning conversation is no longer loaded — clicking an action on it gets a 404 from /v1/surface-actions because findConversation* miss. Add resolveSurfaceConversation: on miss, scan the messages table for the ui_surface block, then call getOrCreateConversation to rehydrate. Apply to both the action and undo handlers. Includes regression tests covering live hits, DB-fallback rehydrate (with and without caller-supplied conversationId), and the no-surface-found 404 path.

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 7, 2026

View reviewed changes

vellum-apollo-bot Bot force-pushed the apollo/surface-action-rehydrate branch from f61da80 to bd293e0 Compare May 7, 2026 17:18

vellum-apollo-bot Bot force-pushed the apollo/surface-action-rehydrate branch from bd293e0 to 8fe70d6 Compare May 7, 2026 20:56

dvargasfuertes approved these changes May 8, 2026

View reviewed changes

dvargasfuertes merged commit 0b1c0dd into main May 8, 2026
13 checks passed

dvargasfuertes deleted the apollo/surface-action-rehydrate branch May 8, 2026 01:16

vellum-apollo-bot Bot mentioned this pull request May 8, 2026

feat(context): preserve verbatim tail-anchor across compaction #29918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): rehydrate conversation when surface-action arrives after eviction#29895

fix(daemon): rehydrate conversation when surface-action arrives after eviction#29895
dvargasfuertes merged 1 commit into
mainfrom
apollo/surface-action-rehydrate

vellum-apollo-bot Bot commented May 7, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 7, 2026

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Uh oh!

devin-ai-integration Bot May 7, 2026

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Uh oh!

devin-ai-integration Bot May 7, 2026

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Root cause

Fix

Tests

Known limitation (followup)

Context

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vellum-apollo-bot Bot commented May 7, 2026 •

edited

Loading