Skip to content

fix(web): fail stale tool calls when the assistant restarts mid-execution#32599

Merged
dvargasfuertes merged 3 commits into
mainfrom
apollo/fail-stale-tool-calls
May 29, 2026
Merged

fix(web): fail stale tool calls when the assistant restarts mid-execution#32599
dvargasfuertes merged 3 commits into
mainfrom
apollo/fail-stale-tool-calls

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

Summary

Catch the case where the assistant daemon restarts (or silently dies) mid-tool-execution and never delivers the tool_result SSE for the call it was running. The web client's tool-call bubble would otherwise spin forever — Vargas's screenshot from May 29: a bash chip stuck at "Verifying no stale refs + running typecheck" across reload because the daemon went away before the tool result hit the wire.

This adds Hack #5 — fail stale tool calls to sanitizeDisplayMessages. Predicate:

  • status === "running"
  • startedAt is set
  • pendingConfirmation is null/undefined (a tool awaiting user approval is correctly stalled — its execution clock hasn't started)
  • now - max(startedAt, lastProgressAt ?? 0) > effectiveTimeoutMs + STALE_GRACE_MS

effectiveTimeoutMs prefers progressTimeoutSec (the daemon-advertised per-tool value), falling back to DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC. That fallback constant is now exposed from @vellumai/assistant-api so backend enforcement and frontend detection reference the same wire-contract default.

STALE_GRACE_MS = 30_000 absorbs daemon-side delivery lag between the server-side timeout firing and the synthetic error result actually crossing the SSE.

How this differs from Hack #4

Hack #4 (repair dangling tool calls) only fires when a subsequent assistant message proves the tool completed server-side. Hack #5 is the timeout-based fallback for the case where no such proof ever arrives.

Stale-tool ticker

useMemo([messages]) only re-evaluates when messages change, so the silent-daemon-death case (no new messages) would never trigger the predicate. chat-route-content now bumps a staleToolNowMs state every 15s only while at least one tool call is running — outside that window the interval never starts, so idle conversations have zero extra renders.

Changes

  • assistant/src/api/constants/tool-execution.ts (new): canonical DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC = 120.
  • assistant/src/api/index.ts: re-export.
  • assistant/src/config/schemas/timeouts.ts + assistant/src/tools/execution-timeout.ts: consume the constant in place of inline 120 literals.
  • apps/web/src/domains/chat/utils/sanitize-display-messages.ts: Hack Create top level setup.sh script #5 implementation. Optional nowMs parameter on sanitizeDisplayMessages for deterministic tests.
  • apps/web/src/domains/chat/utils/sanitize-display-messages.test.ts: 11 new tests.
  • apps/web/src/domains/chat/components/chat-route-content.tsx: 15-second stale-tool ticker, scoped to "while a tool is running".

Test plan

  • cd apps/web && bun test src/domains/chat/utils/sanitize-display-messages.test.ts — 46 pass (35 prior + 11 new).
  • cd assistant && bun test src/__tests__/tool-execute-pipeline.test.ts src/__tests__/recording-handler.test.ts src/__tests__/tool-execution-pipeline.benchmark.test.ts — all pass; default-timeout consumers still see 120s.
  • cd apps/web && bun run typecheck && bun run lint — clean.
  • cd assistant && bun run typecheck — clean.

Follow-up

The current synthetic message is purely client-side decoration. Server-side, the daemon should ideally persist an "in-flight tool" record so the next process can emit a synthetic tool_result on boot — at which point Hack #5 (and Hack #4) become removable. Tracked separately.

Expose DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC from @vellumai/assistant-api
and consume it in sanitizeDisplayMessages as a new Hack #5: any
running tool call whose elapsed time exceeds the configured execution
timeout (plus a small grace window) is rewritten to a synthetic
client-side error.

The case this catches: the assistant restarts mid-tool-execution and
never delivers the tool_result SSE for the call it was running. The
client retains status: "running" forever, leaving a permanent
spinner on the last bubble across reloads.

Hack #4 (repair dangling tool calls) only fires when a SUBSEQUENT
assistant message proves the tool completed server-side. Hack #5 is
the timeout-based fallback for the case where no such proof ever
arrives.

- assistant/src/api/constants/tool-execution.ts (new): canonical
  default constant.
- assistant/src/api/index.ts: re-export.
- assistant/src/config/schemas/timeouts.ts +
  assistant/src/tools/execution-timeout.ts: consume the constant in
  place of inline 120 literals.
- apps/web/src/domains/chat/utils/sanitize-display-messages.ts:
  Hack #5 — failStaleToolCalls. Predicate gates on running status,
  startedAt presence, no pendingConfirmation, and
  now - max(startedAt, lastProgressAt) > effectiveTimeoutMs +
  STALE_GRACE_MS. effectiveTimeoutMs prefers progressTimeoutSec (the
  daemon-advertised per-tool value) and falls back to the canonical
  default.
- Tests: 11 new cases covering threshold semantics, progress-based
  liveness, pending-confirmation safety, missing-startedAt safety,
  last-assistant inclusion, sibling preservation, and COW identity.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfebba4c2c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +397 to +399
tc.progressTimeoutSec && tc.progressTimeoutSec > 0
? tc.progressTimeoutSec
: DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the effective timeout before failing tools

For any deployment with timeouts.toolExecutionTimeoutSec raised above 120s, or for bash/host_bash calls that pass timeout_seconds above 120s (the executor allows shell calls up to shellMaxTimeoutSec, default 600s, in assistant/src/tools/executor.ts), this fallback still marks the chip as failed after only the hard-coded default plus grace. I only found tool_progress parsed/handled on the web side, not emitted by the assistant, so many legitimate long-running calls will have no progressTimeoutSec and will be shown as a client-side timeout while the assistant can still be executing and may later return a real result.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the analysis — searched both directions: grep -rn '"tool_progress"' assistant/ returns nothing; only the web side parses the event. So today the progressTimeoutSec branch is dead code in the deployed system and the 120s+grace fallback is the only path.

Impact:

  • Default-config deployments (toolExecutionTimeoutSec = 120, bash without an explicit timeout_seconds): predicate is correct.
  • Raised toolExecutionTimeoutSec server-side: false positives.
  • bash / host_bash with input.timeout_seconds > 120s (executor clamps to shellMaxTimeoutSec, default 600s — see executor.ts:508-516): false positives.

Proposed follow-up (separate PR so this lands clean): add effectiveTimeoutSec?: number to ToolUseStartEvent, populated on the daemon side from computePerToolTimeoutMs. The web stream-message-updater stamps it onto the tool call alongside startedAt. The sanitizer then prefers this value over the hard-coded fallback, and the predicate becomes correct regardless of server config or per-call timeout_seconds.

The current PR is still strictly better than the status quo (infinite spinner → user reload) for the default-config case, which is what Vargas's screenshot was. Holding the wire-contract change for the follow-up unless he wants it folded in here.

Comment on lines +820 to +836
const hasRunningTool = useMemo(
() =>
messages.some(
(m) => m.toolCalls?.some((tc) => tc.status === "running") ?? false,
),
[messages],
);
const [staleToolNowMs, setStaleToolNowMs] = useState(() => Date.now());
useEffect(() => {
if (!hasRunningTool) return;
const STALE_TOOL_TICK_MS = 15_000;
const id = setInterval(
() => setStaleToolNowMs(Date.now()),
STALE_TOOL_TICK_MS,
);
return () => clearInterval(id);
}, [hasRunningTool]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this to be here, I'm fine with this being just in sanitizeDisplayMessages. I recognize we don't catch the case of the exact moment where we go from not timeout to timeout

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — force-pushed c7bfd87082 and the ticker block is gone. The sanitizer will re-evaluate on the next messages change (next tool call, next assistant text, page reload), which covers the practical case of the user coming back to a stuck conversation. Accepting the at-transition miss as documented.

@vellum-apollo-bot vellum-apollo-bot Bot force-pushed the apollo/fail-stale-tool-calls branch from bfebba4 to c7bfd87 Compare May 29, 2026 19:43

export function sanitizeDisplayMessages(
messages: DisplayMessage[],
options: SanitizeDisplayMessagesOptions = {},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete these options here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 855d0fc45e — dropped the SanitizeDisplayMessagesOptions interface and the second parameter. sanitizeDisplayMessages is back to a one-arg signature reading Date.now() inline. Tests pin the clock via a mockNow(...) helper backed by spyOn(Date, "now"), restored in afterEach. All 46 tests pass.

removeInvalidMessages,
removeDuplicateTrailingAssistant,
repairDanglingToolCalls,
(msgs) => failStaleToolCalls(msgs, nowMs),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here, can call the now() within this method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same commit — Date.now() is now called directly inside sanitizeDisplayMessages at the pipeline boundary, no options pass-through.

…tests

Per review feedback: drop the SanitizeDisplayMessagesOptions interface
and the optional second parameter on sanitizeDisplayMessages. Read
Date.now() directly at the pipeline boundary instead.

Tests now pin the clock via bun:test's `spyOn(Date, "now")` through a
local `mockNow(...)` helper and restore in afterEach. Behavior is
identical to the previous param-injection approach but the production
signature stays one-arg.

function failStaleToolCalls(
messages: DisplayMessage[],
nowMs: number,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't remove this arg and inline'd it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I half-did it last round. Fixed in 1577edfa29nowMs is gone from failStaleToolCalls' signature, Date.now() is called at the top of that function, and the pipeline now references failStaleToolCalls directly instead of wrapping it in a closure. sanitizeDisplayMessages no longer touches the wall clock at all.

Round 2 of the same feedback: don't thread `nowMs` from the pipeline.
`failStaleToolCalls` now reads `Date.now()` at its own top, the
pipeline can use `failStaleToolCalls` as a plain reference instead
of a closure wrapper, and `sanitizeDisplayMessages` doesn't know or
care about the wall clock.
@dvargasfuertes dvargasfuertes merged commit 381dcb3 into main May 29, 2026
18 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/fail-stale-tool-calls branch May 29, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant