fix(web): fail stale tool calls when the assistant restarts mid-execution#32599
Conversation
Expose DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC from @vellumai/assistant-api and consume it in sanitizeDisplayMessages as a new Hack #5: any running tool call whose elapsed time exceeds the configured execution timeout (plus a small grace window) is rewritten to a synthetic client-side error. The case this catches: the assistant restarts mid-tool-execution and never delivers the tool_result SSE for the call it was running. The client retains status: "running" forever, leaving a permanent spinner on the last bubble across reloads. Hack #4 (repair dangling tool calls) only fires when a SUBSEQUENT assistant message proves the tool completed server-side. Hack #5 is the timeout-based fallback for the case where no such proof ever arrives. - assistant/src/api/constants/tool-execution.ts (new): canonical default constant. - assistant/src/api/index.ts: re-export. - assistant/src/config/schemas/timeouts.ts + assistant/src/tools/execution-timeout.ts: consume the constant in place of inline 120 literals. - apps/web/src/domains/chat/utils/sanitize-display-messages.ts: Hack #5 — failStaleToolCalls. Predicate gates on running status, startedAt presence, no pendingConfirmation, and now - max(startedAt, lastProgressAt) > effectiveTimeoutMs + STALE_GRACE_MS. effectiveTimeoutMs prefers progressTimeoutSec (the daemon-advertised per-tool value) and falls back to the canonical default. - Tests: 11 new cases covering threshold semantics, progress-based liveness, pending-confirmation safety, missing-startedAt safety, last-assistant inclusion, sibling preservation, and COW identity.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bfebba4c2c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| tc.progressTimeoutSec && tc.progressTimeoutSec > 0 | ||
| ? tc.progressTimeoutSec | ||
| : DEFAULT_TOOL_EXECUTION_TIMEOUT_SEC; |
There was a problem hiding this comment.
Use the effective timeout before failing tools
For any deployment with timeouts.toolExecutionTimeoutSec raised above 120s, or for bash/host_bash calls that pass timeout_seconds above 120s (the executor allows shell calls up to shellMaxTimeoutSec, default 600s, in assistant/src/tools/executor.ts), this fallback still marks the chip as failed after only the hard-coded default plus grace. I only found tool_progress parsed/handled on the web side, not emitted by the assistant, so many legitimate long-running calls will have no progressTimeoutSec and will be shown as a client-side timeout while the assistant can still be executing and may later return a real result.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Confirmed the analysis — searched both directions: grep -rn '"tool_progress"' assistant/ returns nothing; only the web side parses the event. So today the progressTimeoutSec branch is dead code in the deployed system and the 120s+grace fallback is the only path.
Impact:
- Default-config deployments (
toolExecutionTimeoutSec = 120, bash without an explicittimeout_seconds): predicate is correct. - Raised
toolExecutionTimeoutSecserver-side: false positives. bash/host_bashwithinput.timeout_seconds > 120s(executor clamps toshellMaxTimeoutSec, default 600s — seeexecutor.ts:508-516): false positives.
Proposed follow-up (separate PR so this lands clean): add effectiveTimeoutSec?: number to ToolUseStartEvent, populated on the daemon side from computePerToolTimeoutMs. The web stream-message-updater stamps it onto the tool call alongside startedAt. The sanitizer then prefers this value over the hard-coded fallback, and the predicate becomes correct regardless of server config or per-call timeout_seconds.
The current PR is still strictly better than the status quo (infinite spinner → user reload) for the default-config case, which is what Vargas's screenshot was. Holding the wire-contract change for the follow-up unless he wants it folded in here.
| const hasRunningTool = useMemo( | ||
| () => | ||
| messages.some( | ||
| (m) => m.toolCalls?.some((tc) => tc.status === "running") ?? false, | ||
| ), | ||
| [messages], | ||
| ); | ||
| const [staleToolNowMs, setStaleToolNowMs] = useState(() => Date.now()); | ||
| useEffect(() => { | ||
| if (!hasRunningTool) return; | ||
| const STALE_TOOL_TICK_MS = 15_000; | ||
| const id = setInterval( | ||
| () => setStaleToolNowMs(Date.now()), | ||
| STALE_TOOL_TICK_MS, | ||
| ); | ||
| return () => clearInterval(id); | ||
| }, [hasRunningTool]); |
There was a problem hiding this comment.
I don't expect this to be here, I'm fine with this being just in sanitizeDisplayMessages. I recognize we don't catch the case of the exact moment where we go from not timeout to timeout
There was a problem hiding this comment.
Done — force-pushed c7bfd87082 and the ticker block is gone. The sanitizer will re-evaluate on the next messages change (next tool call, next assistant text, page reload), which covers the practical case of the user coming back to a stuck conversation. Accepting the at-transition miss as documented.
bfebba4 to
c7bfd87
Compare
|
|
||
| export function sanitizeDisplayMessages( | ||
| messages: DisplayMessage[], | ||
| options: SanitizeDisplayMessagesOptions = {}, |
There was a problem hiding this comment.
delete these options here
There was a problem hiding this comment.
Done in 855d0fc45e — dropped the SanitizeDisplayMessagesOptions interface and the second parameter. sanitizeDisplayMessages is back to a one-arg signature reading Date.now() inline. Tests pin the clock via a mockNow(...) helper backed by spyOn(Date, "now"), restored in afterEach. All 46 tests pass.
| removeInvalidMessages, | ||
| removeDuplicateTrailingAssistant, | ||
| repairDanglingToolCalls, | ||
| (msgs) => failStaleToolCalls(msgs, nowMs), |
There was a problem hiding this comment.
and here, can call the now() within this method
There was a problem hiding this comment.
Same commit — Date.now() is now called directly inside sanitizeDisplayMessages at the pipeline boundary, no options pass-through.
…tests Per review feedback: drop the SanitizeDisplayMessagesOptions interface and the optional second parameter on sanitizeDisplayMessages. Read Date.now() directly at the pipeline boundary instead. Tests now pin the clock via bun:test's `spyOn(Date, "now")` through a local `mockNow(...)` helper and restore in afterEach. Behavior is identical to the previous param-injection approach but the production signature stays one-arg.
|
|
||
| function failStaleToolCalls( | ||
| messages: DisplayMessage[], | ||
| nowMs: number, |
There was a problem hiding this comment.
you didn't remove this arg and inline'd it
There was a problem hiding this comment.
You're right, I half-did it last round. Fixed in 1577edfa29 — nowMs is gone from failStaleToolCalls' signature, Date.now() is called at the top of that function, and the pipeline now references failStaleToolCalls directly instead of wrapping it in a closure. sanitizeDisplayMessages no longer touches the wall clock at all.
Round 2 of the same feedback: don't thread `nowMs` from the pipeline. `failStaleToolCalls` now reads `Date.now()` at its own top, the pipeline can use `failStaleToolCalls` as a plain reference instead of a closure wrapper, and `sanitizeDisplayMessages` doesn't know or care about the wall clock.
Summary
Catch the case where the assistant daemon restarts (or silently dies) mid-tool-execution and never delivers the
tool_resultSSE for the call it was running. The web client's tool-call bubble would otherwise spin forever — Vargas's screenshot from May 29: abashchip stuck at "Verifying no stale refs + running typecheck" across reload because the daemon went away before the tool result hit the wire.This adds Hack #5 — fail stale tool calls to
sanitizeDisplayMessages. Predicate:status === "running"startedAtis setpendingConfirmationis null/undefined (a tool awaiting user approval is correctly stalled — its execution clock hasn't started)now - max(startedAt, lastProgressAt ?? 0) > effectiveTimeoutMs + STALE_GRACE_MSeffectiveTimeoutMsprefersprogressTimeoutSec(the daemon-advertised per-tool value), falling back toDEFAULT_TOOL_EXECUTION_TIMEOUT_SEC. That fallback constant is now exposed from@vellumai/assistant-apiso backend enforcement and frontend detection reference the same wire-contract default.STALE_GRACE_MS = 30_000absorbs daemon-side delivery lag between the server-side timeout firing and the synthetic error result actually crossing the SSE.How this differs from Hack #4
Hack #4 (repair dangling tool calls) only fires when a subsequent assistant message proves the tool completed server-side. Hack #5 is the timeout-based fallback for the case where no such proof ever arrives.
Stale-tool ticker
useMemo([messages])only re-evaluates when messages change, so the silent-daemon-death case (no new messages) would never trigger the predicate.chat-route-contentnow bumps astaleToolNowMsstate every 15s only while at least one tool call is running — outside that window the interval never starts, so idle conversations have zero extra renders.Changes
assistant/src/api/constants/tool-execution.ts(new): canonicalDEFAULT_TOOL_EXECUTION_TIMEOUT_SEC = 120.assistant/src/api/index.ts: re-export.assistant/src/config/schemas/timeouts.ts+assistant/src/tools/execution-timeout.ts: consume the constant in place of inline120literals.apps/web/src/domains/chat/utils/sanitize-display-messages.ts: Hack Create top levelsetup.shscript #5 implementation. OptionalnowMsparameter onsanitizeDisplayMessagesfor deterministic tests.apps/web/src/domains/chat/utils/sanitize-display-messages.test.ts: 11 new tests.apps/web/src/domains/chat/components/chat-route-content.tsx: 15-second stale-tool ticker, scoped to "while a tool is running".Test plan
cd apps/web && bun test src/domains/chat/utils/sanitize-display-messages.test.ts— 46 pass (35 prior + 11 new).cd assistant && bun test src/__tests__/tool-execute-pipeline.test.ts src/__tests__/recording-handler.test.ts src/__tests__/tool-execution-pipeline.benchmark.test.ts— all pass; default-timeout consumers still see 120s.cd apps/web && bun run typecheck && bun run lint— clean.cd assistant && bun run typecheck— clean.Follow-up
The current synthetic message is purely client-side decoration. Server-side, the daemon should ideally persist an "in-flight tool" record so the next process can emit a synthetic
tool_resulton boot — at which point Hack #5 (and Hack #4) become removable. Tracked separately.